# Probabilistic models underpin principled AI

PROWLER.io’s platform uses autonomous AI agents that make decisions based on mathematical principles. Probabilistic modelling underpins these principles, enabling us to compartmentalise the decision-making system in ways that offer long-term benefits.

Agents built using principled methods are more interpretable and explainable: we can open them up and explain their actions by examining what losses and utilities they expect. Then we can improve their decision-making by amending their — and the model’s — underlying assumptions. This is an essential part of how PROWLER.io develops what we believe are best practices for genuinely principled AI.

The first step in applying a probabilistic model is to write down our assumptions (e.g., “the response data are linearly related to the stimulus data”). To use a concrete example: a self-driving car might assume “it takes twice as long to stop in the wet”. Our assumptions about complicated situations may need to be abstract, but they are always explicit.

Once we’ve made our modelling assumptions, we add in data, and the job of the probabilistic modelling team is to compute the answer: the posterior distribution. This is a set of plausible explanations for the data, given our assumptions.

An entire research field – computational statistics – is devoted to this difficult problem. It uses algorithms such as Markov Chain Monte Carlo (MCMC) or Variational Bayes to compute the posterior. To decide on an action, we have to know the probability of different outcomes and the losses associated with them. Since the posterior distribution contains plausible explanations for the data, it allows for plausible probabilistic explanations of what will happen next. This means we know when we’re predicting well and when we’re merely guessing, which is vital in decision-making.

The use of probabilistic models underpins three principles:

1. the model should be distinct from the algorithm used to fit that model
2. the posterior distribution of the model is separate from losses associated with outcomes
3. making modelling assumptions overt and explicit allows us to assess the validity of those assumptions.

### Distinct models and algorithms

Building a probabilistic model requires three things: data, assumptions about how the data originate, and number crunching. Precisely how that number crunching occurs shouldn’t affect the answer, any more than whether we solve an equation using a pencil or a computer should affect the answer.

But in machine learning, how we crunch the numbers can matter. When optimising a deep neural network the choice of optimisation method makes a difference: choosing Stochastic Gradient Descent over the Adam optimiser, for instance, can result in solutions that generalise better to unseen data (Wilson et al. 2017). We’re not sure why, and I find this troublesome. It makes it harder to build and refine models if we can’t attribute improvements to particular choices.

Separating the model and the algorithm addresses this gap. When applying an MCMC method to a probabilistic model, the choice of MCMC algorithm makes no difference to the answer. It might affect how rapidly we reach it, or how much computing power it requires, but the answer itself remains unchanged. This is helpful because it allows us to move forward, building on and refining the model, secure in the knowledge there are no unintended interactions between model and algorithm.

### Separate models and losses

Before we can make a decision, we first need to work out which outcomes are probable – using a model – and then specify our preferences for those outcomes in terms of losses. MacKay (2003) writes: “Decision theory is trivial, aside from the computational details”. This means that for a given model of the world and a given loss function, writing down a mathematical description of what to do is easy; it’s the number crunching that’s hard. Decision theory involves taking our current model of the world – once the number crunching is done – and only then choosing actions that minimize expected loss. The model itself should never depend on the loss function.

To use a simple analogy, a doctor must first diagnose a patient before deciding on treatment. The innate efficacy of any treatment is unrelated to what is wrong with the particular patient, though it may influence the doctor’s decision. The doctor analyses the symptoms (the data) and combines them with her assumptions (about how the human body works) to arrive at a diagnosis (the model’s outcome). She then decides on treatment, accounting for efficacy, cost and side effects (the loss function).

Similarly, a self-driving car must first analyse its current state and environment before it can drive. The intrinsic effectiveness of any manoeuvre (e.g. braking, turning) should not affect one’s belief in the current state. The car analyses the road conditions and traffic around it (the data) and combines them with its assumptions (like stopping distances and expected behaviours of other users) to predict the results of manoeuvres (the model’s outcomes). Only then does it decide to turn, accelerate, select a lane etc., while accounting for safety, comfort, ETA and other aspects of its loss function.

Some machine learning methods don’t respect these separations: reinforcement learning methods often create functions that instead map directly from observations of the system to the effectiveness of actions (i.e. “end-to-end learning”). For this, the model is neither explicitly specified nor separated from the decision-making process. The self-driving car’s black-box system goes straight from sensory information to manoeuvres, with no explanation of what will happen next and why. This is problematic in terms of accountability and makes diagnosing problems tricky. In our medical analogy, it’s like having a machine that maps from symptoms to treatments, without any understanding in between.

Probabilistic model-based reinforcement learning methods, such as the PILCO framework (Deisenroth and Rasmussen 2011), are different. A model is specified explicitly, and decisions are made based on what the model predicts will happen next. If the model predicts that an action will lead to a better expected loss, then the action is taken.

### Overt and explicit assumptions

The above separations are principled because they provide probabilistic models with explicit assumptions that can regularly be re-evaluated, criticised and amended, a process known in the literature as Box’s loop (Blei 2014).

In contrast, neural network models are usually amended by changing their architecture (the width and depth of the model); but it’s unclear how these changes affect the results. Moreover, it’s currently accepted that deep neural networks must be massive to be effective – but this runs up against a basic statistical principle: that the complexity of the model is related to the number of parameters. Worse, the model’s effectiveness is dependent on the network’s parameter initialisations: it’s not at all clear what assumptions one is putting into the model by changing these initialisations. Because of the strong interplay between the model structures and the algorithms, it’s hard to nail down exactly what is being assumed.

Probabilistic models, on the other hand, have a built-in model evidence that allows us to compare models and assess underlying assumptions in order to see whether amending them makes sense. Since the assumptions are explicit, we can thoroughly examine how they affect the model and modify them as needed. If a bad decision is made, we can ask the decision-maker "why did you decide this?” and ask the probabilistic model "why did you conclude that this would happen?". Then we are free to adjust the assumptions that are at the foundation of the whole system.

Many practitioners in the machine learning community adhere to principled AI. I am particularly grateful for the influence of conversations with N.D. Lawrence, P. J. Diggle, and C.E. Rasmussen, as well as the writings of D.J.C. MacKay and E.T. Jaynes.

### References

Blei, D.M. "Build, compute, critique, repeat: Data analysis with latent variable models." Annual Review of Statistics and Its Application, 2014

Deisenroth, M and Rasmussen, C.E. "PILCO: A model-based and data-efficient approach to policy search." ICML 2011.

MacKay, D.J.C.  “Information theory, inference and learning algorithms.” Cambridge University Press, 2003.

Wilson, A.C. et al. “The Marginal Value of Adaptive Gradient Methods in Machine Learning.” https://arxiv.org/pdf/1705.08292.pdf, 2017.