*by Vincent Dutordoir (Machine Learning Researcher)*

*In keeping with our commitment to open engagement with the AI community, PROWLER.io is keen to develop and share tools that can move artificial intelligence toward more principled approaches. In this, the first in a series of occasional tutorials, we're proud to introduce GPFlow: an open source Gaussian Process library that enables probabilistic modelling – using TensorFlow for its core computations and python as a front end.*

*GPflow is intuitive, concise and capable of exploiting GPU hardware. It allows faster workflows and prototyping for complex models than TensorFlow alone. For demonstration purposes, we will use it here to solve a fairly classic, well-understood problem:*

# Mixture Density Networks in GPflow

In this tutorial, we'll implement a Mixture Density Network (MDN) (Bishop, 1994) in GPflow. This post is very similar to a blogpost from 2015. But instead of directly using TensorFlow, we'll use GPflow, a probabilistic modelling framework built on top of TensorFlow. It is typically used for building Gaussian Process-based models but the framework contains tons of useful methods and classes that can be used to quickly prototype a wide variety of ML algorithms. Excellent for doing research!

This post is organised as follows: we start by giving a quick motivation of why MDNs can be useful; we then go over a GPflow implementation of the model and use it for a couple of toy experiments.

The full source code can be found on github.

## Conditional Density Estimation

Imagine for a moment that we're interested in performing regression on the following dataset

At first sight, this dataset doesn't look too scary, both input and output are single-dimensional, and the data has a clear sinusoidal pattern. However, we notice that a single input \(\mathbf x \) can correspond to multiple output values \(\mathbf y \). For example, for \(\mathbf x=0 \) the output can be one of the values \(\mathbf \{-1.5, -3/4, 0, 0.8, 1.5\} \). Typical regression algorithms such as Linear Regression, Gaussian processes regression, Multilayer Perceptrons (MLPs), etc. would struggle with this, as they predict only one value for every input.

To model a dataset like this, we will use Conditional Density Estimation (CDE) models. CDE models deal with inferring \(\mathbf p(f(x)|x) \) instead of calculating only the expectation \(\mathbf E[f(x) | x] \). Modelling the complete distribution \(\mathbf p(f(x)|x) \) is typically harder, but it reveals much more interesting properties, such as the modes, outlier boundaries and samples. A real-world example where CDE is required is the modelling of taxi drop-offs conditioned on the pick-up location. We would expect a taxi drop-off location to be multi-modal as passengers need to go to different places (e.g. airport/city centre/suburbs) and the density depends on the starting location.

## Mixture Density Models

Mixture density networks (MDNs) are a parametric class of models that allow for conditional density estimation. They are built out of 2 parts: a neural net and a Mixture of Gaussians (MoG). The neural net is responsible for producing the characteristics of the MoG. In practice, given that the MoG consists of \(\mathbf M \) Gaussians, the neural net will output a collection of \(\mathbf M \) means, variances and weights \(\mathbf \{\mu_m, \sigma_m^2, \pi_m\}_{m=1}^M \). The produced means, variances and weights are used to define the conditional probability distribution function

$$ p(Y = y\,|\,X = x) = \sum_{m=1}^{M} \pi_{m}(x)\,\mathcal{N}\big(y\, \left|\,\mu_{m}(x), \sigma_{m}^2(x)\big)\right. $$

Each of the parameters \(\mathbf \pi_{m}(x), \mu_{m}(x), \sigma_{m}(x) \) of the distribution will be determined by the neural network, as a function of the input \(\mathbf x \).

We train the MDN's neural net by optimizing the model's likelihood

$$ \mathcal{L} \equiv {argmax} _{\Theta} \prod_{n=1}^N p(Y = y_n\,|\,X = x_n), $$

where \(\mathbf \Theta \) collects the neural nets weights and biases and \(\mathbf \{x_n, y_n\}_{n=1}^N \) are our training dataset.

## GPflow MDN implementation

GPflow doesn't reinvent the wheel: most of what you will see is just plain Python/TensorFlow code. We choose to use GPflow, however, because it provides us with some functionality to easily define a model. Once we have a GPflow model, we can specify its objective function, parameters, dataset. This extra layer of abstraction makes interacting with the model, such as optimizing or performing inference, much easier.

We begin by importing the required packages from TensorFlow and GPflow:

```
import tensorflow as tf
from scipy.stats import norm
```

```
import gpflow
from gpflow.models.model import Model
from gpflow.params.dataholders import Minibatch, DataHolder
from gpflow.params import Parameter, ParamList
from gpflow.training import AdamOptimizer, ScipyOptimizer
from gpflow.decors import params_as_tensors, autoflow
float_type = gpflow.settings.float_type
```

We continue by creating an `MDN`

class which inherits from GPflow's `model`

class. We now need to do the following:

- Store the feature and target matrices (X, Y) as
`DataHolders`

. - Define our model's parameters using GPflow's
`Parameter`

and`ParamList`

objects. - Define the objective function. In the
`_build_likelood`

function we need to specify our model's objective function. When we optimize the model the negative of this function will be minimised.

```
class MDN(Model):
def __init__(self, X, Y, inner_dims=[10, 10,], activation=tf.nn.tanh, num_mixtures=5):
Model.__init__(self)
self.Din = X.shape[1]
self.dims = [self.Din, ] + list(inner_dims) + [3 * num_mixtures]
self.activation = activation
self.X = DataHolder(X)
self.Y = DataHolder(Y)
self._create_network()
def _create_network(self):
Ws, bs = [], []
for dim_in, dim_out in zip(self.dims[:-1], self.dims[1:]):
init_xavier_std = (2.0 / (dim_in + dim_out)) ** 0.5
Ws.append(Parameter(np.random.randn(dim_in, dim_out) * init_xavier_std))
bs.append(Parameter(np.zeros(dim_out)))
self.Ws, self.bs = ParamList(Ws), ParamList(bs)
@params_as_tensors
def _eval_network(self, X):
for i, (W, b) in enumerate(zip(self.Ws, self.bs)):
X = tf.matmul(X, W) + b
if i < len(self.bs) - 1:
X = self.activation(X)
pis, mus, sigmas = tf.split(X, 3, axis=1)
pis = tf.nn.softmax(pis) # make sure they normalize to 1
sigmas = tf.exp(sigmas) # make sure std. dev. are positive
return pis, mus, sigmas
@params_as_tensors
def _build_likelihood(self):
pis, mus, sigmas = self._eval_network(self.X)
Z = (2 * np.pi)**0.5 * sigmas
log_probs_mog = (-0.5 * (mus - self.Y)**2 / sigmas**2) - tf.log(Z) + tf.log(pis)
log_probs = tf.reduce_logsumexp(log_probs_mog, axis=1)
return tf.reduce_sum(log_probs)
@autoflow((float_type, [None, None]))
def eval_network(self, X):
pis, mus, sigmas = self._eval_network(X)
return pis, mus, sigmas
```

#### Notes

- Given that we are dealing with a MoG, the neural net output must comply with the following restrictions: \(\mathbf \sum_{m=1}^{M} \pi_{m}(x) = 1 \), \(\mathbf \pi_m \ge 0 \) and \(\mathbf \sigma_m\ \forall\ m \). We achieve this by applying the
`softmax`

operator to the \(\mathbf \pi \)'s and by taking the`exp`

to the \(\mathbf \sigma \)'s. - We use the "Xavier" initialisation for the neural network's weights. (Glorot and Bengio, 2010).
- Instead of calculating the pdf of the Gaussians, we work with the
`log`

of the pdf and use`tf.reduce_logsumexp`

. This is mainly for numerical stability.

#### GPflow insides

- Notice that we store vanilla numpy arrays in
`Parameter`

and`DataHolder`

objects. The`@params_as_tensors`

decorator will make sure that these variables are transformed into TensorFlow tensors once we are inside the decorated method. - The
`@autoflow`

decorator specifies the type and shape of a method's input variables. It will ensure that the graph, constructed inside the decorated method, will be executed inside a TF session and that the output is returned as a`np.array`

when the method is called. This decorator allows one to execute TF graphs without having to worry about managing any`tf.session`

objects or creating any placeholders or other TF related objects.

## Experiments

Let's see how our model works in practice. We will start with modelling the sinusoidal dataset we showed earlier. We do this by initialising a new instance of our MDN model. We further specify the dataset \(\mathbf (X, Y) \), the number of hidden units of the MDN's neural net and the number of mixture components \(\mathbf M \).

`model = MDN(X, Y, inner_dims=[100, 100], num_mixtures=25)`

MDN instances are aware of their objective function. We can therefore directly start with minimising the model. GPflow will make sure that only the variables stored as `Parameters`

will be optimized. For the MDN, the only parameters are the weights and the biases of the neural net; we stored them in `ParamLists`

`self.Ws`

and `self.bs`

.

We are using the `ScipyOptimizer`

, which is a wrapper around scipy's L-BFGS optimization algorithm. GPflow, however, supports all other TensorFlow optimizers, such as `Adam`

, `Adagrad`

, `Adadelta`

, ... as well.

`ScipyOptimizer().minimize(model, maxiter=15000)`

To evaluate the validity of our model we will draw the posterior density. We also plot \(\mathbf \mu(x) \) of the optimized neural net. Remember that for every \(\mathbf x \) the neural net will output, among other, \(\mathbf M \) means \(\mathbf \mu_m(x) \). They determine the location of the Gaussians. We plot all \(\mathbf M \) of those means and use their corresponding mixture weight \(\mathbf \pi_m(X) \) to determine their size. Larger dots will have more impact in the Gaussian ensemble.

#### Let's try another dataset

This dataset is known as the two half moons:

The only difference in the MDN's setup is that we reduce the number of mixture components.

```
model = MDN(X, Y, inner_dims=[100, 100], num_mixtures=5)
ScipyOptimizer().minimize(model, maxiter=10000)
```

We again evaluate the model by plotting the posterior density

## A quick glance at other generally useful GPflow functionality

#### 1. Minibatching

Let's wrap up by giving you a look at some of the other things you can do with GPflow. For example, if our dataset grows so large that training the neural net over the complete dataset becomes infeasible, we can simply use `Minibatches`

instead of `DataHolders`

.

```
class MDN_minibatching(Model):
def __init__(self, X, Y, *args, **kwargs):
MDN.__init__(X, Y, *args, **kwargs)
self.X = Minibatch(X, batch_size=1000, seed=0)
self.Y = Minibatch(Y, batch_size=1000, seed=0)
```

This minor change can have a huge impact on performance. It will make sure that every time we use `self.X`

and `self.Y`

inside our model's methods, we are actually working with a random subset of `1000`

points from our initial dataset.

#### 2. Constrained parameters

Imagine for a second we wish to constrain our neural net biases (or for that matter any another parameter) to always be positive. In plain TensorFlow, imposing such constraints isn't straightforward at all. In GP models, however, we are often faced with this problem. For this reason, GPflow implements a series of `Transforms`

that can be imposed on `Parameter`

objects. For example, declaring a parameter as follows will make sure that its optimized value is positive.

`bias_always_postive = Parameter(np.zeros(5), transform=gpflow.transforms.positive)`

#### 3. Fixing parameters

Sometimes we don't want to optimize all the model's parameters, or we want to interleave optimising a neural net's weights and biases. GPflow provides functionality to set specific parameters untrainable. For example,

```
# Only optimize the biases for 1000 iterations
model.Ws.set_trainable(False)
model.bs.set_trainable(True)
ScipyOptimizer().minimize(model, maxiter=1000)
# Continue, but only optimize the weights now
model.Ws.set_trainable(True)
model.bs.set_trainable(False)
ScipyOptimizer().minimize(model, maxiter=1000)
```

#### 4. Printing

Printing all variables of the model can also be a useful feature. Apart from their value, we also get the shape, prior, transform, etc.

`print(model)`

*We hope this gives you a taste of the power and flexibility of GPFlow. If you'd like to explore further, please check out how we use it to implement Natural Gradients or just dive right into the GitHub **repositories and complete documentation.*