Mixture Density Networks in GPflow

A Tutorial

back to our blogs

Mixture Density Networks in GPflow

A Tutorial

Vincent Dutordoir (Machine Learning Researcher)

By Vincent Dutordoir (Machine Learning Researcher)

In keeping with our commitment to open engagement with the AI community, PROWLER.io is keen to develop and share tools that can move artificial intelligence toward more principled approaches. In this, the first in a series of occasional tutorials, we're proud to introduce GPFlow: an open source Gaussian Process library that enables probabilistic modelling – using TensorFlow for its core computations and python as a front end.

GPflow is intuitive, concise and capable of exploiting GPU hardware. It allows faster workflows and prototyping for complex models than TensorFlow alone.  For demonstration purposes, we will use it here to solve a fairly classic, well-understood problem:


Mixture Density Networks in GPflow

In this tutorial, we'll implement a Mixture Density Network (MDN) (Bishop, 1994) in GPflow. This post is very similar to a blogpost from 2015. But instead of directly using TensorFlow, we'll use GPflow, a probabilistic modelling framework built on top of TensorFlow. It is typically used for building Gaussian Process-based models but the framework contains tons of useful methods and classes that can be used to quickly prototype a wide variety of ML algorithms. Excellent for doing research!

This post is organised as follows: we start by giving a quick motivation of why MDNs can be useful; we then go over a GPflow implementation of the model and use it for a couple of toy experiments.

The full source code can be found on github.


Conditional Density Estimation

Imagine for a moment that we're interested in performing regression on the following dataset


At first sight, this dataset doesn't look too scary, both input and output are single-dimensional, and the data has a clear sinusoidal pattern. However, we notice that a single input \( x \) can correspond to multiple output values \( y \). For example, for \( x=0 \) the output can be one of the values \( \{-1.5, -3/4, 0, 0.8, 1.5\} \). Typical regression algorithms such as Linear Regression, Gaussian processes regression, Multilayer Perceptrons (MLPs), etc. would struggle with this, as they predict only one value for every input.

To model a dataset like this, we will use Conditional Density Estimation (CDE) models. CDE models deal with inferring \( p(f(x)|x) \) instead of calculating only the expectation \( E[f(x) | x] \). Modelling the complete distribution \( p(f(x)|x) \) is typically harder, but it reveals much more interesting properties, such as the modes, outlier boundaries and samples. A real-world example where CDE is required is the modelling of taxi drop-offs conditioned on the pick-up location. We would expect a taxi drop-off location to be multi-modal as passengers need to go to different places (e.g. airport/city centre/suburbs) and the density depends on the starting location.


Mixture Density Models

Mixture density networks (MDNs) are a parametric class of models that allow for conditional density estimation. They are built out of 2 parts: a neural net and a Mixture of Gaussians (MoG). The neural net is responsible for producing the characteristics of the MoG. In practice, given that the MoG consists of \( M \) Gaussians, the neural net will output a collection of \( M \) means, variances and weights \( \{\mu_m, \sigma_m^2, \pi_m\}_{m=1}^M \). The produced means, variances and weights are used to define the conditional probability distribution function $$ p(Y = y\,|\,X = x) = \sum_{m=1}^{M} \pi_{m}(x)\,\mathcal{N}\big(y\, \left|\,\mu_{m}(x), \sigma_{m}^2(x)\big)\right. $$ Each of the parameters \( \pi_{m}(x), \mu_{m}(x), \sigma_{m}(x) \) of the distribution will be determined by the neural network, as a function of the input \( x \).

We train the MDN's neural net by optimizing the model's likelihood $$ \mathcal{L} \equiv {argmax} _{\Theta} \prod_{n=1}^N p(Y = y_n\,|\,X = x_n), $$ where \( \Theta \) collects the neural nets weights and biases and \( \{x_n, y_n\}_{n=1}^N \) are our training dataset.


GPflow MDN implementation

GPflow doesn't reinvent the wheel: most of what you will see is just plain Python/TensorFlow code. We choose to use GPflow, however, because it provides us with some functionality to easily define a model. Once we have a GPflow model, we can specify its objective function, parameters, dataset. This extra layer of abstraction makes interacting with the model, such as optimizing or performing inference, much easier.

We begin by importing the required packages from TensorFlow and GPflow:

import tensorflow as tf from scipy.stats import norm
import gpflow from gpflow.models.model import Model from gpflow.params.dataholders import Minibatch, DataHolder from gpflow.params import Parameter, ParamList from gpflow.training import AdamOptimizer, ScipyOptimizer from gpflow.decors import params_as_tensors, autoflow

float_type = gpflow.settings.float_type

We continue by creating an MDN class which inherits from GPflow's model class. We now need to do the following:

  1. Store the feature and target matrices (X, Y) as DataHolders.
  2. Define our model's parameters using GPflow's Parameter and ParamList objects.
  3. Define the objective function. In the _build_likelood function we need to specify our model's objective function. When we optimize the model the negative of this function will be minimised.
class MDN(Model):

def __init__(self, X, Y, inner_dims=[10, 10,], activation=tf.nn.tanh, num_mixtures=5):
    self.Din = X.shape[1]
    self.dims = [self.Din, ] + list(inner_dims) + [3 * num_mixtures]
    self.activation = activation

    self.X = DataHolder(X)
    self.Y = DataHolder(Y)

def _create_network(self):
    Ws, bs = [], []
    for dim_in, dim_out in zip(self.dims[:-1], self.dims[1:]):
        init_xavier_std = (2.0 / (dim_in + dim_out)) ** 0.5
        Ws.append(Parameter(np.random.randn(dim_in, dim_out) * init_xavier_std))

    self.Ws, self.bs = ParamList(Ws), ParamList(bs)

def _eval_network(self, X):
    for i, (W, b) in enumerate(zip(self.Ws, self.bs)):
        X = tf.matmul(X, W) + b
        if i < len(self.bs) - 1:
            X = self.activation(X)

    pis, mus, sigmas = tf.split(X, 3, axis=1)
    pis = tf.nn.softmax(pis)  # make sure they normalize to 1
    sigmas = tf.exp(sigmas)   # make sure std. dev. are positive
    return pis, mus, sigmas

def _build_likelihood(self):
    pis, mus, sigmas = self._eval_network(self.X)
    Z = (2 * np.pi)**0.5 * sigmas
    log_probs_mog = (-0.5 * (mus - self.Y)**2 / sigmas**2) - tf.log(Z) + tf.log(pis)
    log_probs = tf.reduce_logsumexp(log_probs_mog, axis=1)
    return tf.reduce_sum(log_probs)

@autoflow((float_type, [None, None]))
def eval_network(self, X):
    pis, mus, sigmas = self._eval_network(X)
    return pis, mus, sigmas


  • Given that we are dealing with a MoG, the neural net output must comply with the following restrictions: \( \sum_{m=1}^{M} \pi_{m}(x) = 1 \), \( \pi_m \ge 0 \) and \( \sigma_m\ \forall\ m \). We achieve this by applying the softmax operator to the \( \pi \)'s and by taking the exp to the \( \sigma \)'s.
  • We use the "Xavier" initialisation for the neural network's weights. (Glorot and Bengio, 2010).
  • Instead of calculating the pdf of the Gaussians, we work with the log of the pdf and use tf.reduce_logsumexp. This is mainly for numerical stability.


GPflow insides

  • Notice that we store vanilla numpy arrays in Parameter and DataHolder objects. The @params_as_tensors decorator will make sure that these variables are transformed into TensorFlow tensors once we are inside the decorated method.
  • The @autoflow decorator specifies the type and shape of a method's input variables. It will ensure that the graph, constructed inside the decorated method, will be executed inside a TF session and that the output is returned as a np.array when the method is called. This decorator allows one to execute TF graphs without having to worry about managing any tf.session objects or creating any placeholders or other TF related objects.



Let's see how our model works in practice. We will start with modelling the sinusoidal dataset we showed earlier. We do this by initialising a new instance of our MDN model. We further specify the dataset \( (X, Y) \), the number of hidden units of the MDN's neural net and the number of mixture components \( M \).

model = MDN(X, Y, inner_dims=[100, 100], num_mixtures=25)

MDN instances are aware of their objective function. We can therefore directly start with minimising the model. GPflow will make sure that only the variables stored as Parameters will be optimized. For the MDN, the only parameters are the weights and the biases of the neural net; we stored them in ParamLists self.Ws and self.bs.

We are using the ScipyOptimizer, which is a wrapper around scipy's L-BFGS optimization algorithm. GPflow, however, supports all other TensorFlow optimizers, such as Adam, Adagrad, Adadelta, ... as well.

ScipyOptimizer().minimize(model, maxiter=15000)

To evaluate the validity of our model we will draw the posterior density. We also plot \( \mu(x) \) of the optimized neural net. Remember that for every \( x \) the neural net will output, among other, \( M \) means \( \mu_m(x) \). They determine the location of the Gaussians. We plot all \( M \) of those means and use their corresponding mixture weight \( \pi_m(X) \) to determine their size. Larger dots will have more impact in the Gaussian ensemble.

Sinus Posterior

Let's try another dataset

This dataset is known as the two half moons:


The only difference in the MDN's setup is that we reduce the number of mixture components.

model = MDN(X, Y, inner_dims=[100, 100], num_mixtures=5)
ScipyOptimizer().minimize(model, maxiter=10000)

We again evaluate the model by plotting the posterior density

Moon Posterior

A quick glance at other generally useful GPflow functionality

1. Minibatching

Let's wrap up by giving you a look at some of the other things you can do with GPflow. For example, if our dataset grows so large that training the neural net over the complete dataset becomes infeasible, we can simply use Minibatches instead of DataHolders.

class MDN_minibatching(Model):

def __init__(self, X, Y, *args, **kwargs):
    MDN.__init__(X, Y, *args, **kwargs)
    self.X = Minibatch(X, batch_size=1000, seed=0)
    self.Y = Minibatch(Y, batch_size=1000, seed=0)

This minor change can have a huge impact on performance. It will make sure that every time we use self.X and self.Y inside our model's methods, we are actually working with a random subset of 1000 points from our initial dataset.


2. Constrained parameters

Imagine for a second we wish to constrain our neural net biases (or for that matter any another parameter) to always be positive. In plain TensorFlow, imposing such constraints isn't straightforward at all. In GP models, however, we are often faced with this problem. For this reason, GPflow implements a series of Transforms that can be imposed on Parameter objects. For example, declaring a parameter as follows will make sure that its optimized value is positive.

bias_always_postive = Parameter(np.zeros(5), transform=gpflow.transforms.positive)

3. Fixing parameters

Sometimes we don't want to optimize all the model's parameters, or we want to interleave optimising a neural net's weights and biases. GPflow provides functionality to set specific parameters untrainable. For example,

# Only optimize the biases for 1000 iterations
ScipyOptimizer().minimize(model, maxiter=1000)
# Continue, but only optimize the weights now
ScipyOptimizer().minimize(model, maxiter=1000)

4. Printing

Printing all variables of the model can also be a useful feature. Apart from their value, we also get the shape, prior, transform, etc.

                  class prior transform  trainable       shape  fixed_shape  \
MDN/Ws/item0  Parameter  None    (none)       True    (1, 100)         True   
MDN/Ws/item1  Parameter  None    (none)       True  (100, 100)         True   
MDN/Ws/item2  Parameter  None    (none)       True   (100, 15)         True   
MDN/bs/item0  Parameter  None    (none)      False      (100,)         True   
MDN/bs/item1  Parameter  None    (none)      False      (100,)         True   
MDN/bs/item2  Parameter  None    (none)      False       (15,)         True   

MDN/Ws/item0  [[0.22289049473895722, 0.179433371751925, 0.20...  
MDN/Ws/item1  [[-0.004023774440209315, 0.08002442237043737, ...  
MDN/Ws/item2  [[-0.26038974215137717, 0.02974131665807722, 0...  
MDN/bs/item0  [-0.12444686819532082, -0.30115627447772236, -...  
MDN/bs/item1  [-0.014312818190414486, -0.18312268363430753, ...  
MDN/bs/item2  [-0.15871162461316238, 0.025424580905848693, 0...

We hope this gives you a taste of the power and flexibility of GPFlow. If you'd like to explore further, please check out how we use it to implement Natural Gradients or just dive right into the GitHub repositories and complete documentation.