If we are going to apply machine learning to science, then clearly we need a way of quantifying how good we think our predictions are. Even before we reach that stage though we need a way to decide what is the best model and for any given model, what are the settings, or hyper-parameters, that provide the most believable predictions. This was illustrated in my previous blog in the figure showing how the predictions for the stress/extension relation vary as we vary the length scale hyper-parameter. For all length scales the curve passes exactly through the data, so how do we decide which is best?

There are a number of approaches to this, I’ll discuss just two, which are quite different and provide a nice reminder that as powerful as machine learning is, it has quirks. It is worth noting that the methods for validation are applicable regardless of the particular form of machine learning, so are as valuable for checking neural network predictions as they are for Gaussian processes.

First of all, in this blog I’ll discuss the conceptually simpler, cross-validation. In essence, we split the data into two sets, one is used to train the algorithm, the other is used to measure how good the predictions are. Since there are many different ways of splitting a data set into a training and a validation set, I’ll discuss what is know as Leave One Out cross validation (LOOCV). This is a dry but very descriptive name! All the data points but one are used to train the algorithm and then the trained algorithm is asked to predict the output at the point that has been left out, the closeness of the prediction to the observation is a measure of how good the fit is. This can be repeated so that every data point is left out in turn, with the overall measure of how good the fit is just an average over all of the LOOCV attempts.

We can use this to help guide the choice of hyper-parameters by repeating the process for a range of length scales and variances, introduced in a previous blog. We then look for the pair of values which give the best averaged match …

*Since I wrote what comes next, I have discovered a bug in my code. My next blog will correct this, but since this is a blog and not an article for peer-review, I will preserve my mistakes below for posterity!*

… , which sounds simple, but gives rise to two challenges.

The first of which is that searching for the best match means finding a maximum point on a two dimensional surface (or higher as we add more hyper-parameters). Finding a maximum on a surface that is likely to be quite complex is difficult! It is easy to find a local, but much more challenging to find the global, maxima. This is a problem that exists in a multitude of numerical calculations, and has been known about for a long time, so although it isn’t easy, there are lots of established approaches to help.

The second challenge is best illustrated in a contour plot of the log likelihood for my stress-extension data as I vary the two hyper-parameters:

The colour scale is such that bright yellow represents the highest values whilst dark blue are the lowest. What you might be able to see is that there is no maximum! Even when I increase the length scale to much higher values that log likelihood continues to increase. So according to this method LOOCV predicts that an infinite lengthscale is optimal. Apparently this is a common problem with the cross-validation approach, although I have not yet found an explanation as to why. In the next blog, I’ll discuss a different approach to finding the optimum values for the hyper-parameters, which is less intuitive, but appears to be more robust.