Optimising models

In my previous blog, I discussed how to use an intuitive method, Leave One Out Cross Validation, for determining what parameters work best for a given machine learning algorithm. When I wrote that blog, I was surprised to find that it did not seem to work very well for finding an optimum fit. Something I have learnt over many years of working with numerical analysis is that if the results are surprising, it is well worth checking that they are correct! I’ll now revisit the LOOCV technique and present a much more satisfying outcome. Recall that in this method we train on all but one data point and then use a comparison between the trained model’s prediction for the stress with the observed stress at the point left out. We then optimise by minimising the deviation between the two after we have repeated the LOOCV process by leaving out every data point in turn. If we vary the length and variance hyper parameters in the GP, we obtain a contour map of the model likelihood which enables us to identify the best parameters. This approach to optimisation, iterating over a regularly spaced grid of length and variance values, is not very efficient, but serves to illustrate the process.

The dark red patch in the left figure shows where the fit is best and the figure on the right shows the corresponding fit to the data, including the confidence interval, shaded in grey.

An alternative way to determine how good a model is is to use the log likelihood based solely on the probability of predictions. This gives rise to two competing terms, one measures the model complexity the other the quality of the data fit. Below is the map of likelihood as I vary length scale and variance in my Gaussian process for my stress/extension data.

Although the contour plot looks different, the maximum likelihood is located in a similar, but not identical place. Since both methods are statistically different ways of determining the best fit, it is not too surprising that the exact results should differ, however, we would find it difficult to distinguish between the predictions of the two validation techniques.

For comparison, the optimised hyper-parameters are:

Hyper parameter LOOCV Log Likelihood
length 2 1.8
variance 2.4 2.7