Optimising models

In my previous blog, I discussed how to use an intuitive method, Leave One Out Cross Validation, for determining what parameters work best for a given machine learning algorithm. When I wrote that blog, I was surprised to find that it did not seem to work very well for finding an optimum fit. Something I have learnt over many years of working with numerical analysis is that if the results are surprising, it is well worth checking that they are correct! I’ll now revisit the LOOCV technique and present a much more satisfying outcome. Recall that in this method we train on all but one data point and then use a comparison between the trained model’s prediction for the stress with the observed stress at the point left out. We then optimise by minimising the deviation between the two after we have repeated the LOOCV process by leaving out every data point in turn. If we vary the length and variance hyper parameters in the GP, we obtain a contour map of the model likelihood which enables us to identify the best parameters. This approach to optimisation, iterating over a regularly spaced grid of length and variance values, is not very efficient, but serves to illustrate the process.

The dark red patch in the left figure shows where the fit is best and the figure on the right shows the corresponding fit to the data, including the confidence interval, shaded in grey.

An alternative way to determine how good a model is is to use the log likelihood based solely on the probability of predictions. This gives rise to two competing terms, one measures the model complexity the other the quality of the data fit. Below is the map of likelihood as I vary length scale and variance in my Gaussian process for my stress/extension data.

Although the contour plot looks different, the maximum likelihood is located in a similar, but not identical place. Since both methods are statistically different ways of determining the best fit, it is not too surprising that the exact results should differ, however, we would find it difficult to distinguish between the predictions of the two validation techniques.

For comparison, the optimised hyper-parameters are:

Hyper parameter LOOCV Log Likelihood
length 2 1.8
variance 2.4 2.7

Machine learning: preparing to go underneath the hood

In my previous blog, I showed machine learning predictions for the stress/extension data. In my next blog, I’ll start exploring what is happening within the algorithm. The particular machine learning algorithm that I’m using is known as a Gaussian process, or GP as I’ll use for short from now on. Rasmussen, in his freely available book, discusses what he claims are reasonably close relationship between GPs and many of the other different approaches, including neural networks. I believe that this is not a universally accepted viewpoint, but I suspect for those of us who are likely to remain just users of machine learning, the arguments might be rather too technical to follow! Either way, I find GPs to be one of the more accessible routes into machine learning.

Today though I just want to cover some background about linear regression since this enables me to introduce some of the language and terms that are just as important in machine learning as in linear regression. A useful starting point is to remind ourselves about the Gaussian distribution and how it is used, often without us thinking about it, to determine the line of best fit to data using linear regression, the simplest form of machine learning used well before the term became common place:

linearfit

If we believe that this equation describes our data, and our measurements are free from experimental noise, then our data would fit perfectly onto the straight line. Of course, all measurements have some noise which now means that our belief is that the actual measurement, let’s call it yobs, if we measure it repeatedly at the same value of x, will have a distribution of values with a mean of = mx c, and some distribution about the mean. The most common distribution is the Gaussian function, which says that the probability that we observe a particular value, yobs, is given by,

gaussian

One attraction of the Gaussian distribution is that it is characterised by just one additional parameter, the variance σ2.

You might find it helpful to see what a Gaussian looks like:

Gaussian curve

So we see that the probability is highest at the expected value of = mx c as required. The width of the peak is determined by the variance, the greater the variance the wider the peak and the more likely we are to observe data further away from the mean. Since p(yobs) is a probability we also need a prefactor that ensures that the probability of have any possible value is 1. This leads to

gaussian_normalised

Since we have a probability of observing a particular yobs at each x at which we take a measurement, we need to introduce a joint probability, which is just the probability of observing y1 at x1, y2 at x2 and so on. We usually assume that the noise that affects the measurement at one point is not related, or is uncorrelated, to the noise at another point. The joint probability that two or more independent events occur is the product of the probability of each individual event, so that

joint_gaussian_normalised

Now that we have specified a belief about how our data behaves and encoded it within a probability distribution function, we need to find the line of best fit, which means we need a measure of goodness of fit. The most common measure is the maximum likelihood expectation (MLE). I’ll briefly discuss this in the context of linear regression, but, helpfully, it is also widely used in machine learning to optimise the model based on training data. The MLE corresponds to the parameters, which are just m and c for linear regression, that maximise what is called the log-likelihood, in other words we search for m and c that maximise the logarithm of the joint probability. Why maximise the logarithm of the joint probability and not just the joint probability? The simplest answer is that by taking the logarithm, we simplify the math enormously. Firstly, the logarithm of a product is the sum of the logarithms and secondly, the logarithm of an exponential is just whatever is inside the exponential. In mathematical terms the log-likelihood for a joint Gaussian probability distribution function for independent events is given by

log_joint_gaussian

So now we just need to maximise this with respect to the parameters m and c, which we can do using calculus. I won’t go any more into the mathematics, there are plenty of resources online that describe the process in detail.

Hopefully that has set the scene for my next blog when I’ll explore GPs.