In my previous blog, I showed machine learning predictions for the stress/extension data. In my next blog, I’ll start exploring what is happening within the algorithm. The particular machine learning algorithm that I’m using is known as a Gaussian process, or GP as I’ll use for short from now on. Rasmussen, in his freely available book, discusses what he claims are reasonably close relationship between GPs and many of the other different approaches, including neural networks. I believe that this is not a universally accepted viewpoint, but I suspect for those of us who are likely to remain just users of machine learning, the arguments might be rather too technical to follow! Either way, I find GPs to be one of the more accessible routes into machine learning.

Today though I just want to cover some background about linear regression since this enables me to introduce some of the language and terms that are just as important in machine learning as in linear regression. A useful starting point is to remind ourselves about the Gaussian distribution and how it is used, often without us thinking about it, to determine the line of best fit to data using linear regression, the simplest form of machine learning used well before the term became common place:

If we believe that this equation describes our data, and our measurements are free from experimental noise, then our data would fit perfectly onto the straight line. Of course, all measurements have some noise which now means that our belief is that the actual measurement, let’s call it *y*_{obs}, if we measure it repeatedly at the same value of *x*, will have a distribution of values with a mean of *y *=* mx *+ *c*, and some distribution about the mean. The most common distribution is the Gaussian function, which says that the probability that we observe a particular value, *y*_{obs}, is given by,

One attraction of the Gaussian distribution is that it is characterised by just one additional parameter, the variance *σ*^{2}.

You might find it helpful to see what a Gaussian looks like:

So we see that the probability is highest at the expected value of *y *=* mx *+ *c* as required. The width of the peak is determined by the variance, the greater the variance the wider the peak and the more likely we are to observe data further away from the mean. Since *p*(*y*_{obs}) is a probability we also need a prefactor that ensures that the probability of have any possible value is 1. This leads to

Since we have a probability of observing a particular *y*_{obs} at each *x* at which we take a measurement, we need to introduce a *joint probability*, which is just the probability of observing *y*_{1} at *x*_{1}, *y*_{2} at *x*_{2} and so on. We usually assume that the noise that affects the measurement at one point is not related, or is *uncorrelated*, to the noise at another point. The joint probability that two or more independent events occur is the product of the probability of each individual event, so that

Now that we have specified a belief about how our data behaves and encoded it within a *probability distribution function*, we need to find the line of best fit, which means we need a measure of goodness of fit. The most common measure is the *maximum likelihood expectation* (MLE). I’ll briefly discuss this in the context of linear regression, but, helpfully, it is also widely used in machine learning to optimise the model based on training data. The MLE corresponds to the parameters, which are just *m* and *c* for linear regression, that maximise what is called the log-likelihood, in other words we search for *m *and *c *that maximise the logarithm of the joint probability. Why maximise the logarithm of the joint probability and not just the joint probability? The simplest answer is that by taking the logarithm, we simplify the math enormously. Firstly, the logarithm of a product is the sum of the logarithms and secondly, the logarithm of an exponential is just whatever is inside the exponential. In mathematical terms the log-likelihood for a joint Gaussian probability distribution function for independent events is given by

So now we just need to maximise this with respect to the parameters *m* and *c*, which we can do using calculus. I won’t go any more into the mathematics, there are plenty of resources online that describe the process in detail.

Hopefully that has set the scene for my next blog when I’ll explore GPs.