There are plenty of good books on the mathematics behind machine learning, so I want to present it a little differently. What I’m going to do is describe the mechanics of machine learning, specifically Gaussian processes, starting from just one data point. From a machine learning or even more traditional data fitting perspective this might seem strange, clearly we can learn nothing of value from one data point, but what it will allow us to do is to keep the mathematics at introductory undergraduate level and I believe illustrate the underlying mechanics of machine learning. We also won’t be able to look at optimisation. In the same way that there are an infinite number of equally possible straight lines that can be fit through one data point, there are an infinite number of equally possible predictions that a machine learning algorithm can make. Hopefully though this way of gently easing ourselves into machine learning will provide some additional insight beyond the standard approaches written for those more used to thinking in terms of multidimensional matrix algebra.
A Gaussian process addresses the question of how do I maximise flexibility of my fitting without having an infinite number of parameters, or rather as many parameters as data points? One way to think about it is that the idea is to be able to learn and predict behaviour for any possible, unknown, mathematical description that could fit the data.
There are two key assumptions of standard GPs. The first is that the value of a function, or of a measurement, taken at some value of our independent variable, will be close to the value of the measurement if we take it at a nearby value of the the independent variable. In other words, we assume that what we are measuring varies smoothly with the independent variable. The second is that, if we believe the data is free from noise and we measure y = f(x) at some point x, then we want our machine learning algorithm to return exactly y if we input the value x. This may seem trivial, but what we do not want to do is simply look up whether we have take a measurement at x previously and return the value of the measurement.
For the first assumption, I believe that more sophisticated GPs than my discussion will cover are able to deal with discontinuities in data, such as are found when we measure some property of a material as it undergoes a transition. For the second assumption, including a belief that the data is noisy can be built into a GP, I’ll discuss how that is done in a later blog.
In my previous blog I introduced the joint Gaussian distribution function for the probability of independent events. This can be extended to the case of events that are not independent but related. The reason that this is of interest is because of the first assumption that I mentioned above. If two measurements taken close to each other are expected to be closely related then independence is no longer a valid assumption. To account for this whilst retaining a Gaussian structure we introduce a function, called the covariance function that measures how much knowledge of y measured at x, determines the likely value of y* measured at x*. The covariance function is often referred to as the kernel. In the language of machine learning x and the corresponding y(x) represent our training data. The covariance function depends on the distance between the training point, x, and the point at which we want to predict the outcome, x*. One of the most commonly used is the squared exponential:
The parameter, l, is known as a length (even though generally it will not have dimensions of length since it must have the same dimensions as the independent variable x). It controls how much influence the data measured at x has on the predicted value at x*. The other parameter, σf , is a variance and measures confidence in our prediction. These two parameters characterise relationships between data rather than directly characterising the data itself. For this reason, the parameters are often referred to as hyper-parameters.
You might have noticed that the above function has the same structure as the Gaussian function, which is potentially confusing: this is not the reason that we refer to to this technique as a Gaussian process! The name Gaussian process refers to the joint probability distribution, the form of which is always Gaussian, which I’ll cover in an upcoming blog. Other covariance functions can also be used which do not have the structure of a Gaussian, I’ll also discuss those in a future blog.
With the covariance function defined, the most probable value of y* then depends on it in a very simple way when we only have the one training point:
We see immediately that this ensures that if our query coincides with the training point, i.e., x* = x, then y* = y, which was one of our requirements. You can also see that the mean predicted value does not depend on the variance, which makes sense. The variance tells us about our confidence in the prediction, not the prediction itself. We’ll return to the variance later: one of the most powerful features of GP machine learning compared to other techniques is that it not only predicts the most likely value but also provides a measure of our confidence in that value. This is particularly valuable when we are deciding where we need to undertake more measurements in order to improve the predictive capability.
For those of you that prefer visualisation to understand equations, below are three different predictions based on the squared exponential kernel for the stress extension relation when my training data corresponds to just one data point. In each I’ve used a fixed value for the variance of 1, and varied the length.
The results are quite intuitive. For l = 10 extension units, we are saying that the influence of our one data point extends a long way either side, so the stress is predicted to be more or less constant. For l = 1, we see it falls away more quickly and for l = 0.1, more quickly still. You will notice that in the latter two cases, the prediction falls away to zero stress, which might seem a little strange, but this brings us to an important concept in GPs, and machine learning in general.
We could, if we wished, include a belief that the stress is non-zero for all extensions greater than 1. In the absence of data, we might believe that the relation between stress and extension is of the form
as I discussed previously. (At this point I should apologise for my symbol choice. The σ in the above refers to stress and has nothing to do with the variance, σf .)
Such a belief is known as a prior. So why haven’t I included it? It turns out that once we have many data points on which to train our GP, the prediction for both the mean and the variance of y*, which is called the posterior, becomes insensitive to any prior belief. In machine learning language, the data overwhelms the prior. For this reason, it is common to set the prior belief equal to zero and let the data do the talking.
of course, the predictive value of GP really depends on having many more data points than 1, so the fact that the predictions are obviously very poor is just a consequence of the machine learning algorithm having very little from which to learn.