In this blog, I’ll look at how a Gaussian Process builds up predictions when we have more than one data point to learn from. Below are the predictions for two and three data points. As before I’ve assumed that the data is free from noise. The figures below explore what happens when we vary the length scale hyper parameter used in the squared exponential kernel for two and three data points.
Previously I introduced the covariance function which relates how much the prediction of the value of a function at one point, the test point, is influenced by its value at another point, the training point. If we have multiple training points, then this idea is extended so that the mean prediction of the function, y* , at x*, is given by
Apologies if this looks a little daunting! The summation is over all of the training data points, so the subscript i corresponds to an index from 1 to the number of training points, so this summation is adding the contribution, or influence if you like, that the measured value at each point has on the prediction. This influence depends on what we have learnt about how much the training data points are related to each other, which is encoded in K, and how far our test point is from each of the training points. Let’s explore in more details what each of the factors corresponds to.
K-1 is the inverse of the matrix K that contains covariances between the training points. In other words, the mth row and the nth column of our matrix is given by our choice of the squared exponential function to measure covariance, which I introduced in my previous blog
As a consequence the diagonal elements of the matrix are equal to one.
y is a column vector in which each element corresponds to one of the measured values of our training data set.
The product K-1y is also a vector, and the subscript i above means the ith component of that vector.
The final factor corresponds to the covariance between each of the training points and the test point, so effectively measures how much influence each training point has on the predicted value of the function at the test point. As would be expected, the further x* is from xi the less influence it has, with the extent of influence also being determined by the hyper parameter length scale l:
I find it difficult to interpret the formula for y* intuitively, even though I follow the mathematical manipulations that lead to it. I can however rewrite it as
Here yT is now a row vector but still with each element corresponding to one of the measured values of our training data set. Technically it is known as the transpose of the column vector y, hence the superscript symbol T. k* is a column vector with each element corresponding to one of the covariance relations between the training data and the test data, in other words it contains each of the ki*.
What we can now see, with the relation in this form, is that if x* happens to coincide with one of our training points, xj say, then the column vector k* is identical to the jth column of the matrix K, so we can write
δj is a column vector in which all the elements except the jth are zero, and the jth element is equal to one. As a consequence
In words then, the predicted value is identical to the measured value, when x* coincides with one of our training points. So even if we can’t intuitively interpret the prediction when we have many training points, we can at least be assured that it satisfies one of the requirements that we placed on our GP if the data is noiseless!
One of the challenges of machine learning is the calculation of the inverse of the matrix K. Its size depends on the number of data points we use for training, which we would like to be as large as possible. The problem, the curse of dimensionality, is worsened when our interest in in data that is measured at not one but many independent variables. For example, we might measure the stress extension relationship as the extension, the degree of cross linking and the temperature all vary. Imagine we take measurements at ten different values of each, that’s 103 data points. Each data point requires a covariance measure with every other data point which means the computer needs to invert a matrix that is 1000 elements by 1000 elements, no trivial task. Much of the effort of those developing algorithms for machine learning is focussed on how to tackle this problem and how to reduce the dimensionality so that problems become computationally tractable.
The next question we will need to consider is how do we optimise the model? In other words, how do choose our hyper parameters in order to enable predictions? As can be seen from the figures, different length scales give very different predictions, so which one is the best? In order to answer that question we need to define what we mean by best. Recall that for linear regression models, where we are optimising fitting parameters, we typically use the method of least squares, which effectively minimises the distance between the data and the fit taking into account all the data points. Our GP though has been chosen to ensure that the predictions pass through all the data points, so we will need something a little different.