Still reading? Good. In the second of my two part blog I’ll introduce non-parametric learning. The most important thing to understand about non-parametric learning is that it is not non-parametric. Now that we’ve cleared that up …
Thanks to those of you who voted following my previous blog. You’ll find the results at the end of this blog. So what’s going on in the three figures I posted earlier?
Figure 1 is the simplest and, as, probably, you’ve already guessed, each data point is joined to its neighbours by straight lines. From experience we tend to think that this it is not very likely that if we took measurements between the existing data that they would fall on these lines. We would expect the data to vary more smoothly.
Figure 3 is generated from the best fit of the relationship between stress and extension that arises from some simple assumptions and application of the idea of entropy, as I mentioned here. In terms of learning about physics, this representation could be considered to provide the most learning, albeit that we are learning that our model is too simplistic. By just comparing the shapes of the data with the curve, we can infer that we need a model with more parameters to describe physics not included in our simple model. This is parametric physics learning.
If we aren’t attempting to fit a physics relationship but believe that our data is representative of an underlying trend, what options are there?
Figure 2 is generated using a “smoothing spline”. This is a neat way of attempting to find a way to interpolate data based on the data alone rather than any beliefs about what might cause a particular relation between extension and stress. A smoothing spline is an extension of the cubic splines which is a type of non-parametric learning, The cubic spline describes locally the curve at each data point as a cubic equation. In this case, in contrast to the physics approach, we do not impose a global relationship. This means that knowing the value of the data of the first point tells us nothing about the value at, say, the 10th point of measurement. The physics approach would enable us to make this inference, but as we can see from the figure 2, in some cases, it wouldn’t be a very good prediction!
You may be wondering how we can define a cubic for each individual data point. A cubic equation has four parameters, so we have four unknowns and only one data point. To find the other unknowns, we add assumptions such as the slope and curvature of the curve connecting adjacent points are equal, which guarantees the appearance of smoothness. The math behind this is cumbersome to write down and involves a great deal of repetitious calculations, which is why it only became popular with the advent of computers.
A smoothing cubic spline extends the idea of a cubic spline so it can deal with noisy data, that is data that varies about the expected mean. Without this, the cubic spline can become quite spiky when the data is noisy, so a smoothed spline relaxes the demand that the curve pass through every data point and instead looks for a compromise curve that is smooth but never too far from the data. This requires the introduction of another (non?) parameter, unsurprisingly called the smoothing parameter. When this is zero it reproduces a cubic spline fit that passes through all the data points, when it is one it fits a straight line through the data. The best choice of smoothing parameter requires us to introduce some arbitrary measure of what good looks like, but statisticians have come up with ways of quantifying and measuring this, a topic of a future blog.
In what way, then, is this approach non-parametric? In parametric learning, the number of parameters is dictated by the model we have chosen to fit our data. For the stress-extension entropy model, we have just one parameter. We believe that more data will either improve the fit to the model or further support our view that the model needs to be more sophisticated, but we do not believe that more data will necessarily require more parameters. In non-parametric learning the number of parameters is determined by the amount of data and how much we wish to smooth the data. The parameters should be viewed as having no meaning outside of their use in describing how the data behaves. In other words, we cannot extract any physical meaning from the parameters.
So which one is preferable? This comes down to asking ourselves the questions: what do we want to learn and what do we want to do with our new knowledge. If our goal is to determine the physical laws that govern rubber elasticity and how those laws can be most elegantly represented mathematically, figure 2 is an important step. If our goal is to predict what will happen, but don’t care why, for extension values that we have not measured then figure 3 is preferable. This is the essence of the difference between machine learning and physics. Machine learning works on the basis that the data is everything and we learn everything we need to know from the data itself. Physics on the other hand is continually searching for an underlying description of the universe based on laws that we discover guided by observations.
As for your votes, the view was unanimous: Figure 2. Is this telling us that machines have learnt to think like humans, or that we have biased their learning outcomes with our own preconceived notions?