In this two-part blog, I’ll introduce the two different approaches to machine learning, both of which can be applied to either classification or regression problems, discussed in my previous blog here. It seems to me, at least at this early stage of trying to make sense of AI, the difference between parametric and non-parametric learning strikes at the heart of the difference between machine learners and theoretical physicists. Understanding this difference so we can start to think about how to bring the two worlds together is one of my aims.
In this first part, I’ll provide a personal view of parametric learning from a physics rather than machine learning perspective and touch on how this approach, that we feel so comfortable with, starts to have drawbacks.
I’ll be using one of my favourite examples to illustrate. If you want to know more about rubber band thermodynamics, there are plenty of resources that are easy to find. The rubber band heat engine is one that I’ve used in my class, and there is a lovely, pre-internet era half page paper that describes in the most concise terms I’ve ever come across, simple experiments for discovering entropy in rubber bands
In theoretical physics, we start with laws that have proven themselves to be robust. My personal favourite is the second law of thermodynamics, which tells us that entropy, or disorder, of an isolated system always increases or stays the same but cannot decrease, a seemingly abstract statement with enormous predictive consequences. Why does a rubber band resist being stretched more when you heat it up? Entropy. You might wonder why I’m using this example rather than something from a more exotic area of physics. I could talk about the weird and wonderful world of Bose -Einstein Condensates or even the Higgs Boson, but I still marvel that entropy can be found in the behaviour of something so commonplace.
Anyway, back to physics. On top of laws we usually have to add some assumptions. This might be something like: inside our rubber band, the long chain molecules can be described as a series of connected rods where the joints between each rod can rotate freely. This apparently rather crude description reducing some complicated molecular structure to something so simple turns out to be surprisingly insightful, in part because it simplifies the math significantly. Our assumptions lead to parameters, in this case the number, or density, of crosslinks which are the points inside our rubber where the long chain molecules are chemically tied to other long chain molecules. Our prediction about how the rubber responds when we stretch it, known as the stress/strain relation, depends on the crosslink density. If we measure the stress/strain relation and fit the prediction, it becomes a fitting parameter. This is parametric learning and it is pretty much how physicists like to construct and test their theories.
For data such as stress vs strain we really only need a computer to automate the process of fitting a parametric line of best fit. As I mentioned earlier, this is a primitive form of machine learning. Although this insight gave me the confidence to explore the subject further, it would be doing all the brilliant minds working on the subject a great dis-service to imply that AI is simply ever increasingly complex versions of y = mx + c. But before I start talking about what it is that these brilliant minds are doing, let me finish with a few more words on parametric models.
Parametric models are well-defined but suffer from the fact that in choosing a model, we impose an expected structure on our data, so all the algorithm can do is determine which set of parameters provides the best fit for that particular model. Take for example the simplest theory for how a rubber responds in an experiment when we apply a force and measure how much the rubber is stretched. It predicts that the stress, which we usually give the symbol σ and is found from the force by dividing it by the cross sectional area, depends on the extension, λ:
Here k is Boltzmann’s constant, which always crops up in models built on the second law of thermodynamics, T is the temperature of our measurement and ρ is the crosslink density. We can use linear regression by plotting σ on the y-axis and
on the x-axis, in which case our slope is kTρ, so we can determine the crosslink density in this way. We can compare this with what we expect based on knowledge of how we made the rubber. For many rubbers this relation turns out to be too simplistic. It works well at low extensions, overestimates the stress at intermediate and overestimates it at higher extensions, but the algorithm will find a line of best fit even if it isn’t very good. We will probably guide the fit so that it gives more weight to the region of data where we expect it to work best. This is known as using prior information, an important concept in Bayesian machine learning, to improve the quality of our learning. What the algorithm cannot tell us is whether the equation we are using just needs refining or we need to start with a completely different model. It can tell us how good the fit is, if we impose arbitrary choices about what constitutes a good fit. Where the mismatches between theory and data occur often provides clues about how to build a better description and we go back to the theoretical drawing board to figure out where we went wrong.
If our model is a very poor fit we are unable to use our data fitting to make useful new predictions, but there are plenty of situations where we need to make predictions and we cannot wait for someone to find the right model. The theory for rubber elasticity was first proposed in the 1940s by Hubert James and Eugene Guth. Over 70 years later, theoreticians are still attempting to refine their model! In the second part, I’ll explore non-parametric machine learning, a different way of describing and predicting data.