Learning new physics with AI

It is has been quite a while since I last posted. I was reminded to write something by a nice remark about the blog from a colleague at Microsoft who I met through an Engineering and Physical Sciences Research Council event. He also pointed me to a fascinating blog by Neil Dalchau at Microsoft on how to use machine learning to pin down parameters in models that can only be solved numerically. I was very excited by this because, as it happens, this has become a focus of our own research.

Before I discuss that in a future blog though, I want to share what I think is an exciting example of where machine learning can help us learn more from existing data than we previously thought possible.

First, recall that I’m fascinated by the physics behind structural evolution that leads to images like these:

Ultimately we want to know how a structure relates to some property of the material, which might be its strength or how it absorbs light, depending on the intended application. Ideally, we want a succinct way to characterise these structures. A very powerful experimental approach involves shining a beam of some sort (neutrons are particularly good if the mixture is different polymers, but X-rays or light can also be used), and observing how the beam is affected as it exits the material. A significant piece of information we obtain is a length scale, which tells us something about the size of the structures inside our material. The problem is that the length scale is not unique to a given structure. As an example, the two structures above have the same length scale according to a scattering experiment, as shown in the figure below, but just by looking at images above we can see that they are different. On the left, the structure corresponds to droplets of one type of polymer in a matrix of another, whilst we refer to the structure on the right as co-continuous.

How neutrons “see” the two structures above. The horizontal axis represents the angle that some neutrons are scattered by the structures away from the direct beam, most of which passes through the sample without being affected. The peak position tells us the size of the structures inside the material. The further to the right the peak is, the smaller the structures, but in this case the peaks are in the same place telling us that both structures have the same size even thought they are otherwise very different.

By borrowing tools from cosmology, scientists in the 1990s proposed some extra measures that can characterise the structure. These characteristics, known as Minkowski functionals, are volume, surface area, curvature and connectivity. Volume represents the relative volume occupied by each of the two phases. Surface area is the total contact area between the two structures. Curvature is a measure of how curved the structures are – small spheres have a much higher curvature than large spheres. Connectivity, also called the Euler measure, tells us how easy it is to trace a path from one part of the structure that is occupied by one of the polymers to another that is occupied by the same polymer without having to pass through regions occupied by the other polymer. A droplet structure has low connectivity – if you are inside one droplet you cannot trace e a path to another droplet without passing through the matrix of the other polymer. An ideal co-continuous structure will have perfect connectivity – you can reach anywhere that is also occupied by the same polymer without having to pass through a region occupied by the other polymer.

The challenge with experimentally determining Minkowski functionals is that it remains difficult to obtain three dimensional images of internal structures. Confocal microscopy is one tool for doing this, but it is limited to structures of the order of microns, whilst many mixtures have structures that are much smaller. There have been amazing developments in X-ray and neutron tomography which do enable the images to be obtained, but it is still much more time consuming and costly compared to scattering experiments.

It would be really nice if we could learn more from the scattering than just the length scale. The question we asked ourselves was is there a relation between the scattering and Minkowski functionals? A strong motivation for this is that we know that the volume can be obtained from the scattering from a reasonably simple mathematical relation, so perhaps it is also the case that the other functionals are also hidden in the scattering data.

To answer this question we trained some machine learning algorithms on pairs of scattering data and Minkowski functionals both calculated from the same microstructure image. We did this for approximately 200 different pairs with different microstructures. We then tested the algorithm to predict the Minkowski functionals given 200 different scattering data that it had not previously seen and compared the ML predictions with the direct calculations. The results were considerably more impressive than I anticipated!

Machine learning predictions for the Minkowski connectivity or Euler measure based on scattering data compared to the observed values. The match is excellent apart from at the two extremes of the values for the connectivity.

We’ve since refined our algorithm further and the early indications are that we can do even better at the extreme values, which is a reminder that whilst machine learning is a black box technique, how the question is framed and how the data is prepared remain crucial to success.

Bias

Google were questioned in the US Congress about whether their search algorithms are biased against conservative leaning websites, an accusation that they, unsurprisingly, deny. A New York Times opinion piece questioned the strength of the US legislative response to bias in AI algorithms. Amazon has been caught up in claims of racial and gender bias in tools it has created for facial recognition.

The headlines rarely discuss what bias means. To tackle the problem of bias, and to understand just how complex the issue, in machine learning, could become we first need to understand its origins. The Amazon controversy is one of the clearer examples. Researchers from MIT claimed that the Amazon software more frequently mis-identified the gender of female faces and darker skinned faces than lighter skinned male faces. Whether this bias arises from the data being used to train the algorithm is not representative or whether it is a consequence of the model itself is not clear. Perhaps it it is the machine learning algorithm picking out the bias that is built into both analogue and digital photography?

Even if the data itself is representative and without systematic error, bias still exists in the models generated from the data. To remove bias, we need a way to quantify it. Statistical bias is a measure of how far the average prediction is from the observed value. Bias differs from random variations, which are measured through the variance which tell us how much the predictions vary about the the average prediction. An algorithm that has a high but equal rate of misclassification of the gender of male and female faces has a high variance but a low bias. An algorithm that misclassifies the gender of female faces at a, statistically significant, higher rate has a high bias even if, on average, over both female and male genders, it correctly classifies at a high rate. The challenge in statistical learning is what is called the bias variance trade off. Models with low bias have a high variance, so although they might misclassify male and female faces equally they do so at the price of too much misclassification. Much of the fine-tuning of machine learning is about finding the optimal balance between bias and variance, which is usually considered to be when the model makes the most overall correct predictions.

To try to understand statistical bias in machine learning, I’m now working on a classification problem focussed on my favourite topic, microstructures. The question I’m asking is: can machine learning classify microstructures when it only ‘sees’ a subset of the image data? More on this to follow …

Machine learning in physics: when and why?

Yesterday, after a talk I gave on our progress in using machine learning to emulate phase separation, I was asked whether I was a convert to the machine learning approach. This led to a discussion about when and why machine learning might have a role to play in physics. This is an important question. The danger of machine learning is that it can be used to predict anything and the black box nature makes it difficult to interrogate the integrity of the predictions. This is why validation and cross checking is essential, but before reaching that stage, I think there is a fairly straightforward way to evaluate whether ML may have a role.

ML works well when you have a lot of data but lack a model to describe that data, or you don’t want to be constrained by a particular choice of model. The data might be generated by solving equations or it might be experimental. Establish whether you are trying to do interpolation or extrapolation. Extrapolation, trying to predict the output when the input falls outside of the training input, should be done with considerable caution, and is probably best avoided! As I’ve written previously, unlike many other machine learning algorithms, Gaussian Processes provide not just a prediction but also a measure of the uncertainty of the prediction. This uncertainty typically becomes very large when extrapolating which is an immediate sign that the prediction should not be taken too seriously.

The Lorentz Center, Leiden

I’ve spent the past week at the wonderful Lorentz Center at the University of Leiden.

It’s great to know that there are still places that really appreciate that science progresses best by providing an environment that encourage discussion and collaboration amongst scientists with different but complementary backgrounds, and, of course, unlimited coffee or tea! I’ve been finding out about state of the art in the applications of machine learning to soft matter. The field is very much in its infancy so lots of exciting innovations are taking place. Most of the early successes are in using ML to learn force fields for molecular dynamics simulations, as illustrated nicely in a recent paper from one of the presenters:

https://www.nature.com/articles/s41467-018-06169-2

They have figured out how to constrain the ML predictions to be physically realistic, in this case ensuring energy conservation. Such constraints are not only pleasing from a physics viewpoint, they also help with the challenge that the quantity of data we have to work with in physics is often much less than in other applications of machine learning.

Emulating phase separating mixtures

As I’ve discussed previously, phase separation, or de-mixing, has fascinated me for a long time. It is one of the most ubiquitous examples of material self-assembly, occurring frequently in complex fluids and living systems. Many technologically important metallic alloys derive their strength not from the raw components but how those components are arranged after demising has taken place. Below are three snapshot from a two dimensional model for phase separation of two polymers, with the concentration of each component colour coded in red and blue. Experimentally, we typically observe these structures with length scales ranging from many tens of nanometers to tens of micrometers.

This slideshow requires JavaScript.

 

Polymer blends with structures similar to these have found use in  applications ranging from structural aerospace components for the latest generation of aircraft to flexible solar cells. Despite the technological relevance and the fascinating underpinning physics, the prediction of how microstructure evolves has failed to maintain pace with synthetic and formulation advances. The interplay of non-equilibrium statistical physics, diffusion, rheology and hydrodynamics, causes multiple processes with overlapping time and length scales.

My interest is in whether the data-driven approach of machine learning (ML) provides an alternative empirical paradigm to physics/hypothesis-based modelling, enabling predictions to be made using observations alone, even in the absence of physical knowledge. 

A first step in machine learning is to develop model emulators, also known as surrogate models. This is where we train a machine learning algorithm with data from a computational model, or a simulation, and then use the emulator to explore, more efficiently, what happens as we move around parameter space. 

This is something we are currently working on, and here is an output from an early attempt in one dimension. A small error in the coding can often lead to some surprisingly visually appealing images:

Screenshot 2018-12-13 at 08.24.40

In this case, it was just the graphical representation part of the code that contained the bug. The underlying data shows the match between the full computational solution to the equations I am working with in blue and the emulated data in red. The match, although not entirely obvious from this picture, is surprisingly good, especially as we have a great deal of work to do in order to optimise the hyper-parameters of the Gaussian process.

AI in Science

Today, I’m attending the launch of the Engineering and Physical Science Research Council’s network on Artificial and Augmented Intelligence for Automated Investigation for Science Discovery. Some great talks already, one of particular note was by Gábor Csányi who has done some fascinating work on training machine learning algorithms to learn atomistic force fields for use in molecular dynamics simulations. In MD, we usually represent atoms, or even groups of atoms, as a single object, so that we don’t worry about the electrons and the nucleus. Newton’s laws of motion are then used to predict how the atoms move in response to other atoms. The tricky part is to figure out the potential that dictates how atoms interact with each other. This is determined by quantum mechanical principles and requires us to approximate the best guess that a full quantum mechanical calculation that includes all the electrons into some simpler function. A favourite, is the Lennard-Jones potential which has a short range repulsion and a long range attraction. In many cases though summarising a complex quantum mechanical prediction into such a potential may be too crude. 

Since in a molecular dynamics simulation we don’t care too much about the potential, an alternative is to use something more flexible than a Lennard-Jones, which is where the Gaussian Processes that I’ve been discussing come into play. Essentially Gábor and his team train GPs on quantum mechanical data to find a GP representation of the potential between atoms. The really nice aspect of their work is that they use physics to constrain the possible predictions that the GP can make. One example is the requirement that the simulations are rotationally invariant, so that the properties of the substance do not depend on how we look at the sample.

Optimising models

In my previous blog, I discussed how to use an intuitive method, Leave One Out Cross Validation, for determining what parameters work best for a given machine learning algorithm. When I wrote that blog, I was surprised to find that it did not seem to work very well for finding an optimum fit. Something I have learnt over many years of working with numerical analysis is that if the results are surprising, it is well worth checking that they are correct! I’ll now revisit the LOOCV technique and present a much more satisfying outcome. Recall that in this method we train on all but one data point and then use a comparison between the trained model’s prediction for the stress with the observed stress at the point left out. We then optimise by minimising the deviation between the two after we have repeated the LOOCV process by leaving out every data point in turn. If we vary the length and variance hyper parameters in the GP, we obtain a contour map of the model likelihood which enables us to identify the best parameters. This approach to optimisation, iterating over a regularly spaced grid of length and variance values, is not very efficient, but serves to illustrate the process.

The dark red patch in the left figure shows where the fit is best and the figure on the right shows the corresponding fit to the data, including the confidence interval, shaded in grey.

An alternative way to determine how good a model is is to use the log likelihood based solely on the probability of predictions. This gives rise to two competing terms, one measures the model complexity the other the quality of the data fit. Below is the map of likelihood as I vary length scale and variance in my Gaussian process for my stress/extension data.

Although the contour plot looks different, the maximum likelihood is located in a similar, but not identical place. Since both methods are statistically different ways of determining the best fit, it is not too surprising that the exact results should differ, however, we would find it difficult to distinguish between the predictions of the two validation techniques.

For comparison, the optimised hyper-parameters are:

Hyper parameter LOOCV Log Likelihood
length 2 1.8
variance 2.4 2.7

Likelihoods and complexity

If we are going to apply machine learning to science, then clearly we need a way of quantifying how good we think our predictions are. Even before we reach that stage though we need a way to decide what is the best model and for any given model, what are the settings, or hyper-parameters, that provide the most believable predictions. This was illustrated in my previous blog in the figure showing how the predictions for the stress/extension relation vary as we vary the length scale hyper-parameter. For all length scales the curve passes exactly through the data, so how do we decide which is best?

There are a number of approaches to this, I’ll discuss just two, which are quite different and provide a nice reminder that as powerful as machine learning is, it has quirks. It is worth noting that the methods for validation are applicable regardless of the particular form of machine learning, so are as valuable for checking neural network predictions as they are for Gaussian processes.

First of all, in this blog I’ll discuss the conceptually simpler, cross-validation. In essence, we split the data into two sets, one is used to train the algorithm, the other is used to measure how good the predictions are. Since there are many different ways of splitting a data set into a training and a validation set, I’ll discuss what is know as Leave One Out cross validation (LOOCV). This is a dry but very descriptive name! All the data points but one are used to train the algorithm and then the trained algorithm is asked to predict the output at the point that has been left out, the closeness of the prediction to the observation is a measure of how good the fit is. This can be repeated so that every data point is left out in turn, with the overall measure of how good the fit is just an average over all of the  LOOCV attempts.

We can use this to help guide the choice of hyper-parameters by repeating the process for a range of length scales and variances, introduced in a previous blog. We then look for the pair of values which give the best averaged match …

Since I wrote what comes next, I have discovered a bug in my code. My next blog will correct this, but since this is a blog and not an article for peer-review, I will preserve my mistakes below for posterity!

… , which sounds simple, but gives rise to two challenges.

The first of which is that searching for the best match means finding a maximum point on a two dimensional surface (or higher as we add more hyper-parameters). Finding a maximum on a surface that is likely to be quite complex is difficult! It is easy to find a local, but much more challenging to find the global, maxima. This is a problem that exists in a multitude of numerical calculations, and has been known about for a long time, so although it isn’t easy, there are lots of established approaches to help.

The second challenge is best illustrated in a contour plot of the log likelihood for my stress-extension data as I vary the two hyper-parameters:

loglikelihoodLOO.png

The colour scale is such that bright yellow represents the highest values whilst dark blue are the lowest. What you might be able to see is that there is no maximum! Even when I increase the length scale to much higher values that log likelihood continues to increase. So according to this method LOOCV predicts that an infinite lengthscale is optimal. Apparently this is a common problem with the cross-validation approach, although I have not yet found an explanation as to why. In the next blog, I’ll discuss a different approach to finding the optimum values for the hyper-parameters, which is less intuitive, but appears to be more robust.

Moving beyond one data point

In this blog, I’ll look at how a Gaussian Process builds up predictions when we have more than one data point to learn from. Below are the predictions for two and three data points. As before I’ve assumed that the data is free from noise. The figures below explore what happens when we vary the length scale hyper parameter used in the squared exponential kernel for two and three data points.

This slideshow requires JavaScript.

This slideshow requires JavaScript.

Previously I introduced the covariance function which relates how much the prediction of the value of a function at one point, the test point, is influenced by its value at another point, the training point. If we have multiple training points, then this idea is extended so that the mean prediction of the function, y* , at x*, is given by

mean_multiple_data

Apologies if this looks a little daunting! The summation is over all of the training data points, so the subscript i corresponds to an index from 1 to the number of training points, so this summation is adding the contribution, or influence if you like, that the measured value at each point has on the prediction. This influence depends on what we have learnt about how much the training data points are related to each other, which is encoded in K, and how far our test point is from each of the training points. Let’s explore in more details what each of the factors corresponds to.

K-1 is the inverse of the matrix K that contains covariances between the training points. In other words, the mth row and the nth column of our matrix is given by our choice of the squared exponential function to measure covariance, which I introduced in my previous blog

Kmn_element

As a consequence the diagonal elements of the matrix are equal to one.

y is a column vector in which each element corresponds to one of the measured values of our training data set.

The product K-1y is also a vector, and the subscript i above means the ith component of that vector.

The final factor corresponds to the covariance between each of the training points and the test point, so effectively measures how much influence each training point has on the predicted value of the function at the test point. As would be expected, the further x* is from xi the less influence it has, with the extent of influence also being determined by the hyper parameter length scale l:

k_star_i_element

I find it difficult to interpret the formula for y* intuitively, even though I follow the mathematical manipulations that lead to it. I can however rewrite it as

mean_multiple_data_matrix_form

Here yis now a row vector but still with each element corresponding to one of the measured values of our training data set. Technically it is known as the transpose of the column vector y, hence the superscript symbol T. k* is a column vector with each element corresponding to one of the covariance relations between the training data and the test data, in other words it contains each of the ki*.

What we can now see, with the relation in this form, is that if x* happens to coincide with one of our training points, xj say, then the column vector k* is identical to the jth column of the matrix K, so we can write

k_star_delta

δj is a column vector in which all the elements except the jth are zero, and the jth element is equal to one. As a consequence

y_star_coincident

In words then, the predicted value is identical to the measured value, when x* coincides with one of our training points. So even if we can’t intuitively interpret the prediction when we have many training points, we can at least be assured that it satisfies one of the requirements that we placed on our GP if the data is noiseless!

One of the challenges of machine learning is the calculation of the inverse of the matrix K. Its size depends on the number of data points we use for training, which we would like to be as large as possible. The problem, the curse of dimensionality, is worsened when our interest in in data that is measured at not one but many independent variables. For example, we might measure the stress extension relationship as the extension, the degree of cross linking and the temperature all vary. Imagine we take measurements at ten different values of each, that’s 103 data points. Each data point requires a covariance measure with every other data point which means the computer needs to invert a matrix that is 1000 elements by 1000 elements, no trivial task. Much of the effort of those developing algorithms for machine learning is focussed on how to tackle this problem and how to reduce the dimensionality so that problems become computationally tractable.

The next question we will need to consider is how do we optimise the model? In other words, how do choose our hyper parameters in order to enable predictions? As can be seen from the figures, different length scales give very different predictions, so which one is the best? In order to answer that question we need to define what we mean by best. Recall that for linear regression models, where we are optimising fitting parameters, we typically use the method of least squares, which effectively minimises the distance between the data and the fit taking into account all the data points. Our GP though has been chosen to ensure that the predictions pass through all the data points, so we will need something a little different.

Understanding machine learning with just one data point

There are plenty of good books on the mathematics behind machine learning, so I want to present it a little differently. What I’m going to do is describe the mechanics of machine learning, specifically Gaussian processes, starting from just one data point. From a machine learning or even more traditional data fitting perspective this might seem strange, clearly we can learn nothing of value from one data point, but what it will allow us to do is to keep the mathematics at introductory undergraduate level and I believe illustrate the underlying mechanics of machine learning. We also won’t be able to look at optimisation. In the same way that there are an infinite number of equally possible straight lines that can be fit through one data point, there are an infinite number of equally possible predictions that a machine learning algorithm can make. Hopefully though this way of gently easing ourselves into machine learning will provide some additional insight beyond the standard approaches written for those more used to thinking in terms of multidimensional matrix algebra.

A Gaussian process addresses the question of how do I maximise flexibility of my fitting without having an infinite number of parameters, or rather as many parameters as data points? One way to think about it is that the idea is to be able to learn and predict behaviour for any possible, unknown, mathematical description that could fit the data.

There are two key assumptions of standard GPs. The first is that the value of a function, or of a measurement, taken at some value of our independent variable, will be close to the value of the measurement if we take it at a nearby value of the the independent variable. In other words, we assume that what we are measuring varies smoothly with the independent variable. The second is that, if we believe the data is free from noise and we measure yf(x) at some point x, then we want our machine learning algorithm to return exactly y if we input the value x. This may seem trivial, but what we do not want to do is simply look up whether we have take a measurement at x previously and return the value of the measurement.

For the first assumption, I believe that more sophisticated GPs than my discussion will cover are able to deal with discontinuities in data, such as are found when we measure some property of a material as it undergoes a transition. For the second assumption, including a belief that the data is noisy can be built into a GP, I’ll discuss how that is done in a later blog.

In my previous blog I introduced the joint Gaussian distribution function for the probability of independent events. This can be extended to the case of events that are not independent but related. The reason that this is of interest is because of the first assumption that I mentioned above. If two measurements taken close to each other are expected to be closely related then independence is no longer a valid assumption. To account for this whilst retaining a Gaussian structure we introduce a function, called the covariance function that measures how much knowledge of y measured at x, determines the likely value of y* measured at x*. The covariance function is often referred to as the kernel. In the language of machine learning x and the corresponding  y(x) represent our training data. The covariance function depends on the distance between the training point, x, and the point at which we want to  predict the outcome, x*. One of the most commonly used is the squared exponential:

SE_kernel

The parameter, l, is known as a length (even though generally it will not have dimensions of length since it must have the same dimensions as the independent variable x). It controls how much influence the data measured at x has on the predicted value at x*. The other parameter, σf , is a variance and measures confidence in our prediction. These two parameters characterise relationships between data rather than directly characterising the data itself. For this reason, the parameters are often referred to as hyper-parameters.

You might have noticed that the above function has the same structure as the Gaussian function, which is potentially confusing: this is not the reason that we refer to to this technique as a Gaussian process! The name Gaussian process refers to the joint probability distribution, the form of which is always Gaussian, which I’ll cover in an upcoming blog. Other covariance functions can also be used which do not have the structure of a Gaussian, I’ll also discuss those in a future blog.

With the covariance function defined, the most probable value of y* then depends on it in a very simple way when we only have the one training point:

predicted_mean_onepoint

We see immediately that this ensures that if our query coincides with the training point, i.e., x* = x, then y* = y, which was one of our requirements. You can also see that the mean predicted value does not depend on the variance, which makes sense. The variance tells us about our confidence in the prediction, not the prediction itself. We’ll return to the variance later: one of the most powerful features of GP machine learning compared to other techniques is that it not only predicts the most likely value but also provides a measure of our confidence in that value. This is particularly valuable when we are deciding where we need to undertake more measurements in order to improve the predictive capability.

For those of you that prefer visualisation to understand equations, below are three different predictions based on the squared exponential kernel for the stress extension relation when my training data corresponds to just one data point. In each I’ve used a fixed value for the variance of 1, and varied the length.

 

This slideshow requires JavaScript.

The results are quite intuitive. For l = 10 extension units, we are saying that the influence of our one data point extends a long way either side, so the stress is predicted to be more or less constant. For l = 1, we see it falls away more quickly and for l = 0.1, more quickly still. You will notice that in the latter two cases, the prediction falls away to zero stress, which might seem a little strange, but this brings us to an important concept in GPs, and machine learning in general.

We could, if we wished, include a belief that the stress is non-zero for all extensions greater than 1. In the absence of data, we might believe that the relation between stress and extension is of the form

stressstrain

as I discussed previously. (At this point I should apologise for my symbol choice. The σ in the above refers to stress and has nothing to do with the variance, σf .)

Such a belief is known as a prior. So why haven’t I included it? It turns out that once we have many data points on which to train our GP, the prediction for both the mean and the variance of y*, which is called the posterior, becomes insensitive to any prior belief. In machine learning language, the data overwhelms the prior. For this reason, it is common to set the prior belief equal to zero and let the data do the talking.

of course, the predictive value of GP really depends on having many more data points than 1, so the fact that the predictions are obviously very poor is just a consequence of the machine learning algorithm having very little from which to learn.