# Learning new physics with AI

It is has been quite a while since I last posted. I was reminded to write something by a nice remark about the blog from a colleague at Microsoft who I met through an Engineering and Physical Sciences Research Council event. He also pointed me to a fascinating blog by Neil Dalchau at Microsoft on how to use machine learning to pin down parameters in models that can only be solved numerically. I was very excited by this because, as it happens, this has become a focus of our own research.

Before I discuss that in a future blog though, I want to share what I think is an exciting example of where machine learning can help us learn more from existing data than we previously thought possible.

First, recall that I’m fascinated by the physics behind structural evolution that leads to images like these:

Ultimately we want to know how a structure relates to some property of the material, which might be its strength or how it absorbs light, depending on the intended application. Ideally, we want a succinct way to characterise these structures. A very powerful experimental approach involves shining a beam of some sort (neutrons are particularly good if the mixture is different polymers, but X-rays or light can also be used), and observing how the beam is affected as it exits the material. A significant piece of information we obtain is a length scale, which tells us something about the size of the structures inside our material. The problem is that the length scale is not unique to a given structure. As an example, the two structures above have the same length scale according to a scattering experiment, as shown in the figure below, but just by looking at images above we can see that they are different. On the left, the structure corresponds to droplets of one type of polymer in a matrix of another, whilst we refer to the structure on the right as co-continuous.

By borrowing tools from cosmology, scientists in the 1990s proposed some extra measures that can characterise the structure. These characteristics, known as Minkowski functionals, are volume, surface area, curvature and connectivity. Volume represents the relative volume occupied by each of the two phases. Surface area is the total contact area between the two structures. Curvature is a measure of how curved the structures are – small spheres have a much higher curvature than large spheres. Connectivity, also called the Euler measure, tells us how easy it is to trace a path from one part of the structure that is occupied by one of the polymers to another that is occupied by the same polymer without having to pass through regions occupied by the other polymer. A droplet structure has low connectivity – if you are inside one droplet you cannot trace e a path to another droplet without passing through the matrix of the other polymer. An ideal co-continuous structure will have perfect connectivity – you can reach anywhere that is also occupied by the same polymer without having to pass through a region occupied by the other polymer.

The challenge with experimentally determining Minkowski functionals is that it remains difficult to obtain three dimensional images of internal structures. Confocal microscopy is one tool for doing this, but it is limited to structures of the order of microns, whilst many mixtures have structures that are much smaller. There have been amazing developments in X-ray and neutron tomography which do enable the images to be obtained, but it is still much more time consuming and costly compared to scattering experiments.

It would be really nice if we could learn more from the scattering than just the length scale. The question we asked ourselves was is there a relation between the scattering and Minkowski functionals? A strong motivation for this is that we know that the volume can be obtained from the scattering from a reasonably simple mathematical relation, so perhaps it is also the case that the other functionals are also hidden in the scattering data.

To answer this question we trained some machine learning algorithms on pairs of scattering data and Minkowski functionals both calculated from the same microstructure image. We did this for approximately 200 different pairs with different microstructures. We then tested the algorithm to predict the Minkowski functionals given 200 different scattering data that it had not previously seen and compared the ML predictions with the direct calculations. The results were considerably more impressive than I anticipated!

We’ve since refined our algorithm further and the early indications are that we can do even better at the extreme values, which is a reminder that whilst machine learning is a black box technique, how the question is framed and how the data is prepared remain crucial to success.

# Optimising models

In my previous blog, I discussed how to use an intuitive method, Leave One Out Cross Validation, for determining what parameters work best for a given machine learning algorithm. When I wrote that blog, I was surprised to find that it did not seem to work very well for finding an optimum fit. Something I have learnt over many years of working with numerical analysis is that if the results are surprising, it is well worth checking that they are correct! I’ll now revisit the LOOCV technique and present a much more satisfying outcome. Recall that in this method we train on all but one data point and then use a comparison between the trained model’s prediction for the stress with the observed stress at the point left out. We then optimise by minimising the deviation between the two after we have repeated the LOOCV process by leaving out every data point in turn. If we vary the length and variance hyper parameters in the GP, we obtain a contour map of the model likelihood which enables us to identify the best parameters. This approach to optimisation, iterating over a regularly spaced grid of length and variance values, is not very efficient, but serves to illustrate the process.

The dark red patch in the left figure shows where the fit is best and the figure on the right shows the corresponding fit to the data, including the confidence interval, shaded in grey.

An alternative way to determine how good a model is is to use the log likelihood based solely on the probability of predictions. This gives rise to two competing terms, one measures the model complexity the other the quality of the data fit. Below is the map of likelihood as I vary length scale and variance in my Gaussian process for my stress/extension data.

Although the contour plot looks different, the maximum likelihood is located in a similar, but not identical place. Since both methods are statistically different ways of determining the best fit, it is not too surprising that the exact results should differ, however, we would find it difficult to distinguish between the predictions of the two validation techniques.

For comparison, the optimised hyper-parameters are:

Hyper parameter LOOCV Log Likelihood
length 2 1.8
variance 2.4 2.7

# Likelihoods and complexity

If we are going to apply machine learning to science, then clearly we need a way of quantifying how good we think our predictions are. Even before we reach that stage though we need a way to decide what is the best model and for any given model, what are the settings, or hyper-parameters, that provide the most believable predictions. This was illustrated in my previous blog in the figure showing how the predictions for the stress/extension relation vary as we vary the length scale hyper-parameter. For all length scales the curve passes exactly through the data, so how do we decide which is best?

There are a number of approaches to this, I’ll discuss just two, which are quite different and provide a nice reminder that as powerful as machine learning is, it has quirks. It is worth noting that the methods for validation are applicable regardless of the particular form of machine learning, so are as valuable for checking neural network predictions as they are for Gaussian processes.

First of all, in this blog I’ll discuss the conceptually simpler, cross-validation. In essence, we split the data into two sets, one is used to train the algorithm, the other is used to measure how good the predictions are. Since there are many different ways of splitting a data set into a training and a validation set, I’ll discuss what is know as Leave One Out cross validation (LOOCV). This is a dry but very descriptive name! All the data points but one are used to train the algorithm and then the trained algorithm is asked to predict the output at the point that has been left out, the closeness of the prediction to the observation is a measure of how good the fit is. This can be repeated so that every data point is left out in turn, with the overall measure of how good the fit is just an average over all of the  LOOCV attempts.

We can use this to help guide the choice of hyper-parameters by repeating the process for a range of length scales and variances, introduced in a previous blog. We then look for the pair of values which give the best averaged match …

Since I wrote what comes next, I have discovered a bug in my code. My next blog will correct this, but since this is a blog and not an article for peer-review, I will preserve my mistakes below for posterity!

… , which sounds simple, but gives rise to two challenges.

The first of which is that searching for the best match means finding a maximum point on a two dimensional surface (or higher as we add more hyper-parameters). Finding a maximum on a surface that is likely to be quite complex is difficult! It is easy to find a local, but much more challenging to find the global, maxima. This is a problem that exists in a multitude of numerical calculations, and has been known about for a long time, so although it isn’t easy, there are lots of established approaches to help.

The second challenge is best illustrated in a contour plot of the log likelihood for my stress-extension data as I vary the two hyper-parameters:

The colour scale is such that bright yellow represents the highest values whilst dark blue are the lowest. What you might be able to see is that there is no maximum! Even when I increase the length scale to much higher values that log likelihood continues to increase. So according to this method LOOCV predicts that an infinite lengthscale is optimal. Apparently this is a common problem with the cross-validation approach, although I have not yet found an explanation as to why. In the next blog, I’ll discuss a different approach to finding the optimum values for the hyper-parameters, which is less intuitive, but appears to be more robust.