Learning new physics with AI

It is has been quite a while since I last posted. I was reminded to write something by a nice remark about the blog from a colleague at Microsoft who I met through an Engineering and Physical Sciences Research Council event. He also pointed me to a fascinating blog by Neil Dalchau at Microsoft on how to use machine learning to pin down parameters in models that can only be solved numerically. I was very excited by this because, as it happens, this has become a focus of our own research.

Before I discuss that in a future blog though, I want to share what I think is an exciting example of where machine learning can help us learn more from existing data than we previously thought possible.

First, recall that I’m fascinated by the physics behind structural evolution that leads to images like these:

Ultimately we want to know how a structure relates to some property of the material, which might be its strength or how it absorbs light, depending on the intended application. Ideally, we want a succinct way to characterise these structures. A very powerful experimental approach involves shining a beam of some sort (neutrons are particularly good if the mixture is different polymers, but X-rays or light can also be used), and observing how the beam is affected as it exits the material. A significant piece of information we obtain is a length scale, which tells us something about the size of the structures inside our material. The problem is that the length scale is not unique to a given structure. As an example, the two structures above have the same length scale according to a scattering experiment, as shown in the figure below, but just by looking at images above we can see that they are different. On the left, the structure corresponds to droplets of one type of polymer in a matrix of another, whilst we refer to the structure on the right as co-continuous.

How neutrons “see” the two structures above. The horizontal axis represents the angle that some neutrons are scattered by the structures away from the direct beam, most of which passes through the sample without being affected. The peak position tells us the size of the structures inside the material. The further to the right the peak is, the smaller the structures, but in this case the peaks are in the same place telling us that both structures have the same size even thought they are otherwise very different.

By borrowing tools from cosmology, scientists in the 1990s proposed some extra measures that can characterise the structure. These characteristics, known as Minkowski functionals, are volume, surface area, curvature and connectivity. Volume represents the relative volume occupied by each of the two phases. Surface area is the total contact area between the two structures. Curvature is a measure of how curved the structures are – small spheres have a much higher curvature than large spheres. Connectivity, also called the Euler measure, tells us how easy it is to trace a path from one part of the structure that is occupied by one of the polymers to another that is occupied by the same polymer without having to pass through regions occupied by the other polymer. A droplet structure has low connectivity – if you are inside one droplet you cannot trace e a path to another droplet without passing through the matrix of the other polymer. An ideal co-continuous structure will have perfect connectivity – you can reach anywhere that is also occupied by the same polymer without having to pass through a region occupied by the other polymer.

The challenge with experimentally determining Minkowski functionals is that it remains difficult to obtain three dimensional images of internal structures. Confocal microscopy is one tool for doing this, but it is limited to structures of the order of microns, whilst many mixtures have structures that are much smaller. There have been amazing developments in X-ray and neutron tomography which do enable the images to be obtained, but it is still much more time consuming and costly compared to scattering experiments.

It would be really nice if we could learn more from the scattering than just the length scale. The question we asked ourselves was is there a relation between the scattering and Minkowski functionals? A strong motivation for this is that we know that the volume can be obtained from the scattering from a reasonably simple mathematical relation, so perhaps it is also the case that the other functionals are also hidden in the scattering data.

To answer this question we trained some machine learning algorithms on pairs of scattering data and Minkowski functionals both calculated from the same microstructure image. We did this for approximately 200 different pairs with different microstructures. We then tested the algorithm to predict the Minkowski functionals given 200 different scattering data that it had not previously seen and compared the ML predictions with the direct calculations. The results were considerably more impressive than I anticipated!

Machine learning predictions for the Minkowski connectivity or Euler measure based on scattering data compared to the observed values. The match is excellent apart from at the two extremes of the values for the connectivity.

We’ve since refined our algorithm further and the early indications are that we can do even better at the extreme values, which is a reminder that whilst machine learning is a black box technique, how the question is framed and how the data is prepared remain crucial to success.

Machine learning in physics: when and why?

Yesterday, after a talk I gave on our progress in using machine learning to emulate phase separation, I was asked whether I was a convert to the machine learning approach. This led to a discussion about when and why machine learning might have a role to play in physics. This is an important question. The danger of machine learning is that it can be used to predict anything and the black box nature makes it difficult to interrogate the integrity of the predictions. This is why validation and cross checking is essential, but before reaching that stage, I think there is a fairly straightforward way to evaluate whether ML may have a role.

ML works well when you have a lot of data but lack a model to describe that data, or you don’t want to be constrained by a particular choice of model. The data might be generated by solving equations or it might be experimental. Establish whether you are trying to do interpolation or extrapolation. Extrapolation, trying to predict the output when the input falls outside of the training input, should be done with considerable caution, and is probably best avoided! As I’ve written previously, unlike many other machine learning algorithms, Gaussian Processes provide not just a prediction but also a measure of the uncertainty of the prediction. This uncertainty typically becomes very large when extrapolating which is an immediate sign that the prediction should not be taken too seriously.

The Lorentz Center, Leiden

I’ve spent the past week at the wonderful Lorentz Center at the University of Leiden.

It’s great to know that there are still places that really appreciate that science progresses best by providing an environment that encourage discussion and collaboration amongst scientists with different but complementary backgrounds, and, of course, unlimited coffee or tea! I’ve been finding out about state of the art in the applications of machine learning to soft matter. The field is very much in its infancy so lots of exciting innovations are taking place. Most of the early successes are in using ML to learn force fields for molecular dynamics simulations, as illustrated nicely in a recent paper from one of the presenters:

https://www.nature.com/articles/s41467-018-06169-2

They have figured out how to constrain the ML predictions to be physically realistic, in this case ensuring energy conservation. Such constraints are not only pleasing from a physics viewpoint, they also help with the challenge that the quantity of data we have to work with in physics is often much less than in other applications of machine learning.

Emulating phase separating mixtures

As I’ve discussed previously, phase separation, or de-mixing, has fascinated me for a long time. It is one of the most ubiquitous examples of material self-assembly, occurring frequently in complex fluids and living systems. Many technologically important metallic alloys derive their strength not from the raw components but how those components are arranged after demising has taken place. Below are three snapshot from a two dimensional model for phase separation of two polymers, with the concentration of each component colour coded in red and blue. Experimentally, we typically observe these structures with length scales ranging from many tens of nanometers to tens of micrometers.

This slideshow requires JavaScript.

 

Polymer blends with structures similar to these have found use in  applications ranging from structural aerospace components for the latest generation of aircraft to flexible solar cells. Despite the technological relevance and the fascinating underpinning physics, the prediction of how microstructure evolves has failed to maintain pace with synthetic and formulation advances. The interplay of non-equilibrium statistical physics, diffusion, rheology and hydrodynamics, causes multiple processes with overlapping time and length scales.

My interest is in whether the data-driven approach of machine learning (ML) provides an alternative empirical paradigm to physics/hypothesis-based modelling, enabling predictions to be made using observations alone, even in the absence of physical knowledge. 

A first step in machine learning is to develop model emulators, also known as surrogate models. This is where we train a machine learning algorithm with data from a computational model, or a simulation, and then use the emulator to explore, more efficiently, what happens as we move around parameter space. 

This is something we are currently working on, and here is an output from an early attempt in one dimension. A small error in the coding can often lead to some surprisingly visually appealing images:

Screenshot 2018-12-13 at 08.24.40

In this case, it was just the graphical representation part of the code that contained the bug. The underlying data shows the match between the full computational solution to the equations I am working with in blue and the emulated data in red. The match, although not entirely obvious from this picture, is surprisingly good, especially as we have a great deal of work to do in order to optimise the hyper-parameters of the Gaussian process.

AI in Science

Today, I’m attending the launch of the Engineering and Physical Science Research Council’s network on Artificial and Augmented Intelligence for Automated Investigation for Science Discovery. Some great talks already, one of particular note was by Gábor Csányi who has done some fascinating work on training machine learning algorithms to learn atomistic force fields for use in molecular dynamics simulations. In MD, we usually represent atoms, or even groups of atoms, as a single object, so that we don’t worry about the electrons and the nucleus. Newton’s laws of motion are then used to predict how the atoms move in response to other atoms. The tricky part is to figure out the potential that dictates how atoms interact with each other. This is determined by quantum mechanical principles and requires us to approximate the best guess that a full quantum mechanical calculation that includes all the electrons into some simpler function. A favourite, is the Lennard-Jones potential which has a short range repulsion and a long range attraction. In many cases though summarising a complex quantum mechanical prediction into such a potential may be too crude. 

Since in a molecular dynamics simulation we don’t care too much about the potential, an alternative is to use something more flexible than a Lennard-Jones, which is where the Gaussian Processes that I’ve been discussing come into play. Essentially Gábor and his team train GPs on quantum mechanical data to find a GP representation of the potential between atoms. The really nice aspect of their work is that they use physics to constrain the possible predictions that the GP can make. One example is the requirement that the simulations are rotationally invariant, so that the properties of the substance do not depend on how we look at the sample.

Moving beyond one data point

In this blog, I’ll look at how a Gaussian Process builds up predictions when we have more than one data point to learn from. Below are the predictions for two and three data points. As before I’ve assumed that the data is free from noise. The figures below explore what happens when we vary the length scale hyper parameter used in the squared exponential kernel for two and three data points.

This slideshow requires JavaScript.

This slideshow requires JavaScript.

Previously I introduced the covariance function which relates how much the prediction of the value of a function at one point, the test point, is influenced by its value at another point, the training point. If we have multiple training points, then this idea is extended so that the mean prediction of the function, y* , at x*, is given by

mean_multiple_data

Apologies if this looks a little daunting! The summation is over all of the training data points, so the subscript i corresponds to an index from 1 to the number of training points, so this summation is adding the contribution, or influence if you like, that the measured value at each point has on the prediction. This influence depends on what we have learnt about how much the training data points are related to each other, which is encoded in K, and how far our test point is from each of the training points. Let’s explore in more details what each of the factors corresponds to.

K-1 is the inverse of the matrix K that contains covariances between the training points. In other words, the mth row and the nth column of our matrix is given by our choice of the squared exponential function to measure covariance, which I introduced in my previous blog

Kmn_element

As a consequence the diagonal elements of the matrix are equal to one.

y is a column vector in which each element corresponds to one of the measured values of our training data set.

The product K-1y is also a vector, and the subscript i above means the ith component of that vector.

The final factor corresponds to the covariance between each of the training points and the test point, so effectively measures how much influence each training point has on the predicted value of the function at the test point. As would be expected, the further x* is from xi the less influence it has, with the extent of influence also being determined by the hyper parameter length scale l:

k_star_i_element

I find it difficult to interpret the formula for y* intuitively, even though I follow the mathematical manipulations that lead to it. I can however rewrite it as

mean_multiple_data_matrix_form

Here yis now a row vector but still with each element corresponding to one of the measured values of our training data set. Technically it is known as the transpose of the column vector y, hence the superscript symbol T. k* is a column vector with each element corresponding to one of the covariance relations between the training data and the test data, in other words it contains each of the ki*.

What we can now see, with the relation in this form, is that if x* happens to coincide with one of our training points, xj say, then the column vector k* is identical to the jth column of the matrix K, so we can write

k_star_delta

δj is a column vector in which all the elements except the jth are zero, and the jth element is equal to one. As a consequence

y_star_coincident

In words then, the predicted value is identical to the measured value, when x* coincides with one of our training points. So even if we can’t intuitively interpret the prediction when we have many training points, we can at least be assured that it satisfies one of the requirements that we placed on our GP if the data is noiseless!

One of the challenges of machine learning is the calculation of the inverse of the matrix K. Its size depends on the number of data points we use for training, which we would like to be as large as possible. The problem, the curse of dimensionality, is worsened when our interest in in data that is measured at not one but many independent variables. For example, we might measure the stress extension relationship as the extension, the degree of cross linking and the temperature all vary. Imagine we take measurements at ten different values of each, that’s 103 data points. Each data point requires a covariance measure with every other data point which means the computer needs to invert a matrix that is 1000 elements by 1000 elements, no trivial task. Much of the effort of those developing algorithms for machine learning is focussed on how to tackle this problem and how to reduce the dimensionality so that problems become computationally tractable.

The next question we will need to consider is how do we optimise the model? In other words, how do choose our hyper parameters in order to enable predictions? As can be seen from the figures, different length scales give very different predictions, so which one is the best? In order to answer that question we need to define what we mean by best. Recall that for linear regression models, where we are optimising fitting parameters, we typically use the method of least squares, which effectively minimises the distance between the data and the fit taking into account all the data points. Our GP though has been chosen to ensure that the predictions pass through all the data points, so we will need something a little different.

Understanding machine learning with just one data point

There are plenty of good books on the mathematics behind machine learning, so I want to present it a little differently. What I’m going to do is describe the mechanics of machine learning, specifically Gaussian processes, starting from just one data point. From a machine learning or even more traditional data fitting perspective this might seem strange, clearly we can learn nothing of value from one data point, but what it will allow us to do is to keep the mathematics at introductory undergraduate level and I believe illustrate the underlying mechanics of machine learning. We also won’t be able to look at optimisation. In the same way that there are an infinite number of equally possible straight lines that can be fit through one data point, there are an infinite number of equally possible predictions that a machine learning algorithm can make. Hopefully though this way of gently easing ourselves into machine learning will provide some additional insight beyond the standard approaches written for those more used to thinking in terms of multidimensional matrix algebra.

A Gaussian process addresses the question of how do I maximise flexibility of my fitting without having an infinite number of parameters, or rather as many parameters as data points? One way to think about it is that the idea is to be able to learn and predict behaviour for any possible, unknown, mathematical description that could fit the data.

There are two key assumptions of standard GPs. The first is that the value of a function, or of a measurement, taken at some value of our independent variable, will be close to the value of the measurement if we take it at a nearby value of the the independent variable. In other words, we assume that what we are measuring varies smoothly with the independent variable. The second is that, if we believe the data is free from noise and we measure yf(x) at some point x, then we want our machine learning algorithm to return exactly y if we input the value x. This may seem trivial, but what we do not want to do is simply look up whether we have take a measurement at x previously and return the value of the measurement.

For the first assumption, I believe that more sophisticated GPs than my discussion will cover are able to deal with discontinuities in data, such as are found when we measure some property of a material as it undergoes a transition. For the second assumption, including a belief that the data is noisy can be built into a GP, I’ll discuss how that is done in a later blog.

In my previous blog I introduced the joint Gaussian distribution function for the probability of independent events. This can be extended to the case of events that are not independent but related. The reason that this is of interest is because of the first assumption that I mentioned above. If two measurements taken close to each other are expected to be closely related then independence is no longer a valid assumption. To account for this whilst retaining a Gaussian structure we introduce a function, called the covariance function that measures how much knowledge of y measured at x, determines the likely value of y* measured at x*. The covariance function is often referred to as the kernel. In the language of machine learning x and the corresponding  y(x) represent our training data. The covariance function depends on the distance between the training point, x, and the point at which we want to  predict the outcome, x*. One of the most commonly used is the squared exponential:

SE_kernel

The parameter, l, is known as a length (even though generally it will not have dimensions of length since it must have the same dimensions as the independent variable x). It controls how much influence the data measured at x has on the predicted value at x*. The other parameter, σf , is a variance and measures confidence in our prediction. These two parameters characterise relationships between data rather than directly characterising the data itself. For this reason, the parameters are often referred to as hyper-parameters.

You might have noticed that the above function has the same structure as the Gaussian function, which is potentially confusing: this is not the reason that we refer to to this technique as a Gaussian process! The name Gaussian process refers to the joint probability distribution, the form of which is always Gaussian, which I’ll cover in an upcoming blog. Other covariance functions can also be used which do not have the structure of a Gaussian, I’ll also discuss those in a future blog.

With the covariance function defined, the most probable value of y* then depends on it in a very simple way when we only have the one training point:

predicted_mean_onepoint

We see immediately that this ensures that if our query coincides with the training point, i.e., x* = x, then y* = y, which was one of our requirements. You can also see that the mean predicted value does not depend on the variance, which makes sense. The variance tells us about our confidence in the prediction, not the prediction itself. We’ll return to the variance later: one of the most powerful features of GP machine learning compared to other techniques is that it not only predicts the most likely value but also provides a measure of our confidence in that value. This is particularly valuable when we are deciding where we need to undertake more measurements in order to improve the predictive capability.

For those of you that prefer visualisation to understand equations, below are three different predictions based on the squared exponential kernel for the stress extension relation when my training data corresponds to just one data point. In each I’ve used a fixed value for the variance of 1, and varied the length.

 

This slideshow requires JavaScript.

The results are quite intuitive. For l = 10 extension units, we are saying that the influence of our one data point extends a long way either side, so the stress is predicted to be more or less constant. For l = 1, we see it falls away more quickly and for l = 0.1, more quickly still. You will notice that in the latter two cases, the prediction falls away to zero stress, which might seem a little strange, but this brings us to an important concept in GPs, and machine learning in general.

We could, if we wished, include a belief that the stress is non-zero for all extensions greater than 1. In the absence of data, we might believe that the relation between stress and extension is of the form

stressstrain

as I discussed previously. (At this point I should apologise for my symbol choice. The σ in the above refers to stress and has nothing to do with the variance, σf .)

Such a belief is known as a prior. So why haven’t I included it? It turns out that once we have many data points on which to train our GP, the prediction for both the mean and the variance of y*, which is called the posterior, becomes insensitive to any prior belief. In machine learning language, the data overwhelms the prior. For this reason, it is common to set the prior belief equal to zero and let the data do the talking.

of course, the predictive value of GP really depends on having many more data points than 1, so the fact that the predictions are obviously very poor is just a consequence of the machine learning algorithm having very little from which to learn.

Machine learning vs physics learning. A physicist’s view of the machine learning approach.

Cubic splines.

Still reading? Good. In the second of my two part blog I’ll introduce non-parametric learning. The most important thing to understand about non-parametric learning is that it is not non-parametric. Now that we’ve cleared that up …

Thanks to those of you who voted following my previous blog. You’ll find the results at the end of this blog. So what’s going on in the three figures I posted earlier?

Figure 1 is the simplest and, as, probably, you’ve already guessed, each data point is joined to its neighbours by straight lines. From experience we tend to think that this it is not very likely that if we took measurements between the existing data that they would fall on these lines. We would expect the data to vary more smoothly.

Figure 3 is generated from the best fit of the relationship between stress and extension that arises from some simple assumptions and application of the idea of entropy, as I mentioned here. In terms of learning about physics, this representation could be considered to provide the most learning, albeit that we are learning that our model is too simplistic. By just comparing the shapes of the data with the curve, we can infer that we need a model with more parameters to describe physics not included in our simple model. This is parametric physics learning.

If we aren’t attempting to fit a physics relationship but believe that our data is representative of an underlying trend, what options are there?

Figure 2 is generated using a “smoothing spline”. This is a neat way of attempting to find a way to interpolate data based on the data alone rather than any beliefs about what might cause a particular relation between extension and stress. A smoothing spline is an extension of the cubic splines which is a type of non-parametric learning, The cubic spline describes locally the curve at each data point as a cubic equation. In this case, in contrast to the physics approach, we do not impose a global relationship. This means that knowing the value of the data of the first point tells us nothing about the value at, say, the 10th point of measurement. The physics approach would enable us to make this inference, but as we can see from the figure 2, in some cases, it wouldn’t be a very good prediction!

You may be wondering how we can define a cubic for each individual data point. A cubic equation has four parameters, so we have four unknowns and only one data point. To find the other unknowns, we add assumptions such as the slope and curvature of the curve connecting adjacent points are equal, which guarantees the appearance of smoothness. The math behind this is cumbersome to write down and involves a great deal of repetitious calculations, which is why it only became popular with the advent of computers.

A smoothing cubic spline extends the idea of a cubic spline so it can deal with noisy data, that is data that varies about the expected mean. Without this, the cubic spline can become quite spiky when the data is noisy, so a smoothed spline relaxes the demand that the curve pass through every data point and instead looks for a compromise curve that is smooth but never too far from the data. This requires the introduction of another (non?) parameter, unsurprisingly called the smoothing parameter. When this is zero it reproduces a cubic spline fit that passes through all the data points, when it is one it fits a straight line through the data. The best choice of smoothing parameter requires us to introduce some arbitrary measure of what good looks like, but statisticians have come up with ways of quantifying and measuring this, a topic of a future blog.

In what way, then, is this approach non-parametric? In parametric learning, the number of parameters is dictated by the model we have chosen to fit our data. For the stress-extension entropy model, we have just one parameter. We believe that more data will either improve the fit to the model or further support our view that the model needs to be more sophisticated, but we do not believe that more data will necessarily require more parameters. In non-parametric learning the number of parameters is determined by the amount of data and how much we wish to smooth the data. The parameters should be viewed as having no meaning outside of their use in describing how the data behaves. In other words, we cannot extract any physical meaning from the parameters.

So which one is preferable? This comes down to asking ourselves the questions: what do we want to learn and what do we want to do with our new knowledge. If our goal is to determine the physical laws that govern rubber elasticity and how those laws can be most elegantly represented mathematically, figure 2 is an important step. If our goal is to predict what will happen, but don’t care why, for extension values that we have not measured then figure 3 is preferable. This is the essence of the difference between machine learning and physics. Machine learning works on the basis that the data is everything and we learn everything we need to know from the data itself. Physics on the other hand is continually searching for an underlying description of the universe based on laws that we discover guided by observations.

As for your votes, the view was unanimous: Figure 2. Is this telling us that machines have learnt to think like humans, or that we have biased their learning outcomes with our own preconceived notions?