Learning new physics with AI

It is has been quite a while since I last posted. I was reminded to write something by a nice remark about the blog from a colleague at Microsoft who I met through an Engineering and Physical Sciences Research Council event. He also pointed me to a fascinating blog by Neil Dalchau at Microsoft on how to use machine learning to pin down parameters in models that can only be solved numerically. I was very excited by this because, as it happens, this has become a focus of our own research.

Before I discuss that in a future blog though, I want to share what I think is an exciting example of where machine learning can help us learn more from existing data than we previously thought possible.

First, recall that I’m fascinated by the physics behind structural evolution that leads to images like these:

Ultimately we want to know how a structure relates to some property of the material, which might be its strength or how it absorbs light, depending on the intended application. Ideally, we want a succinct way to characterise these structures. A very powerful experimental approach involves shining a beam of some sort (neutrons are particularly good if the mixture is different polymers, but X-rays or light can also be used), and observing how the beam is affected as it exits the material. A significant piece of information we obtain is a length scale, which tells us something about the size of the structures inside our material. The problem is that the length scale is not unique to a given structure. As an example, the two structures above have the same length scale according to a scattering experiment, as shown in the figure below, but just by looking at images above we can see that they are different. On the left, the structure corresponds to droplets of one type of polymer in a matrix of another, whilst we refer to the structure on the right as co-continuous.

How neutrons “see” the two structures above. The horizontal axis represents the angle that some neutrons are scattered by the structures away from the direct beam, most of which passes through the sample without being affected. The peak position tells us the size of the structures inside the material. The further to the right the peak is, the smaller the structures, but in this case the peaks are in the same place telling us that both structures have the same size even thought they are otherwise very different.

By borrowing tools from cosmology, scientists in the 1990s proposed some extra measures that can characterise the structure. These characteristics, known as Minkowski functionals, are volume, surface area, curvature and connectivity. Volume represents the relative volume occupied by each of the two phases. Surface area is the total contact area between the two structures. Curvature is a measure of how curved the structures are – small spheres have a much higher curvature than large spheres. Connectivity, also called the Euler measure, tells us how easy it is to trace a path from one part of the structure that is occupied by one of the polymers to another that is occupied by the same polymer without having to pass through regions occupied by the other polymer. A droplet structure has low connectivity – if you are inside one droplet you cannot trace e a path to another droplet without passing through the matrix of the other polymer. An ideal co-continuous structure will have perfect connectivity – you can reach anywhere that is also occupied by the same polymer without having to pass through a region occupied by the other polymer.

The challenge with experimentally determining Minkowski functionals is that it remains difficult to obtain three dimensional images of internal structures. Confocal microscopy is one tool for doing this, but it is limited to structures of the order of microns, whilst many mixtures have structures that are much smaller. There have been amazing developments in X-ray and neutron tomography which do enable the images to be obtained, but it is still much more time consuming and costly compared to scattering experiments.

It would be really nice if we could learn more from the scattering than just the length scale. The question we asked ourselves was is there a relation between the scattering and Minkowski functionals? A strong motivation for this is that we know that the volume can be obtained from the scattering from a reasonably simple mathematical relation, so perhaps it is also the case that the other functionals are also hidden in the scattering data.

To answer this question we trained some machine learning algorithms on pairs of scattering data and Minkowski functionals both calculated from the same microstructure image. We did this for approximately 200 different pairs with different microstructures. We then tested the algorithm to predict the Minkowski functionals given 200 different scattering data that it had not previously seen and compared the ML predictions with the direct calculations. The results were considerably more impressive than I anticipated!

Machine learning predictions for the Minkowski connectivity or Euler measure based on scattering data compared to the observed values. The match is excellent apart from at the two extremes of the values for the connectivity.

We’ve since refined our algorithm further and the early indications are that we can do even better at the extreme values, which is a reminder that whilst machine learning is a black box technique, how the question is framed and how the data is prepared remain crucial to success.

Optimising models

In my previous blog, I discussed how to use an intuitive method, Leave One Out Cross Validation, for determining what parameters work best for a given machine learning algorithm. When I wrote that blog, I was surprised to find that it did not seem to work very well for finding an optimum fit. Something I have learnt over many years of working with numerical analysis is that if the results are surprising, it is well worth checking that they are correct! I’ll now revisit the LOOCV technique and present a much more satisfying outcome. Recall that in this method we train on all but one data point and then use a comparison between the trained model’s prediction for the stress with the observed stress at the point left out. We then optimise by minimising the deviation between the two after we have repeated the LOOCV process by leaving out every data point in turn. If we vary the length and variance hyper parameters in the GP, we obtain a contour map of the model likelihood which enables us to identify the best parameters. This approach to optimisation, iterating over a regularly spaced grid of length and variance values, is not very efficient, but serves to illustrate the process.

The dark red patch in the left figure shows where the fit is best and the figure on the right shows the corresponding fit to the data, including the confidence interval, shaded in grey.

An alternative way to determine how good a model is is to use the log likelihood based solely on the probability of predictions. This gives rise to two competing terms, one measures the model complexity the other the quality of the data fit. Below is the map of likelihood as I vary length scale and variance in my Gaussian process for my stress/extension data.

Although the contour plot looks different, the maximum likelihood is located in a similar, but not identical place. Since both methods are statistically different ways of determining the best fit, it is not too surprising that the exact results should differ, however, we would find it difficult to distinguish between the predictions of the two validation techniques.

For comparison, the optimised hyper-parameters are:

Hyper parameter LOOCV Log Likelihood
length 2 1.8
variance 2.4 2.7

Likelihoods and complexity

If we are going to apply machine learning to science, then clearly we need a way of quantifying how good we think our predictions are. Even before we reach that stage though we need a way to decide what is the best model and for any given model, what are the settings, or hyper-parameters, that provide the most believable predictions. This was illustrated in my previous blog in the figure showing how the predictions for the stress/extension relation vary as we vary the length scale hyper-parameter. For all length scales the curve passes exactly through the data, so how do we decide which is best?

There are a number of approaches to this, I’ll discuss just two, which are quite different and provide a nice reminder that as powerful as machine learning is, it has quirks. It is worth noting that the methods for validation are applicable regardless of the particular form of machine learning, so are as valuable for checking neural network predictions as they are for Gaussian processes.

First of all, in this blog I’ll discuss the conceptually simpler, cross-validation. In essence, we split the data into two sets, one is used to train the algorithm, the other is used to measure how good the predictions are. Since there are many different ways of splitting a data set into a training and a validation set, I’ll discuss what is know as Leave One Out cross validation (LOOCV). This is a dry but very descriptive name! All the data points but one are used to train the algorithm and then the trained algorithm is asked to predict the output at the point that has been left out, the closeness of the prediction to the observation is a measure of how good the fit is. This can be repeated so that every data point is left out in turn, with the overall measure of how good the fit is just an average over all of the  LOOCV attempts.

We can use this to help guide the choice of hyper-parameters by repeating the process for a range of length scales and variances, introduced in a previous blog. We then look for the pair of values which give the best averaged match …

Since I wrote what comes next, I have discovered a bug in my code. My next blog will correct this, but since this is a blog and not an article for peer-review, I will preserve my mistakes below for posterity!

… , which sounds simple, but gives rise to two challenges.

The first of which is that searching for the best match means finding a maximum point on a two dimensional surface (or higher as we add more hyper-parameters). Finding a maximum on a surface that is likely to be quite complex is difficult! It is easy to find a local, but much more challenging to find the global, maxima. This is a problem that exists in a multitude of numerical calculations, and has been known about for a long time, so although it isn’t easy, there are lots of established approaches to help.

The second challenge is best illustrated in a contour plot of the log likelihood for my stress-extension data as I vary the two hyper-parameters:

loglikelihoodLOO.png

The colour scale is such that bright yellow represents the highest values whilst dark blue are the lowest. What you might be able to see is that there is no maximum! Even when I increase the length scale to much higher values that log likelihood continues to increase. So according to this method LOOCV predicts that an infinite lengthscale is optimal. Apparently this is a common problem with the cross-validation approach, although I have not yet found an explanation as to why. In the next blog, I’ll discuss a different approach to finding the optimum values for the hyper-parameters, which is less intuitive, but appears to be more robust.

A new challenge ahead: Automating Science Discovery

The reason I started this blog was to document my progress as I delved into machine learning. One of the primary motivations for doing that was that I was preparing a proposal to the UK’s Engineering and Physical Science Reaearch Council for a call for feasibility studies bringing together physical science and artificial intelligence. After an Expression of Interest, an audition(!), that I did from Los Angeles at 3am local time, writing and submitting a full proposal and then attending an interview, I was unsurprisingly thrilled to learn that our bid was successful.

One of the key requirements for the proposal call was that we develop not just the use of AI in the physical sciences but also new AI. Below is the summary of our project describing the particular area of the physical sciences that we will focus on and the challenges we have set ourselves.

“De-mixing is one of the most ubiquitous examples of self-assembly, occurring frequently in complex fluids and living systems. It has enabled the development of multi-phase polymer alloys and composites for use in sophisticated applications including structural aerospace components, flexible solar cells and filtration membranes. In each case, superior functionality is derived from the microstructure, the prediction of which has failed to maintain pace with synthetic and formulation advances. The interplay of non-equilibrium statistical physics, diffusion and rheology causes multiple processes with overlapping time and length scales, which has stalled the discovery of an overarching theoretical framework. Consequently, we continue to rely heavily on trial and error in the search for new materials.”

“Our aim is to introduce a powerful new approach to modelling non-equilibrium soft matter, combining the observation based empiricism of machine learning with the fundamental based conceptualism of physics. We will develop new methods in machine learning by addressing the broader challenge of incorporating prior knowledge of physical systems into probabilistic learning rules, transforming our capacity to control and tailor microstructure through the use of predictive tools. Our goal is to create empirical learning engines, constrained by the laws of physics, that will be trained using microscopy, tomography and scattering data. In this feasibility study, we will focus on proof-of-concept, exploring the temperature / composition parameter space for a model blend, building the foundations for our ambition of using physics informed machine learning to automate and accelerate experimental materials discovery for next generation applications.”

Machine learning: preparing to go underneath the hood

In my previous blog, I showed machine learning predictions for the stress/extension data. In my next blog, I’ll start exploring what is happening within the algorithm. The particular machine learning algorithm that I’m using is known as a Gaussian process, or GP as I’ll use for short from now on. Rasmussen, in his freely available book, discusses what he claims are reasonably close relationship between GPs and many of the other different approaches, including neural networks. I believe that this is not a universally accepted viewpoint, but I suspect for those of us who are likely to remain just users of machine learning, the arguments might be rather too technical to follow! Either way, I find GPs to be one of the more accessible routes into machine learning.

Today though I just want to cover some background about linear regression since this enables me to introduce some of the language and terms that are just as important in machine learning as in linear regression. A useful starting point is to remind ourselves about the Gaussian distribution and how it is used, often without us thinking about it, to determine the line of best fit to data using linear regression, the simplest form of machine learning used well before the term became common place:

linearfit

If we believe that this equation describes our data, and our measurements are free from experimental noise, then our data would fit perfectly onto the straight line. Of course, all measurements have some noise which now means that our belief is that the actual measurement, let’s call it yobs, if we measure it repeatedly at the same value of x, will have a distribution of values with a mean of = mx c, and some distribution about the mean. The most common distribution is the Gaussian function, which says that the probability that we observe a particular value, yobs, is given by,

gaussian

One attraction of the Gaussian distribution is that it is characterised by just one additional parameter, the variance σ2.

You might find it helpful to see what a Gaussian looks like:

Gaussian curve

So we see that the probability is highest at the expected value of = mx c as required. The width of the peak is determined by the variance, the greater the variance the wider the peak and the more likely we are to observe data further away from the mean. Since p(yobs) is a probability we also need a prefactor that ensures that the probability of have any possible value is 1. This leads to

gaussian_normalised

Since we have a probability of observing a particular yobs at each x at which we take a measurement, we need to introduce a joint probability, which is just the probability of observing y1 at x1, y2 at x2 and so on. We usually assume that the noise that affects the measurement at one point is not related, or is uncorrelated, to the noise at another point. The joint probability that two or more independent events occur is the product of the probability of each individual event, so that

joint_gaussian_normalised

Now that we have specified a belief about how our data behaves and encoded it within a probability distribution function, we need to find the line of best fit, which means we need a measure of goodness of fit. The most common measure is the maximum likelihood expectation (MLE). I’ll briefly discuss this in the context of linear regression, but, helpfully, it is also widely used in machine learning to optimise the model based on training data. The MLE corresponds to the parameters, which are just m and c for linear regression, that maximise what is called the log-likelihood, in other words we search for m and c that maximise the logarithm of the joint probability. Why maximise the logarithm of the joint probability and not just the joint probability? The simplest answer is that by taking the logarithm, we simplify the math enormously. Firstly, the logarithm of a product is the sum of the logarithms and secondly, the logarithm of an exponential is just whatever is inside the exponential. In mathematical terms the log-likelihood for a joint Gaussian probability distribution function for independent events is given by

log_joint_gaussian

So now we just need to maximise this with respect to the parameters m and c, which we can do using calculus. I won’t go any more into the mathematics, there are plenty of resources online that describe the process in detail.

Hopefully that has set the scene for my next blog when I’ll explore GPs.

A first attempt at machine learning

Today, I’ve generated my first results using a machine learning algorithm. I don’t understand the theory behind the model well enough yet to describe it, but will be working on that later this week, so watch this space.

One of the first challenges for anyone wanting to adopt machine learning is the huge choice of algorithms. Probably the most famous are neural networks but there are numerous others with names such as Support Vector Machines and Reducing Kernel Hilbert Space methods (a personal favourite, just for the impenetrability of the name). The family of methods that I’ve been trying to understand are known as Gaussian Processes. I’ll talk more about what these are and why I’ve chosen them in a later blog, but for now let’s just look at what they predict, returning to my stress/extension data:

stressstrainGP_matlab

You’ll notice that the curve is smoother than the smoothed spline I used previously. Even better is that the output from this model is predictive rather than descriptive, in the sense that it is straightforward to provide the algorithm with new extension values and ask for the most probable stress. That is how I generated the curve above.

Gaussian process models are available in Matlab as part of their Regression Learner apps, I’ve found them helpful as a starting point, but a little too restrictive in terms of understanding what is happening underneath the hood of the engine. I assume that other mathematical software packages also have Gaussian Process capabilities. If you use Matlab, then feel free to play with the code that I’ve written. You can download the zip file with the .mlx live script function that does the training and testing, the two .mat files with the training data and the .m GP model.

I’ve started to use GPy which is a package within Python. It is more powerful than the Matlab version of Gaussian Processes, as well as free. Here is the GPy prediction based on the same training data:

stressstrainGP_optimized

You will notice differences between the predictions of the Matlab and the GPy models. The GPy version seems more faithful to the data but doesn’t predict a stress of zero with an extension ratio of one. There is no particular reason it should. As far as machine learning is concerned the (0,1) data point has no more meaning than any of the other data points and it doesn’t know anything about the underlying physics, so I’m actually surprised that the Matlab version does pass through (0,1). I’ll try to figure out why this is the case! One of the reasons I like the GPy version is that it is easier to explore what is happening within the model, which gives me greater confidence. As an example, I can look at what it predicts before I optimise:

stressstrainGP_unoptimized

One of the nice features of the GPy package is that blue shaded region indicates the range of confidence in the predicted stress values. As you would expect the range of confidence is much narrower for the optimised case. Again, feel free to download the .py file and the data.

In both cases, I encourage you to replace the stress and extension data with any x, y dataset of your own and build your own machine learning Gaussian Process! Perhaps you can upload results in the comments section of the blog?

 

 

Machine learning vs physics learning. A physicist’s view of the machine learning approach.

Cubic splines.

Still reading? Good. In the second of my two part blog I’ll introduce non-parametric learning. The most important thing to understand about non-parametric learning is that it is not non-parametric. Now that we’ve cleared that up …

Thanks to those of you who voted following my previous blog. You’ll find the results at the end of this blog. So what’s going on in the three figures I posted earlier?

Figure 1 is the simplest and, as, probably, you’ve already guessed, each data point is joined to its neighbours by straight lines. From experience we tend to think that this it is not very likely that if we took measurements between the existing data that they would fall on these lines. We would expect the data to vary more smoothly.

Figure 3 is generated from the best fit of the relationship between stress and extension that arises from some simple assumptions and application of the idea of entropy, as I mentioned here. In terms of learning about physics, this representation could be considered to provide the most learning, albeit that we are learning that our model is too simplistic. By just comparing the shapes of the data with the curve, we can infer that we need a model with more parameters to describe physics not included in our simple model. This is parametric physics learning.

If we aren’t attempting to fit a physics relationship but believe that our data is representative of an underlying trend, what options are there?

Figure 2 is generated using a “smoothing spline”. This is a neat way of attempting to find a way to interpolate data based on the data alone rather than any beliefs about what might cause a particular relation between extension and stress. A smoothing spline is an extension of the cubic splines which is a type of non-parametric learning, The cubic spline describes locally the curve at each data point as a cubic equation. In this case, in contrast to the physics approach, we do not impose a global relationship. This means that knowing the value of the data of the first point tells us nothing about the value at, say, the 10th point of measurement. The physics approach would enable us to make this inference, but as we can see from the figure 2, in some cases, it wouldn’t be a very good prediction!

You may be wondering how we can define a cubic for each individual data point. A cubic equation has four parameters, so we have four unknowns and only one data point. To find the other unknowns, we add assumptions such as the slope and curvature of the curve connecting adjacent points are equal, which guarantees the appearance of smoothness. The math behind this is cumbersome to write down and involves a great deal of repetitious calculations, which is why it only became popular with the advent of computers.

A smoothing cubic spline extends the idea of a cubic spline so it can deal with noisy data, that is data that varies about the expected mean. Without this, the cubic spline can become quite spiky when the data is noisy, so a smoothed spline relaxes the demand that the curve pass through every data point and instead looks for a compromise curve that is smooth but never too far from the data. This requires the introduction of another (non?) parameter, unsurprisingly called the smoothing parameter. When this is zero it reproduces a cubic spline fit that passes through all the data points, when it is one it fits a straight line through the data. The best choice of smoothing parameter requires us to introduce some arbitrary measure of what good looks like, but statisticians have come up with ways of quantifying and measuring this, a topic of a future blog.

In what way, then, is this approach non-parametric? In parametric learning, the number of parameters is dictated by the model we have chosen to fit our data. For the stress-extension entropy model, we have just one parameter. We believe that more data will either improve the fit to the model or further support our view that the model needs to be more sophisticated, but we do not believe that more data will necessarily require more parameters. In non-parametric learning the number of parameters is determined by the amount of data and how much we wish to smooth the data. The parameters should be viewed as having no meaning outside of their use in describing how the data behaves. In other words, we cannot extract any physical meaning from the parameters.

So which one is preferable? This comes down to asking ourselves the questions: what do we want to learn and what do we want to do with our new knowledge. If our goal is to determine the physical laws that govern rubber elasticity and how those laws can be most elegantly represented mathematically, figure 2 is an important step. If our goal is to predict what will happen, but don’t care why, for extension values that we have not measured then figure 3 is preferable. This is the essence of the difference between machine learning and physics. Machine learning works on the basis that the data is everything and we learn everything we need to know from the data itself. Physics on the other hand is continually searching for an underlying description of the universe based on laws that we discover guided by observations.

As for your votes, the view was unanimous: Figure 2. Is this telling us that machines have learnt to think like humans, or that we have biased their learning outcomes with our own preconceived notions?

The appeal of data

I’m working on my next blog on non-parametric learning. I hope to publish early next week. In the meantime, take a look at the three graphs depicting the same stress/extension data for a rubber below and tell me which you prefer via this link. There is no correct answer, each represents a different view of learning. I’ll describe how I generated each curve in my next blog. Don’t worry if you are not familiar with stress/extension data, you are not alone! The point of graphics is to convey information to widest possible audience in the most straightforward way possible, so your opinion is still valuable to me. These different ways of graphically describing the same data challenge us to think about what is it we are hoping to know that we didn’t know before, and, then, to ask what can we do differently now that we know it?

Feel free to add comments to the blog, if you’d like to let me know why you’ve made your particular choice.

Graph number 1:

jointhedots

 

Graph number 2:

smoothspline

 

Graph number 3:

theoreticalphysics

 

Machine learning vs physics learning: a physicist’s view.

In this two-part blog, I’ll introduce the two different approaches to machine learning, both of which can be applied to either classification or regression problems, discussed in my previous blog here. It seems to me, at least at this early stage of trying to make sense of AI, the difference between parametric and non-parametric learning strikes at the heart of the difference between machine learners and theoretical physicists. Understanding this difference so we can start to think about how to bring the two worlds together is one of my aims.

In this first part, I’ll provide a personal view of parametric learning from a physics rather than machine learning perspective and touch on how this approach, that we feel so comfortable with, starts to have drawbacks.

I’ll be using one of my favourite examples to illustrate. If you want to know more about rubber band thermodynamics, there are plenty of resources that are easy to find. The rubber band heat engine is one that I’ve used in my class, and there is a lovely, pre-internet era half page paper that describes in the most concise terms I’ve ever come across, simple experiments for discovering entropy in rubber bands

In theoretical physics, we start with laws that have proven themselves to be robust. My personal favourite is the second law of thermodynamics, which tells us that entropy, or disorder, of an isolated system always increases or stays the same but cannot decrease, a seemingly abstract statement with enormous predictive consequences. Why does a rubber band resist being stretched more when you heat it up? Entropy. You might wonder why I’m using this example rather than something from a more exotic area of physics. I could talk about the weird and wonderful world of Bose -Einstein Condensates or even the Higgs Boson, but I still marvel that entropy can be found in the behaviour of something so commonplace.

Anyway, back to physics. On top of laws we usually have to add some assumptions. This might be something like: inside our rubber band, the long chain molecules can be described as a series of connected rods where the joints between each rod can rotate freely. This apparently rather crude description reducing some complicated molecular structure to something so simple turns out to be surprisingly insightful, in part because it simplifies the math significantly. Our assumptions lead to parameters, in this case the number, or density, of crosslinks which are the points inside our rubber where the long chain molecules are chemically tied to other long chain molecules. Our prediction about how the rubber responds when we stretch it, known as the stress/strain relation, depends on the crosslink density. If we measure the stress/strain relation and fit the prediction, it becomes a fitting parameter. This is parametric learning and it is pretty much how physicists like to construct and test their theories.

For data such as stress vs strain we really only need a computer to automate the process of fitting a parametric line of best fit. As I mentioned earlier, this is a primitive form of machine learning. Although this insight gave me the confidence to explore the subject further, it would be doing all the brilliant minds working on the subject a great dis-service to imply that AI is simply ever increasingly complex versions of y = mx + c. But before I start talking about what it is that these brilliant minds are doing, let me finish with a few more words on parametric models.

Parametric models are well-defined but suffer from the fact that in choosing a model, we impose an expected structure on our data, so all the algorithm can do is determine which set of parameters provides the best fit for that particular model. Take for example the simplest theory for how a rubber responds in an experiment when we apply a force and measure how much the rubber is stretched. It predicts that the stress, which we usually give the symbol σ and is found from the force by dividing it by the cross sectional area, depends on the extension, λ:

stressstrain
Here k is Boltzmann’s constant, which always crops up in models built on the second law of thermodynamics, T is the temperature of our measurement and ρ is the crosslink density. We can use linear regression by plotting σ on the y-axis and

strainxaxis

on the x-axis, in which case our slope is kTρ, so we can determine the crosslink density in this way. We can compare this with what we expect based on knowledge of how we made the rubber. For many rubbers this relation turns out to be too simplistic. It works well at low extensions, overestimates the stress at intermediate and overestimates it at higher extensions, but the algorithm will find a line of best fit even if it isn’t very good. We will probably guide the fit so that it gives more weight to the region of data where we expect it to work best. This is known as using prior information, an important concept in Bayesian machine learning, to improve the quality of our learning. What the algorithm cannot tell us is whether the equation we are using just needs refining or we need to start with a completely different model. It can tell us how good the fit is, if we impose arbitrary choices about what constitutes a good fit. Where the mismatches between theory and data occur often provides clues about how to build a better description and we go back to the theoretical drawing board to figure out where we went wrong.

If our model is a very poor fit we are unable to use our data fitting to make useful new predictions, but there are plenty of situations where we need to make predictions and we cannot wait for someone to find the right model. The theory for rubber elasticity was first proposed in the 1940s by Hubert James and Eugene Guth. Over 70 years later, theoreticians are still attempting to refine their model! In the second part, I’ll explore non-parametric machine learning, a different way of describing and predicting data.

Are you discrete?

In this blog, I’ll look at the two different data types that we work with: discrete or continuous. The distinction between them determines the type of machine learning that we will use. If our data can only have discrete values we seek to classify it. Like The Terminator? You’ll probably be classified into the group of people who also like Total Recall. That is a guess, but any AI that didn’t make that prediction probably isn’t very good.

Whilst classification is hugely important in science as well as many other big data problems, my interest is mostly in regression, which is about describing and predicting data that can take a range of values.

Perhaps a nice physical illustration of the difference between classification and regression is the phase behaviour of substances. Whether a substance is solid, liquid or gas at a given temperature and pressure is a classification problem. Take enough measurements at different pairs of temperature and pressure and an AI algorithm will be able to start constructing the probable boundaries that separate the phases. If, on the other hand you are interested in the properties of the substance just in the liquid phase, you might, for example, measure the density as you vary temperature. The density will vary continuously as long as it stays in the liquid phase. Describing such continuous variations, such as the density/temperature relation, is an exercise in regression.

Whilst in many cases it is obvious, whether a data set is discrete or continuous might also be a choice for you to make. It is worth remembering that no AI based decision making is free from human choices and/or influence. In the above example of the phase behaviour, I have decided that my substance can only be in one of three phases. There are many more possibilities, different types of solid phases, supercritical fluid phases, liquid crystals, and then there is even the question of whether we really know what we mean by the liquid phase