In this blog, I’ll look at the two different data types that we work with: discrete or continuous. The distinction between them determines the type of machine learning that we will use. If our data can only have discrete values we seek to classify it. Like The Terminator? You’ll probably be classified into the group of people who also like Total Recall. That is a guess, but any AI that didn’t make that prediction probably isn’t very good.
Whilst classification is hugely important in science as well as many other big data problems, my interest is mostly in regression, which is about describing and predicting data that can take a range of values.
Perhaps a nice physical illustration of the difference between classification and regression is the phase behaviour of substances. Whether a substance is solid, liquid or gas at a given temperature and pressure is a classification problem. Take enough measurements at different pairs of temperature and pressure and an AI algorithm will be able to start constructing the probable boundaries that separate the phases. If, on the other hand you are interested in the properties of the substance just in the liquid phase, you might, for example, measure the density as you vary temperature. The density will vary continuously as long as it stays in the liquid phase. Describing such continuous variations, such as the density/temperature relation, is an exercise in regression.
Whilst in many cases it is obvious, whether a data set is discrete or continuous might also be a choice for you to make. It is worth remembering that no AI based decision making is free from human choices and/or influence. In the above example of the phase behaviour, I have decided that my substance can only be in one of three phases. There are many more possibilities, different types of solid phases, supercritical fluid phases, liquid crystals, and then there is even the question of whether we really know what we mean by the liquid phase …