Our Chief Technical Officer Adam Fleming cuts through the confusion when it comes to machine learning in today’s post, explaining exactly what it is and why it’s becoming so useful to businesses.
“…about a 10x multiplier on your valuation” – it’s a slightly tired joke with a fragment of truth. Machine Learning is a subject that’s surrounded by a lot of hype, so I thought it would be useful to spend a little time unpacking it.
Before drilling down further, it’s probably worth saying what machine learning is not…
Machine learning is part of a bigger field: artificial intelligence.
AI is about making a computer do things which, if they were being done by a biological system, would imply intelligence.
Machine learning specifically, is about creating programs which can make predictions based on data – a process which, if done by a human, would imply learning.
Out with the old programming
In traditional programming, we tell a machine exactly what we want it to do.
That description must be very clear, totally unambiguous, and written in a way that a computer can understand it – this is coding.
In traditional programming, all learning aspects have already been done by a human, who then codes the result of their learning.
With machine learning, we use a set of techniques to do the learning for us, but the result will still need to be integrated by a human and turned into a system that’s useful.
At this point, it’s probably worth looking at an example…
Real-world example: value my property
An estate agent wants to build a system for estimating property values – they want to launch the “WeBuyAnyCar” website for the housing market.
They want (and expect) far too much traffic to make it feasible to send a human estimator to every property that comes to their site, so they need an automated way of generating reasonably accurate estimates for property values.
Taking the traditional route
With the traditional approach, we need to write a completely unambiguous definition of how to take information about a property and turn that into a number that’s reasonably close to the actual value of that property on the open market.
In order to come up with that definition, we need to understand how those values are actually produced. So, firstly we talk to some humans that actually do this.
We organise a big event for estate agents and in exchange for a nice lunch, try and gather and distil their combined knowledge into a set of simple rules. A few examples:
- “In area X, a house with 4 bedrooms will normally sell for between £100 and £150k”
- “Adding a garage will usually add about £10k to the price, unless it’s in [THESE] areas, in which case, not having a garage will reduce the value by £15k”
- “No house in area D will ever sell for more than £250k”
- “No house in area E will ever sell for less than £500k”
Now we’re ready to code up this set of rules and create a system that we can test. For a period of time, we pull together examples of properties from all over the area we’re interested in, feed that information into the system, and compare the estimate that we generate with one generated by a real human.
As expected, the first version of the system isn’t perfect. We quickly see that there are a number of cases where the system simply gets the values wrong. Debugging through the system, we realise that frequently it’s not a bug as such, but cases where the rules that we’re working with are incorrect, or out of date. For example:
- Property A is being undervalued by 50% compared to the actual for-sale price – we don’t have a specific rule for that area, which happens to be far more valuable than the nearest area which we do have a rule for
- Property B is overvalued by 30% – although it’s only 1 mile from a similar property, it’s in a different school catchment area, affecting all property prices in that area
- All the properties in Area C are now coming out at 30% over-value – a huge construction project has started there since we wrote the program, so values for every house have fallen
Here we’re seeing 2 types of error that are unavoidable with traditionally developed systems.
Some things are down to either missing or incorrect information used when the system was built. Others are due to factors which have changed since the system was built.
In a traditional environment, the only way to resolve these issues is for a human to analyse what’s going wrong, figure out how to solve it, and encode that solution back into the system. In many cases, this may be as simple as tweaking a couple of values to ensure that the system is properly weighting different factors – but in some cases, it may require a radical change to the way the system has been put together.
As time goes on, the system becomes more and more complex and the task of tweaking and maintaining it becomes more and more difficult.
Now, let’s go back to the start, but take a machine learning approach…
The machine learning approach
The fundamental difference here is that we’re not expecting the programmer to learn how to estimate property values, we’re going to build a system that can look at the raw data and produce predictions without explicitly being told how to.
Again, the first step of the process is data collection, but this time, we’re not trying to gather information about how other people do the estimation – no big lunch for estate agents.
This time, we want to know as much as possible about real properties which have actually sold, and how much they sold for. Fortunately, this kind of data is reasonably widely available in several open or cheap(ish) sources.
I’m not going to talk too much about where we get the data from, or how we convert it to a form that’s usable – that’s a whole different subject. For now, let’s assume we have a set of data that includes several factors about properties which have really sold, along with the price they sold for and when those sales took place.
With our dataset in hand, things start to get interesting.
It’s worth taking a minute to think about that, there are a couple of important implications to it:
- There’s an assumption that the output value is dependent on (or related to) input values. For example, if our dataset contained only facts which are at-best marginally related to the final purchase price of the house – the colour of the carpets, how long it’s been since the bathroom was replaced, whether there are an odd number of stairs – then any relationships that the machine algorithm finds are likely to be similarly only marginally related to the purchase price. Garbage-in, garbage-out.
- When we apply machine learning to a problem, we’re looking for a general solution. For this to be possible, we need our dataset to be as representative as possible. In other words, it should cover as much of the range of actual values as possible. For example, if our dataset contains only 3-bedroomed houses, then we have very little hope of being accurate for both 1-bedroomed flats, and 12-bedroomed mansions.
To sum up, when building the dataset (also known as the “training” set) – it’s critical that it:
- includes the main factors that are related to the output value, and
- accurately represents the realistic range of those factors
Once we’ve got our dataset, it’s time to pick an algorithm, but first a couple of technical terms. An algorithm is a process which is applied to data – it’s the way that you turn data into something that can make predictions. A model is the thing that’s output from an algorithm – it’s the thing that gets encoded into the system that you’ll actually use to make the predictions.
The process of creating a model, using an algorithm is called training the model.
Choosing an algorithm and training a model
This is a huge topic in and of itself which I’m only going to touch on briefly. Which algorithm you choose will be based on a variety of factors including:
- What kind of values are you trying to predict? Are they continuous (like a value in £s or $s or a probability), boolean (a patient does or doesn’t have cancer), or categorical (a given petal comes from a certain species of plant)?
- Will you need to explain why a decision was made? Some algorithms will produce models which lend themselves well to explanation – others involve extremely complex or abstract mathematical concepts which cannot really be explained in terms that people outside the field would understand.
- What algorithms are your team familiar with? There are a huge number of algorithms in the field – and that number is growing.
- How well does your data represent the space that you’re trying to model? If you have lots of data which covers the space thoroughly, that would suggest one class of algorithm. If you have data which is less representative, then your model will need to generalise more – and that may suggest a different class of algorithm.
- How much time/money do you have to develop the model? Some algorithms are very resource intensive – both in human and computer terms.
In all honesty, it’s relatively common to try a variety of different algorithms to try and figure out which one performs best for your specific problem.
Glossing over all the detail, once an algorithm has been selected, it’s time to train your model. As we mentioned above, training a model means that we’re running our dataset through our selected algorithm. The output is a model (a specific type of program) which maps the inputs to the outputs.
The hope is that, if it’s reasonably accurate for the data it’s been trained on, it’ll be similarly accurate for data that it hasn’t seen. That second part is the kicker – we can train a system to perform perfectly on all data we’ve got, but if it can’t generalise well, it’s a failure.
But how do we know if a model is going to generalise well? It’s standard practice when training a system to hold back a part of the dataset to use for testing – as long as the data that you hold back is still representative, you can be confident that the result is representative of the wider set of possible data.
To summarise – we split our dataset into the data that we’re going to use for training (the training set) and the data that we’re going to use for testing (the test set.) We train our model using the training set, and then run the factors from the test-set through to see what values it predicts – and we compare those values to the values that we know are actually true. If the values that we predict are close enough to the true values, then we can say that our model represents that data well and can be used for predictions in the real world. This rarely happens first time around, and it’s common to get into a loop of tweaking settings and trying again.
If this all sounds a little experimental and time-consuming, that’s because it is.
You could accurately characterise the process of building your dataset, selecting your algorithm, tweaking the inputs to the algorithm, training your model, and then testing it, to looking for a fragment of a needle in several haystacks.
That’s why, even though many of the machine learning algorithms have been around for decades, they’re only now coming to a point of being useful. With the advent of truly massive data-sets, combined with almost infinite computing power and massive advances in automation, we can suddenly throw huge volumes of data and immense computing power at problems which previously were simply too big to be addressed.
Machine learning for estate agents
Let’s get back to our example.
Assuming all has gone well, our machine learning team have successfully converted data, caffeine and computing power into a model that predicts values with a reasonably degree of accuracy.
The good news is, the heavy lifting is pretty much done – at least for now. Typically, a machine learning model is much cheaper to run than it is to train – and frequently, it’ll be expressed in a very concise form which can easily be encoded in an application which can be linked to a website for example.
We can now launch WeBuyAnyHouse, and for the first 2-3 months everything is fine.
Over time though, we’ll inevitably start to see problems; let’s refer back to that traditional approach and the problems we faced.
First, valuations were wrong following something not being captured by our model – a new area, school catchment areas and so on. With machine learning, if we’ve built our dataset representatively and we’ve taken care to ensure the inclusion of all factors, this kind of error should be rare.
Second, we had errors caused by circumstantial change since the model was built. Unfortunately, this is something we’ll still face with machine learning solutions – our model is based on data it’s seen, and it’s static.
Fortunately, as long as our algorithm selection is still valid, it’s usually a case of simply updating the dataset and re-training the model. Frequently, this kind of thing can also be automated. It’s pretty common to have a separate system to retrain the model periodically, specifically to work around this kind of problem where there’s a constantly moving factor which isn’t directly included in the dataset – the “heat” of the local housing market in this case.
There are cases where it emerges that enough has changed within the system we’re modelling that we have to go almost back to basics to generate an updated model. Fortunately, in a machine learning environment, this going back to a clean slate is far less traumatic (and expensive) than it would be in a traditional environment.
In fact, in many cases it’s reasonable to continue to experiment and produce separate and different models constantly – even once a model has been produced and gone into production – because the effort to swap one model out, and another in can be minimal.
Machine learning definitely follows the old axiom that any sufficiently advanced technology looks like witchcraft.
It’s easy to get buried in the hype, but the basic fact is that machine learning is really just a way of automatically finding and encoding relationships between input data and an output value.
The main things you need are a strong dataset and an idea of the factors that might affect your output value. Algorithm selection, model training and testing can (and should) be done iteratively. Once you’ve got a model that works, it’s critical that you continue to test, retrain or even completely revise.
I’ve deliberately simplified some things here and glossed over others, but I wanted to show you that, broken down, machine learning isn’t all that scary – it’s incredibly useful, and is finding new uses as people become used to these concepts.
Hopefully I’ve given enough of a framework to identify the main challenges and tasks involved in developing a machine learning system. As with pretty much any kind of witchcraft, the devil is in the details.