There is a lot of hype about automated machine learning at the moment, with multiple vendors boasting about they are the best at it. But a lot of the data scientists I have talked to are skeptical, still, and that is a puzzle to me. The market always evolves; I have no doubt we will get to a point where even data scientists will find a way to leverage auto ML. But what about today?
Automated Machine Learning, or Auto ML, is all about creating, training and testing machine learning models without, or with as little as possible, human intervention. Twenty years ago, the CRISP-DM approach was created, and the steps are well-known and standard. It is hard not to admit that some of the steps, like data preparation or the selection or training of models, could be automated as they are so repetitive.
Every organization leveraging machine learning should explore Auto ML because humans take time to do things, and that is costly. Imagine if an Auto ML could test out multiple models and select the best one, as opposed to just picking the one that a team thinks will likely give the best results and simply run with that one? When running only one experiment takes weeks, decision-makers must make educated guesses. The objective of Auto ML then is to automate as many of the steps as possible.
Sounds great: let's not get caught up with the religious discussion of whether we can automate everything or not, and instead look at it as a way to speed up the data science process and come up with a better end product. In some circumstances, it will be a better performing model. In others, it will allow a team to run more experiments and determine the potential value of experiments at a faster pace, thus increasing productivity.
Are there potential problems with this?
For sure, we can accept the premise that the more standard data science problems can be automated, and this is not a massive leap as tools do this today (see table below for a sample). The process goes like this. We start with some data, assuming it is pretty clean and representative of the problem we are trying to solve. We choose a model, train it on the training data set, evaluate it on the test set. If performance is decent, we automate it in production (deploy). We may add some "hyperparameter optimization" steps to get the best performance. All of this is very repetitive.
As you probably suspect, it is all about the data. It gets more complicated when the data is not perfectly aligned with the problem (does this ever happen?). An expert or data scientist must infuse some domain or unique knowledge to pre-process the data or filter it before we can move to the modelling step. We want to present the data to the model in a particular way that will consider external factors that no level of sophistication could know about. It is no surprise that 80% of data scientists' work is in data preparation.
For example, back in 2012, an app called Street Bump was launched in Boston, aiming to crowdsource where potholes were in the city. It was first found that most of the potholes were in affluent neighbourhoods. After some analysis, the company realized that most users were from affluent areas. Understanding the context, where the data comes from, how it was collected, and a whole slew of other factors is essential to make sure your model will work well.
So Guided Auto ML is about this: allowing experts to interact at specific points in the data pipeline. Adding these strategic points of interaction is not a bad idea because data changes over time and data pipelines get reused (hopefully) by others, maybe on different data (same format, but diverse geographies, for instance). Guided Auto ML is a combination of workflow software (for the interactions) and Auto ML.
One tool that takes this approach is Knime.
An article in HBR by Ahmed Abbasi, Brent Kitchens and Faizan Ahmad last October suggests that this is a preferred approach because of various biases. They call it augmented Auto ML or AugML. Their recommendations focus on the fact that the data can take your experiment sideways. Automation will never replace expert knowledge about the context and how the data is presented to the model, so we should bake it in.