Automated Machine Learning and Its Tools

Posted on May 1st, 2019 by Anam Haq

Most of us are well aware of how tedious, time-consuming and error inclined it could be to design a good machine learning pipeline for specific data or problem. In order to achieve such machine learning (ML) pipeline, a researcher has to perform extensive experiments at each stage. These stages are listed as follows:

Preprocessing and cleaning the data.
Selection and transformation of relevant features.
Selection of an appropriate family of classification or regression models.
Optimization of the classification model hyper-parameters.

The success of a designed machine learning pipeline highly relies on the selection of algorithms at each level. At each of these steps, the expert has to decide on the selection of the appropriate algorithm or set of algorithms along with their parameters (hyper-parameters) and parameters for the ML model. For a non-expert person, it is tough to make such selection which limits the use of ML models or results in poor predictive outcomes. The rapid growth of machine learning applications has created a demand for off-the-shelf machine learning methods that can be used easily and without expert knowledge.

The idea behind Auto-ML (automated machine learning) is to make the selection of best algorithms (suited for the given data) at each stage of ML pipelines easier and also independent of human input.

So now, what a typical Auto-ML model will look like for a non-expert person?

Something like shown in the block diagram, where Auto-ML model looks like a black box.

Wellllllllllllllllllllllllll It looks like a black box :p however, in reality, it is not. Various optimization schemes execute inside this block to help in the determination of appropriate preprocessing and ML algorithms along with the set of suitable parameters.

The advancement in Auto-ML started with the development of one of the first famous tools implementing the idea of auto-ML titled as “Auto-WEKA” – a package integrating with WEKA. The algorithm that Auto-Weka uses to perform the pipeline suggestion is known as SMAC (sequential model-based algorithm configuration).

Since then various other automated tools have been made available some of the popular among them are:

Auto-sklearn (Auto scikit-learn): Implemented in Python by using python’s very famous library Scikit-Learn. Similar to Auto-Weka in respect as it uses the same optimization scheme, i.e., SMAC.
TPOT: Implemented in Python using Scikit-Learn library also calls itself a data scientist assistant. It uses a tree-based optimization scheme along with genetic programming schemes to find the right machine learning pipeline for the user.
H2O: Mostly used for big data held on cloud computing systems. The main features it provides the user with are the analytics and visualization of big datasets.

So far now we have been discussing the idealistic version of Auto-ML which is being able to handle the selection of appropriate algorithms and parameters at each level. However, in reality, this will take a longer time to achieve. All the Auto-ML tools available up to this day has some limitations. However, we will be focusing only on Auto-WEKA and Auto-sklearn:

From our experiments we have found the following, starting with Auto-WEKA :

Auto-WEKA does not provide any suggestion regarding preprocessing and cleaning of data. It also does not provide any information regarding whether to use feature transformation algorithms or not.
The results displayed after running Auto-Weka are misleading as it shows reclassification results (testing on a learning set) that in most cases are overly optimistic. So in order to have reliable results, one should take the Auto-ML pipeline suggestions from the Auto-WEKA tool and build those similar pipelines on Weka using Explorer or Experimental modes.
Selection of classifiers is restricted to the models available within WEKA.
To run Auto-WEKA, the user has to specify the time limit, and it is suggested in papers describing this tool that one should allocate at least 24hr to obtain a reasonable ML pipeline.

Now coming towards the pitfalls of Auto-sklearn,

Auto-sklearn is built in Python using Sklearn (scikit-learn) library. Sklearn library requires greater awareness from the user who needs to apply appropriate encoding for categorical data (WEKA applies it automatically), So it cannot handle categorical information. In order to use any data with categorical information, it must have to be transformed first using one-hot encoding or any other encoding scheme.
Selection of classifiers is restricted to the models available within scikit-learn.
Auto-sklearn by default creates 50 ensembles. However, this setting can be changed, and a smaller number can be enforced.

In terms of flexibility, Auto-sklearn is more flexible as compared to Auto-Weka. It is much easier to add custom metrics for evaluations, and new algorithms, along with that is also offers preprocessing related to feature transformations (PCA etc.). However, the data needs some preprocessing as mentioned earlier. Also, keep in mind that Auto-WEKA and Auto-sklearn rely solely on ML algorithms provided by their host environments.

Both of these Auto-ML tools, i.e., Auto-WEKA and Auto-sklearn have drawbacks and loopholes which should be investigated and resolved. However, till now we cannot say that the concept of Auto-ML has been fully achieved, but hopefully, at some point in the coming years, we will see more sophisticated and comprehensive Auto-ML tools which will make the use of machine learning more common.