I’ve long heard that “a watched pot never boils,” but when I am heavily invested in the outcome of a process, I still tend to monitor it intently. In my impatience I begin to wonder if my attention is worse than useless, and is actually impeding the progress of the process! When considering machine learning models, it may be true. Let me explain.
Stop Attending to Model Training
We data scientists can be excitable creatures when it comes to watching our models retrain after we manually tweak its high-level parameters (or hyperparameters). I secretly cheer on the model at times hoping for a much improved result. After the dopamine rush at the moment of convergence I’ll quickly assess the results and kick off another training run with an even-more refined set of hyperparameters. While you obviously can’t slow down a training run just by watching it, you can slow down a whole project by paying too much attention to individual runs.
This introduces the idea: “Can data scientists be replaced with unemotional, hyper-multitasking, artificially intelligent agents?” I personally believe data scientists are here to stay, and that the real question is better put, “Have data scientists developed ways to reliably automate model tuning?” The answer is a resounding yes. Let’s explore hyperparameters in more detail and some ways to find good sets of them.
What Are Model Hyperparameters?
Hyperparameters are the high-level “knobs” or “levers” of a model. For example, key hyperparameters of a Random Forest are the number of decision trees in the forest and the maximum depth of a given tree. For a Deep Neural Network—the most recent star of machine learning—examples would be the number of layers, the size of each layer, and its learning rate. Hyperparameters are modified iteratively during the many (variable number of) passes of the time-consuming model training process with each pass containing multiple sub-runs for Cross-Validation. How one adjusts the hyperparameters after each training run to get the best outcome on your scoring metric is an iterative process referred to as hyperparameter optimization, or model tuning. This process is what we seek to automate using the following techniques:
Grid Search Optimization
One of the simplest methods is a brute force march through a pre-determined list of possible combinations. One creates a list (or grid) of candidate values for each hyperparameter and loops through all the possible combinations to see which is optimal. Grid Search is easy to program and works well for a reasonable number of dimensions (hyperparameters) with a relatively fast-training process but grows exponentially as the problem is scaled up as in many real-world cases. The cost in time can easily be estimated for a particular granularity (detail of grid), dimensions (number of hyperparameters), and probe time (training run time), so short experiments can quickly tell you if this simple-to-program search method will suffice for you.
Random Search Optimization
The Grid Search has a major drawback in that it is not as useful if it doesn’t have time to run to completion. It actually spends most of its time evaluating points on the boundary regions near the edges of the search space and not near the center of the space where you’d like it to concentrate. So, instead of exhaustively searching through every combination of hyperparameters, what if we randomly sample values? We can even sample from non-uniform probability distributions for continuously valued hyperparameters, thereby guiding more samples toward the center of the space of interest. This has been shown to usually produce better results in less time than Grid Search. In addition, Random Search allows for a tuning budget to be set independent of the size of your problem (e.g., number of hyperparameters); that is, you set how many iterations you desire and can stop it at any time. Lastly, it is even easier to program.
Gradient Search
A major drawback of Grid Search and Random Search is that they ignore the results you get! That is, they don’t use the previous search results when they tell you where to look next. A Gradient Search pays attention to your recent results and—say you are trying to minimize error—will try to move you in a direction more downhill. It can go too far, so has to be able to back up and change direction at right angles when one line plays out, leading to Conjugate Gradient Search algorithms, etc. Also, a major concern is getting stuck in local minima while looking for the elusive global minimum.
Multi-Start Search
There are usually many different minima, each with their “basins of attraction” from which they draw points, much like different ponds, lakes, and oceans on a continent gather raindrops depending on where those first fall. One way to find a collection of local minima that compete for the global title is to re-start a good local search algorithm many times from many different initial conditions; this approach is known as the Multi-Start Search.
Sequential Model-Based Optimization (SMBO)
The most sophisticated way to capture and use search result information is to build up a model of the underlying score surface—to get a picture of the surface and interrogate that model to figure out where the most likely place the optimum point would be. This method was actually used by John Elder1 decades ago to (along with a similar method by Cox & John2), beat all then-existing global search algorithms on a large suite of two-dimensional test functions.
Recently, the idea has been rediscovered and called Sequential Model-Based Optimization (SMBO). SMBO is similar to Grid Search and Random Search in that it is a black-box modeling technique that is not as dependent on a smooth surface as Gradient Search and its variants. The method comes in multiple forms depending on the underlying model that is trained to generate the next hyperparameter set (see the Resources section below for links to technical details3,4). SMBO is the most intelligent, and often the best, method for hyperparameter optimization, providing a savings in time and ultimately project costs.
Conclusion
By using the powerful tools available for hyperparameter optimization to automate the tuning of machine learning models2, data scientists can pay attention to other technical aspects of a project while models train and be more efficient. This is a win-win situation: data scientists can engage in more creative analysis and problem solving, while project owners benefit from reduced risk to the project timeline and an increased confidence in the overall success of a project. Let the pots boil!