Introduction
Have you ever built a predictive model only to discover that your best result was not good enough to deploy? The threshold for “good” usually depends on the situation. It could mean, “Will we make a profit if we deploy this model?” or “Are we likely to enjoy a more favorable outcome by implementing this model?” You likely employed one or more of the most popular algorithms such as decision trees, logistic regression, or even random forests only to discover that none of them met your performance requirement when evaluated on unseen data. Now, you’re only 90% of the way there, after having invested considerable time.
If this describes your situation, then trying Keras sequential models could be your next best step, especially when combined with other best practices as I describe below. With Keras, you may quickly find that extra 10% performance without a huge investment of time and resources. One hopes that cutting your losses, taking no action, and maintaining status quo operations is not what is necessary. In my own experience it’s thankfully more often been that using Keras results in enough incremental model improvements to validate my business and analytic intuition and push my models across the finish line into deployment and profitability.
Tips and Tricks
Before jumping into Keras, make sure you are getting the most out of your current modeling framework. Implementing several solid tips and tricks may be sufficient to increase model performance on out of sample data. One trick to boost performance is to gather more and higher quality inputs through judicious feature engineering and refinement of available raw data. Feature engineering works best given expert knowledge of cause and effect relationships, but in the absence of such knowledge, you risk identifying false patterns through the “vast search effect”. Regularization techniques such as ridge and lasso regression as implemented in glmnet, and variable selection and reduction techniques such as Boruta can help to improve performance on unseen data by reducing the effective degrees of freedom in the model and thereby reducing the chance of overfit. This strengthens confidence that our model will perform as expected when deployed. If regularization and feature engineering are insufficient, you can try neural networks, which provide a rich set of potential nonlinear relationships. While neural networks are renowned for their potential for additional accuracy, they are notorious for being black boxes whose results cannot be interpreted.
But what does a client mean when they demand interpretability? A critic of black box models decries the lack of a functional or mechanistic connection between inputs and output. To them, interpretability means traversing the branches in a decision tree to read out a rule, or inspecting the sign and magnitude of regression coefficients to know the weight and direction of each factor. However, a pragmatic definition of the interpretability some are seeking means: Can I easily poke the model (i.e. change the inputs) and get the output response? If the answer is “yes”, then the user has the feeling of interpretability, and that may be all that matters!
Keras for Neural Networks
Keras sequential models may provide the 5% to 10% performance boost needed to deploy a model and achieve success. As a neural network, Keras sequential models take advantage of input interactions and non-linearities, but with the added benefits of an easy to implement and modifiable building block structure. As stated on their product website, “Keras is a deep learning API written in Python, running on top of the machine learning platform TensorFlow. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result as fast as possible is key to doing good research.” While Keras includes deep learning modules for image classification and natural language processing, the focus here is sequential modeling and serve the needs of both regression and classification problems. With Keras, the user has great flexibility in regularizing the model to prevent overfit by controlling the training duration (via the number of epochs) and adjusting the node dropout rate (image source) within each hidden layer of the network. Note that dropout is short for randomly “dropping out”, or omitting, both hidden and visible nodes within a neural network during model training. In addition, the user can easily modify the structure by adding or subtracting layers and by specifying the number of nodes in each layer. With so many tunable features to choose from, don’t lose sight of the end goal: building a model that performs well on out of sample (unseen) data. Here are some practical guidelines that worked well in my tests.
- Choose a validation split fraction of 20% to 30% along with early stopping. By tracking model performance in real time, I could quickly learn how many epochs were needed to train the model without overfitting. When I did not have a good choice of model structure (number of layers and nodes per layer), I could fail early and move on.
- Choose an initial model structure of moderate complexity (not too simple, not too complex). After reading the discussion here, I experimented with three hidden layers with each one having between 4 and 128 nodes each. The best starting point for number of nodes will depend on the number of inputs and number of cases in your data set.
- Vary dropout rate while keeping structure constant. With the structure set (above), I tried varying the dropout rate within hidden layers between 20% and 80% and discovered that 40% to 50% generally worked best. A structure with more nodes per layer generally benefitted from a higher dropout rate. This result was not surprising; a more complex model with more free parameters requires more regularization via dropout to generalize well out of sample.
- Final model build stage. Once I was happy with the structure of my Keras model and amount of training, I removed the validation split and trained on all available training data with the number of epochs determined from the tests above. Then, I validated my model using a final holdout set. In my application, the holdout set included only data collected after the most recent training data. This gave me the best chance of producing a model that will perform well on unseen data. For data where the time stamp is less important, exercise constraint in selecting a holdout dataset that you will only use for final (or near final) model validation purposes.
Other Considerations
An added benefit of using Keras for classification is the confidence obtained from target label predictions. The output layer of a Keras sequential classification model is a softmax function. Probabilities across all possible class labels sum to one. In one application, I used Keras for binary classification and utilized the additional class probabilities to assign confidence to model predictions. For instance, class probabilities of {0.95,0.05} suggested more confidence in a 0 versus 1 outcome compared to probabilities of {0.55,0.45}. These probabilities can help guide model post-processing decision making. For instance, you might have the option to preferentially act on certain cases over others. In this scenario, prioritize cases having extreme class probabilities (nearly 100% 0 or 1 in binary classification).
Let’s say you followed the steps I advise above and have implemented Keras to solve your regression or classification problem. It performs better than your baseline models (decision trees, logistic regression, etc.), but you are still a few percentage points short of the finish line. Now what? First, consider your choice of a loss function. For binary classification, categorical cross entropy is a good choice while mean absolute error (MAE) is a good choice for regression. In Keras, an optimizer must also be set, and Adam (a stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments) has proven to be a good choice in my experiments with sequential modeling. Trying multiple loss functions and optimizers is a relatively straightforward process and should be attempted first. If these do not prove to be fruitful, ensembling is another good option. Multiple Keras models may be built by bagging and averaging the results. Some amount of individual model overfit is acceptable in ensembles since weak learners can combine to generalize well. Ensembling and its role within the Keras sequential modeling framework remains an area of active research. Learn more about the power and promise of model ensembles here.
Building a Keras model is not difficult, though ensuring it is high-performing can be daunting. The model building process requires domain expertise and some intuition regarding the interplay of multiple hyperparameters (number of layers and nodes, number of epochs, etc.). While I have set out some guidelines for tuning these hyperparameters, you may discover my suggested processes insufficient to meet your needs. If you still find yourself a bit short of the finish line, consider wrapping a global search around Keras hyperparameters. Dr. Elder’s Global Rd Optimization when Probes are Expensive (GROPE) algorithm is a great candidate for performing that search. Here, obtaining a single probe is “expensive” because it involves training and evaluating a Keras neural network.
Closing Remarks
I am not advocating Keras sequential modeling as the “go-to” method for all your analytical needs. More easily implemented algorithms such as regression, decision trees, and random forests should be tried first. However, if other modeling techniques leave you short of the finish line, consider Keras. The Keras neural network structure has the potential to extract complex input interactions and nonlinearities without anyone needing to hypothesize or understand the underlying mechanics. Keras may be the missing ingredient that propels your project across the deployment finish line.