Top 3 Lessons Learned While Drinking from the Data Science Firehose


Sam Ballerini

Date Published:
November 30, 2018

I had one hard requirement during my job search: wherever I ended up, I wanted to drink from the “data science firehose.” I wanted to work alongside seasoned data scientists with diverse skillsets and an unadulterated passion for solving problems with data. I wanted to leave the office after my first day asking myself, “How in the world am I going to keep up with these people?” And that’s exactly what I’ve gotten at Elder Research.

I would like to share three lessons I’ve learned as a “new” data scientist at Elder Research. My purpose is twofold:

  1. To give an inside look at the rigor of Elder Research’s problem solving approach
  2. To show prospective employees what they can expect from a career as a data scientist

Each of the lessons stem from a recent engagement where the client wanted to predict whether a particular product would pass a regulatory audit. Our data included a small number of products with test results from a five-year period and a much larger population of products without test results.

Lesson #1: Data Exploration Can Make or Break a Model

I can’t stress enough how important data exploration was on this project, and by data exploration, I mean more than just building histograms and scatterplots. Data exploration is the loosely defined process of discovering things in your data that make you go, “Hmm…” It could be finding a group of observations (products, in our case) that look eerily similar, a variable with a lot of missing values, or anything else that stands out.

As we sifted through the data, we came to a sudden realization: more than half of the 20 thousand products were exact duplicates! It wasn’t initially evident because they all had unique IDs, but when we dropped those IDs, the rest of the products’ information was exactly the same. This was an important discovery because an analytical model will overweight duplicate data in the decisions it makes.

In addition to unraveling the duplicate records, we spent significant time understanding our missing variables. Were they missing at random or by design? Does “missing” mean something other than, “We don’t know the value of this variable?” In our case, it did: each product that was audited could have up to five tests. Certain tests were not applied to some of the products because they did not have the components required by all five tests. Products with fewer than five tests were missing values for the tests that didn’t apply to them; it wasn’t random. This was a key discovery for the construction of our target variable, the thing we’re trying to predict.

Lesson #2: Choose Your Target Wisely

The client provided us with a binary target variable, failure, that indicated a product’s test result. After choosing our candidate input predictors of failure and gathering them into a unified table, we searched for a relationship between them and the target. Despite our best efforts, we couldn’t find any good relationship, so we revisited the target variable to better understand how it was defined.

For each product, failure was defined relative to how a manufacturer claimed that their product would perform on each test. If the test measurement was a set percentage below the product’s claimed measurement, the product was deemed a failure. In other words, a product that passed 4 out of 5 tests with room to spare in 2002 was treated no differently than a product that passed 1 out of 5 tests by a small margin in 1992. Not only was the target not as granular as it should have been, but it did not account for changes in the tests over time (the average test measurement in 1992 may be very different from that of 2002 due to changes in product engineering or regulations). Going back to our missing test results discovery, we realized that products eligible for all five tests had more opportunities to fail than those with just three tests. With this in mind, we engineered a target variable to capture the degree of product failure (or how close it was to passing) while adjusting for the time between tests and the number of tests.

When the initial approach did not deliver the desired results, we pushed forward to address the root of the problem, an ill-defined target variable, circled back with our client to make sure they understood and agreed with our decision to establish a more granular target variable. Using the revised definition of failure (target) produced a predictable relationship, and we moved on to the next step in the process.

Lesson #3: “Cutting-edge” Algorithms Don’t Always Win

At Elder Research, we pride ourselves in staying on the cutting edge of data science. When the latest techniques boast impressive performance, we read the papers, implement the methods, and test them on different datasets. In this particular engagement, the data lent itself to a technique called semi-supervised learning, which leverages large amounts of unlabeled data to improve predictions on labeled data.

Semi-supervised learning is a compromise between the two main types of machine learning, supervised and unsupervised learning:

  • Supervised Learning: In this scenario the model (learner) discovers relationships between the predictor variables and the target, or label. You can think of a supervised learner as a student preparing for an exam with a pretest and an answer key: the questions in the pretest are our predictors, and the answers in the answer key are our labels. When it comes time to take the exam, the student draws on his knowledge of the pretest to answer new, never-before-seen questions.
  • Unsupervised Learning: In this scenario the student has the pretest, but he doesn’t have the answer key, so the exam is much more difficult.
  • Semi-supervised Learning: In this scenario the student has many pretests and an answer key for only one of them. This student can learn from both the labeled (answered) and the unlabeled pretest questions to prepare for the exam. It is not as advantageous as having all the observations labeled, but far more useful than ignoring the unlabeled observations. Typically, as was the case for our client engagement, the latter far outnumber the former.

Our semi-supervised models sought to learn from both tested (labeled) products and untested (unlabeled) products. We tackled the problem with a handful of diverse tools, established a baseline model with semi-supervised learning techniques and challenged that model with a deep learning approach. Despite the challenger model’s unparalleled performance on many of the canonical machine learning data sets, it was a semi-supervised learning technique that came out on top.

I’m not saying that the deep learning hype is unwarranted. We’ve seen deep learning perform extremely well for many of our clients. At a recent NVIDIA conference, I saw many examples of successful deep learning applications. From mortality risk prediction in critical care centers to disaster recovery resource allocation, deep learning is taking the world by storm. But, as John Elder often puts it, “every dog has its day”; on this problem, it was the semi-supervised dog’s day!