Surfing requires a combination of skill, balance, strength, and awareness. A surfer only has so much control over where they are headed. It’s less about a specific destination, and more about catching the wave and seeing where it takes you.
Solving problems with data is (surprisingly) a lot like surfing. If the data and problem goal do not match, it is like trying to point a surfboard straight toward the shore — it won’t likely take you where you want to go. So, like deciding which wave to ride, how do you know if you’ve picked the right problem?
Riding the Wave Where It Takes You
On a recent trip to Hawaii, I had the opportunity to take surfing lessons. Against better judgment, I found myself on a longboard, falling time after time, but having a blast. I expected that the hardest part as a novice would be standing up. Actually, that turned out to be pretty easy. The novice mistake I kept making — I finally realized — was, when a wave would come, I would point my board straight to shore. The wave then carries you for a bit as it approaches and passes underneath your surfboard, but eventually it “breaks away” and leaves you standing still in the water.
My instructor’s guidance was (seemingly) straightforward: look for the wave, and ride it where it takes you.” However, it takes skill and experience to change direction with the wave, and to identify the right wave to ride. Successfully applying Data Science to solve business problems requires a similar kind of agility and expertise.
Letting the Data Lead
In Data Science, we most often work with “found,” or observational datasets; these are data that have been collected for purposes other than analytics. These datasets have limitations, and often biases, that may inhibit their applicability to solving problems that generate business value. While data transformations and feature engineering can often generate meaningful information from data, even these tools have limits. Ultimately, even experienced Data Scientists must go where the data take them.
Data Scientists with limited experience will often select a problem only to find that the data they have selected or collected (or is available) is insufficient to answer the questions at hand. Although listening only to the data is one of John Elder’s Top 10 Data Mining Mistakes, failing to listen to the data at all is like trying to surf straight to shore — the data, like the wave, may not be going that way.
Going With the Data to Escalating Degrees
In a recent client engagement, a large dataset was collected with the goal of predicting fraudulent logins to a web portal. There are several transactions that can take place when a user accesses this web portal (e.g., change of service), and establishing the sequence of transactions in a session can be difficult. By identifying fraud earlier in the session, our client could minimize damage to their business and reputation.
In the existing process, fraud investigators would label (as fraud-related) whichever transaction was closest to the confirmed fraudulent event. During the Data Discovery phase of our engagement, we performed a thorough analysis of the available data to determine its suitability for building a predictive model. Most importantly, we flagged an unintended consequence of the current approach to fraud labeling: no login transactions were labeled as fraudulent. In more technical terms, we lacked a target variable to train a supervised learning algorithm. Based on extensive experience building fraud detection models, we recommended two options to resolve this dilemma:
- Collect more login data and instruct investigators to label “fraud” on those logins. In the interim, we could try using clustering analysis to determine if there were login patterns that correlated with labeled fraudulent transactions.
- Build a predictive model for the fraud labeled on other transactions.
Rather than riding the login data wave (Option 1) that would not take them where they wanted to go, our client decided on Option 2, and we built a fraud model for the labeled fraudulent cases they had. This model was three times more effective than the baseline process at identifying actual fraud in the top 1% of cases that our model predicted as likely to be fraudulent. While it was not the predictive model they had hoped to build from the outset, it still was a useful and valuable model outcome.
In another recent client engagement, a very large and non-rectangular dataset that recorded user navigation through a complicated software tool needed to be wrangled into shape and analyzed. It had been accumulated for years, recording keystrokes of those users who volunteered to be tracked during their sessions, but had never before been looked at. (This is arguably one definition of Big Data.) Some unaccountable anomalies in the dataset found when attempting to match it up with other data assets of the client eventually led to the discovery that a large percentage of the users (who were voluntarily tracked) were not licensed users! This was supposed to be impossible, given the business processes in place, and so was a minor crisis, immediately becoming more important to address than the customer segmentation goal that originally drove the project. So, the data discovery phase led to an extraordinarily valuable, though not pleasant, finding, that redirected the project. After it was successfully addressed, the original goal was then tackled successfully as well.
Tsunami (Story from Dr. John Elder)
Years ago, Elder Research was asked to help a startup firm with a very challenging project – to use infrared waves shone through the skin, and reflected off an individual’s bloodstream, to diagnose dangerous diseases, such as cancer (or, in other configurations, glucose levels useful for regulating diabetes). We were able to help the firm improve their classification ability quite a bit, but a huge challenge was negating the unique characteristics of each individual’s skin. That is, to get at the common blood chemistry internally, we had to fight past all the unique properties of skin color, elasticity, reflectivity, etc. created by each person’s combination of race, age, sex, propensity to be outdoors, etc. It was a huge challenge, and required gathering a great deal of sample data from real-world people from many cities, and many passes of sophisticated analysis and brainstorming.
The brilliant idea that one of the client officers eventually got was “this negative is so bad it must be a positive”. Long story short, they formed a new startup, and used the same technology for biometrics; that is, showing that each person has a skin reflectivity signature — like a fingerprint, or an iris pattern — that is unique, and can be used as a very simple non-invasive, inexpensive, identifier. A few years later, when I took my younger children to Disney World, I proudly showed them the little red light that we put our thumbs over to get into the park and told their confused faces that Daddy had helped build that.
So, in this last case, the data wave was so big it created a new company.
Chasing the Perfect (Data) Wave
Like a surfer riding a wave wherever it takes them, Data Scientists and the stakeholders they support must be agile, and adjust their expectations based on the peculiarities of the data they have available. Maybe it means answering a different question, taking intentional steps to improve data quality, or for more examples of an outcome to appear. As in surfing, so too in Data Science: the experience to know which leads to chase and when to hang back is a key skill.
Like surfers chasing the perfect wave (but not finding it), Data Scientists yearn for perfect data, but know that it doesn’t exist. Instead of waiting on perfection, we determine what is valuable to our clients in the data they have, and carefully extract it through a robust and rigorous modeling process based on years of practice. We see the data, and follow it where it takes us.