Data wrangling is the process of gathering, selecting, and transforming data to answer an analytical question. Also known as data cleaning or “munging”, legend has it that this wrangling costs analytics professionals as much as 80% of their time, leaving only 20% for exploration and modeling. Why does it take so long to wrangle the data so that it is usable for analytics?
Why is Data Wrangling Necessary?
Wrangling produces analytic data sets that usually have one of four structures. The first is the most common for data science:
Analytic Base Table (ABT): The ABT is used for machine learning. Each row in the table represents a unique entity (such as a person, product, claim, vehicle, batch, etc.) with columns containing information (inputs) about that entity for a specific point in time: its attributes, history, and relationship with other entities.
If the ABT will be used for supervised machine learning, it must include outcomes that post-date its inputs. Supervised machine learning analyzes the table looking for consistent patterns that are predictive of the desired outcomes.
De-normalized Transactions: Transactional information is used for organizational business operations such as:
- Presentation of prior contact with a customer, including notes and the actions made during past calls, to address concerns on a current call
- A line item in a particular order, including information about the complete order and detailed product information
- A dental record of a root canal and the historical and current x-rays for comparison, to ensure the tooth is ready for a crown
For analytical purposes, these transactions are summarized for online analytical processing (OLAP) data marts used by business intelligence for managers. Alternatively, this transactional structure is the precursor to the ABT.
Time Series: One or more attributes about a specific entity over time. For standard time series analysis, the observations must be divided into consistent increments of time (seconds, weeks, years, etc.). Often the entity and its trended attribute are aggregates over time, such as the monthly average of the S&P 500 index.
Document Library: A consistent corpus of documents, predominantly text, for analysis by text mining.
It is easy to underestimate the difficulty of producing these analytic data structures, especially when an organization has invested heavily in its data infrastructure over many years.
Analytics team leaders, customer champions, and other project stakeholders are often astonished by the estimate from the data scientists of how long it will take to wrangle the data before modeling can begin.
Some stakeholders, such as the users of the corresponding OLAP data mart, may think they should just try to do this themselves using their favorite dashboard data. The essential difference is that the OLAP query answers questions for aggregates of transactions, whereas an analytical model must make a unique prediction for each transaction.
Data Wrangling Challenges
There are many challenges with preparing data for use in the modeling process:
Clarifying the use case:
Because the data required depends completely on the question you are trying to answer, a data mart view is rarely adequate. To clarify the use case one must consider these questions:
- What subset of entities are relevant?
- Are you predicting the likelihood of an event or estimating a future amount?
- Within what time period and under what conditions?
- If the model results will drive an action, do you know the causal relationship between that action and the desired outcome?
Obtaining access to the data:
It is easiest if the data scientist or analyst secures permission to directly access the relevant data. Otherwise, they must submit precise instructions for “scrubbed” data, and hope their request will be granted and executed correctly.
Navigating these policy boundaries is difficult and time consuming.
Resolving key entities:
There must be no ambiguity about what entity is relevant. Although this sounds easy, it isn’t.
Take for example the entity “customer”. Who is a customer, exactly?
There may be a Customer ID in a convenient table, but if a customer leaves and returns, they may get a new Customer ID. Are John Smith, John R. Smith, and J.R. Smith the same customer?
If a person has a business account and a personal account, are they two customers or one? Can a person have multiple accounts, attempting to mask their identity?
Are there other members of the household who use the same account? What about hierarchical organizations of customers? The list goes on and on.
Identifying the extent and relationships of source data:
This is where data warehousing best practices help considerably, especially if there are appropriate views already constructed. Without this luxury, discovering how the natural key structures of the entities tie together can require significant data exploration, and then these implied rules must be verified by the data owners.
Inevitably, analytic models need to know what the data looked like on a specific date in the past–so the data stores must support “historical snapshots”. Also, if the meaning of the data (i.e., metadata) has changed over time, the confidence in the analysis may drop off sharply.
Securing relevant treatment data:
Since the purpose of a model is to decide which entities should be treated (that is, acted upon), the history of treatments taken is paramount. The treatment depends on the use case — anything from which customers to call based on their orders, to a decision whether to install a downhole pump to prevent a gas well from freezing—the possibilities are endless!
But, usually the treatment was conceived and implemented outside the domain of the other operational data—in documents and spreadsheets, for example. These dispersed files need to be fully compiled and integrated into the ABT. Eventually, new processes should be established to store subsequent treatment records in the master data source for use by future analytics projects.
Avoiding selection bias:
Too often ignored until model failures occur, selection bias is a significant problem for data science.
Selection bias remediation can be a difficult task. It is important to make sure the training sample data is representative of the implementation sample. Finding the right weights for bootstrapped sampling (an important technique to generate an unbiased sample) may require building a model just for this purpose.
Before the supervised learning phase, raw model inputs need to be transformed into features suitable for machine learning. This is the “art” of data science.
What should be nominal, what should be continuous? How should nominal levels be combined? What deterministic relationships need to be encoded? And, don’t forget that date discipline is paramount to prevent leaks from the future.
Safeguards are required to prevent features from inheriting leaks beyond the “snapshot date” for an observation. Also, every entity for a model build that will be scored in production must have an equivalent history as other entities.
Finally, the target outcome duration, or window of time, must be equal for all observations used to train a model.
Exploring the data:
Correlation analysis should be completed before starting a model build. Before exploring the relationship to the ultimate outcome, it is very helpful to remove redundancy in the data. Skip this step and you will pay for it later.
Often, in wide files, a set of columns are highly related, meaning that they contain redundant information, which only serves to make feature and model selection more difficult.
Although such redundancy might be evidenced by a high correlation coefficient, sometimes it is not. For example, if there are two columns for color, one with Spanish names for the color and one with English names, they contain identical and redundant information.
Usually the redundancy is more subtle. Variable importance metrics will report that both related inputs are important, but will naturally find it very difficult to choose a balance between the two in the model. Failure to remedy this situation before modeling leads to confusing (and overly complex) models.
Is Data Wrangling Worth the Effort?
These data wrangling challenges only compound other project delays, such as competing work priorities, funding delays, and personnel changes. Proper data wrangling represents a daunting amount of work and time. Some may question if it is worth the effort.
It can be difficult to defend the work during this period as sometimes there is little to show for the hard labor, unlike the cascade of results that ensue during the modeling phase. At other times, there are vital discoveries made during the wrangling phase that are extremely valuable – sometimes redirecting the entire project.
Consider this analogy; the foundation to a skyscraper is expensive and takes a lot of time before the above-ground building starts, but this solid foundation is necessary for the building to survive and serve its purpose for future generations.
Once the code and data infrastructure foundation are in place for data wrangling, it will deliver results quickly (in many cases, almost instantly), for as long as the use case is relevant!
On the other hand, skipping essential data wrangling steps will result in erroneous models that mar the reputation of analytics within the organization, and stifle analytic innovation.