You’ve probably heard it before – analytics professionals working directly with data spend as much as 80% of their time on data preparation, leaving only 20% for actual analytics and modeling. There are several common terms for the activities making up this 80%, including data “cleaning,” “wrangling,” or “munging,” with perhaps the highest-profile example being “data janitor work,” as discussed in The New York Times.
The consensus seems to be that this work is undesirable, a necessary evil we must endure to get to the “cool” parts of data science. The practitioners quoted in the Times article lament the countless hours they pour into data prep, and the author entices the reader with the possibility of automating the process.
While anyone who works in predictive analytics would welcome the chance to cut down on prep work, but what are the downsides of adopting this attitude in the practice of data science?