Blog

Data Science, Statistics, and the "Method of Moments"

Peter Bruce

August 31, 2018

 

BLOG_Data Science, Statistics, and the Method of Moments

I got my introduction to statistics via resampling, working with Julian Simon, an early resampling pioneer. Demonstrating this "brute force" computer method to my father, I saw that he was vaguely offended by its inelegance.

Blog_Data Science vs. Statistics-Figure 1aHe launched into an explanation that involved the "method of moments" and a lot of equations. "Method of moments" stuck with me in a poetic way, though I had no idea what he was talking about; to this day the subject is at the bottom of my list of jolly conversational topics. (The first through fourth "moments" in statistics are the mean, variance, skewness and kurtosis). It seems to epitomize the classical approach to statistics and probability (i.e. lots of math and theory). Pafnuty Chebyshev introduced the "Methods of Moments" in 1887 (see photo at right).

Statistics in Data Science Programs

As data science grows in importance in university programs, so does the prevalence of statistics - it's almost always part of a data science program. And it is striking how often such programs include statistics courses that look like they could date from a century ago. One top tier American university's data science program has a course that features topics such as hypothesis testing, one-sample methods, two-sample methods, and ANOVA.

So, what is it about about data science that calls for something different?

The science of statistics arose initially out of the need to measure things - especially physical and mental attributes of people. But what fills most standard statistics books is the machinery of statistical inference (hypothesis tests, confidence intervals) that arose a century ago from the need to quantify the uncertainty inherent in relatively small samples.

Data science, by contrast, is not faced with a shortage of data and consequent small samples. So all that inference machinery (one sample tests, two-sample tests, t-tests, F-tests, chi-square tests, goodness-of-fit, etc.) is mostly unneeded.

But multivariate statistical modeling - regression, principal components, clustering - is very useful for making predictions, reducing dimensionality and segmenting data. It's just that most software implementations of these modeling procedures, even in Python, come with unnecessary inference information in the output.

Blog_Data Science vs. Statistics-Figure 2aFigure 1. Python regression output - statistical inference metrics

Not a big deal, but you have to know to ignore it, or at least not get confused by it! Here at Statistics.com, we retain some of the inferential machinery in our foundational statistics courses, for our students learning "pure statistics"in our programs. But in those introductory courses, we also provide the appropriate connections with, and distinctions from, data science. For example, we teach the R-sq metric in regression, but place it in the context of statistics for research, where we want to know how well the model fits the sample of data. For data science analytics, we point out that predictive accuracy with new data is a more appropriate metric for the regression model.


Need help getting started with analytics? Our on-site half-day Analytics Executive Strategy session delivers strategies for using analytics to improve organizational decision-making and recommendations on how to grow your analytics capabilities. Learn more. 


Related

Read the blog 3 Myths About the Normal Distribution

Read the blog Are We Using Machine Learning?

Download the eBook Leading a Data Analytics Initiative


About the Author

Peter Bruce Peter Bruce is Founder and President of The Institute for Statistics Education at Statistics.com. Previously he taught statistics at the University of Maryland, and served in the U.S. Foreign Service. He is a co-author of Data Mining for Business Analytics, with Galit Shmueli and Nitin R. Patel (Wiley, 3rd ed. 2016; also JMP version 2017, R version 2018, Python version 2019; plus translations into Korean and, forthcoming, Chinese), Introductory Statistics and Analytics (Wiley, 2015), and Practical Statistics for Data Scientists, with Andrew Bruce, (O'Reilly 2016). His blogs on statistics are featured regularly in Scientific American online.