Model Validation and Reproducibility of Results


John F. Elder

Date Published:
December 4, 2020

Arguably the most vital modeling phase is validation; the model has to work on new, never-before-seen data or it is worthless.  The problem is much greater than most researchers are aware.  Most experiments – indeed, most published scientific papers based on inducing results from data – are believed to be irreproducible; that is, they can’t be verified (get similar results) when independently repeated by a different team following the same procedures.  (See, for instance, the landmark paper by JPA Ioannidis (2005) “Why Most Published Research Findings are False”, PLoS Med 2(8): e124.)  In fact, in 1995 Science magazine picked as one of its “breakthroughs of the year”, a study which revealed that only 36/100 major psychology papers could be reproduced when diligent researchers, in cooperation with the original authors, ran the experiment a second time (Science 28 Aug 2015, “Estimating the reproducibility of psychological science”, Vol. 349, Issue 6251).  If refereed articles in the most prestigious scientific journals have such a bad accuracy rate, how much worse is it for models created with even less formal and competitive validation and verification procedures?

Elder Research is a world leader in model validation.  We go beyond just following best practices in model validation (such as cross-validation and bootstrapping) as our deep experience has led us to actually invent techniques essential to cutting-edge best practices.  Good models have many enemies.  There are many that ways inducing knowledge from sample data can go wrong, as summarized by my popular “Top Ten Data Mining Mistakes” (Chapter 20 in his book).  In fact, data science’s greatest strength – its ability to discover previously unknown relationships in data to predict or classify a target (output) variable – can also be its greatest weakness; when not constrained properly, a modeling routine can go wild and find spurious correlations in the data that hold for the training data it sees, but not for the out-of-sample (OOS) data it doesn’t see.  That problem, of over-fit, is fairly well-known, but the related problem of over-search, is less widely understood.  There, the final model may be simple, but because the modeling procedure considered so many combinations of factors – potentially billions of hypotheses – the chance of finding something by luck alone becomes much too large to ignore.  In 1995, I invented a technique to solve this problem – more precisely, to measure its influence so it could be controlled.  Target Shuffling is a re-sampling method to calibrate the “interestingness measure” of a model (such as p-value or R2) to a probability of its being “real” – that is, of holding up out of sample.  This tool has enabled Elder Research to build reliable models in domains previously believed to be dominated by noise, and the Target Shuffling capability has been adopted by leading tool vendors, such as KNIME and CART, to add to their software as a stand-alone module.

Lastly, Elder Research and its subsidiary,, are world leaders in educating other researchers and practitioners in best practices, with an emphasis on model validation.  We teach, from our vast field experience and our research work, how to see danger signs of model weakness at each stage of the CRISP-DM process, from data gathering, feature design, sampling, modeling, evaluation, and deployment.  Our deployed success rate – where our models are used in the field, is over 90%.  Anything short of 100% is not ideal, but there are many human factors affecting acceptance of a model, not just technical ones, that must be attended to.  This rate is astonishingly high – by a factor of 3x to 5x – compared to our industry, and is due to our long experience and careful study of what all the factors need to be for implementation success.

Get the most from your data. Let us help validate your models.

Let's Chat