Top Mistakes When Backtesting Investment Strategies


John F. Elder

Date Published:
February 16, 2018

A market index provides a tough hurdle to beat for any investment strategy. Employing an index is almost always better than a strategy that systematically picks a subset of its space or time (i.e., does portfolio-picking, or market timing ). The cost of a predetermined index is low, since no thought is required, and the long-term results over the last century of major market indices have been impressive. So, the argument goes: “Why waste your money paying for expensive managers who may only beat the market by luck?”

John Bogle, founder of The Vanguard Group, the world’s largest mutual fund group, implemented the first index fund — one of history’s simplest billion-dollar ideas. It was astonishingly recently — in 1975 — even though the Dow Jones Industrial Average was almost a century old by then. Indices like the Dow and the S&P 500 were invented to illustrate how the overall market was performing. The Dow sums the price of 30 large public US companies. It does not take into account their size (market capitalization) unlike the S&P, so it is not as good an indicator of market performance, but it’s easier to calculate. Hard to believe that for 90 years you could see how the Dow was doing but couldn’t easily invest in it!

When a hedge fund strategy is introduced the most important thing everyone looks at is its backtest – how it would have done in the past. Obviously, no marketed fund’s backtest ever looks bad! But, how representative are those past results of what will happen going forward? That is the essential question, and it’s an extremely hard one to answer. I’m going to share some wisdom from Vanguard, and expand on it, about mistakes to look for in backtests. Of course, you have to remember, Vanguard “has a dog in the hunt”. When they are providing their useful critique they are secretly saying “probably no hedge fund really works, you know!”

In Vanguard’s concise 2-page document, “A quick guide to evaluating back-tested investment strategies” linked to from some articles/ads/blogs Vanguard supplies three categories of questions to ask of an investment strategy:

  • What is its Rationale?
  • What is the Empirical evidence?
  • What are its Implementation results?

First, let’s look at the actual results (the “true backtest”) of the market. Many big events of the last century, such as the Great Depression, WWII, and the Global Financial Crisis of 2008, are naturally reflected in the stock market’s performance (Figure 1), though some primarily financial events seem to have more impact than do world-shaking global conflagrations! Note that as a rule, stocks have an overall upward slope over time and the average long-term rate of return is about 9% per year – a surprisingly strong number. The Figure is on a log scale so a constant rate of return will be a straight line; and if on a linear scale, the graph would be dominated by the most recent information only. (So the log scale is the best way to plot investment results.)

Figure 1. Log scale of annual stock market return based on S&P composite index.

The stock market’s compounded growth is pretty amazing, although there have also been periods of real terror. In the midst of a down period, there’s no way to know how long it will last. Since we all are subject to fear and greed, we individual investors usually make investment decisions that do worse than the index. Studies repeatedly show that the average equity investor’s return over time is much worse than the S&P 500. As shown in Figure 2, from Dalbar’s Quantitative Analysis of Investor Behavior (QAIB) Report from 2014, the average investor holding equities for one year obtained a 6% return, while the S&P 500 yielded a 14% return. Even for those investing in equities for 30 years, the S&P 500 returned 11% per year and the investors had only achieved a return of 4% per year. Compounding this return over 30 years, the S&P returned seven times more money than the individuals received! It’s a powerful and sad measure of the cost of humanity’s loss aversion; we tend to not stick it out through the tough periods and therefore aren’t invested if and when the subsequent strong rise occurs. Emotionally, we can’t handle the stress.

Figure 2. Average equity Investor performance over different periods of years compared to the S&P 500 Index (from Dalbar)

The baseline performance of a broad-market index is incredibly challenging to beat. If the index is your competitor, you are already facing a very tough opponent. Conversely, it’s a great default investment strategy if you’re not going to be a full-time investment professional. At Elder Research, the default retirement investment option, if you don’t set one for your profit sharing, is to combine this index wisdom with the solid advice to invest in stocks while you’re young, and shift to bonds as you age and your “investment horizon” shrinks; therefore, our default is to put you in an age-adjusted “Target-Date Retirement Fund” (first invented in the 90’s).

Well, you may find index funds boring. And mutual funds – which are more numerous than stocks – are no different, but are more expensive than indices because their managers do so much hard work picking what they think will be winners and losers. When bad times happen, they’re all exposed, all sectors become correlated, and they all go down hard. Alternative investments, like hedge funds, look especially attractive then. Alternatives do unusual things, and aren’t correlated with the broad markets. So, you find yourself starting to look at some intriguing backtests….

Vanguard wants to help you here. Keep you from jumping over the edge. Take a minute to look at the dangers. And, frankly, their succinct 2-pager does a great job of capturing common-sense points as well as some subtle dangers most people miss. It’s officially only available to investment professionals, but I’ll summarize it here.

Vanguard’s advice for evaluating a new investment system:

  1. Is there a sensible rationale for the strategy?
    • Data snooping bias – Was the strategy preplanned before backtesting? Does it have a reasonable explanation, and is not just a random finding after the fact?
    • Redundant theme – Is it derived from a strategy in the literature? If so, why might it still work (since any “pockets of inefficiency” would be pounced on and thereby killed)?
  2. Is there extensive empirical support for the strategy?
    • Look-ahead bias – Is its information publicly available at trading time? For example, is the model based on closing prices from different time zones?
    • Time-period bias – Why did the backtest start and stop when it did? How sensitive are the results to different time periods, such as low versus high volatility markets, growth periods, low versus high interest rates, etc.?
    • Overconfidence and confirmation bias – Has the strategy been vetted by independent, unbiased third parties?
    • Selection (multi-testing) bias – How many strategies were tested? If more than one, were adjustments made to account for the fact that standard statistical significance levels apply only when a single test is conducted? (This is excellent; most researchers miss this subtle but essential issue. It is why target shuffling was invented; one must calibrate the statistical test to account for all of the things one has looked at to avoid overconfidence.)
    • Overfitting bias – Given that some degree of random noise influences strategy results over any period of time, how was the strategy constructed to avoid having too many parameters that could lead to the risk of overfitting? How to make sure that you didn’t just fit the training data well? Did you do out of sample testing?
  3. Was there real-world implementation testing?
    • Representativeness bias – Consider how the strategy has done since it was first published, as well as how it has performed in other markets or asset classes. For instance, does it work in bonds as well as stocks? Is the concept behind the strategy robust?
    • Implementation costs – Have the costs of implementing the strategy been overlooked? Confirm whether commissions, bid/ask spreads, expenses, and market impact were accounted for.

In the midst of sharing this advice, in a paid article/ad that recently appeared many places, a Vanguard executive pointed out this (too clever) counterexample:

“What one person might consider research another might call data mining, and there can be a fine line between the two. But however you think about it, there is a difference between finding a random anomaly and identifying a viable rules-based strategy. As a fun example of this, a few years ago my colleagues Joel Dickson and Chuck Thomas ran a hypothetical simulation that compared the performance of the S&P 500 Index with an equity portfolio that had an equally weighted combination of all stocks with tickers that began with S, M, A, R, or T. As Figure 3 shows, this simple, rules-based strategy did very well over a long period of time. However, let’s be honest, there is no sound reason to justify why it would be a good idea to pursue this strategy in the future.”

Figure 3. Annualized return of S.M.A.R.T. Beta strategy from Dec. 31, 1994, to Oct. 31, 2013

This hypothetical portfolio was created to mock a competing investment class that is like a pseudo index, and is called SMART Beta. Beta represents a strategy’s correlation with the market, and Alpha is its excess return compared with the market. Other vendors are offering indices that are tilted toward volatility, or exposure to foreign markets, etc. – that have “beta” that is “smart”, rather than being entirely passive, like the plain vanilla index.

In Figure 3, Vanguard is trying to demonstrate that of course you can (somehow) come up with a strategy that can beat the S&P 500 index, but just because it exits doesn’t mean it will do so in the future. This one is nonsensical in origin, so of course it won’t continue to outperform, they are implying. But their example is terrible! (First, they are mocking a portfolio that does so well for so long, I’ll bet some people will be convinced by their Figure to invest in it! Over the ~18 years it would have compounded to about 5x the index’s return.) One critique (and the only one I saw others mention on comments sections online where this ad/article ran) is that their example equal-weights the stocks whereas the S&P index capitalization-weights stocks. That is not fair as equally weighting invests more in the smaller and more volatile stocks so will improve one’s return during a rising market.

But the killer offense, that trumps all others, is survivor bias. Note the word “current” in their description above of how the portfolio was created. They used all existing securities in the current index and went backwards in time! Vanguard broke their own rules by picking present day members of the S&P 500 for their portfolio 18 years previously; but, of course, investors back then would have had no idea what stocks would comprise the far future S&P 500 index. This issue – survivor bias – is an important problem in investment models and many other domains. It can be fatal to a statistical model so we have to force ourselves to think about where it might be and how to eliminate it. It is better to brainstorm fiercely about possible flaws in a model and kill it early if we have to, rather than after it has been deployed, when it is much more expensive to do so.

Why would the current members of the S&P have such a great advantage over the actual ones historically? Because a stock with poor performance either fails or is replaced in the S&P by a new or more successful stock. The current members of the S&P are going to be all-stars, looking back historically.

A friend who used to be a hedge fund manager (and is now a missionary in Nicaragua doing amazing work) sent me this article as a cool research result. I saw the flaw immediately, probably because I’ve (made and) seen so many examples of errors, and made something of a writing career out of hunting down and cataloguing errors. But no one else online has noticed it, yet. I’ve sent this example to many people – without suggesting there might be an error, and no one has noticed. When I suggested there might be a problem, only one person so far has diagnosed it right away – Colin Thomas here at ERI. So it’s not easy to recognize these errors in the wild. But if we can, it can save us a great deal of heartache.

So what could have helped us notice? Since the SMART beta portfolio represents more than 20% of the S&P 500 It should have been a red flag that this subset of stocks crushed the total return of the S&P 500 index. If it was an honest research result, the majority of the index would have had to have done truly horribly for the minority subset to do so well by comparison. And with so many stocks (500) for so long (18 years) it’s just not rational that randomness could allow for such divergence.

It is beyond ironic that Vanguard’s excellent advice on how to judge backtesting is accompanied by an example that violates the most important principle of all: The strategy has to be possible, at the time, to implement!

I have some interesting stories I’ll share in a later note, of hedge fund strategies (live and not-yet-live) gone wrong, that have been kicked up in memory by this exercise. Meanwhile, remember to seek as many “outside eyes” to critique your work as you can! It’s the best way to get it right.

Want to Learn More?

Learn about more common data mining mistakes. Download the eBook The Top 10 Data Mining Mistakes
Download Now