Limits of Backtesting

Last week's column serves to illustrate some of the care needed in running and interpreting backtests. We missed a big clue about the volatility of the strategy discussed. It turns out that the low volatility was just an illusion. Backtesting blends of Workshop strategies has limited value, but it can be useful as long as the limitations are kept in mind at all times.

By Ann Coleman (TMF AnnC)
November 16, 2000

Last week's column was a really good example of how careful one must be in examining blended strategies and backtested returns. I missed something big when evaluating the strategy sent in by a reader. Two things actually. One was the significance of the fact that the returns were only for January starts. The other was that the volatility for the other start months was TERRIBLE.

I guess I was a bit blinded by the numbers. Huge returns and low volatility are the Holy Grail of investing. But last week's numbers were just as illusory as the visions seen by Arthur's knights.

If you recall, we are looking for a blend of mechanical investing strategies to use for the real-money portfolio that will be starting at the end of the year. One way of selecting such a blend of strategies is to test them using Jamie Gritton's Backtest Engines, which can show us how various strategies and combinations of strategies have performed over the past 14 years. The idea is to look for a blend of strategies that have different strengths and weakness so that, in theory anyway, when one strategy has a bad stretch, another will take up the slack.

The blend we discussed last week had great statistics, a CAGR (average annual return) of 52% and a GSD(geometric standard deviation, a measure of volatility) of 18%, but that was only for portfolios started in January. Three of the four strategies in the blend were annual strategies, which are most susceptible to variations in start months. The same blend started in other months had wildly different numbers, with October showing a CAGR of 42% and a GSD of 51%! That's a big difference, although one has to expect a good bit of variation since we are only dealing with 14 years' worth of data.

If you want to see the entire test for yourself, go to the Backtest Engines All-Months Summary page and enter the following code in the URL window: BLkc30n1213ps30n1213bw20n1212rq20n0312. Before you hit Run, change "1998" to "1999."

Now, I should have noticed this. Luckily, we have an active community that is always ready to point out (gently or otherwise) the error of my ways. (See Related Links above for the relevant discussion threads.) I'm not sure why I missed it, especially since there was a big, big clue right there on the original test result. It was the GSD(M) number. GSD(M) is a measure of annualized monthly volatility.

In a strategy with relatively stable returns, the GSD(M) figure will be close to the GSD number. For January starts of the blend discussed last week, the GSD was that lovely, low 18%, but the GSD(M) was 33%, almost twice as high. When the two are very different, GSD(M) gives a better measure of actual volatility. We will discuss GSD(M) more next week. It's a relatively new and potentially very valuable addition to our measures of volatility, and it will need to be considered whenever we are testing potential portfolio blends.

But let's talk just a bit more about this whole idea of testing blends. We need to keep clearly in mind that the process is an imperfect guide at best. Its main virtue is that it's better than nothing -- and that's only true when you understand its limitations. Anyone who expects that the CAGR from a backtest is guaranteed or even that there is a high probability of such returns in the future would be better off with nothing.

Right now, for example, few of the major Workshop strategies are working well, and almost any blended portfolio would be showing losses over the past several weeks, if not for the whole year. By blending strategies that pick both large- and small-cap stocks, growth and value stocks, and stocks from different industries, we hope to mitigate some of the year-to-year volatility our strategies are prone to -- even if we can't protect against market meltdowns or short-term rough patches. (Any guesses as to which we are experiencing now?)

Backtesting blends is subject to the same problems that affect backtesting single strategies, and it may be more sensitive to issues like data-mining and multiple-hypotheses problems. Blends must be designed well based on our knowledge about the types of stocks selected by the underlying screens -- not used simply because of they have great numbers.

Backtesting is probably more useful for rejecting strategies than for accepting them. Rather than using the backtested numbers as a final determinant for our next portfolio, I look at it as a kind of hurdle that a blend has to jump before being considered further. If we test a blend of strategies and it offers no improvement over the individual strategies, then I feel fairly safe in rejecting it as a candidate. Hopefully we will get suggestions of several well-designed blends that have a higher ratio of returns to volatility than the underlying strategies, and from those blends, the community will select the final mix.

Backtesting can also be useful in testing small variations. Say we were trying to decide whether to use a monthly or quarterly version of our Relative Strength strategy. Given that we had little else on which to base such a decision and that trading costs were not a big issue, I would have no trouble turning such a decision over to a mechanical backtester. In a case like that, it might even prove better than tossing a coin, although I wouldn't bet on it.

In short, no one should expect that the backtesting process provides anything other than a best guess at how such a blend might work in the future, and if the markets of the future are very different from those of the past 14 years, all bets are off.

Fool on and prosper!