Foolish Workshop
Thinking Outside the Sample
How NOT to get trapped in a data mine

By Todd Beaird (TMF Synchronicity) (synchronicity)

"Got a little ol' trick, to get rich quick
if you dig dig dig with a shovel and a pick
Heigh ho! Heigh ho!
"
--from Tom Waits' interpretation of "Heigh Ho, The Dwarves' Marching Song"

DUSSELDORF, GERMANY (Oct. 7, 1999) -- Hello from the Continent! This is my first time in a non-English-speaking country, and it's a little disorienting. Then again, many of you who've been trying to follow the statistical analysis of the various Workshop screens would claim that we're speaking a foreign language. Hopefully, our "translation" of some of the statistical issues involved will help you in telling a good screen from a Disney fantasy.

The first issue is one from last week. If you recall, we looked first at the results from some overall group (be it our 1,000 grapefruit throwers or the S&P 500), and compared them to the results from a "screen," or protocol for selecting members of the group based on certain characteristics. If the screen appears to identify group members who are better-than-average performers, we need to ask: Can we rely on this screening protocol to identify future outstanding performers or are those great results simply due to chance?

Since even unlikely events are possible, we defined a "good screen" as one whose terrific results are not just whims of Fate. We want to find a screen that works because of a sound underlying reason. Those are the ones that keep working.

It is important to realize what this means. If we set a confidence level of 95%, and if we chose our portfolio of stocks or our grapefruit-throwing teams by pulling names out of a hat, then we would get teams or portfolios that perform well enough to pass that confidence level test about 1 time in 20.

So, the first issue is the one we touched on at the end of the last article. We go through a lot of screens in the workshop. We might go through 19 screens that "should" work but don't, before stumbling upon one that does. Does this mean that it will work going forward? That twentieth screen might be "statistically significant" at a 95% confidence level, but intuitively, we would (and should) be skeptical. How do we know that the screen isn't simply due to chance? We can use a higher confidence level, but nothing will ever eliminate the possibility that we are seeing the laws of probability at work.

Other ways to get outstanding results are data mining and curve fitting. These terms are often used interchangeably, and are frequently used to criticize mechanical screening techniques. Moe Chernick's excellent article on Characteristics of a Good Screen shows how to spot some of the signs of data mining and curve fitting. But what exactly are they, and can you spot them mathematically?

Remember our group of 1,000 grapefruit throwers? Imagine that we gathered a LOT of data on those people in addition to their skill at tossing large citrus. Aside from standard factors such as height and weight and age, we asked them a zillion other questions about things like their religious affiliation, favorite breakfast cereal, and whether they think Elvis is still alive.

If we spent weeks looking through that database, we'd find many correlations that are probably spurious but show up anyway. Maybe we'll find that Presbyterian Cheerios-eaters who play table tennis do much better than average at tossing grapefruit. There might be some rational reason for this (maybe our group of 1,000 included a church table tennis team that works out together by tossing softballs around, and they have persuaded each other to stay in better health by avoiding sugary cereals), but more likely, it's just a random correlation that we found by spending a lot of time "mining" our data. Although a statistical test such as a T-test will often filter these out, there's always those 1 in 20 or 1 in 100 (depending on what level of confidence we set) chances that such correlations will get by.

At the risk of upsetting the mathematicians out there, I view data mining as "discovering" correlations that, in reality, have no cause-and-effect relationship. The Super Bowl Indicator (if a team that belonged to the original NFL -- before the merger with the AFL -- wins the Super Bowl, the market goes up) is an example. There is a definite statistical correlation, but is there any rational reason why the success of the Pittsburgh Steelers or New York Giants (old NFL teams) should have any effect on the market?

Curve fitting is essentially the same concept. Curve fitting is tweaking your protocol to make your results better by adding or adjusting variables in ways that don't make sense. The variable itself may be a reasonable one (e.g., EPS growth greater than X%), but it might be set at a value that just happens to fit your data (EPS growth > 18.23%, for example). Why 18.23%? Maybe that was precisely the cutoff that excluded a really bad stock (with EPS growth of 18.2%) and caught a really good stock (with EPS growth of 18.25%). This is one reason why small sample sizes are suspect. A single stock having an exceptional year can really skew the average.

When you're building a screen, look for criteria that are indicators of good performance. The problem is that you can easily be guilty of data mining and/or curve fitting without even knowing it as you search for those criteria. Using any criteria that caught Goodyear Tire <% if gsSubBrand = "aolsnapshot" then Response.Write("(NYSE: GT)") else Response.Write("(NYSE: GT)") end if %> in 1991 (182% return) will do wonders for a backtest that doesn't have a high number of data points. Such a screen might even pass that 95% (or higher) "confidence" threshold even if it is composed solely of companies that own blimps. What to do?

The best way to avoid this issue is through "out-of-sample" testing. In our grapefruit example, we would go to another state or another part of the world, gather another 1,000 people, have them fill out that questionnaire about height and weight and breakfast cereals and Elvis, and then have them toss some grapefruit. Then we'd see how good the Cheerios-eating, Presbyterian table tennis players do. In stocks we'd like to test our screens in both good and bad markets, over different time periods or on different sets of stocks.

If our screen is the result of data mining, then it probably WON'T hold up when used on whole new set of data. If the screen is still significant at 95% confidence (or better), then you can sing its praises.

One last note. There are NO guarantees. Even the most logical, firmly grounded, and well-tested screens may be due to nothing more than blind luck. Out of the millions of possible screens out there, one of them's gonna Lotto. However, I feel a lot better following a thoroughly tested and rational screen that's beaten the index than I would following an underperforming strategy (actively managed mutual funds come to mind here). And remember that many people who put money in the market aren't following any strategy at all!

I hope that these articles have helped many of you understand what we're looking at statistically. I also hope that I haven't offended too many mathematicians out there. (If so, blame it all on my editor. Just kidding, Ann!) Next week I'll still be in Europe and away from the Workshop, but will return the following week. Until then, Fool On!


Meanwhile, drop on by the Foolish Workshop message board and voice your opinion of these rules.

New Rankings | Workshop Returns