Foolish Workshop
Ask Not for Whom the Bell Curve Tolls
By
Todd Beaird (TMF Synchronicity)
(Workshop)
DES PLAINES, IL (Sept. 30, 1999) - Last week we were tossing grapefruit everywhere, trying to get at the slippery concepts of a normal distribution curve (also called a bell curve) and standard deviation. This week we'll look a little closer at these ideas and show how they apply to our mechanical stock screens.
Remember that a normal distribution curve shows us how frequently a specific value occurs. For most naturally occurring phenomena, about two-thirds of all values for the phenomena will be within one standard deviation of average (either higher or lower), and 95% of all values will be within two standard deviations of average.
Grapefruit Distance
Looking back at our grapefruit chucking example, we determined that the average value, or mean distance the fruit flew, was 60 feet. To find the standard deviation, we looked at the point on the graph (reproduced below) where the shoulders of the curve shifted from "more out than down" to "more down than out."
This point (determined mathematically, not by eye) defines the standard deviation for a bell curve. Run a line down from that point and find the value. Subtract that value from the mean to find the standard deviation. Double that value and you will find that the point on the graph that corresponds to two standard deviations, i.e., the point at which the line shifts from "more down than out" back to "more out that down."
Every normal distribution is defined by its mean and standard deviation. From these two numbers we can determine the probability that a person chosen at random will throw a grapefruit more than 60 feet (50% of all people), more than 75 feet (about 16%), greater than 90 feet (about 2.5%), or over 105 feet (fat chance). (It's actually less than 0.20% of all people, or about the expense ratio of Vanguard's index fund. I think "fat chance" is close enough.)
Imagine that we have to select potential candidates for the U.S. Olympic Grapefruit Tossing Team. Problem is, we only have a few weeks to get the team together and we're not about to let every Tom, Dick, and Jane try out. How can we identify people who are likely to be much better at grapefruit tossing than average so we can invite them, and only them, to try out?
Many people would come up with some preconceived notions of what makes a good grapefruit thrower (based on talks with analysts in the grapefruit-throwing business, perhaps), scout out those types of people ("Look, that guy is tall and muscular with large hands -- bet he can throw a grapefruit a mile"), and ask them to try out. That might work fine, but a Foolish Workshop regular would do things a little differently.
Rather than guess, we would analyze great citrus throwers of the past, looking for common traits. (Maybe long arms are far more important than big hands.) Then we would take a largish group of people (like the thousand we looked at last week), use those traits to screen for applicants that meet those criteria, and see how well they throw. If the screen is successful at selecting good grapefruit tossers, we continue to use it, and maybe even look for ways to improve it.
Switching from grapefruit throwing to stock screening, we are looking for better-than-average returns. To build a strategy, we look for traits that are associated with stocks that do better than average (high relative strength ranking or earnings per share growth, for example). Then we go back in time and select stocks that score high in those traits -- we "screen" for those specific criteria.
Just like the grapefruit example, we try out those stock screens. Because we are using objective, quantifiable criteria to determine our stocks ("choose the five stocks with the highest relative strength over the past 26 weeks"), we can look at how this strategy would have worked in the past, as far back as we can get good data.
When backtesting, we run the screen using only the data available to an investor at the beginning of each test period. Then we shift forward in time to see how those stocks performed. If we graphed the results from this process, we would get another bell curve -- not a perfect one, because we don't usually have enough data points to fill in all the gaps, but close enough to work with.
We can compare these results to the average returns for all stocks, and ask, "How likely is it that this performance resulted from luck?"
This is slightly different from our backtest above. We already know that the returns are far higher than average. Now we want to know: "What is the probability that the results from a group of five stocks (for example) will do X% better than average over a 10-year period? Instead of one datapoint (one person making one throw), we're taking one bell curve (the curve that represents all the returns from our screen), and comparing it to another bell curve (which represents the results for the market as a whole during the same time period).
The most common test for this is called a T-test. Basically, we want the bell curve from our screen to 1) have a higher average than the market, 2) be based on lots of data points (outperforming in one year is no big deal; outperforming over 38 years is a different story), and 3) have a fairly low standard deviation. If you have two screens that return 20% per year, you'd rather have one with two-thirds of its returns ranging between 15% and 25% (SD = 5%) than one with a range from -30% to +70% (SD=50%).
Once we have our data, we can run our T-test to determine the probability that our results were just a fluke. If the probability is less than 5% (one chance in 20 that the results are due to luck), then we can say that our screen is statistically significant at a 95% confidence level. This means that there is a 95% chance that the high returns are due to some underlying cause, and a 5% chance they are simply the result of the random distribution of the data. Most studies use a minimum confidence level of 90%, and often demand a confidence level as high as 99%.
These numbers don't tell us everything. A confidence level of 95% may sound impressive, but remember, we go through a LOT of screen ideas. If a screen has a 1 in 20 chance of being a fluke and you're trying your 21st screen of the week, then what?
There are a few other issues that concern us, such as the ever-present possibilities of datamining and curve-fitting. Next week I'll be coming to you from Europe (courtesy of my day job, unfortunately), and we'll go over those concepts, as well as ways to avoid them. Can you say "out-of-sample testing"?
New Rankings | Workshop Returns