<FOOLISH FOUR PORTFOLIO>
Coal or Diamonds?
Part 3 of Data Mining
by Ann Coleman (TMF AnnC)
Reston, VA (May 12, 1999) -- Last month, an article critical of the Foolish Four appeared in the Financial Analysts Journal. The main criticism implied that the Foolish Four was not a valid investment strategy because it was the result of "data mining," a concept we have been exploring since Monday.
One thing I need to make clear is that data mining is a charge that is often leveled as though it were the ultimate proof of invalidity. This strikes me as one of those good ideas that has gotten out of hand. Data mining, in its classic definition, i.e., searching through a large number of variables for correlations and then developing an investment strategy based on those correlations, is certainly something investors should be warned of.
But the practice of searching for correlations is not useless. As Bob Price (TMF Sandy), co-discoverer of the RP method along with Elan Caspi, said, carry this vendetta against data mining to its logical conclusion and you've pretty much trashed the scientific method.
The example of bad data mining that McQueen and Thorley use in their article is an interesting one -- apparently a study of variables in the United Nations database "revealed" a high correlation between the Standard & Poor's 500 Index and butter production in Bangladesh. (The Bangladesh Butter Stock Indicator Newsletter should be starting soon!) And we have all seen studies that seem to indicate that certain medical conditions are related to certain foods or additives, only to find out later that such correlations vanished when studied a second time.
But what about studies such as the long-term effects of smoking on heart disease? When it was first reported, it was nothing more than a statistical correlation found by searching through a huge database looking for relationships between heart disease and various factors. I remember thinking: I can see a relationship between smoking and lung cancer, but heart disease? But that particular correlation held up in study after study, and eventually some of the causative mechanisms became clear.
While data mining can lead to unreliable conclusions, searching through a database for correlations can also be a valuable tool for finding less obvious truths. So how do you tell the difference?
Well, McQueen and Thorley describe that quite well. In the section of their report called "Too much Digging/Too Little Confidence," they list some warning signs: "The most straightforward warning sign of data mining, if the information is acknowledged by the investigator, is a high number of variables." They also warn of wording such as "noted" or "noticed" and go on to say, "One wonders what else has been tried. Did researchers look at the S&P 500 before the Dow? Did they unsuccessfully search over industries, starting date, P/E, sales levels, number of analysts, length of dividend history, prior returns, book-to-market ratio? We do not know, and that is the point."
Here again the authors are again incomprehensibly wondering about something that they could have discovered with a simple phone call or e-mail. I think the problem is that they read The Motley Fool Investment Guide, a book written strictly for the general public, as though it were written for scholarly publication.
As we discussed yesterday, I know of no reason to assume that the original correlation between high yield and price appreciation for Dow stocks was the result of anything other than an obvious conjecture, which proved to be true when studied. I would welcome any information to the contrary, though.
As for our own work, the RP formula that we now use was certainly the result of a study that deliberately looked for a ratio between price and yield that provided the highest returns. The idea was that there might be an optimal ratio between the two factors that were used in Beating the Dow. You can read about The Birth of RP in our archives. Although only two factors were studied (this time -- we don't know about Beating the Dow), the idea of varying the ratio is characteristic of the kind of data mining that McQueen and Thorley are quite rightfully warning us against, and a practice that one would have to look closely at before accepting the results as worthy of consideration.
Here is why I am comfortable with the RP as a stock selection mechanism:
1. The correlation between the RP ratio and higher return was originally done on yearly data only (i.e., portfolios starting in January). When monthly data became available, the correlation held up for portfolios starting in each month over a 37-year time span.
2. The pattern of the results is a fairly smooth curve -- exactly what you would find if the relationship were a valid one. When TMF Sandy (Bob Price) went looking for a correlation between the ratio of dividend and price (later simplified mathematically to yield and price since yield is dividend/price), he varied the ratio in tiny increments, then studied the results. What he found was a bell curve, with the peak right at a ratio of 1/1.5. If the relationship were simply the result of chance, one would have expected the results to be scattered: i.e., a ratio of 1.46 might produce good results but 1.47 would produce much lower returns. Instead, the returns increased fairly smoothly as the ratio approached 1/1.5 and decreased fairly smoothly as that peak was passed. The peak was also a very sharp one -- sticking out well above the rest of the crowd.
These two factors are strong indications that the association is a true one, not the results of a statistical fluke. Sometimes you find a diamond mixed in with all those tons of coal.
LOTS more work could be done on the statistical validity. One of my dreams is to have the time to really dig into the statistics, but the correlation is so strong that I personally don't fear that the RP, as a stock selection method, is going to blow up in my face. At worst, it may simply "revert to the mean," meaning the returns won't be any better than a high yield strategy. I can live with that kind of risk.
Many useful and practical things are not proven to academic standards. In fact, about the only thing most academic researchers will admit is "proven" these days is the Efficient Market Theory, which asserts that you can't beat the market over time.
I guess we will just have to see.
Friday we will finish up with the data mining question, and next week we will look at the other objections raised in the journal article: Taxes and Risk-adjusted returns.
Fool on and prosper!
Today's Stock Lists | 1999 Dow Returns
05/12/99
Close
Stock Change Last -------------------- CAT -1 5/16 61.88 JPM +3 13/16 139.13 MMM -1 1/8 93.88 IP + 3/8 54.25 |
Day Month Year History FOOL-4 -0.09% 1.30% 30.62% 32.56% DJIA -0.23% 1.96% 20.20% 19.72% S&P 500 +0.62% 2.16% 11.28% 11.55% NASDAQ +1.55% 2.50% 18.87% 20.50% Rec'd # Security In At Now Change 12/24/98 24 Caterpillar 43.08 61.88 43.63% 12/24/98 9 JP Morgan 105.51 139.13 31.86% 12/24/98 14 3M 73.57 93.88 27.60% 12/24/98 22 Int'l Paper 43.55 54.25 24.57% Rec'd # Security In At Value Change 12/24/98 24 Caterpillar 1034.00 1485.00 $451.00 12/24/98 9 JP Morgan 949.62 1252.13 $302.51 12/24/98 14 3M 1030.00 1314.25 $284.25 12/24/98 22 Int'l Paper 958.12 1193.50 $235.38 Dividends Received $29.45 Cash $28.26 TOTAL $5302.59 </FOOLISH FOUR PORTFOLIO> |