Don't worry dear reader, there will be no gratuitous use of poetry in this missive, despitethe title – I promise. However, pause again for a moment and consider what baseball, wine,medical diagnosis, university admissions, criminal recidivism and I might have incommon.
The answer is that they all represent realms where simple statistical models have outperformed so-called experts. Long-time readers may recall that a few years ago I designed a tactical asset allocation tool based on a combination of valuation and momentum. At first this model worked just fine, generating signals in line with my own bearish disposition. However, after a few months, the model started to output bullish signals. I chose to override the model, assuming that I knew much better than it did (despite the fact that I had both designed it and back-tested it to prove itworked). Of course, much to my chagrin and the amusement of many readers, I spent about 18 months being thrashed in performance terms by my own model. This is only anecdotal(and economist George Stigler once opined "The plural of anecdote is data"), but it setsthe scene for the studies to which I now turn.
The first study I want to discuss is a classic in the field. It centres on the diagnosis of whether someone is neurotic or psychotic. A patient suffering psychosis has lost touch with the external world; whereas someone suffering neurosis is in touch with the external world but suffering from internal emotional distress, which may be immobilising. The treatments for the two conditions are very different, so the diagnosis is not one to be taken lightly.
The standard test to distinguish the two is the Minnesota Multiphasic Personality Inventory (MMPI). This consists of around 600 statements with which the patient must express either agreement or disagreement. The statements range from "At times I think I am no good at all" to "I like mechanics magazines". Fairly obviously, those feeling depressed are much more likely to agree with the first statement than those in an upbeat mood. More bizarrely, those suffering paranoia are more likely to enjoy mechanics magazines that the rest of us!
In 1968, Lewis Goldberg1 obtained access to more than 1000 patients' MMPI test responses and final diagnoses as neurotic or psychotic. He developed a simple statistical formula, based on 10 MMPI scores, to predict the final diagnosis. His model was roughly 70% accurate when applied out of sample. Goldberg then gave MMPI scores to experienced and inexperienced clinical psychologists and asked them to diagnose the patient. As Fig.1 shows, the simple quant rule significantly outperformed even the best of the psychologists.
Even when the results of the rules' predictions were made available to the psychologists, they still underperformed the model. This is a very important point: much as we all like to think we can add something to the quant model output, the truth is that very often quant models represent a ceiling in performance (from which we detract) rather than a floor (to which we can add).
Every so often, and always with the aid of a member of the quant team, I publish a quant note in Global Equity Strategy. The last one was based on the little book that beats the market (see Global Equity Strategy, 9 March 2006). Whenever we produce such a note, the standard response from fund managers is to ask for a list of stocks that the model would suggest. I can't help but wonder if the findings above apply here as well. Do the fund managers who receive the lists then pick the ones that they like, much like the psychologists above selectively using the Goldberg rule as an input?
Similar findings were reported by Leli and Filskov2, in the realm of assessing intellectual deficit due to brain damage. They studied progressive brain dysfunction and derived a simple rule based on standard tests of intellectual functioning. This model correctly identified 83% of new (out of sample) cases.
However, groups of inexperienced and experienced professionals working from the same data underperformed the model with only 63% and 58% accuracy respectively (that isn't a typo; the inexperienced did better than the experienced!). When given the output from the model the scores improved to 68% and 75% respectively – still both significantly below the accuracy rate of the model. Intriguingly, the improvement appeared to depend upon the extent of the use of the model.
Dawes3 gives a great example of theimpotence of interviews (further bolstering our arguments as to the pointlessness of meeting company managements – The seven sins of fund management, November 2005). In 1979, the Texas legislature required the University of Texas to increase its intake of medical students from 150 to 200. The prior 150 had been selected by first examining the academic credentials of approximately 2200 students, and then selecting the highest 800. These 800 were called for an interview by the admissions committee and one other faculty member. At the conclusion of the interview, each member of the committee ranked the interviewee on a scale of 0 (unacceptable) to 7 (excellent). These rankings were then averaged to give each applicant a score.
The 150 applicants who ended up going to Texas were all in the top 350, as ranked by the interview procedure. When the school was told to add another 50 students, all that were available were those ranked between 700 and 800. 86% of this sample had failed to get into any medical school at all. No one within the academic staff was told which students had come from the first selection and which had come from the second. Robert DeVaul and colleagues4 decided to track the performance of the two groups at various stages – i.e. the end of the second year, the end of the clinical rotation (fourth year) and after their first year of residency.
The results they obtained showed no difference between the two groups at any point in time; they were exactly equal at all stages. For instance, 82% of each group were granted the M.D. degree, and the proportion granted honours was constant etc. The obvious conclusion: the interview served absolutely no useful function at all.
Between October 1977 and May 1987, 1035 convicts became eligible for parole in Pennsylvania. They were interviewed by a parole specialist who assigned them a score on a five point scale based on the prognosis for supervision, risk of future crime, etc. 743 of these cases were then put before a parole board. 85% of those appearing before the board were granted parole, the decisions (bar one) following the recommendation of the parole specialist.
25% of the parolees were recommitted to prison, absconded, or arrested for anothercrime within the year. The parole board predicted none of these. Carroll et al5 compared the accuracy of prediction from the parole board's ranking, with that of a prediction based on a three factor model driven by the type of offence, the number of past convictions andthe number of violations of prison rules. The parole board's ranking was correlated 6% with recidivism. The three factor model had a correlation of 22%.
So far we have tackled some pretty heavy areas of social importance. Now for somethinglighter. In 1995, a classic quant model was revealed to the world: a pricing equation forBordeaux wine.
Ashenfelter et al6 computed a simple equation based on just four factors; the age of the vintage, the average temperature over the growing season (April-September), rain inSeptember and August, and the rain during the months preceding the vintage (October-March). This model could explain 83% of the variation of the prices of Bordeaux wines.
Ashenfelter et al also uncovered that young wines are usually overpriced relative to whatone would expect based on the weather and the price of old wines. As the wine matures,prices converge to the predictions of the equation. This implies that "bad" vintages areoverpriced when they are young, and "good" vintages may be underpriced.
Fig.3 shows the basic pattern. It shows the price of a portfolio of wines fromeach vintage relative to the (simple average) price of the portfolio of wines from the 1961,62, 64 and 66 vintages. The second column gives the value of the benchmark portfolio inGBP. The entries for each of the vintages in the remaining columns are simply the ratios of the prices of the wines in each vintage to the benchmark portfolio. The predicted price from the equation is also shown. Incidentally, this data is from a different sample than the original estimation of the equation, so it amounts to an out of sample test7.
Professor Chris Snijders has been examining the behaviour of models versus purchasing managers8. He has examined purchasing managers at 300 different organizations. Theresults will not be surprising to those reading this note. Snijders concludes "We find that (a) judgments of professional managers are meagre at best, and (b) certainly not betterthan the judgments by less experienced managers or even amateurs. Furthermore, (c)neither general nor specific human capital of managers has an impact on their performance, and (d) a simple formula outperforms the average (and the above average) manager even when the formula only has half of the information as compared to the manager."
Ok enough already, you may cry9. I agree. But, to conclude, let me show you that the range of evidence I've presented here is not somehow a biased selection designed to prove my point.
Grove et al10 consider an impressive 136 studies of simple quant models versus humanjudgements. The range of studies covered areas as diverse as criminal recidivism to occupational choice, diagnosis of heart attacks to academic performance. Across these studies 64 clearly favoured the model, 64 showed approximately the same result between the model and human judgement, and a mere 8 studies found in favour of human judgements. All of these eight shared one trait in common; the humans had more information than the quant models. If the quant models had the same information it is highly likely they would have outperformed.
Fig.4 shows the aggregate average 'hit' rate across the 136 studies that Grove et al examined. The average person in the study (remember they were all specialists in their respective fields) got 66.5% of the cases they were presented with correct. However, the quant models did significantly better with an average hit ratio of 73.2%.
As Paul Meehl (one of the founding fathers of the importance of quant models versus human judgements) wrote: There is no controversy in social science which shows such a large body of qualitatively diverse studies coming out so uniformly in the same direction as this one… predicting everything from the outcomes of football games to the diagnosis of liver disease and when you can hardly come up with a half a dozen studies showing even a weak tendencyin favour of the clinician, it is time to draw a practical conclusion.
The good news is that in some fields quant models have become far more accepted. For instance, Fig.5 shows the number of states in which the decision to parole has a quant prediction instrument involved. However, in the field of finance most still shy away from an explicit quant process. A few brave souls have gone down this road. Two explicitly behavioural finance groups stand out as using an explicitly quantitative process – LSV and Fuller & Thaler. Fig.6 shows the performance of their funds relative to benchmark since inception. With only one exception all of these funds have delivered pretty significant positive alpha. Of course, this doesn't prove that quant investing is superior; I would need a much larger sample to draw any valid conclusions. But it is a niceillustration of the point I suspect is true.
The most likely answer is overconfidence. We all think that we know better than simple models. The key to the quant model's performance is that it has a known error rate while our error rates are unknown.
The most common response to these findings is to argue that surely a fund manager should be able to use quant as an input, with the flexibility to override the model when required. However, as mentioned above, the evidence suggests that quant models tend to act as a ceiling rather than a floor for our behaviour. Additionally there is plenty of evidence to suggest that we tend to overweight our own opinions and experiences against statistical evidence. For instance, Yaniv and Kleinberger11 have a clever experiment based on general knowledge questions such as: In which year were the Dead Sea scrolls discovered?
Participants are asked to give a point estimate and a 95% confidence interval. Having done this they are then presented with an advisor's suggested answer, and asked for their final best estimate and rate of estimates. Fig.7 shows the average mean absolute error in years for the original answer and the final answer. The final answer is more accurate than the initial guess.
The most logical way of combining your view with that of the advisor is to give equal weight to each answer. However, participants were not doing this (they would have been even more accurate if they had done so). Instead they were putting a 71% weight on their own answer. In over half the trials the weight on their own view was actually 90-100%! This represents egocentric discounting – the weighing of one's own opinions as much more important than another's view.
Similarly, Simonsohn et al12 showed that in a series of experiments direct experience is frequently much more heavily weighted than general experience, even if the information is equally relevant and objective. They note, "If people use their direct experience to assess the likelihood of events, they are likely to overweight the importance of unlikely events that have occurred to them, and to underestimate the importance of those that have not". In fact, in one of their experiments, Simonsohn et al found that personal experience was weighted twice as heavily as vicarious experience! This is an uncannily close estimate to that obtained by Yaniv and Kleinberger in an entirely different setting.
Grove and Meehl13 suggest many possible reasons for ignoring the evidence presented in this note; two in particular stand out as relevant to the discussion here. Firstly, the fear of technological unemployment. This is obviously an example of a self serving bias. If, say, 18 out of every 20 analysts and fund managers could be replaced by a computer, the results are unlikely to be welcomed by the industry at large. Secondly, the industry has a large dose of inertia contained within it. It is pretty inconceivable for a large fund management house to turn around and say they are scrapping most of the processes they had used for the last 20 years, in order to implement a quant model instead.
Another consideration may be the ease of selling. We find it 'easy' to understand the idea of analysts searching for value, and fund managers rooting out hidden opportunities. However, selling a quant model will be much harder. The term 'black box' will be bandied around in a highly pejorative way. Consultants may question why they are employing you at all, if 'all' you do is turn up and run the model and then walk away again.
It is for reasons like these that quant investing is likely to remain a fringe activity, no matter how successful it may be.
1 Goldberg (1968) Simple models or simple processes? Some research on clinical judgements, American Psychologist, 23
2 Leli and Filskov (1981) Clinical-Actuarial detection and description of brain impairment with the W-B Form 1, Journal of Clinical Psychology, 37
3 Dawes (1989) House of Cards: Psychology and Psychotherapy built on myth, Free Press
4 DeVaul, Jervey, Chappell, Carver, Short, and O'Keefe (1957) Medical school performance of initially rejected students, Journal of American Medical Association, 257
5 Carroll, Winer, Coates, Galegher and Alibrio (1988) Evaluation, Diagnosis, and Prediction in Parole Decision Making, Law and Society Review, 17
6 Ashenfelter, Ashmore and LaLonde (1995) Bordeaux Wine Vintage Quality and the Weather, Chance
7 For those with an interest, Ashenfelter publishes a news letter called liquid asset.
9 For those wondering about the mention baseball at the beginning this wasn't a red herring…Michael Lewis' book Moneyball is a great example of the triumph of statistics over judgement.
10 Grove, Zald, Lebow, Snitz, Nelson (2000) Clinical Versus Mechanical Prediction: A meta-analysis, Psychological Assessment, 12
11 Yaniv and Kleinberger (2000) Advice taking in decision making: Egocentric discounting and reputation formation, Organizational Behavior and Human Decision Processes. 831 2 Simonsohn, Karlsson, Loewenstein and Ariely (2004) The tree of experience in the forest ofknowledge: Overweighting personal over vicarious experience