Ed. note: This post is an updated version of one originally published in March 2016.
Expert judgment, like the Internet, runs from sublime to sordid. Since its halting entry into the Halls of Science with the Delphi studies of the 1960s, expert judgment has remained something of an embarrassment: scientists use it all the time but would rather not talk about it. Now it is poised to become respectable. Here’s my latest take on some leading indicators:
Bamber and Aspinall’s 2013 paper “An expert judgment assessment of future sea level rise from the ice sheets”* was selected in 2016 as one of ten articles to highlight research in Nature Climate Change over the previous five years. Healthy suspicion within the science community was allayed by the “classical model for structured expert judgment,” the hallmark of which is empirical validation with performance-based weighted combinations of experts’ judgments. Exhortations to take on climate uncertainty have appeared in Nature Climate Change and the National Academy of Science advocates its use in quantifying the social cost of carbon.
Other recent events also signal expert judgment’s ascendency. Climate gadfly Judith Curry penned an excellent blog post in 2015 on expert judgment and rational consensus, emphasizing the risks of confusing consensus with certainty. Also in 2015, Australian biologist and bio-security expert Mark Burgman’s Trusting Judgment hit the bookshelves, with exhaustive reviews of the sordid side of expert judgment. This followed Sutherland and Burgman’s piece in Nature on using experts wisely and Aspinall’s appeal for a “route to more tractable expert advice.”
Building on the pioneering work of Eggstaff et al. on cross-validation, Abigail Colson and I recently demonstrated out-of-sample-validity for the set of professional studies conducted after 2006. Elsewhere, highly visible applications of expert judgment appearing in top-tier scientific journals have targeted the Asian carp invasion of Lake Erie and nitrogen runoff in the Chesapeake Bay, both with out-of-sample validation. In 2016, the World Health Organization (WHO) completed a structured expert judgment study of food-borne diseases with empirical validation on a massive scale: 74 experts distributed over 134 panels averaging 10 experts each quantified uncertainty in transmission rates of pathogens through food pathways for different regions of the world. A study on the effect of breast feeding on IQ was just completed.
The world of expert judgment divides into two hemispheres. The science/engineering hemisphere usually works with small numbers (on the order of 10) of carefully selected experts, asks them about uncertain quantities with a continuous range, and propagates the results through numerical models. The psychology hemisphere estimates probabilities of future newsworthy events. Philip Tetlock’s Good Judgment Project was proclaimed the winner of a five-year forecasting tournament organized by the Intelligence Advanced Research Projects Activity using the Brier Score for evaluating forecasters (disparaged in the classical model for confounding statistical accuracy and informativeness). Drawing from an expert pool of more than 3,000 and skimming off the top 2 percent of all experts, Tetlock’s group distilled a small group of “superforecasters.” With a small fraction of Tetlock’s resources, Burgman’s “Australian Delphi” method (based on the classical model with Delphi-like add-ons) is said to make a strong showing—though data and analysis from the tournament are not released.
In applications of the classical model, experts are typically asked to assess 5-, 50- and 95- percentiles for continuous quantities of interest—and for calibration variables (order 10) from their field, the true values of which are known post hoc. Experts are scored on statistical accuracy and informativeness. If only 2 of 10 values of calibration variables fall within an expert’s 90 percent central confidence band, that would result in a low statistical accuracy score. Informativeness is measured as the degree to which an expert’s percentiles are close together. (Proper definitions and data are freely available.) The two scores are negatively correlated, though the WHO data in the Figure 1 show that the correlation attenuates as we down-select to statistically more accurate experts.
Figure 1. Rolling Rank Correlations of Informativeness and Statistical Accuracy for Subsets of Successively More Statistically Accurate Experts
Source: Aspinall et al. (2016).
Unlike current events, science/engineering studies do not have access to thousands of experts and years of data per expert panel. Rather, “in-sample” validation looks at performance on the calibration variables, and “cross-validation” initializes the weighting model on subsets of calibration variables and gauges performance on the complementary set. Eggstaff and colleagues developed cross-validation for the extensive database of applications with the classical model in 2014. The performance ratios for performance-based and equal weighting in Figure 2 speak for themselves.
Both hemispheres agree that measuring expert performance and using performance-based combinations pay off.
Figure 2. Performance Weight/Equal Weight (PW/EW) Ratios for 62 Studies
Note: The ratios concern combined scores for statistical accuracy and informativeness, aggregated over all test/training sets within each study.
Source: Cooke (2015).
*At a 2013 RFF event, Bamber, Aspinall, and other experts discussed the ice sheets covering Antarctica and Greenland, which pose both the largest risk and uncertainty for sea-level rise and are considered to be one of the greatest hazards from future climate change. Watch the video.