Researchers can use randomized controlled trials to establish with certainty if a variable of interest causes an observed effect. However, randomized controlled trials are often complex, expensive, and impractical to implement. Recent research from RFF scholars shows that an alternative approach, using machine learning, can help economists link cause and effect without major investments in big experiments.
Economists (and a recent Nobel Prize committee) love field experiments, because randomization can address concerns about unwanted (and unknowable) confounding effects that threaten to mess up an analysis. For example, suppose an economist wants to know the effect of a change in electricity pricing on electricity consumption. The most convincing way to estimate this effect is to randomize consumers into two groups and subject one group (the treatment group) to a change in price and the other group (the control group) to no change. The economist can then compare the electricity consumption of each group to estimate the sensitivity of demand to higher prices. Because consumers were randomly assigned to the two states of the world (one with a price change and one without), we can be confident that the only difference between the two groups is the price change itself. The control group serves as a “counterfactual” for the treatment group’s energy use; in other words, the control group acts as a measure of how much energy the treatment group might have used in the absence of the treatment.
But running randomized experiments, also known as randomized controlled trials (RCTs), can be costly. And due to regulatory constraints around randomization, an unwillingness of utilities to subject consumers to differential treatment, or just plain practical issues, it’s often not feasible to conduct an RCT. So, economists and other researchers who want to understand the effects of new programs or policy changes often rely on other methods in which observational data, rather than experimental data, can help tease out the effect of a policy change.
One new method that’s growing in popularity is using machine learning (ML) to generate counterfactual predictions of what would have happened in the absence of the policy change or the new program. But how well do these new methods fare in estimating causal treatment effects? We explore the ability of ML methods to estimate causal effects revealed by RCTs in our recent working paper, which we coauthored with RFF University Fellow Casey Wichman.
ML methods have been applied to study the effectiveness of energy efficiency investments in schools, the effect of carbon taxes on emissions in the United Kingdom, the importance of lender access to information in consumer credit markets, and more. In general, this “ML counterfactual prediction” approach goes like this: first, estimate an ML model that does a good job of predicting an outcome of interest—say, a household’s energy consumption—using data from before the treatment was implemented; second, use this ML model to predict what that outcome would be in the absence of any treatment; third, calculate the differences between predicted and actual consumption (i.e., the “prediction error”); finally, use standard econometric models to estimate the effect of the treatment or policy change on this prediction error. Intuitively, if the treatment has no effect and the ML model is accurate, we wouldn’t expect any change in the prediction error after the treatment is implemented. But if the treatment does have an effect (e.g., reduced energy consumption), we would expect a difference between the counterfactual prediction and the actual consumption.
The whole discussion up to this point presupposes that we all know what ML is and how it works. Let’s now take a step back. ML uses statistical algorithms to find the right combination of available data and model structure that can explain observed variation well enough to make accurate predictions. ML models simultaneously allow for flexible, highly nonlinear relationships among variables without inadvertently overfitting the data. (Overfitting occurs when a statistical model contorts itself to capture idiosyncrasies of one dataset that are not broadly generalizable, leading to inaccurate predictions when applied in other contexts.) The common approach that’s applied to balance these two objectives—ensuring flexibility and avoiding overfitting—is called cross-validation. ML techniques are widely used because they repeatedly have succeeded at predicting outcomes; however, their success in estimating causal effects has not yet been vetted.
Our new working paper looks at the effectiveness of ML approaches in replicating the results of an actual RCT. Specifically, we consider the question of whether it’s possible to reduce electricity consumption during critical peak hours, through the use of peak period pricing and information interventions. That experiment—run by Pecan Street Inc. and evaluated in a paper by Burkhart, Gillingham, and Kopalle—finds that a critical peak pricing intervention yielded a 14 percent reduction in peak energy use for experimental households in Austin, Texas. However, three types of information interventions (one passive, one active, and one with recommendations on how to reduce energy use) had no effect on peak consumption. Using the data from Pecan Street, we test the effectiveness of several popular ML algorithms in replicating the experimental results from Burkhart, Gillingham, and Kopalle. Spoiler alert: we find that ML counterfactual prediction approaches perform quite well.
Because the ability to replicate experimental results could depend in important ways on the types of data available, we explore the ability of ML approaches to replicate the experiment in the context of three types of data samples:
- the original experimental data, including both treatment and control households
- a second sample that includes treatment households alongside another set of comparison households from the same neighborhood that were not part of the original experiment
- a sample that includes just the treatment households, which resembles a situation often encountered by researchers
For each of these data samples, we use three different ML approaches to create counterfactual predictions of hourly, household-specific energy consumption before and after pricing changes and information interventions. These ML approaches include XGBoost, Random Forests, and LASSO; an explanation of all three methods can be found in the accessible and free textbook An Introduction to Statistical Learning. For each sample, we also use a standard difference-in-differences framework—or, in the case of the treatment households only, a simple pre-post comparison—to predict the effect of the pricing and information treatments. For one of the ML approaches, we also explore what combination of explanatory variables is necessary to replicate the results of the RCT.
Each of the ML approaches that we test replicates the experimental treatment effects well, reflecting both the significant reductions in consumption in response to the peak period pricing treatment, along with the null effects of the three information treatments—even for the data sample that includes just treatment households. We also find that the simple difference-in-differences approach with a comparison group replicates the RCT results well. We find little difference across the three tested ML approaches in terms of replicating the experimental results, despite differences in their predictive accuracy (Figure 1). And we find that ML methods can replicate experimental treatment effects remarkably well, even with relatively little data available.
Figure 1. Effect of Changing Peak Electricity Pricing among Households in Austin, Texas, Estimated Using Alternative Machine Learning Approaches
These methods and results may be particularly useful in analyzing energy demand and understanding how pricing and other policy interventions affect actual consumption behavior. The growing penetration of smart electricity meters has increased the temporal granularity of data that measure energy consumption, which presents valuable opportunities for better policy evaluation and for the implementation of time-varying prices.
In practice, RCTs continue to be the gold standard for program evaluation, as they can estimate causal effects transparently. But given the increasing availability of high-frequency data on electricity use, our work provides some optimistic reassurance: even without an RCT, these new tricks in the econometric toolbox can help us better understand the effects of various policies on energy demand.