Why it’s challenging to evaluate unconditional cash transfer programs

[Disclosure – I was formerly a research analyst at GiveWell, a nonprofit that recommends unconditional cash transfers as a high impact giving opportunity. The views in this post are mine alone.]

Vipul Naik is worried that the research literature on universal basic income (UBI) may become biased because the most prominent field trials are being run by organizations who are already big believers in the intervention.1 Vipul asks “what can opponents of basic income do to better address this bias, and the way it might skew experimental results?”

I’m not an opponent of basic income. In the long-run, I think it might be a very important tool for maintaining wellbeing if we ever enter a world of mass technological unemployment.2 I’m agnostic about whether UBI would be a good policy today. I think a UBI designed to mitigate mass technological unemployment in the long-run would be very different (and have very different costs and benefits) from a UBI designed to be helpful today.3 So I do share Vipul’s concern about conflating these arguments for a UBI.

In this post, I write about two challenges of evaluating unconditional cash transfer programs like UBI. Both these issues could be exacerbated by researcher bias including unconscious researcher bias.4

I also suggest one answer to Vipul’s question about how UBI skeptics could usefully respond to this research program.

Unconditional cash transfer programs are particularly vulnerable to multiple comparison problems, publication bias, p-hacking, and the garden of forking paths

Unconditional cash transfers are an exciting intervention in part because they allow each recipient to spend her benefits on whatever she determines is her greatest need. One recipient can buy medical supplies, another food, and a third business investments without any program administrator having to figure out that their needs differ. If individuals’ priorities vary a lot according to their environments or preferences, this can be a great advantage over, for example, giving a bed net to every individual across a wide region.

However, this very strength makes the effectiveness of cash transfers particularly difficult to evaluate. We can evaluate the effectiveness of an educational intervention by looking at test scores, graduation rates, or other education-related outcomes. We can evaluate the effectiveness of anti-malarial bednets by looking at changes in malaria infection rates, infant mortality rates, and other outcomes we expect to be affected by malaria. It’s always possible that these types of narrow interventions have large unexpected secondary benefits on unrelated outcomes but we often treat these types of findings with skepticism because they’re so vulnerable to publication bias, p-hacking, and related statistical problems. But if we expect cash transfers to benefit different recipients in completely different ways, there’s no clean distinction between “primary” and “secondary” outcomes of interest. What outcomes, then, should evaluators test?

Here are some strategies researchers have employed to date, each of which has some weaknesses:

  1. There are dozens of individual studies testing the effects of cash transfers on a wide array of potential outcomes of interest. Some arbitrarily chosen examples include studies finding that unconditional cash transfers can: reduce teenage pregnancy and marriage rates; improve adolescent mental health; increase investment in education; increase consumption of nutritious food (1, 2); be invested in microbusiness at moderate or large real returns(1, 2, 3, 4); reduce child labor and increase school attendance; empower women (but not reduce intimate partner violence); improve the cognitive development of young children (perhaps by increasing their consumption of nutritious food, early stimulation, or use of preventative health care); and reduce child mortality.5 Each of these effects strikes me as plausible but there are clearly enormous risks of publication bias and multiple comparison problems.
  2. Some studies reduce the number of comparisons they make by focusing on high-level indices that summarize the effects of cash transfers on well-being or combine many variables into one index. For example, the prominent short-term randomized controlled trial of some of GiveDirectly’s programs in Kenya constructed indices of GiveDirectly’s impact on food security, health, education, psychological well-being and female empowerment as well as non-land assets, non-durable expenditures, and monthly revenue. These indices do somewhat reduce multiple comparison problems and p-hacking if the methods for constructing them are preregistered in advance of the study. However, they still suffer from the “garden of forking paths” because there are so many possible ways to construct them. [EDIT 3/21/17: This was an error. If all the methods are preregistered then the garden of forking paths problem is solved. Thanks to Eric Potash for pointing this out.] Moreover, they can be difficult to interpret and sensitive to scaling (e.g. how “good” is it to improve an index of psychological wellbeing by 0.25 standard deviations?), they may not be closely related to the underlying concepts we really care about, they can fail to pick up unexpected or narrow effects, and they are often reported in conjunction with narrower outcomes so multiple comparison problems remain.
  3. Some studies require subjects to fill out long, detailed surveys that address a very wide array of potential effects and then attempt to use statistical methods to account for multiple comparisons. For example, the aforementioned evaluation of GiveDirectly reported at least ten measures of psychological wellbeing, eight categories of consumption, and three categories of business activity. If pre-registered, this does reduce publication bias. One challenge with this method is that in order to pre-register the correct outcome measures researchers must anticipate the potential benefits to recipients in advance. More fundamental challenges include some disagreement over the correct statistical adjustments for multiple comparison testing and the fact that these adjustments can substantially reduce a study’s power, increasing the “false negative” rate. This type of study may therefore require a very large number of participants and may be very expensive. Moreover, if each participant in a particular study benefits in a different way (e.g. some use cash for long-term investments, others purchase immediate healthcare) it may be very difficult to pick up on these effects.

Overall, I think the tension between the flexibility of cash transfers and the perils of multiple comparisons is a fairly fundamental research challenge and I unfortunately can’t offer many solutions other than for consumers of research on cash transfers to evaluate reported results with an extra dose of skepticism.

These ideas are not original but it’s also particularly important for researchers in this field to adhere to standard good research practice (which is unfortunately rare in the social sciences):

  • Pre-registering studies (especially the outcomes that will be measured)
  • Attempting to correct for multiple comparisons
  • Refraining from overemphasizing positive results that are cherry-picked ex post
  • Deemphasizing results from small, low-powered studies in which even statistically significant results are very likely to overestimate the effect size
  • Making heavy use of exploratory (and potentially qualitative) pilot studies in a wide range of communities prior to carrying out evaluations so researchers can fine tune their sense of what outcomes to measure

Research into cash transfers is unusually vulnerable to falling prey to the “cult of statistical significance.” There is a large risk that evaluations finding only statistically significant evidence that cash transfers have greater than zero net benefit to recipients will be incorrectly interpreted as evidence that cash transfers are a relatively cost-effective intervention.

Social science’s overemphasize on “statistical significance” has already been written about widely. It causes many problems but here I’m specifically concerned about the difference between “statistical significance” and the more important question of “real-world, practical significance.” When we say that the difference between a treatment and control group is “statistically significant” all that means is that a difference of that magnitude would be unlikely to occur by random chance if the intervention was having no effect at all. At best, this is mild evidence that the intervention is, indeed, having a non-zero effect. But this focus on whether an intervention is having any effect at all can be a distraction from the more important (and more difficult) question of whether we can be confident the effect is big enough to justify the intervention.

For certain policy proposals, this distinction between statistical significance and practical significance is only of moderate importance. At times, policy is able to leverage relatively limited resources to affect a very large population. In such cases, research demonstrating that a policy has any benefits at all might be substantial evidence that the policy is worth carrying out.

For cash transfers this is not the case. Research into cash transfers is particularly vulnerable to the cult of statistical significance for two reasons.

  • Demonstrating that cash transfers have at least some non-zero benefit for their recipients is trivial but demonstrating that these benefits are worth the cost is hard. Almost everybody agrees that people benefit from getting money – including most critics of cash transfers and most skeptics of social welfare programs. If you give someone money, they’ll almost certainly be able to purchase at least a little more of something they want. So as long as you have enough participants in your study, you’ll almost certainly be able to prove that, on average, either members of the treatment group got more nutritious food, or they got more healthcare or education, or more investments, or more of something desirable. As argued above, studies will likely ask about all of these things and stumble on at least one statistically significant benefit. However, since almost nobody denies that cash must have some benefits, a statistically significant effect on any of these variables does not in itself constitute new evidence in favor of the intervention. If we don’t vigilantly enforce the distinction between statistical significance and practically substantial effects, we’ll end up promoting cash transfers as a successful program merely due to the fact that people would rather consume more than less.
  • There is no consensus about what constitutes success for cash transfer programs. There is a conceptually easy solution to the above problem for most social programs because most programs are designed to target a few particular outcomes (say, reductions in the incidence of malaria or improvements in students’ test scores). In these cases, we can ask whether the intervention does a better job of improving this outcome than other known interventions. However, as noted above, the strength of cash is its flexibility. If cash works, different communities and different individuals within a community will benefit from cash in different ways and often in ways that a researcher would not have predicted in advance. Cash may be an effective intervention because it allows some recipients to purchase anti-malarial bed nets and others to purchase school uniforms. But even if it’s successful, it’s very unlikely to reduce malaria as much as an anti-malaria intervention or improve education as much as an educational one. A successful cash transfer pilot could reduce hunger in one community and an equally successful cash transfer program might have no effect on hunger at all in a community with different needs. There’s no consensus about the bar for success and success would look different in different communities anyway. In such cases, “statistical significance” is a (malicious) attractor as a default measure of success. As results from cash transfer evaluations pour in and we don’t know what to expect, it will be tempting to judge their success based on whether they have statistically significant benefits. We need to avoid this temptation.

Can we mitigate the ill effects of the cult of statistical significance on cash transfers research?

Earlier Vipul asked what skeptics of UBI could do to ensure that evaluations of the intervention are not biased. I don’t have anything approaching a complete answer but I can think of some relatively low cost practices that might reduce the likelihood that unconscious bias leads researchers to fall prey to the cult of statistical significance.

  • Potential consumers of this research (including UBI/cash transfer skeptics) should ask people running evaluations to preregister their opinions of what constitutes success (in addition to preregistering their methodology, which GiveDirectly has done in the past). Of course, we shouldn’t expect researchers to have very precise opinions – it’s up to recipients whether to spend their benefits on food, healthcare, education, or what have you. But success thresholds could be disjunctive. Perhaps an intervention would be considered successful if it led to an X% reduction in short-term hunger, or a Y% increase in education or a Z% improvement in long-run income (or some weighted average of the above). Anything along these lines would prevent the (very human) temptation of defining success after we are biased by seeing what benefits a particular pilot has. The temptation will be strong because, as argued above, it will nearly always have some benefits.
  • Studies are traditionally designed to have sufficient power so that they are very likely to reject the null hypothesis of no effect under the assumption that the “true” effect is of a certain size. In other words, researchers try to make sure that if they are right about the “true effect” of the intervention, the study will be precise enough to show that the effect is greater than zero (but not necessarily precise enough to show that the effect is meaningfully large). This type of research design is not good enough because it fails to ask whether a study will be large enough to answer the questions that actually matter for policy purposes. We should instead push for studies with sufficient power so that if the designers’ priors about “true” effect sizes are correct, the studies are very likely to reject the null hypothesis that the effect is beneath the aforementioned “success” thresholds.6 This will require very large, expensive studies but studies of this size are required to actually demonstrate a high probability that cash transfers are a successful intervention.
  • We should push for studies that use control groups who receive a viable intervention approximately as costly as cash. Advocates for cash transfers have often asked other interventions to prove that they are more cost effective than cash. In parallel, cash should have to demonstrate that it is more cost effective than other interventions. There would be difficult methodological issues in comparing other interventions to cash because cash does not target a specific outcome. But I think these issues are worth tackling. At the very least, participants could be asked to rate how big of a difference they believed the program made and whether they would choose to remain in the same group if the experiment were repeated.

——————————————————————–

1. 1) GiveDirectly’s experiment in Kenya. 2) YC Research’s experiment in Oakland.

2. Although some have pointed out the potential for problems in the very long run as population grows or if a UBI program is expanded to include nonhuman sentient beings (such as nonhuman animals or potentially sentient AI). Some discussion here. Thanks to Carl Shulman for pointing this out.

3. A few links follow for those who want to read more but they’re far from comprehensive and not necessarily the best sources on the given topics. 1) An argument that today’s AI is not displacing enough jobs to justify basic income. 2) An argument that a basic income program designed to handle truly massive technological disemployment would have to be global (unlike the programs proposed to alleviate poverty today). 3) Some evidence (policy brief; academic paper) from an analogy to lotteries that under current conditions would lead people to work less, which might harm the overall economy.

4. Despite these difficulties, I do believe there’s strong evidence that cash transfers improve the lives of the global poor (although I think even more effective interventions exist). See GiveWell’s review of the evidence on cash transfers in the developing world for a good summary. (Note – I wrote a previous version of this report when I was a research analyst at GiveWell). I think the big challenge is estimating the magnitude of these benefits and whether they’re as effective as other attempts to help equally poor populations (such as providing antimalarial bed nets).

5. Examples are all taken from studies cited by GiveWell’s report on unconditional cash transfers.

6. This is sometimes called the “smallest effect size of interest.” https://daniellakens.blogspot.com/2016/05/absence-of-evidence-is-not-evidence-of.html;
Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses: Sequential analyses. European Journal of Social Psychology, 44(7), 701–710.http://doi.org/10.1002/ejsp.2023. Thanks to Carl Shulman for making me aware of these citations.

Leave a comment