The Importance of Heterogeneity in Large-Scale Replications

In a large-scale replication effort, Klein et al. (2018, https://doi.org/10.1177/2515245918810225) investigate the variation in replicability and effect size across many different samples and settings. The authors concluded that, for any given effect being studied, heterogeneity across samples and settings does not explain failures to replicate. In the current commentary, we argue that the heterogeneity observed indeed has implications for replication failures, as well as for statistical power and theory development. We argue that psychological scientific research questions should be contextualized—considering how historical, political, or cultural circumstances might affect study results. We discuss how a perspectivist approach to psychological science is a fruitful way for designing research that aims to explain effect size heterogeneity.

The data clearly show that effect sizes vary strongly depending on the topic of study.However, heterogeneity in effect sizes within the effect being studied seems to be dismissed as inconsequential.For example, the authors note "heterogeneity across samples does not provide much explanatory power for failures to replicate." In contrast, here we argue that the variability in effect sizes observed in Many Labs 2 (ML2; Klein et al., 2018) is highly consequential and may in fact explain failures to replicate.Additionally, we describe approaches to the research process that should lead to greater insights from large-scale replication projects such as ML2.

Heterogeneity in Many Labs 2
Although the authors report multiple measures of heterogeneity, here we focus on I 2 for concision and ease of interpretation.I 2 "describes the percentage of total variation across studies that is due to heterogeneity rather than chance" (Higgins et al., 2003, p. 558).This metric runs from 0% to 100%, with 0% indicating no observed heterogeneity and greater values indicating greater heterogeneity.
As the ML2 authors report, 12 of 28 effects (about 43%) exhibited medium or high heterogeneity (see Higgins et al., 2003).Typically, multiple tests of the same phenomenon include different methods and procedures, and therefore heterogeneity should be expected (Higgins et al., 2003).However, considering that great efforts were made to ensure the materials and procedures for all studies in ML2 were identical, it is noteworthy that about 43% of effects still showed substantial heterogeneity.Because materials and procedures were nearly identical, this means that the observed heterogeneity is measuring the lower bound of heterogeneity (McShane, Tackett, Bockenholt, & Gelman, 2018).
A compelling way to investigate the implications of this heterogeneity is to examine the range of observed effect sizes, rather than solely relying on heterogeneity statistics (Borenstein, Higgins, Rothstein, & Hedges, 2015).
When viewing Figure 2 in ML2 (p.470), for example, it is clear that numerous effects-many of which qualified as having successfully replicated-had individual studies whose effects differed in sign.Even for topics where individual studies had entirely or nearly all the same sign, there was still notable variability in effect sizes.
This heterogeneity in effect sizes has clear implications for statistical power.For example, to detect an r = .1 effect size with 80% power, a researcher would need to have about 782 participants' data available for analysis.To detect an r = .5effect size with 80% power, a researcher would need only about 28 participants.This effect size range was observed in several cases in ML2.This has large implications for study planning (Kenny & Judd, 2019), especially in a world where resources dedicated to the social sciences are decreasing (Lupia, 2014), and therefore large samples more difficult to obtain.
Additionally, the substantial heterogeneity in effect sizes observed in ML2 has strong practical implications.For example, when social scientific findings are used to make policy decisions, decision-makers need to have a clear understanding of the expected magnitude of an intervention, compare it to alternative interventions, and choose a path forward with limited time and resources.When significant unexplained heterogeneity is present, there is, by definition, less certainty about the consistency of any given effect.Seemingly small differences in effect sizes can translate to large differences in the real world, especially when small events are repeated and can accumulate over time (Abelson, 1985;Funder & Ozer, 2019).Thus, attempts to explain heterogeneity should be well worth the efforts.

A Path Forward
A fruitful path forward is to use a perspectivist approach to psychological research (McGuire, 1989(McGuire, , 2004)).A perspectivist approach "assumes that all hypotheses and theories are true, as all are false, depending on the perspective from which they are viewed, and that the purpose of research is to discover which are the crucial perspectives."(McGuire, 2004, p. 173).Consistent with this perspective, we argue that heterogeneity should be embraced, seen as something to be understood, and used to identify needs for theory development (also see McShane et al., 2018).
When heterogeneity is embraced, researchers can propose variables in advance that might explain effect size heterogeneity, measure them in their study, and aim to explain when effect sizes should be smaller, larger, or run in the opposite direction.For example, Goldberg and colleagues (2019a) recently replicated the same study on a highly contentious issue (climate change) across three sampling platforms (Amazon's Mechanical Turk, Prime Panels, and Facebook), and found significant heterogeneity, with effect sizes for the same manipulation ranging from d = 0.42 to d = 0.86 using a mixed design and from approximately zero to d = .54using a between-subjects design.The authors explained that significant differences between samples in education, political ideology, and familiarity with the treatment message likely accounted for differences in effect sizes because more educated, liberal, and familiar participants had higher baseline levels of the dependent variable, thereby leading to ceiling effects or sensitization to the treatment (also see Chandler, Mueller, & Paolacci, 2014;Druckman & Leeper, 2012;Goldberg et al., 2019b).
Instead of concluding that between-sample differences are a product of random noise, embracing heterogeneity as something to be understood may lead to fruitful research questions.For example, although Goldberg et al. (2019a) found that demographics and familiarity with the treatment message were promising explanations for effect size heterogeneity, open questions remain as to whether participants from different sampling platforms (e.g., MTurk vs. Facebook) are different in other fundamental ways, or whether different incentives (e.g., paid vs. unpaid participation) can explain differences in effect sizes.
A similar approach can be used to understand heterogeneity in ML2 and other large-scale replication efforts.For example, although the heterogeneity (I 2 = 37%) for Zaval et al. (2014) on heat priming is surprising given the many null effects, an alternative explanation is that the relationship between perceptions of heat and public opinion about climate change is highly context-dependent.For example, Hornsey, Harris, and Fielding (2018) found that the relationship between political ideology and climate change skepticism is emphasized by the political culture in the United States, but this relationship was weak in 24 other countries they examined, thereby pointing to the importance of political culture.A perspectivist approach to the research question would consider whether the findings are limited to certain cultural or political contexts.This aids theory development because it goes beyond the presence or absence of the effect, but rather asks whether the effect exists and in which contexts and under which historical, political, or cultural circumstances the effect is more or less likely to emerge.
In sum, we must remain skeptical of the claim that knowing the effect being studied is necessarily more important than knowledge about the sample and setting.In short, heterogeneity in effect sizes across identical experiments should be used to inform researchers about the boundary conditions of the theories they are testing, as well as the importance of context.