Most Psychotherapy Research Probably Isn’t Reproducible (But We Can Fix That)

Samantha L. Bernecker, M.S.

May 22, 2016

Most Psychotherapy Research Probably Isn’t Reproducible (But We Can Fix That)

Papers about reproducibility are filling journals; arguments about reproducibility ricochet through the blogosphere. Concerns about the trustworthiness of published research are not limited to psychology: they extend to the biomedical sciences (Begley & Ionannidis, 2015), political science (Esarey, Stevenson, & Wilson, 2014), and even computer science (LeVeque, Mitchell, & Stodden, 2012).

But only psychotherapy researchers have the distinction of having a mascot for our reproducibility problem: the dodo bird.

Psychotherapy researchers have known for decades that treatment trials fail to replicate upsettingly often. I believe that viewing the dodo bird verdict (i.e.., that all psychotherapy treatments produce equal outcomes) through a reproducibility lens and addressing its root causes will improve the state of all psychotherapy research, not just treatment trials—and will get some answers that will satisfy all parties so that we don’t need to rehash the pun-filled arguments of which we are so weary.

My goals for this article are twofold: first, to convince any remaining skeptics that we should be very concerned about the reproducibility of psychotherapy research, and second, to suggest some actions we can all take to fix it.

Yes, There is a Problem

Though psychotherapy research has produced some body of consistent and useful findings, I believe that a great deal of psychotherapy research would fail to replicate (if it hasn’t already). The dodo bird effect is an example of failed replication: any strong findings in favor of one treatment over another tend to be contradicted by null findings or opposite findings.

But wait, you say, I thought the dodo bird was extinct? Didn’t David Tolin kill it with a really good meta-analysis or something (2010, 2014)?

Yes, some meta-analyses find differential efficacy for some comparisons—but other meta-analyses continue to find no differences between treatments (Hofmann, Asnaani, Vonk, Sawyer, & Fang, 2012). So the meta-analyses themselves are failing to replicate, resulting in a Schrodinger’s dodo bird-type situation.

We can discuss why this happens all we like, going back and forth about sound meta-analytic methods and the inferences we can make from these meta-analyses. But regardless of our stances, we can all agree that there is evidence of a problem here. Over the course of decades psychotherapy research has produced a body of literature that is so ambiguous that smart people can come to different conclusions by slicing it slightly different ways.

Perhaps you are unconcerned that comparative outcome trials aren’t reproducible. Actually, if it were only comparative outcome trials, I wouldn’t be worried either—I would expect trials comparing multi-component treatment packages to produce unstable findings, for reasons discussed below. Accordingly, many researchers have agreed that studies of mechanisms, process, moderators, and individual treatment components are more fruitful.

Yet, there is plenty of reason to believe that these other kinds of psychotherapy studies aren’t reproducible either. First of all, it’s simply more parsimonious to assume that psychotherapy research is like research in all of psychology (and other fields) in its reproducibility problem, being subject to the same pressures.

And second, plenty of examples of failed replications beyond the dodo can be found in the psychotherapy literature. My own anecdotal experience in performing obsessive literature reviews is that lack of reproducibility is the rule rather than the exception.

For example, I was asked to write this article to accompany a paper in which my colleagues and I failed to replicate the finding that attachment style moderates the relative efficacy of cognitive-behavioral therapy and interpersonal psychotherapy (IPT; Bernecker et al., 2016). Having noted during the writing process that the extant literature was difficult to sum up, we embarked on a comprehensive review of the literature on moderators in IPT. We have discovered few consistent results, with even oft-cited findings contradicted by other investigations (Bernecker, Coyne, & Constantino, in preparation).

Another example can be found in a recent meta-analysis of cognitive bias modification (CBM; Cristea, Kok, & Cuijpers, 2015). Though I advocate testing single-ingredient interventions as one remedy to this crisis (see below), that meta-analysis elegantly shows that positive findings for CBM have been driven by publication bias and methodological artifacts, demonstrating that even narrowly-focused interventions will have the same reproducibility problems as multi-component treatments unless we alter our methods and publishing practices.

If you’re still unconvinced that you should be worried about the reproducibility of psychotherapy research, let me present the same argument I make to people who doubt climate change: given the magnitude of the danger here, isn’t it safer to assume that a problem exists and to address it? I wouldn’t want to be on the wrong side of the bet and end up with the scientific equivalent of mass extinctions and underwater cities.

If we want our work to influence practice, training, and policy, we should want to be able to draw reasonably certain conclusions when we synthesize a body of research; we shouldn’t have effects that are so fragile that they blow over in a slight breeze.

In fact, even better, wouldn’t it be nice for individual papers to be trustworthy enough that we can cite them with reasonable certainty that their conclusions are “true,” without having to wait for five or ten more studies addressing the same question?

We Need to Do Something About It—and I Believe We Can

Psychotherapy research has produced valuable findings, but the practices that have led to inconsistencies in the literature have hampered progress. Now that science has recognized the scope of the reproducibility problem and identified its causes, we have the opportunity to make changes that will accelerate research and get us answers to the questions we care about.

Two types of forces combine to produce bias in the published literature: forces that generate variability and forces that select for stronger effect sizes. Due to institutional and personal pressures, small effect sizes are filtered out at various stages of the research process, producing a literature filled with false positives. Perversely, the research practices that result in the most noise and that introduce the most bias reap rewards with publication and accolades and appear to generate the most “favorable” findings, regardless of true effect size (see figure).

Within labs, then, the state of the literature can be remedied:

(1) by reducing selection bias, that is, by ensuring that the effects that make their way into the literature are not “filtered” in a biased direction

(2) by reducing variability

(3) by changing the system. At a systemic level, these goals can be hastened by changing the institutional and personal incentives that favor demonstration of large effects. Though investigators are to some degree victims of this system, we are also culpable of maintaining it, and we can take responsibility for permanently changing it.

1. Reducing “Selection Bias” in Analyses and Reporting

Questionable research practices (QRPs; John, Loewenstein, & Prelec, 2012) include things like weeding out small and null results during the data collection, analysis, and reporting process. Publication bias prevents small and null effects from entering the literature. As individuals, we can take it upon ourselves to counteract both types of “filters.”

Avoid QRPs, increase transparency, and consider preregistration

The actual behaviors we need to change have been enumerated eloquently elsewhere. Two of my favorite references are Gelman and Loken (2013) and Simmons, Nelson, and Simonsohn (2011).

In short: in confirmatory studies, don’t fish, and don’t allow your data to influence your methods. In exploratory studies, report everything, including the results of different possible analytic strategies.

Publish everything

Even if we are scrupulous in reporting a “representative sample” of the possible analyses from a data set rather than a biased sample, the literature will remain biased if the findings never see publication. It appears that the problem is not that studies with weak findings are getting rejected from journals; rather, we are not writing them up at all (Franco, Malhotra, & Simonovits, 2014).

It is our responsibility to try to publish everything. Start trying with your favorite journal; you may be surprised at how welcoming they are of null findings, especially if you invoke the cause of reproducibility (the editors and reviewers at Psychotherapy never turned up their nose at our recent paper). If that doesn’t work, try a journal like PLoS ONE that selects articles based on methodological soundness rather than findings (and still has a high impact factor!),and if all else fails, put your paper online in a public repository (Nosek, Spies, & Motyl, 2012). There’s even a journal specifically for publishing null results (www.jasnh.com).

Commit to educating yourself and those around you

I believe that the vast majority of scientists intend to present “real” results, not fraudulent ones, and when we engage in QRPs, it is without understanding the gravity or the impact of our actions (John et al., 2012). Therefore, faculty should not only make sure that they are well-versed in the nuances of avoiding QRPs, but also should commit to implementing required graduate (and undergraduate) training during methods, statistics, and ethics courses. Many graduate students to whom I speak appear only vaguely acquainted with reproducibility issues and have themselves engaged in QRPs, sometimes with the blessings of mentors. Without in-depth training, we risk the next generation perpetuating the problem.

2. Reducing Variability

We can also vastly increase the reproducibility of research by improving the signal-to-noise ratio—and in the process, we can make our research more clinically applicable and easier to disseminate.

In experimental studies, test laser-focused single-component manipulations

Multi-component interventions enable clinicians to flexibly deliver different doses of components to each patient (the responsiveness problem; Stiles, 2009). This probably benefits the patient, but it also washes out effects and makes it impossible to tell what treatment components are accounting for improvement. Tests of single-component interventions (or unitary alterations to existing treatments) will vastly improve precision and reproducibility.

Though these single-component interventions or augmentations should be applied consistently, they needn’t be “rigid” in the sense of being given to patients for whom they are not appropriate; rather, patients for whom they are theoretically appropriate can be identified in advance. Indeed, a happy side effect of this single-component approach is that it will enable flexible combining of ingredients to match each patient’s unique presentation.

Think of yourself as manualizing high-quality training rather than manualizing the intervention ingredients

Implementing evidence-based training methods such as behavior modeling training (Hill & Lent, 2006; Taylor, Russ-Eft, & Chan, 2005) will enhance clinicians’ ability to accurately reproduce the intended behaviors. And subsequent replication and dissemination will be effortless if we thoroughly specify the materials and exercises used to train study clinicians.

In both experimental and observational studies, use focused and reliable measures

I am particularly enamored of the idea of measuring individual target symptoms or mediators rather than aggregate assessments of “disordered-ness” or diagnostic status; this will reduce error and will clarify which elements of the patient’s presentation are changed by the intervention. Moreover, it is more consistent with modern conceptualizations of mental illness (Borsboom & Cramer, 2013), as well as with clinical practice, given that patients often want to focus on specific complaints.’

3. Changing the System

As consumers of research and as reviewers, we are part of the system of incentives, so we can (and have begun to!) change the pressures that shift research toward false positives and noisy methods.

As a reviewer, “gatekeep” based on methods, not findings

Our recommendations regarding whether to publish an article should not be based on its results or headline-grabbing potential, but on the soundness of its methods. We should also make it clear to authors that they need to report all outcomes and analytic decisions, and that they will not be penalized for doing so (Simmons et al., 2011).

As a consumer, reward practices that promote reproducibility

We also have power as consumers of research. We should refuse to cite articles that show signs of selective reporting, other kinds of “p-hacking,” or methods that introduce lots of noise. We should instead cite, blog and tweet about:

replications and comments,
work that is methodologically sound, but has mixed or null findings,
and work that has clearly made an effort to reduce noise and has ended up with small effects as a consequence, thus personally rewarding the researchers who engage in this work and demonstrating to journals that this is what gets cited and is thus worth publishing.

Concluding Remarks

I hope that I have galvanized you to take action.

If so, please avail yourself of the references cited here; my thoughts are indebted to many people wiser than I am who have written about the replication crisis (and dodo bird-related issues) and who present more thorough recommendations.

The November 2012 special section of Perspectives on Psychological Science is also an excellent resource (Pashler & Wagenmakers, 2012).

Go forth and promote reproducibility!

About the Author

Samantha L. Bernecker, M.S.

Sam earned her PhD in Clinical Psychology from the University of Massachusetts Amherst in 2017, where she worked under Dr. Michael Constantino. After completing a postdoctoral fellowship at Harvard University with Drs. Ron Kessler, Matt Nock, Pete Gutierrez, and Thomas Joiner, she joined the Boston Consulting Group as a management consultant. She tries to keep a hand in the psychotherapy research world and currently advises several effective altruism organizations that investigate cost-effective mental health interventions.

View all articles by Samantha→

Citation

Bernecker, S. (2016, May). Most psychotherapy research probably isn't reproducible (but we can fix that). [Web article]. Retrieved from: https://societyforpsychotherapy.org/most-psychotherapy-research-probably-isnt-reproducible

References

Begley, C. G., & Ioannidis, J. P. A. (2015). Reproducibility in science: Improving the standard for basic and preclinical research. Circulation Research, 116, 116-126. doi: 10.1161/CIRCRESAHA.114.303819

Bernecker, S. L., Constantino, M. J., Atkinson, L. R., Bagby, R. M., Ravitz, P., & McBride, C. (2016). Attachment style as a moderating influence on the efficacy cognitive-behavioral and interpersonal psychotherapy for depression: A failure to replicate. Psychotherapy, 53(1), 22-33. doi: 10.1037/pst0000036

Bernecker, S. L., Coyne, A., & Constantino, M. J. (in preparation). For whom does interpersonal psychotherapy work? A systematic review of moderators of IPT’s efficacy.

Borsboom, D., & Cramer, A. O. J. (2013). Network analysis: An integrative approach to the structure of psychopathology. Annual Review of Clinical Psychology, 9, 91-121. doi: 10.1146/annurev-clinpsy-050212-185608

Cristea, I. A., Kok, R. N., & Cuijpers, P. (2015). Efficacy of cognitive bias modification interventions in anxiety and depression: Meta-analysis. The British Journal of Psychiatry 206(1), 7-16. doi: 10.1192/bjp.bp.114.146761

Esarey, J., Stevenson, R. T., & Wilson, R. K. (Eds.). (2014). Special issue on replication. [Special issue]. The Political Methodologist, 22(1).

Franco, A., Malhotra, N., & Simonovits, G. (2014). Publication bias in the social sciences: Unlocking the file drawer. Science, 345(6203), 1502-1505. doi: 10.1126/science.1255484

Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. Retrieved from http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf

Hill, C. E., & Lent, R. W. (2006). A narrative and meta-analytic review of helping skills training: Time to revive a dormant area of inquiry. Psychotherapy, 43(2), 154–172. doi:10.1037/0033-3204.43.2.154

Hofmann, S. G., Asnaani, A., Vonk, I. J. J., Sawyer, A. T., & Fang, A. (2012). The efficacy of cognitive behavioral therapy: A review of meta-analyses. Cognitive Therapy and Research, 36(5), 427–440. doi: 10.1007/s10608-012-9476-1

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23(5), 524–532. doi: 10.1177/0956797611430953

LeVeque, R. J., Mitchell, I. M., & Stodden, V. (2012). Reproducible research for scientific computing: Tools and strategies for changing the culture. Computing in Science and Engineering, 14, 13-17.

Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific utopia: II. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7(6), 615-631. doi: 10.1177/1745691612459058

Pashler, H., & Wagenmakers, E.-J. (Eds.). (2012). Replicability in psychological science: A crisis of confidence? [Special issue]. Perspectives on Psychological Science, 7(6), 528-654.

Simmons, J., Nelson, L., and Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allow presenting anything as significant. Psychological Science 22, 1359–1366.

Stiles, W. B. (2009). Responsiveness as an obstacle for psychotherapy outcome research: It's worse than you think. Clinical Psychology: Science and Practice, 16(1), 86-91. doi: 10.1111/j.1468-2850.2009.01148.x

Taylor, P. J., Russ-Eft, D. F., & Chan, D. W. L. (2005). A meta-analytic review of behavior modeling training. Journal of Applied Psychology, 90(4), 692–709. doi:10.1037/0021-9010.90.4.692

Tolin, D. F. (2010). Is cognitive-behavioral therapy more effective than other therapies? A meta-analytic review. Clinical Psychology Review, 30(6), 710-720. doi: 10.1016/j.cpr.2010.05.003

Tolin, D. F. (2014). Beating a dead dodo bird: Looking at signal vs. noise in cognitive-behavioral therapy for anxiety disorders. Clinical Psychology: Science and Practice, 21(4), 351–362. doi: 10.1111/cpsp.12080