Are you any good as a therapist? Overall, therapists seem to be quite a confident group. A study by Walfish, McAllister, O’Donnell, and Lambert (2012) asked 129 therapists to compare their psychotherapy results to those of their peers. They found that 25% of the therapists estimated that their results were in the upper 10% of all therapists, and not a single one estimated that their results were below average. Obviously these are statistical impossibilities, suggesting that therapists are generally overconfident.
The December 2015 issue of Psychotherapy was a special issue devoted to progress monitoring and feedback in mental health care. As Wampold (2015) reiterated, Clement (1994) asked therapists, “Are you any good?” Clement suggested that therapists had a professional and ethical obligation to systematically evaluate their own outcomes. To his credit, Clement followed his own suggestion—he evaluated himself over his entire career (683 cases).
Decades of psychotherapy research have shown that some therapists perform exceptionally well while others do not (Wampold & Brown, 2005). Do years of experience matter? Likely not. Among other things, Clement (1994) learned from his cases that his results had not improved with experience. There is additional evidence that this is not so (e.g., Minami et al., 2009).
Is it the kind of therapy that matters then? Very unlikely. Specific method of therapy seems to contribute very little to clients’ progress (Wampold & Imel, 2015). Therefore, your therapist telling you that s/he practices “cognitive behavioral therapy (CBT)” provides you with no assurance that s/he’s any good.
To assess a therapist’s effectiveness, it is necessary to have a well-developed methodology that addresses multiple measurement challenges such as use of different questionnaires, differences in case mix, and small sample sizes. We and our collaborators have attempted to address these over the years using data from thousands of therapists, culminating in a series of articles (e.g., Minami, Wampold, Serlin, Kircher, & Brown, 2007; Minami, Serlin, Wampold, Kircher, & Brown, 2008a; Minami et al., 2008b; Minami et al., 2009; Minami, Brown, McCulloch, & Bolstrom, 2012; Brown, Simon, Cameron, & Minami, 2015).
The purpose of this brief report is to present the newest results of applying this methodology to the dataset from the ACORN collaboration, which includes over 4000 therapists working in diverse clinical settings treating a wide variety of presenting problems.
Therapists and Clients
We selected therapists who had:
- At least 5 cases with 2 or more assessments and,
- Clients with intake scores that were severe enough to be in a clinical range.
A total of 2,820 therapists met this threshold, with a combined sample size of 162,168 cases.
As is typical of outpatient mental health care, just over 60% of the clients were female and roughly 1/3 of cases were youth under the age of 18. No other demographics variables were available for the entire database due to idiosyncratic reporting requirements across agencies.
While most of the questionnaires were developed as part of the ACORN collaboration, a number of sites used other well-developed measures such as the OQ-45, BASIS-32, PHQ9, GAD7, and the ORS. These questionnaires demonstrated very similar factor structures and produced comparable effect sizes.
Factor analyses confirmed that all questionnaires had items that loaded heavily on a common factor, which we refer to as global distress. For these reasons, we were comfortable combining the results across different questionnaires and had confidence that the results would hold with other well-developed questionnaires that are used in measuring psychotherapy.
The methodology has been described at length in prior articles, and thus what follows is a brief summary. The difference between posttreatment and pretreatment (i.e., change score) is standardized to an effect size (Cohen’s d) simply by dividing the change score by the standard deviation of the questionnaire at intake (Cohen, 1988; Becker, 1988).
However, it is important to bear in mind that the magnitude of the change score is highly dependent on the pretreatment score. Clients with high levels of distress will tend to report much more change than those with very little distress at intake. For this reason, clients with intake scores that are not considered within a clinical range are commonly excluded from analyses, as we did here. This cutoff also has the advantage of selecting a sample that is more comparable to those found in clinical trials for psychotherapy for various disorders (Minami et al., 2007, 2008b).
The methodology for benchmarking therapists was extended in two ways (Minami et al., 2012):
1. Rather than using raw effect sizes, severity adjusted effect sizes (SAES) were calculated for each case. This approach had the advantage of controlling for differences in case mix (e.g., diagnoses, intake scores) from one therapist to another.
2. The use of hierarchical linear modeling (HLM; also known as multilevel modeling) allowed for estimation of therapists’ performance under the reality that therapists in the database are only a fraction of the therapists out in the world and that their clients also are likely a fraction of their cases. This method thus leads to a more conservative yet reliable estimate of the therapists’ performance, as will be illustrated below.
The following table presents the mean and distribution of therapists’ performance in both raw average severity adjusted effect sizes (SAES) and hierarchical linear modeling (HLM)-estimated average SAES.
The table breaks out the results by the number of clients that the therapists have in the database in order to demonstrate the benefits of using HLM to estimate therapists’ effect sizes.
Table 1: Distribution of therapist effect size as a function of sample size
As is apparent, raw SAES calculations show greater differences in performance among therapists when the number of clients per therapist is low, which questions the reliability of the estimates. On the other hand, with the same data, HLM produces estimates that are more convergent regardless of the number of clients. It is theoretically and statistically very unlikely that therapists who have lower number of clients in the database differ greatly as compared to therapists who have higher number of clients. Therefore, we believe that HLM produces more reliable estimates of therapists’ performance.
So, how should the HLM results be interpreted?
To explain this, we will look at the most reliable results of the HLM-estimated SAESs, which is that of therapists with over 100 cases. The mean effect size of d= 0.83 indicates that, on average, a client has roughly 80% chance that her/his symptoms at the end of treatment would be better than the average client’s symptoms at the beginning of treatment.
Similarly, the effect size of the lowest-performing therapist at d= 0.25 indicates that a client has only 60% chance that her/his symptoms at the end of treatment would be better than the average client’s symptoms at the beginning of treatment. Not much improvement from a 50/50 chance.
On the other hand, the effect size of the highest-performing therapist at d= 1.64 indicates that a client has roughly 95% chance that her/his symptoms at the end of treatment would be better than the average client’s symptoms at the beginning of treatment.
If you were a client, which therapist would you choose?
The story gets worse. At least one study estimated how much clients change if they were in the wait-list condition and found that the effect size was d= 0.15 (Minami et al., 2007). This suggests a 56% chance of getting better without any treatment, which is only 4% less than being treated by the lowest-performing therapist.
Further, as the number of assessments suggest, it is highly unlikely that the better performances of therapists with a large number of clients in the database are attained by using more sessions.
Therefore, if one were to assume that the rate of improvement is consistent, the lowest-performing therapist would require 3 times the number of sessions as compared to the average therapist and as much as 7 times the number of sessions as compared to the highest-performing therapist.
The overall result shows a robust estimate of performance for a large group of therapists. These results also illustrate the pitfalls of trying to evaluate therapist effectiveness using simple averages rather than employing HLM. Finally, the table provides some guidance for therapists trying to interpret their own performance, provided they measure their outcomes.
It goes without saying that benchmarking performance is not an exact science, and thus there are an infinite number of reasons why therapists’ performances differ. In addition, the results cannot provide sufficient explanation for any particular therapist with a particular set of clients. Regardless, we believe that the results provide unwavering confidence that measuring performance, albeit with imperfections, is better than not measuring performance.
As this sample demonstrates, there are substantial differences in therapists’ performance. Fortunately, most therapists produce results comparable to what we would expect from well-conducted clinical trials, which is an effect size of approximately d= .80. About 25% have results that exceed this benchmark by 10% or more (d => .90), while about 25% will fall short by a similar amount (d < .68).
The question “Are you any good?” only makes sense in the context of “Compared to what?” Benchmarking provides therapists with a way to answer this question and to evaluate if their performance is improving over time.
Wampold (2015), summarizing the evidence from the articles in the special issue, concluded that sufficient evidence exists to adapt one of the feedback systems described, if not something similar. We wholeheartedly agree and assert, as did Clement (1994), that it is unethical to not monitor progress and provide feedback on performance.
After all, without measuring your outcomes, you have no idea if you’re any good.