The over- and underutilization of ANCOVA

After completing several statistics courses I lived in the illusion that I knew the ins-and-outs from, what I thought to be, basic statistical analyses. During this course, however, I saw pitfalls in almost all of them and came to the realization that the application of statistical procedures are not as straightforward as I once thought they were. One of the most striking examples is the analysis of covariance (ANCOVA). A statistical procedure used a lot, seemingly as a way to “control” for confounds. I was always impressed by this procedure, until I found out there is a lot more to it than just “controlling” for confounds.

The analysis of covariance (ANCOVA) was developed as an extension to the analysis of variance (ANOVA) to increase statistical power (Porter & Raudenbush, 1987). By including covariates, the variance associated with these covariates is being “removed” from the dependent variable (Field, 2009). This way, from the manipulation point of view, the error variance in the dependent variable is reduced and hence the statistical power increases, see Figure 1. Given that psychological research is often underpowered (Cohen, 1990), ANCOVA is an important statistical procedure in the revelation of psychological phenomena and effects.

Schermafbeelding 2013-10-25 om 16.15.11

This promising application of ANCOVA, however, only holds when there is no systematic relationship between the grouping variable and the covariate, i.e., the groups cannot differ on the covariate. This is an assumption that many researchers today fail to check. As a result, ANCOVA is widely misunderstood and misused (Miller & Chapman).

The importance of this assumption is illustrated in Figure 2. Namely, when group and covariate are related, removing the variance associated with the covariate will alter the group. In other words, the remaining variance of group after removing the variance associated with the covariate has poor construct validity and the results are therefore uninterpretable.

Schermafbeelding 2013-10-25 om 16.33.59

The general point is that the legitimacy of ANCOVA depends on the relationship between the grouping variable and the covariate. ANCOVA is justified only when there is no systematic relationship between these variables.

On the one hand, it is quite straightforward to defend this judgement in a randomized experiment; given random assignment, individual characteristics are equally distributed across the groups and thus, group means should not differ except by chance, see left panel of Figure 3. As a result, including a covariate in a randomized experiment increases the statistical power. In this sense, ANCOVA is underutilized. On the other hand, when studying pre-existing groups (i.e., non-random assignment), individual characteristics are not evenly distributed across groups and hence a relationship between group and covariate can exist. Thus, including a covariate in a non-randomized experiment might alter the grouping variable and result in unfounded interpretations and conclusions. In this sense, ANCOVA is overutilized.
It is worrisome that ANCOVA is more often applied in non-randomized experiments than in randomized experiments (Keselman et al., 1998). The idea is that researchers want to “control” for pre-existing differences. ). This idea is incorrect since there just is no statistical way to “control” for these pre-existing differences. ANCOVA “removes” the variance that is associated with the covariate, but it does not “control” for the covariate.

Schermafbeelding 2013-10-25 om 16.50.32
We, as researchers, should acknowledge the inabilities (ANCOVA cannot “control” for pre-existing differences) and abilities (ANCOVA can increase statistical power) of ANCOVA. This way we should be able to eliminate unfounded conclusions that are the result of the misapplication of ANCOVA. And, most important, we can expand the strengths of its application: increase statistical power. This way ANCOVA can help to reveal real psychological phenomena and effects.

For a good overview about this problem consult Miller and Chapman (2001). The original paper introducing this problem gives a good example on how the inclusion of a covariate can lead to incorrect conclusions (Lord, 1967).

If something varies normally between two far extremes, it usually swings back naturally to values in between.

Once upon a time, there was a young man, named Francis Galton, who was interested in heredity. One day, when he was looking at the heights of children, he stumbled upon an unexpected consistency between the heights of children and their parents. It turned out that tall parents get tall children, but that these children are not as tall as their parents. More specific, the parents differed more from the average length of all parents, than their children did from the mean of all children. The same effect holds for short parents and their children.

Galton named this effect “Regression towards Mediocrity in Hereditary Stature”, which is now known as “Regression towards the Mean” (RTM) (Bland and Altman, 1994).

RTM occurs whenever the correlation (r) between the outcome measure (y) and the predictor (x) is smaller than one. RTM refers to the finding that for a given value of x, the predicted value of y is always fewer standard deviations from its mean than is x from its mean.

In the original study of Galton, the outcome measure (y) was the height of the children and the predictor was the height of the parents (x). For example, if the correlation between the heights of children and parents is .5, then the children will deviate .5 fewer standard deviations from its mean than the parents from their mean.

Schermafbeelding 2013-09-24 om 14.36.23

So, what causes regression towards the mean?

First, as we have already stated RTM will occur when the correlation between predictor and outcome variable is smaller than one. Furthermore, the smaller the correlation, the greater the RTM effect.

Second, the population standard deviation influences the effect of RTM: the bigger the population standard deviation, the bigger the effect.

Third, the effect depends upon the cut-off score that is used. For example, in the experiment of Galton, it depends if you consider “tall” to be heights above 1.70 metres, or if you consider “tall” to be heights above 2.00 metres. The more extreme the cut-off score, the bigger the RTM effect.

The first cause implies that RTM is only absent when there is a perfect correlation between predictor and outcome (r = 1). However, this will never happen due to unsystematic error. According to classical test theory the scores we obtain consist of two components: the true score and error (X = T + E). This implies that next to the true score we also measure unsystematic error that is not correlated with anything. Therefore, no perfect correlation can be obtained. In other words: “RTM will occur in any measurement (biological, psychometric, anthropometric, etc.) that is observed with error” (Barnett, van der Pols, and Dobson, 2005).

And, what kind of research is at risk?

RTM occurs in two kinds of research. First, research is prone to RTM when it measures one variable is on two occasions. One can think of experiments in which is over time is measured (e.g., the effectiveness of a treatment for depression).

Second, research that measures two variables on one occasion and that wants to regress one variable on the other (e.g., Galton’s experiment with the heights of children regressed on the heights of parents).

How cautious should we be?

“One should assume that regression towards the mean has taken place unless the data show otherwise” (Barnett, van der Pols, and Dobson, 2005).

This quote shows that we have to be very cautious with respect to RTM. We hope we have given a good overview of the origins of the problem. Now that you know what RTM is, and how it occurs, there are multiple ways to overcome RTM. So don’t stress, just be alert!

Riet en Tessa


Bland, J. M., & Altman, D. G. (1994). Some examples of regression towards the mean. British Medical Journal, 309, 780.

Barnett, A. G., van der Pols, J. C., & Dobson, A. J. (2005). Regression to the mean: What it is and how to deal with it. International Journal of Epidemiology, 34, 215-220.

Is psychology in crisis? A Heated Debate

There are a lot of controversies concerning scientific conduct. To spark this debate, three statements were presented and different arguments were given. This all resulted in a heated debate which I will (shortly) present to you:

Statement 1: A researcher is ethically obliged to “snoop around” in his/her data.
Although controversial, consensus was reached easily: exploratory research (i.e., “snooping around”) is permitted, as long as it is clearly documented as being exploratory.

Statement 2: Recent replication studies that failed to find the original result show that important scientific results are incredible.
Replication seems to be the ultimate way to check the robustness of a result. However, it is very important that the study that tries to replicate the effect, is an exact replication of the method that was used in the original study. And although a failed replication may question the scientific claim, it improves science as a whole.

Statement 3: The paper of Bem (2011) should never have been published in JPSP.
In short, the article of Bem (2011) concerns ‘psi’ which denotes anomalous processes of information or energy transfer that are currently unexplained in terms of known physical or biological processes.
The main objections against the article, and thus in favour of the statement, concern the methodology of the experiments: procedures were changed during the experiment, different transformations and measurements have been used, no corrections were made for multiple testing (Wagenmakers, Wetzels, Borsboom, Van der Maas, 2011).
Soon it was agreed upon that the methodology of Bem’s article was flawed. However, the question remained if publishing it was the right thing to do. On the one hand, it was argued that science should be self-correcting instead of excluding experiments in advance. Through replication and discussion, the robustness of an effect is put to the test. Furthermore, Bem’s article showed how “statistics can mislead and be misused”. It was necessary to spark the debate about the rigor of scientific conduct in psychology and it can be argued that psychology is better off after this publications. On the other hand, it was argued that it is harmful for the field of psychology and science as a whole that such a paper, with all its methodological flaws gets published. And, furthermore, that the publication of this paper was needed for a shift in science.

Time was running out and no consensus was reached: should the article of Bem have been published (in order to spark the debate about scientific conduct)? However, in the end, all agreed that the more critical attitude towards science is desirable.

Bem, D. J. (2011). Feeling the future: experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100, 407-425.
Galak, J., LeBoeuf, R. A., Nelson, L. D., & Simmons, J. P. (2012). Correcting the past: Failures to replicate psi. Journal of Personality and Social Psychology, 103, 933-948.
Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17, 551-556.
Wagenmakers, E-J., Wetzels, R., Borsboom, D., & van der Maas, H. L. J. (2011). Why psychologists must change the way they analyze their data: The case of psi: Comment on Bem (2011). Journal of Personality and Social Psychology, 100, 426-432.