The independent t-test—violations of assumptions

Some methodologists complain that researchers do not check assumptions before doing a test. But in some cases, it does not help much when researchers do. For example, when there are no clear standards on how to test assumptions, and on the robustness of tests against violation of these assumptions, results can still be interpreted badly.

In the case of the independent t-test, it is often assumed that this test is robust against violation of the normality assumption, at least when the sample sizes of both groups are at least 30. But is becoming clear that even small deviations from normality may lead to a very low power (Wilcox, 2011). So, when a groups do differ, t-tests do often not detect it. This is especially true for mixed normal distributions which have a relative large variance. At the left side of figure 1, two normal distributions are compared and the independent t-test has a power of .9; at the right side of this figure, two mixed normal distributions with variance 10.9 are compared, but here the power is only .28. So, in this case researcher might conclude from a plot that the assumption of normality is not violated. The result of the t-test might be that the null-hypothesis of equal means cannot be discarded, and therefore the conclusion is that the groups do not differ. What is missed is that the distribution is actually a mixed normal distribution, and a difference was not detected due to low power.


Figure 1 At the left two normal distributions that differ 1, power of the independent t-test is .9; at the right two mixed-normal distributions with variance=10.9 that differ 1, power of the independent t-test is now .28

The main problem of the mixed normal distribution is its heavy tails. There are relatively many outliers. One solution is to use a 20% trimmed mean. This means that the highest twenty percent of the values is discarded, as is the lowest twenty percent. A method called Yuen’s method uses this trimmed mean in combination with Winsorized variance (Wilcox, 2011). This is the variance when the highest twenty percent of values is displaced by the remaining highest value, and the lowest twenty percent of values is displaced by the remaining lowest value. Yuen’s method performs better in terms of power, controlling type I errors and getting accurate confidence intervals.

Another assumption that is often violated with large consequences is the assumption of homoscedasticity. Homoscedasticity means that both groups have equal variances. Often Levene’s test and an F-test are used to assess whether the assumption is violated. But both tests are themselves susceptible to violation of the normality assumption. Therefore it is recommended to use a variant of the independent t-test that does not assume equal variances, for example Welch’s test. Zimmerman (2004) even suggests that Welch’s test should be used in all cases, even when distributions are homoscedastic. With Welch’s test the probability of a type I error can be controlled better, and therefore the power is higher in most situations.



Wilcox, R. (2011) Modern Statistics for the Social and Behavioral Sciences. New York: CRCPress.

Warning! The relationship may be reverse!

Pia Tio & Riëtte Olthof

One speaks of a Simpson’s paradox when a certain statistical relationship is observed in a group and differs from statistical relationships found in its subgroups (Kievit, Frankenhuis, Waldorp, & Borsboom, 2012). It occurs most often when inferences are drawn from one explanatory level to another, and can be present in both categorical and continuous variable datasets.

One of the most well-known examples is the admission rates of University of Berkeley in 1973 (Bickel, Hammel, & O’Connell, 1975). Overall it seemed that being a man would increase your chance of being admitted compared to being a woman. However, if one looks not at the university but at the faculty level, a different pattern emerges. At faculty level the admission rate for women is often higher than that of men, which contradicts the seemingly bias against women found earlier.

The main cause of the Simpson’s paradox is the assumption that a relationship/correlation between predictor (in Berkeley’s case sex of applicant) and outcome variable (in Berkeley’s case admission rate) is due to a link between the two variables. What is forgotten is that this apparent relationship can also be due to a link between the predictor and a third variable (in Berkeley’s case faculty where admission is sought).

The consequences of ignoring this link between the predictor and a third variable can be far-reaching. In table 1 an example of Simpson’s paradox with serious consequences is given (Suh, 2009). It seems that treatment A has a higher mortality rate than treatment B. But there is a link between treatment and a third variable ‘presence of risk factor’. Both in the case of ‘no risk factor present’ and in the case of ‘risk factor present’, the mortality rate is higher for treatment B. So, this contradicts the conclusion of the pooled data. This happens because the presence of a risk factor is not equally divided over the treatment groups. People with a high risk factor more often get treatment A and people without this risk factor more often get treatment B.

Table 1 (Suh, 2009)

Suh stated that this occurred in studies on treatment of dementia with antipsychotics. In one large study in 2003 it was concluded that the mortality was higher with this treatment compared to another. Therefore, the U.K. Committee of Safety of Medicins warned clinicians not to prescribe antipsychotics to dementia patients. Suh concluded in 2009 that Simpson’s paradox was involved in the 2003 study and that the mortality rate of patients using the antipsychotics was actually lower, compared to that of other treatments. So, clinicians should not have been warned but should have been stimulated to prescribe the antipsychotics to dementia patients. Due to the Simpson’s paradox, patients may have been treated with less favourable medicines and even may have died as a consequence.

The prevalence of Simpson’s paradox in published literature is unknown. According to a simulation study by Pavlides and Perlman (2009) Simpson’s paradox will be present in 1.69% of the cases. But this might be either an overestimation or an underestimation of the incidence in the published literature.

Simpson’s paradox can be revealed through visualisation of the data, especially with scatterplots. There are also some tests available that will help to detect Simpson’s paradox. Examples are a conditional dependence test for categorical data and a homoscedasticity check and cluster analyses for continuous data. Moreover, for continuous data, an R package called ‘Simpsons’ can be used (Kievit, Frankenhuis, Waldorp, & Borsboom, 2012). This package will warn you in case of a reversed relationship in one of the subgroups.


Bickel, P. J., Hammel, E. A., and O’Connell, J. W. (1975). Sex bias in graduate admissions: Data from Berkley. Science, 187, 4175, 398-404.

Kievit, R. A., Frankenhuis, W. E., Waldorp, L. J., & Borsboom, D. (2013). Simpson’s paradox in psychological science: A practical guide. Frontiers in psychology,  4, 1-12.

Pavlides, M. G., & Perlman, M. D. (2009).  How likely is Simpson’s Paradox? The American Statistician, 63, 226-233.

Suh, G. H. (2009). The use of atypical antipsychotics in dementia: rethinking Simpson’s paradox. International Psychogeriatrics, 21, 616–621.