Multilevel is needed when we want to model a hierarchical structure. Because the world is full of hierarchies (students within classes within schools, time within people, penalties within goalkeepers, clients within therapists and what we often forget in  science, research subjects within experimenters), we quite often, or almost always, need multilevel models. Through several examples (Clark, 1973; Van Baaren, Holland, Steenaert, & van Knippenberg; Hoeken & Hustinx, 2009) I hope to have shown the necessity of this procedure. However, I’ve gotten the question what the assumptions of multilevel models where. Because I did not had a straight answer, I would like to take this opportunity to address this question here.

To understand this, I quickly have to explain the nature of multilevel modeling. A complete (random intercept, random slope) multilevel model consists of a normal regression equation, yij=b0j+b1j*Xi+eij, where the subscripts i represent the differences between individuals and the subscripts j represent the differences between groups. The differences between groups are specified by regressing with a new formula (level 2) on the intercept and slope parameter, b0j=y00+u0j and b1j=y10+u1j (one of these equations can be dropped to get a fixed intercept, random slope or random intercept, fixed slope parameter model, see all three models in the figures below).


Now to the assumptions. These are the same as in ordinary multiple regression analysis

  • linear relationships,
  • normal distribution of the residuals,
  • and homoscedasticity.

When the assumption of linearity is violated we could check for other relations (for instance, by using the square of a time variable in a longitudinal study). Note that due to the introduction of multiple levels there is now more than one residuals. Of course this complicates matters a bit, but it has been shown that multilevel estimation methods are quite robust for violations of this assumption on the second level (Hox & Maas, 2004). Another great advantage of multilevel is that heteroscedasticity can be modeled directly, to account violations of the final assumption (cf. Goldstein, 1995, pp. 48–57).

So to sum up, USE MULTILEVEL!



Goldstein, H., 1995. Multilevel Statistical Models. Edward Arnold, London; Halsted, New York.


Maas, C.J.M. & Hox, J.J., (2003). The influence of violations of assumptions on multilevel parameter estimates and

their standard errors. Computational Statistics & Data Analysis, 46, 427-440

“The Earth Is Round (p<.05)"

In the context of our presentation about power issues in psychological research, we were very pleased to introduce you to or remind you of one of the most brilliant and therefore classical articles in the field of psychological methods: “The Earth Is Round(p<.05)”, by Jacob Cohen (1994). In his article, Cohen questions the, nowadays still, dominant procedure of null hypothesis significance testing. The fact that this procedure is still broadly used is quite remarkable since we know that:

  • H0 is seldom true
  • we are simplifying reality when we think in those binary terms, like: effect vs no effect
  • we don’t even “get what we want” by applying this procedure

Cohen questions the idea of living by p values alone and suggests the scientists avail themselves of the multiple tools they can find in the statistical toolbox. He made clear that namely what many researchers want and therefore conclude from this statistical testing is the probability that an hypothesis is true, given the evidence (=data), whereas what one gets from standard test is the probability of the evidence assuming that the H0 is true.

Ignorance of many people of the exact operation called NHST is the source of many errors in the application of this procedure. One of the most ignored and/or neglected components of this procedure is power! Whether an effect is called significant depends on the chosen alpha, the sample size, the (expected) effect size and the power. To illustrate the interrelations between the different components, check the graphical representation on the following website:

The power of a statistical test is the probability that the test will reject the null hypothesis when the null hypothesis is false. Though in 1994 Cohen questions NHST, he has spent a huge amount of effort on describing and clarifying this procedure. For instance, Cohen has written a book in 1988 that describes different power calculation and has large index tables for power values with various effect sizes, alpha’s and sample sizes. A more recent development in the area of power calculation is the invention of G*Power (2007). This is an easy to use software to compute the power for many different statistical tests.

We hope that this exploration of probability space has shown you that the neglect of power is bad science.


Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum
Cohen, J. (1994). The earth is round (p<.05). American Psychologist, 49, 997-1003
Faul, F., Erdfelder, E., Lang, A.-G. & Buchner, A. (2007). G*Power 3: A flexible   statistical power analysis program for the social, behavioral, and   biomedical sciences. Behavior Research Methods, 39, 175-191

Replications, a necessity for the social sciences to stay healthy

Replication is very important for science, and especially for the social sciences. In physics researchers are able to control a lot, if not all variables in an experiment. If then a result is found, it is easy to replicate this by creating the exact same circumstances. In the social sciences this is a lot harder, as there are too many differences between people and their environment which could . Therefore we try to approach this and use statistics to prove that the difference found between groups cannot be due to external factors.

However, this method is probabilistic. Due to this probabilistic nature an effect (or difference between groups) always has a chance of to be proven, while in fact it is not there (usually set at α=0.05, a so called false positive). This means that 1 out of every 20 experiments that show a significant result should be discarded, or to go even further, that substantial proof for an effect  can only be given by showing that the effect at hand is replicated in least 19 out of 20 (independent) experiments. This of course is only possible by replication.

Furthermore, another problem is the availability of the studies that show no effect. These studies don’t have a large chance of getting published and therefore it is hard to get a reliable overview of the number and ratio of studies that have and have not found an effect. This publication bias is also a large danger for meta-analyses that might overcome the effect of false positives. The best way to solve this would be to convince journals of the necessity of replication, but this seems to be quite difficult. Luckily there are some journals, like PlosOne, that do allow papers with replication studies and initiatives like could also help a lot.

Another way in which I would suggest replication could be possible (and even on large scale) would be (ab)using students. In the bachelor program of psychology at the UvA and probably at many other (social) studies there is a research practical. All students who take part in this research practical learn to do research by executing a certain research. When this would be well coordinated, each year many replication of one study could be done, without additional large problems. However, after enquiring a few fellow students, it seemed that the subject of the research practical up to this day has not changed since I did the research practical myself (which is 5 years ago). It saddened me to hear this, because this seemed me as a beautiful opportunity for the social science to prove its consistency.

Furthermore, if a partnership between the research practicals of different universities would arise consistency of effects over populations (e.g. Dutch vs Germans) could also be established. And if I might dream even further, we could even compare within-country  differences with between-country differences. However, without a proper replication plan, this ideal will probably stay in my dreams.


Some people claim that science is self-correcting. I would like to change this claim by stating that science could be self-correcting, but we have not yet reached that state. There are options to improve this (though this obviously and ironically needs some more research) and one of the most important is replication.