Publication Bias

Meta analysis in statistics refers to a method of combining independent studies to see if there is any disagreement among the results and look for interesting patterns. In an ideal world all valid results (e.g. results that are found through the use of good methods and statistics) on the topic that is analyzed would be at the disposal of the analyst. Through combining these results the nature of a statistically significant result can be investigated with a broader perspective. Unfortunately, it is rarely the case that all results are published. This is a serious problem.

In reality a positive outcome of a study makes it more likely that you can publish your results (Mahoney, 1977; Bakker, Van Dijk & Wicherts, 2012). When the scientific community pushes researchers to get significant results, factors that are different from the urge to find the truth might come into play. Researchers can react to this extremely by engaging in behavior where anything goes (e.g. fraud) to get significant results. This would leave us with a very biased sample of published research consisting of significant results that do not correspond with the real world. One can correctly argue that the majority of researchers do not go to these extremes however; a reaction that is much more mild than outright fraud can also have a severe effect on the sample of published research (Simmons, Nelson and Simonsohn, 2011; John, Loewenstein & Prelec, 2012). When papers that show true null results are rejected and (unconsciously) encouraging researchers to force results to a pre-specified significance level we are left with unreliable publications. This brings us back to meta analysis. Meta analyzing a biased sample of research is problematic. So, how are we to solve this problem? Here I will mention two solutions: (1) a solution from the perspective of conducting meta analysis and (2) a solution from the perspective of the people that are involved in the publication process.

First, this problem is not new in psychology (Rosenthal, 1979). Researchers themselves have already developed different ways to improve meta analysis in such a way that a publication bias can be detected by making funnel plots, use fail safe N analyses and much more. However all these solutions in meta analysis are to estimate the likeliness of publication bias. Through indirect measures it is measured if there is something like a publication bias. In this way we can never get our hands on the actual size of the bias.
Second, several initiatives have been started to make psychological science more transparent by making all conducted research available for everyone. One of these initiatives that have been here for a while is Here people can upload their non-significant replications which otherwise would not have been published. A more recent initiative is This is a website where researchers can publish almost everything they do in their research and make it available for everyone to use and check.

Making analyses more sophisticated and psychological science more transparent will hopefully reduce the amount of bias in a way that we can (almost) fully rely on published research again.

Bakker, M., van Dijk, A., & Wicherts, J. M. (2012). The rules of the game called psychological science. Perspectives on Psychological Science, 7, 543-554.

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23(5), 524-532.

Mahoney, M. J. (1977). Publication prejudices: An experimental study of confirmatory bias in the peer review system. Cognitive therapy and research,1, 161-175.

Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86, 638-641.

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359-1366.

I have the power! Or do I really?



“Knowledge is power”
Francis Bacon

Have a look at the following results and try to explain for yourself what the reasons might be that the results of three replications appear to be that different.

Maxwell (2004)

Because of the headline you might already suspect that power is probably the issue here. Here are three additional points: (1) The sample size of all studies is n=100, (2) all predictors share a medium correlation (r = .30) with the dependent variable and with each other and (3) G*Power indicates a Post hoc power .87. Does this mean that you were totally wrong? The answer is no.

If you find it difficult to explain the deviant results you are no exception. This table is taken from a paper by Maxwell (2004) where he demonstrates how many psychology experiments have a lack of power. How could this be a lack of power if G*Power indicates a statistical power of .87? Well, G*Power does not distinguish, as well as most of us, between the power to find at least one significant predictor and the power to find any specific predictor. Maxwell conducted several simulations and found that the power to find any single specific effect in a multiple regression (n=100) with five predictors is .26 and the chance that all five predictors turn out significant is less than .01. Considering this, it is much easier to explain the unstable pattern of results. One might say that a multiple regression with five predictors is an extreme example but even a 2 x 2 ANOVA with medium effect sizes and n=40 per cell only finds all true effects (two main and one interaction effect) with a chance of 69%.

This is only another powerful example how significance tests can be very misguiding. We have to be aware that this method, which is the common test paradigm in Psychological research, can be flawed and has to be evaluated with caution (Wagenmakers, 2007). Evaluating Confidence Intervals for example, can be one way to realize the uncertainty that underlies Frequentist hypothesis. The following table shows the confidence intervals of the five predictors in the three replications and clarifies that the results do not differ as much as the p-values indicate.

figure 1

The most important lessons that can be taken from this demonstration are: (1) don’t be fooled by p-values, (2) consider the confidence intervals, (3) be aware of the uncertainty of results, (4) do not let your theory be dismissed by a Type II error and (5) publication bias as well as underpowered studies might lead to a distorted body of literature and to be safe one should assume that there is an overestimation of effect sizes. Consider these five lessons when you plan your experiment because a lack of power can turn the results of an otherwise excellent experiment into useless data. While it might be disturbing that a 2 x 2 ANOVA needs more than 160 participants to exceed a power of 0.69 to find all medium sized effects, one should be aware that sample size is not the only way to increase the power of an experiment. Reducing variance with covariates and increasing the effect size by strengthening the manipulation are very effective and might often be more feasible than having more and more participants.

Boris & Alex

Maxwell, S. E. (2004). The persistence of underpowered studies in psychological research: causes, consequences, and remedies. Psychological Methods, 9, 147.

Wagenmakers, E. J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14, 779-804.

I’m WEIRD, is that a problem?

We as psychologist students have two things in common. First, on a given moment in our life we chose to study psychology and second, in our first year of psychology we all had to participate in several (psychological) experiments. Do these two commonalities make us different from other people? And if so, do they have implications for the way we view human behaviour?  The answer is yes and the answer is no.

Most psychology students belong to a group which is called “WEIRD”; they belong to Western, Educated, Industrialized, Rich and Democratic societies. Although WEIRD people only represent 12% of the human population, they account for 96% of participants in behavioural experiments. What is remarkable, is that most of these behavioural studies claim to study “human behaviour”.  How valid is it to study human behaviour when you only incorporate a tiny part of it into your sample?GSBS

When you think about seeing or hearing; important aspects of human behaviour,  you would intuitively guess these are prototypes of behaviour that does not show great variability within the whole population. However, even in the case of the basic cognitive process of seeing, it turns out people can interpret visual input differently. In a study of Segal (1966), it was shown that people from small scale societies interpret visual illusions different compared to people from industrialized countries; they were less vulnerable to the illusions. An often given explanation for this is that the way our environment is constructed affects how we perceive the world: living in a society with lots of geometrically shaped buildings shapes your interpretation of visual input.
There are as well differences between people who are from WEIRD societies themselves. Different experiments have illustrated striking social differences in the way Asian versus Western people conform to rules or in the way they judge themselves and the people around them. Even when we go further down the hierarchy, we find differences between Western and non-Western societies, between Americans and non-Americans, and so on. But, are all these differences a problem?

The differences themselves are not a problem, but the way researchers investigate and not take them into account is a serious problem, however. As we said before, we as psychology students, or WEIRD-people, have been the subject of a tremendous amount of behavioural studies. When researchers interpret and publish their results, they often overlook the differences between people. They generalize their results to the whole human population, although most results were based on a limited and homogeneous WEIRD subject pool. As a result, researchers throw psychological phenomena such as the “fundamental” attribution error into the world as being a universal human trait, only to find out after a couple of years later this phenomenon is not fundamental at all, but instead a typical Western tendency.

So, there is not a problem in being WEIRD, but a problem in forgetting to be WEIRD. When doing research as WEIRD researchers, we should be aware we are people with possibly different attitudes and behaviours and we should be aware that we only represent a tiny part of the whole human population. This means, a lot of last century’s gathered knowledge about human behaviour  is only applicable to a small group of people. In the next century,  maybe we should focus somewhat more on the other 88

of the world?

Evalyne & Joanne


Warning! The relationship may be reverse!

Pia Tio & Riëtte Olthof

One speaks of a Simpson’s paradox when a certain statistical relationship is observed in a group and differs from statistical relationships found in its subgroups (Kievit, Frankenhuis, Waldorp, & Borsboom, 2012). It occurs most often when inferences are drawn from one explanatory level to another, and can be present in both categorical and continuous variable datasets.

One of the most well-known examples is the admission rates of University of Berkeley in 1973 (Bickel, Hammel, & O’Connell, 1975). Overall it seemed that being a man would increase your chance of being admitted compared to being a woman. However, if one looks not at the university but at the faculty level, a different pattern emerges. At faculty level the admission rate for women is often higher than that of men, which contradicts the seemingly bias against women found earlier.

The main cause of the Simpson’s paradox is the assumption that a relationship/correlation between predictor (in Berkeley’s case sex of applicant) and outcome variable (in Berkeley’s case admission rate) is due to a link between the two variables. What is forgotten is that this apparent relationship can also be due to a link between the predictor and a third variable (in Berkeley’s case faculty where admission is sought).

The consequences of ignoring this link between the predictor and a third variable can be far-reaching. In table 1 an example of Simpson’s paradox with serious consequences is given (Suh, 2009). It seems that treatment A has a higher mortality rate than treatment B. But there is a link between treatment and a third variable ‘presence of risk factor’. Both in the case of ‘no risk factor present’ and in the case of ‘risk factor present’, the mortality rate is higher for treatment B. So, this contradicts the conclusion of the pooled data. This happens because the presence of a risk factor is not equally divided over the treatment groups. People with a high risk factor more often get treatment A and people without this risk factor more often get treatment B.

Table 1 (Suh, 2009)

Suh stated that this occurred in studies on treatment of dementia with antipsychotics. In one large study in 2003 it was concluded that the mortality was higher with this treatment compared to another. Therefore, the U.K. Committee of Safety of Medicins warned clinicians not to prescribe antipsychotics to dementia patients. Suh concluded in 2009 that Simpson’s paradox was involved in the 2003 study and that the mortality rate of patients using the antipsychotics was actually lower, compared to that of other treatments. So, clinicians should not have been warned but should have been stimulated to prescribe the antipsychotics to dementia patients. Due to the Simpson’s paradox, patients may have been treated with less favourable medicines and even may have died as a consequence.

The prevalence of Simpson’s paradox in published literature is unknown. According to a simulation study by Pavlides and Perlman (2009) Simpson’s paradox will be present in 1.69% of the cases. But this might be either an overestimation or an underestimation of the incidence in the published literature.

Simpson’s paradox can be revealed through visualisation of the data, especially with scatterplots. There are also some tests available that will help to detect Simpson’s paradox. Examples are a conditional dependence test for categorical data and a homoscedasticity check and cluster analyses for continuous data. Moreover, for continuous data, an R package called ‘Simpsons’ can be used (Kievit, Frankenhuis, Waldorp, & Borsboom, 2012). This package will warn you in case of a reversed relationship in one of the subgroups.


Bickel, P. J., Hammel, E. A., and O’Connell, J. W. (1975). Sex bias in graduate admissions: Data from Berkley. Science, 187, 4175, 398-404.

Kievit, R. A., Frankenhuis, W. E., Waldorp, L. J., & Borsboom, D. (2013). Simpson’s paradox in psychological science: A practical guide. Frontiers in psychology,  4, 1-12.

Pavlides, M. G., & Perlman, M. D. (2009).  How likely is Simpson’s Paradox? The American Statistician, 63, 226-233.

Suh, G. H. (2009). The use of atypical antipsychotics in dementia: rethinking Simpson’s paradox. International Psychogeriatrics, 21, 616–621.