Publication bias: how often do researchers check for this?

Researchers (and us, the “upcoming researchers”) have been complaining about all the things that seem to be wrong with psychology or science in general for years: researchers don’t share their data, they use the wrong statistical analyses, they come up with their hypotheses after they have seen their data and they massage their data until something “interesting” pops up. Whenever we, the upcoming researchers, discuss these problems we always end up talking about publication bias: the tendency of journals and researchers to publish studies with significant results, resulting in file drawers full of non-significant (but at least as interesting) studies.

Because of publication bias people only read a small portion of articles on a specific effect and start to believe this effect is true, even though it might not actually exist. In the late 1990’s, for instance, articles were published that supported the hypothesis that reboxetine was an effective medicine in the treatment of major depressive disorder. It was not until 2010, when a meta-analysis looked into the possible presence of publication bias, that researchers discovered that not only was the drug ineffective, it was potentially harmful! What had happened? Only 26% of the patient data had been published. The remaining 74% was not significant and was therefore not published, resulting in a terrible mistake: psychiatrists had been prescribing a potentially harmful pill to patients who were battling major depression. This example clearly shows that publication bias should not be taken lightly. However, for years journals have failed to combat this problem.

But now, finally, things seem to be changing: journals such as Cortex have started working with preregistration, a system in which articles are chosen for publication based on the quality of their methods instead of their outcome and “interestingness”. While this is a wonderful development and will definitely help combat publication bias, it is not enough. In some fields publication bias may have been present for years and preventing it from occurring in future articles is not enough. Therefore it’s very important that researchers check for the possible presence of publication bias when conducting a meta-analysis. My question for the final assignment was: how often do researchers actually do this?

I checked this for 140 randomly drawn meta-analyses (twenty for every two years, from 2000-2013). What I found was that in only 37.14% of the articles researchers checked for the presence of publication bias. Perhaps even more shocking was that in the 88 articles in which no check was conducted only 6 (6.82%) articles mentioned why the authors did not do this (i.e. “because we added unpublished studies in our analyses, publication bias cannot be present” or “we wanted to check for publication bias with a funnel plot, but this was not possible due to a small sample of studies”).

Whether or not these reasons are correct, the main issue here is that apparently a lot of researchers either do not know that publication bias is a serious problem or they simply fail to see it as a problem. Either way: researchers and upcoming researchers need to be taught or reminded of the problems with publication bias and how they can check for this in the future. What I also think would help, is if journals demand these checks for any meta-analysis that is considered for publication.

My question to you is: How do you think we, researchers and upcoming researchers, should combat publication bias? Is there a possibility for a science in which publication bias is not an issue anymore?

Innocent… until someone gets suspicious.

Scientists ought to be critical and sceptical of each other’s work. But how far can we take this norm? I analysed 4 publications of a UvA professor, to determine whether or not there were signs of data fabrication of falsification.

The professor that was the topic of my final assignment is called Mr. 132. This is a pseudonym that was inspired by the row number of the data file on which he stood when I performed a replication study of Bakker and Wicherts’ study (2011) in 2012. In this replication study, Mr. 132 stood out (not in a good way) because he misreported a high amount of p-values, and in particular he made a lot of gross errors (errors that change the significance of the result). The high amount of misreported p-values made me suspicious and when this assignment was announced, I knew what the topic was going to be.

It turned out that Mr. 132 still misreports a high amount of p-values. This barplot  demonstrates this. In this figure, the percentage of errors and gross errors is plotted against the percentages that were found by Bakker and Wicherts (2011). If I was not suspicious already, I would be after these results.

Luckily, I did not need to make a trip to the head of my faculty to accuse a professor of data fabrication or falsification. Using the simulation method proposed by Simonsohn (2013), I simulated 69 pairs of standard deviations, each 100.000 times (a more detailed description of this method can be found in Simonsohn’s paper or in the paper on my profile on the Open Science Framework ). Of these simulations, four pairs turned out to be significant (α = .05). Three of these significant probabilities were due to standard deviations that were equal; the simulation method does not work when the standard deviations are exactly equal. The fourth was turned out to be .042. To me, this is a borderline case. Only 4 out of 69 simulations were significant, which is not convincing me that Mr. 132 fabricated or falsified data.

In sum, the odds are in Mr. 132’s favour. I think it is highly unlikely that he fabricated or falsified data, based on these results –even though he misreported a high amount of his p-values. I also do not think that further investigation is needed to prove he is in fact innocent (or guilty, since I can not rule that out with absolute certainty). So in this case, Mr. 132 is innocent, until someone gets suspicious again.



Bakker, M., & Wicherts, J. M. (2011). The (mis) reporting of statistical results in psychology journals. Behavior Research Methods, 43, 666–678.

Simonsohn, U. (2013). Just post it: The lesson from two cases of fabricated data detected by statistics alone. Psychological Science, 20, 1–14.

How to implement open science?

I think that many researchers and students would agree that we need to change science. We simply do not know what researchers do with their data, as we have no access to them. This freedom gives researchers the chance do either consciously or unconsciously perform questionable research practices. Science is in need of improvement, and openness is a central theme within this revolution. But just agreeing on this will not help us any further, we need a practical change. So the key question is how we are going to change the scientific field into a more open field. One way is to implement the Open Science Framework, an initiative by Brian Nosek, that encourages scientists to share their research projects online. The Open Science Framework enables researchers to create a project, upload files, and share them, whenever they think they are ready to do so. This way, journal reviewers (or others) can check on the researcher’s work any time. Moreover, it shows that the researcher at hand does not have anything to hide. In my final assignment for the Good Science, Bad Science course, I propose an implementation of this framework in undergraduate education. I created a lesson plan that explores different features of the framework and that also involves several analytical and writing skills. The students will create an account on the Open Science Framework, review a project of a researcher, make a project themselves, check projects of other students, and write an essay on the positives and negatives of the Open Science Framework and its potential within the scientific field. Through working on their analytical and writing skills on the Open Science Framework, they will hopefully incorporate the framework and use it in their future (scientific) careers. Learning future researchers to use the OSF may change the scientific world from the inside out. We do not have to oblige researchers to practice open science, but we can simply introduce the young ones to proper working methods that they will hopefully use later on in their lives. Changing science will be a matter of time; learning students how to use the Open Science Framework will not cause an immediate change, but it might cause a more thorough change on the long term than if we just tell researchers to change their working habits right now.

We are students in an exciting time. Science is on the move. And perhaps in ten years time, science will have become its true self again.

Should we trust science?

The course ‘Good Science, Bad Science’ has almost come to an end. The first half of the group has presented its final assignments. Although no topic was the same, all topics related to one issue: the way researchers practice science is faulty. Researchers make the one mistake after the other, sometimes unconsciously, other times consciously. Researchers do everything to get their paper published. Science is no longer about the pursuit of truth, but about personal image, about publishing, about finding the easy way out. During the presentations I was thinking about how I wound up in this almost evil place. But this pessimistic attitude of me is perhaps not entirely justified. Although each student presented a problem within science, something that we have to pay attention to, they portrayed no doom scenarios. Each of them offered solutions or alternatives for the problems they mentioned, and many of them expressed at least some amount of trust in the future of science. And perhaps they have good reason to trust science more than I am expressing in this blog. Science is on the move. It is the period of ‘academic spring’, as one of the students elegantly called it. We are finally becoming aware of the mistakes we make, and not just the over-intellectual methodologists, who have been telling everyone what they have been doing wrong for quite some time now, but also the students and the the more applied researchers who deal with the real world and real data. Knowledge on what is good and bad science becomes more and more widespread. So yes, although my opening of this blog was rather pessimestic, I would rather acknowledge that science is probably in a better place than it was a couple of years ago, and that science is improving every minute. Nevertheless, the question remains whether we should trust science the way it is right now. And although I would love to be optimistic about this, I notice that I fall back into my pessimism. I tend to answer this question with: ‘no, not until we have good reasons for it.’ We are not there yet. If I read a research article, I am more likely to be sceptic than to be enlightened. That might not be fair to all the trustworty researchers out there, but I also believe that some researchers are in one way or another not so trustworthy. But hopefully the academic spring will continue making us aware of pitfalls in science, and I truly hope one day that my answer to this question will be a straightforward ‘yes’.

Crisis, what crisis? Revolution!

Barbara Spellman (introduced as Bobbie) teaches evidence and various courses on the intersection of psychology and law at the UVA. Also she is now editor of Perspectives on Psychological Science.
She discussed the crisis of replicability. However for her this is not a crisis but rather a revolution. She talked about some analogies of the failure to replicate, and its impact, to a revolution.
I will talk about the past, the present and the future that Bobbie mentioned. What made the failure to replicate to a revolution?, How to behave in the revolution? and: What to predict for the future?

First I like to start with describing what kind of revolution Bobbie means when she compares the situation in our field to a revolution. The change in our field is similar to the French revolution since our own people (researchers) push the revolution, and they do not want to change or fight “the others” but rather change a “topic” within our own field. (In the French revolution this was the monarchy).

The past:
Replicating studies of other researchers did not just start, we’ve been doing this for a long time already. Also then certain studies would not replicate. Why is this problem only now a revolution? What events pushed the revolution? One of the major happenings in the recent years are the fraud cases (not just in the Netherlands, hooray). These cases opened our eyes on how valid our findings are. But maybe more important is the technological change in the last years. The use of internet in science accelerated the research. One can easily find lots of participants, all articles are within reach (one click on a button and it’s on your computer) and running the analyses is also easier and faster. However this fast science also can lead to sloppy science. What determines which of these findings are real effects? We need to replicate. Additionally internet made that replicating got a real voice. Before the shift to sharing so much information and research on internet, it was hard to get attention with your replication. Now you can spread your information (did it replicate or not) very fast. And all replications add up on the internet making its influence stronger.

The present:
How to behave in this revolution? There are the so called replicators and anti-replicators, and both have a lesson to learn. First of all we should never take science personal. A failure to replicate is no personal rejection. The anti-replicators should not think that replicators are just evil. On the other hand replicators should not act like replication is the only thing. Let’s meet in the middle.

The future:
More young people get a PhD and more older people retire. The younger people are the ones that are used to sharing on the internet. Therefore Bobbie predicts that this revolution will take place and internet will have a prominent role in science. She predicts that ultimately there will be no journals anymore. You take your paper and send it into the “sky”. You have to categorize it (e.g. this paper focusses on cognitive neuro-science) and then online ‘journals’ can spot it and ‘publish’ it online, awarding it with a star or something else that distinguishes this paper from others on the internet.

Let’s see how it works out!

Riet van Bork

Bringing back science!

In order for science to be of any value, several code of conducts have been written which scientists must (well, should) obey. Though there are slight differences, these code of conducts all embrace the same values: a scientist should be scrupulous, reliable, impartial, independent, and all of his or her work should be verifiable. To violate these codes implies disrespect for the practice of science. Among the worse forms of scientific misconduct are fabrication (making up data), falsification (changing collected data), and plagiarism.

Estimating how often the codes of conduct are violated is not easy task. For one people who have deviated from this code might not be willing to be honest about their deeds. For another while most scientists agree that practices such as plagiarism and fabrication are a form of scientific misconduct, there is a large grey area where in one context a scientific practice is completely in line with the codes of conduct, while the same practice in another context could be seen as a questionable research practice. However, attempts to estimate scientific misconduct have been made. Fanelli (2009) found that 1.97% of the scientists participating in her research admitted to have been fabricated or falsified data, and that up to 33.7% admitted other questionable research practices.


Even though scientific misconduct does happen, the amount of reported scientists is small. The most common ways by which scientists are caught is either by a whistle blower or through statistical detection. There are however many reasons why not a lot of people decide to blow the whistle, including that scientific misconduct is a problem of non-academic nature, that the whistle blower must invest time and run professional risk, and a lack of knowledge concerning detection. Concerning statistical detection, several scientists risk their time and career by concerning themselves with looking for scientific misconduct (for example Simonsohn, 2013). However, while there are many parties[1] involved in science and publishing findings, no party takes the responsibility to actively check for and deal with questionable research practices.

Despite all of this, there are a few solutions that can help us increase good scientific practices. Clear guidelines and code of conducts should be written and actively distributed. Also (raw) data and instructions should be distributed, as well as lab books and the documentation of experiments. Additionally, it would help immensely if journals would be more open to post publication reviews and comments. Too often we seem forget what values are needed to practice science; it is time to take responsibility and bring them back.

[1] These parties include publishers and editors of a journal; co authers and colleges, peers, peer reviewers, the Royal Academy, research institutions, research funds, and professional organizations.

Articles are no IKEA-manuals for replication studies

During our career as psychology research students we learned to read articles and to evaluate them critically. We learned that articles are divided in different sections which highlight the different aspects of the research process and moreover, we learned to not skip the methodology section. We learned that the methodology section contains the core of the actual experiment and therefore reveals not only the strengths of the study, but also the weaknesses. By learning to evaluate the methodology section, we learned also that sometimes things are not clear and that we have to assume certain things. However, at the end, we think that we comprehend the most important aspects of the article and so the study in its entirety.

But was does that mean “comprehending the study”? Does that mean that we know now every aspect of the study, that everything is clear? Does that mean that we should be able to do the same study by ourselves? Or in scientific terms, that we are able to replicate the study?

Unfortunately, the answer is disappointing. If we are lucky, the original researchers provided some additional material with for example more detailed information about the stimuli, the script and so on. However, even with the additional information, many important questions remain unanswered. If we are lucky twice, the original researchers are prepared to give the remaining answers or even willing to cooperate in the replication. But even then, the replication is not guaranteed. Additional problems and dilemmas impose themselves: participants with other (cultural) backgrounds, weak procedure-aspects, dubious interpretations, weird analyses, missing elements,…

So what did I learned this time? Replication studies are not boring at all, although half the world think. It is defiant, difficult, and demands much research skills. Furthermore, the methodology sections are indeed the core of the experiment, but they should be seen more as the skeleton for which we have to search the coating by ourselves. And finally, research articles give an image of the conducted research, but they are, however, no clear IKEA-manuals for how we should replicate the study step by step.

Evalyne Thauvoye

Stepwise regression: when to use It?

Say you, as a scientist, want to predict something in your research, such as the amount of oxygen someone can uptake. You would want to have certain measures that could say something about that, such as a person’s age, height and weight. With (some of) these predictive measures, or predictors, you would then want to try and find out whether you can actually predict something about how much oxygen someone can uptake. To this end, the method of stepwise regression can be considered.

There are two methods of stepwise regression: the forward method and the backward method.
In the forward method, the software looks at all the predictor variables you selected and picks the one that predicts the most on the dependent measure. That variable is added to the model. This is repeated with the variable that then predicts the most on the dependent measure. This little procedure continues until adding predictors does not add anything to the prediction model anymore.
In the backward method, all the predictor variables you chose are added into the model. Then, the variables that do not (significantly) predict anything on the dependent measure are removed from the model one by one.
The backward method is generally the preferred method, because the forward method produces so-called suppressor effects. These suppressor effects occur when predictors are only significant when another predictor is held constant.
There are two key flaws with stepwise regression. First, it underestimates certain combinations of variables. Because the method adds or removes variables in a certain order, you end up with a combination of predictors that is in a way determined by that order. That combination of variables may not be closest to how it is in reality. Second, the model that is found is selected out of the many possible models that the software considered. It will often fit much better on the data set that was used than on a new data set because of sample variance.

There are no solutions to the problems that stepwise regression methods have. Therefor it is suggested to use it only in exploratory research. Stepwise regression methods can help a researcher to get a ‘hunch’ of what are possible predictors. This is what is done in exploratory research after all.

But off course confirmatory studies need some regression methods as well. Luckily there are alternatives to stepwise regression methods. One of these methods is the forced entry method. In this method the predictors are put in the model at once without any hierarchical specification of the predictors. For example, a scientist wants to test a theory in which math ability in children is predicted by IQ and age but he has no assumptions about which is the best predictor. In this case the forced entry method is the way to go.
Although the forced entry method is the preferred method for confirmatory research by some statisticians there is another alternative method to the stepwise methods. This is the hierarchical (blockwise entry) method. Basically, this method consists of two steps. In the first step predictors are entered in the model in a hierarchical manner. For example, a scientist specifies a model in which math ability is best predicted by IQ and than by age. Step two is an optional step in which the scientist can add more predictors. These predictors can be entered in the model hierarchical, forced entry or stepwise. Of course the problems mentioned earlier still occur when the stepwise methods are used in the second step.

In the end all methods can have a purpose but it is important for a scientist to know when to use the right method for the right purpose.

Lucas & Jochem

Publication Bias

Meta analysis in statistics refers to a method of combining independent studies to see if there is any disagreement among the results and look for interesting patterns. In an ideal world all valid results (e.g. results that are found through the use of good methods and statistics) on the topic that is analyzed would be at the disposal of the analyst. Through combining these results the nature of a statistically significant result can be investigated with a broader perspective. Unfortunately, it is rarely the case that all results are published. This is a serious problem.

In reality a positive outcome of a study makes it more likely that you can publish your results (Mahoney, 1977; Bakker, Van Dijk & Wicherts, 2012). When the scientific community pushes researchers to get significant results, factors that are different from the urge to find the truth might come into play. Researchers can react to this extremely by engaging in behavior where anything goes (e.g. fraud) to get significant results. This would leave us with a very biased sample of published research consisting of significant results that do not correspond with the real world. One can correctly argue that the majority of researchers do not go to these extremes however; a reaction that is much more mild than outright fraud can also have a severe effect on the sample of published research (Simmons, Nelson and Simonsohn, 2011; John, Loewenstein & Prelec, 2012). When papers that show true null results are rejected and (unconsciously) encouraging researchers to force results to a pre-specified significance level we are left with unreliable publications. This brings us back to meta analysis. Meta analyzing a biased sample of research is problematic. So, how are we to solve this problem? Here I will mention two solutions: (1) a solution from the perspective of conducting meta analysis and (2) a solution from the perspective of the people that are involved in the publication process.

First, this problem is not new in psychology (Rosenthal, 1979). Researchers themselves have already developed different ways to improve meta analysis in such a way that a publication bias can be detected by making funnel plots, use fail safe N analyses and much more. However all these solutions in meta analysis are to estimate the likeliness of publication bias. Through indirect measures it is measured if there is something like a publication bias. In this way we can never get our hands on the actual size of the bias.
Second, several initiatives have been started to make psychological science more transparent by making all conducted research available for everyone. One of these initiatives that have been here for a while is Here people can upload their non-significant replications which otherwise would not have been published. A more recent initiative is This is a website where researchers can publish almost everything they do in their research and make it available for everyone to use and check.

Making analyses more sophisticated and psychological science more transparent will hopefully reduce the amount of bias in a way that we can (almost) fully rely on published research again.

Bakker, M., van Dijk, A., & Wicherts, J. M. (2012). The rules of the game called psychological science. Perspectives on Psychological Science, 7, 543-554.

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23(5), 524-532.

Mahoney, M. J. (1977). Publication prejudices: An experimental study of confirmatory bias in the peer review system. Cognitive therapy and research,1, 161-175.

Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86, 638-641.

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359-1366.

I have the power! Or do I really?



“Knowledge is power”
Francis Bacon

Have a look at the following results and try to explain for yourself what the reasons might be that the results of three replications appear to be that different.

Maxwell (2004)

Because of the headline you might already suspect that power is probably the issue here. Here are three additional points: (1) The sample size of all studies is n=100, (2) all predictors share a medium correlation (r = .30) with the dependent variable and with each other and (3) G*Power indicates a Post hoc power .87. Does this mean that you were totally wrong? The answer is no.

If you find it difficult to explain the deviant results you are no exception. This table is taken from a paper by Maxwell (2004) where he demonstrates how many psychology experiments have a lack of power. How could this be a lack of power if G*Power indicates a statistical power of .87? Well, G*Power does not distinguish, as well as most of us, between the power to find at least one significant predictor and the power to find any specific predictor. Maxwell conducted several simulations and found that the power to find any single specific effect in a multiple regression (n=100) with five predictors is .26 and the chance that all five predictors turn out significant is less than .01. Considering this, it is much easier to explain the unstable pattern of results. One might say that a multiple regression with five predictors is an extreme example but even a 2 x 2 ANOVA with medium effect sizes and n=40 per cell only finds all true effects (two main and one interaction effect) with a chance of 69%.

This is only another powerful example how significance tests can be very misguiding. We have to be aware that this method, which is the common test paradigm in Psychological research, can be flawed and has to be evaluated with caution (Wagenmakers, 2007). Evaluating Confidence Intervals for example, can be one way to realize the uncertainty that underlies Frequentist hypothesis. The following table shows the confidence intervals of the five predictors in the three replications and clarifies that the results do not differ as much as the p-values indicate.

figure 1

The most important lessons that can be taken from this demonstration are: (1) don’t be fooled by p-values, (2) consider the confidence intervals, (3) be aware of the uncertainty of results, (4) do not let your theory be dismissed by a Type II error and (5) publication bias as well as underpowered studies might lead to a distorted body of literature and to be safe one should assume that there is an overestimation of effect sizes. Consider these five lessons when you plan your experiment because a lack of power can turn the results of an otherwise excellent experiment into useless data. While it might be disturbing that a 2 x 2 ANOVA needs more than 160 participants to exceed a power of 0.69 to find all medium sized effects, one should be aware that sample size is not the only way to increase the power of an experiment. Reducing variance with covariates and increasing the effect size by strengthening the manipulation are very effective and might often be more feasible than having more and more participants.

Boris & Alex

Maxwell, S. E. (2004). The persistence of underpowered studies in psychological research: causes, consequences, and remedies. Psychological Methods, 9, 147.

Wagenmakers, E. J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14, 779-804.