Whose Fall is the Academic Spring?


The past has seen several protests against the controversial business practices of commercial publishing-houses. However none of them had been as influential as the recent Elsevier boycott, a movement, which has later been titled The Academic Spring. It is now almost two years later, as it is in the case of its eponym, The Arab Spring. Recognizing that in many countries of the Arab world the protests have not resulted in the changes, many have hoped for, it seems reasonable to take a look to have a critical look at the outcomes of the Academic uprising.

From unreasonable journal prices, over the publication of fake journals to promote the products of pharmaceutical companies [], the support of the Research Works Act (RWA) however was for many the final nail in the coffin of Elsevier’s integrity []. Elsevier’s opponents however could quickly celebrate their first victory when the RWA was declared dead in late February. The RWA however was only one aspect of the critique and its failure prevented that the situation would turn worse, but it didn’t really affect the status quo. The costofknowledge still continues to gather members, but does not seem to effectively threaten Elsevier’s market-domination. That the problem remains is demonstrated when in April of 2012, the library of Harvard University released a memorandum in which they described increasing difficulties to pay the annual costs for journal subscriptions and conclude that “many large journal publishers have made the scholarly communication environment fiscally unsustainable and academically restrictive.”[]. One year later Greg Martin resigned from his position on the editorial board of Elsevier’s Journal of Number Theory. In the resignation letter he concludes that there have been no observable changes in Elsevier’s business []. In their ‘one year resume’ the initiators of the boycott have been a little bit more optimistic []. While they admit that not much has changed in the general business strategy of Elsevier, the ‘Big Deal’ price negotiations haven’t become more transparent and bundling is still a common practice, they report some minor price drops. However they state, that more importantly, the boycott has raised awareness and increased the support for newer more open business models. What almost all of the critics unites, is a shared hope in open access. The recent Access2Research petition can be seen as a further success of the open access movement, as it convinced the White house to release a memorandum that directs all federal agencies to make all federally-funded research freely available within 12 month after initial publication [].

PLoS One, an open access journal is now by far the largest academic journal in the world [] and open access journals are being founded almost daily. While some of these new journals appear what one would call ‘scam’, they, as well as all the journals with poor quality standards won’t have a long life expectancy. The genie however left the bottle and it is unlikely that the fresh spirit of open access will disappear any time soon.


The independent t-test—violations of assumptions

Some methodologists complain that researchers do not check assumptions before doing a test. But in some cases, it does not help much when researchers do. For example, when there are no clear standards on how to test assumptions, and on the robustness of tests against violation of these assumptions, results can still be interpreted badly.

In the case of the independent t-test, it is often assumed that this test is robust against violation of the normality assumption, at least when the sample sizes of both groups are at least 30. But is becoming clear that even small deviations from normality may lead to a very low power (Wilcox, 2011). So, when a groups do differ, t-tests do often not detect it. This is especially true for mixed normal distributions which have a relative large variance. At the left side of figure 1, two normal distributions are compared and the independent t-test has a power of .9; at the right side of this figure, two mixed normal distributions with variance 10.9 are compared, but here the power is only .28. So, in this case researcher might conclude from a plot that the assumption of normality is not violated. The result of the t-test might be that the null-hypothesis of equal means cannot be discarded, and therefore the conclusion is that the groups do not differ. What is missed is that the distribution is actually a mixed normal distribution, and a difference was not detected due to low power.


Figure 1 At the left two normal distributions that differ 1, power of the independent t-test is .9; at the right two mixed-normal distributions with variance=10.9 that differ 1, power of the independent t-test is now .28

The main problem of the mixed normal distribution is its heavy tails. There are relatively many outliers. One solution is to use a 20% trimmed mean. This means that the highest twenty percent of the values is discarded, as is the lowest twenty percent. A method called Yuen’s method uses this trimmed mean in combination with Winsorized variance (Wilcox, 2011). This is the variance when the highest twenty percent of values is displaced by the remaining highest value, and the lowest twenty percent of values is displaced by the remaining lowest value. Yuen’s method performs better in terms of power, controlling type I errors and getting accurate confidence intervals.

Another assumption that is often violated with large consequences is the assumption of homoscedasticity. Homoscedasticity means that both groups have equal variances. Often Levene’s test and an F-test are used to assess whether the assumption is violated. But both tests are themselves susceptible to violation of the normality assumption. Therefore it is recommended to use a variant of the independent t-test that does not assume equal variances, for example Welch’s test. Zimmerman (2004) even suggests that Welch’s test should be used in all cases, even when distributions are homoscedastic. With Welch’s test the probability of a type I error can be controlled better, and therefore the power is higher in most situations.



Wilcox, R. (2011) Modern Statistics for the Social and Behavioral Sciences. New York: CRCPress.

The Cost of Knowledge, a matter of life and death

Did you know that psychological research can be a matter of life and death? Two researchers, Rodin and Langer, showed in the late 70-ties that elderly in nursing homes lived longer and had improved quality of life when they were more involved in daily events such as the choose of diner, type of leisure activity and so on. This is a result of a well-known fundamental psychological process, namely the relationship between perceived control and reward. It is a process that is often studied with visual illusions like the dot-illusion. However, this is not about illusions, but about the fact that even the most fundamental theoretical research can have a profound impact on real world problems. There is only one crucial problem… the accessibility to this knowledge.

With our traditional publication systems, people have to pay to have access to published research. This means that professionals from outside the scientific world have often no access to it, because they are often not willing to pay the exorbitant prices for it. But, also scientist themselves have often no access to the their peers research or even to their own research because their academic institution cannot afford to subscribe for all the scientific journals. This problem is known as the pay-wall and access barrier which seriously limits the value of our scientific research.

Although the problem was already known for several years and many researchers drew the attention on it, it was not until 2012 that the problem got a lot of attention both within and outside the academic world. On January 21, 2012 the mathematician Timothy Growers of Cambridge University posted his blog “Elsevier – my part in its downfall”. In this blog he stated that he refuse to have anything to do in the future with one of the biggest commercial publishers, Elsevier. According to him and many researchers who supported his statement Elsevier is an exemplar of everything that is wrong with the current publication system. The pay-wall and access barrier were the central points in his objections toward them.

As a response on his call to attention, the petition “The Cost of Knowledge” was created. In this petition researchers declare that they will not publish, refer or do editorial work for journals published by Elsevier. Nowadays, almost 14,000 researchers signed the petition and so, support the boycott of Elsevier. At the same time, it enhanced the Academic Spring, a movement formed by researchers, academics and scholars who oppose the traditional publication system and promote open access as an alternative model. Their ultimate goal is fairer access to published research. Only then can science fully contribute to our society and its problems, and… can it makes the difference between life and death.

Evalyne Thauvoye


What Does the Psychologist Do?

Like a lot of people we don’t just say what we did, but we add to that why we did it. For example: “I scream at you because I am very angry at you” (what is anger?) or “Because I’m very intelligent I was able to get a high score on that test” (what is intelligence?). We use unobserved entities to explain our behaviour. In psychology we are also concerned with what the causes are of human behaviour.

Psychologists measure all kinds of things that have to do with behaviour. But what exactly do they measure? To answer this question I will make a distinction between manifest variables and latent variables. A manifest variable is a variable that a psychologist is able to directly observe like the response someone gives on a questionnaire. A latent variable is unobserved and that makes it very hard to measure it or even worse, to prove the existence of a latent variable. Examples of latent variables are depression, intelligence, motivation, power, and sadness.

Depression, intelligence, motivation, power, and sadness are all examples of what a psychologist tries to measure. You might think that measuring these things is not that hard. If you think about yourself you might say that you know very well if you are depressed, intelligent or motivated. You might even say, “Come here psychologist, I will tell you how depressed, intelligent and motivated I am”. But then the psychologist will answer that such information is not of any use because it is subjective. And if a psychologist is subjective he cannot work as a researcher at the university.

What the psychologist does very often is indirect measurement of latent variables. If (s)he wants to measure latent variables indirectly (s)he needs manifest variables. Why? Because the psychologist thinks that responses from people that are tested on manifest variables are caused by a latent variable. For example: If someone responds to the statement “I don’t sleep very well” on a seven-point Likert scale with a seven, meaning that someone barely sleeps, we believe that this response is caused by a depression. If we have a collection of statements, we believe we can say something “objective” about a depression (a subjectively constructed latent variable fyi). We do this by putting all the data in a computer so that the latent variable can be calculated. By calculating a number for a latent variable it becomes kind of real.

And how do we do that?

Psychologists have a collection of formal models to their disposal to measure latent variables. A formal model is just a lot of mathematics that have nothing to do with psychology or behaviour.

But how can you measure motivation with something that has nothing to do with motivation? I don’t measure length with a weighing scale do I?

What a psychologist does, is dressing the formal model up with theories, theories about depression, motivation or intelligence. Then he will look if the formal model with all the equations fits the clothes with which the psychologist tried to dress him up. If it does the psychologist can say that his theory is approximately true. We can use the same clothing analogy to show one of the shortcomings.

Shopping stores have a limited amount of sizes they sell their clothes in. Often a lot of different people fit the same t-shirt.


Boris Stapel

Recommended reading:

Borsboom, D., Mellenbergh, G. J., & Van Heerden, J. (2003). The theoretical status of latent variables. Psychological review110, 203.

Borsboom, D. (2008). A tour guide to the latent realm. Measurement6, 134-146.

To reject the null hypothesis correctly or incorrectly: that is the question.

Comparing more than one (multiple) variables or conditions leads to an increase in the chance of finding false positives: rejecting the null hypothesis even though the null hypothesis is true (see Table 1). Here I shortly describe four approaches on how to deal with the problem of multiple comparisons.


Null hypothesis (H0) is true

Null hypothesis (H0) is false

Reject null hypothesis

Type I error
False positive


Correct outcome
True positive


Fail to reject null hypothesis

Correct outcome
True negative


Type II error
False negative


Table 1. The four possible outcomes of null hypothesis significance testing. The outcome in bold is the main concern in multiple comparison testing.

  1. Family Wise Error Rate (FWER). The FWER (Hochberg & Tamhane, 1987) is the probability of making at least one Type I error among all the tested hypotheses. It can be used to control the amount of false positives by changing α (the chance of a false positive). Normally, α set on 0.05 or 0.01. Using one of the FWER based corrections, from the overall α, an α for each comparison is calculated using the amount of comparisons made. The most known FWER based correction is the Bonferroni correction. Here αcomparison = α/the amount of comparisons.
  2. False Discovery Rate (FDR). The FDR (Benjamini & Hochberg, 1995) is the expected proportion of false positives (α) divided by the total amount of positives (the sum of all hypotheses falling in the categories α and 1-α in Table 1). The general procedure is to order all p-values from small to large and compare each p-value to a FDR threshold. Is the p-value is smaller or equal to this threshold it can be interpreted as being significant. How the threshold is calculated depends on the correction methods used.
  3. ‘Doing nothing’- approach. Since all correction methods have their flaws, advocates of this approach are of the opinion that no corrections should be made (Perneger, 1998). Scientists should state in their article clearly how and which comparisons they made, and what the outcome of these comparisons was.
  4. Bayesian approach. The Bayesian approach discards the frequentist approach, including the null hypothesis statistical testing. By doing this, the whole problem of multiple comparisons, and Type I and II errors do not exist. Rather than correcting for a perceived problem, Bayesian based methods build ‘the multiplicity into the model from the start’ (p.190, Gelman, Hill, & Yajima, 2012).

All four approaches of dealing with multiple comparisons have their advantages and disadvantages. Overall, I believe that the Bayesian approach is by far the most favourable option. Apart from dismissing the problem of multiple comparisons (and others), it provides researchers with the opportunity to collect data in favour of either hypothesis, instead of making probabilistic statements about rejecting or not rejecting the null hypothesis. What stands in the way of applying the Bayesian approach is its theoretical difficulty (as compared to the frequentist approach). With the increase in approachable books, articles, and workshops about the Bayesian approach, and the development Bayesian scripts for statistical software a revolution in how scientists practice science seems to get closer.



Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society,  57, 289-300.

Gelman, A., Hill, J., & Yajima, M. (2012). Why we (usually) don’t have to worry about multiple comparisons. Journal of Research on Educational Effectiveness,  5, 189-211.

Hochberg, Y., & Tamhane, A.C. (1987). Multiple Comparison Procedures. Wiley, New York.

Perneger, T. V. (1998). What’s wrong with Bonferroni adjustments. British Medical Journal,  316, 1236-1238.


The problematic p-value: How large sample sizes lead to the significance of noise

The most reliable sample with the highest power is of course the complete population itself. How is it possible that this sample can reject null hypotheses that we do not want to reject?

When the sample size increases this has as a consequence that very small effect sizes can become significant. I did an independent samples t-test over two simulated groups on a variable over which they are normally distributed. This was repeated 1000 times with a sample size of 200 persons per group and 1000 times with a sample size of 20000 persons per group (see figure 1.). I analyzed what proportion of these t-test are significant. Note that even if there is no effect or the effect is too small to detect, you expect to find at least 5 percent of the t-test to be significant due to a significance level of .05 (type I error). When the total sample size consists of 400 persons (200 per group) you will find in approximately 5.6% of the times a significant effect (running this several times resulted in values between .05 and .06). The proportion of times the null hypothesis is rejected increases when you use a bigger sample size. When you use groups of 20.000 this proportion increases to approximately 73.3%.


The Crud Factor is described by Meehl (1990) as the phenomenon that ultimately everything correlates to some extent with everything else. This phenomenon was supported by a large exploratory research he conducted with Lykken in 1966, in which 15 variables were cross tabulated with a sample size of 57.000 school children. All 105 cross tabulations were significant, of which 101 were significant with a probability of less than 10^-6.

Similar findings were described by Starbuck (as cited in Andriani, 2011, p. 457), who found that: “choosing two variables at random, a researcher has a 2-to-1 odds of finding a significant correlation on the first try, and 24-to-1 odds of finding a significant correlation within three tries.” Starbuck concludes from these findings (as cited in Andriani, 2011, p. 457): “The main inference I drew from these statistics was that the social sciences are drowning in statistically significant but meaningless noise.”

Taken literally, the null hypothesis is always false. (as cited in Meehl, 1990, p 205). When this phenomenon is combined with the fact that very large samples can make every small effect a significant effect, one has to conclude that with the ideal sample (as large as possible) one have to reject every null hypothesis.
This is problematic because this will turn research into a tautology. Every experiment that has a p-value > .05, will become a type II error since the sample was just not big enough to detect the effect. A solution could be to attach more importance to effect sizes and make them decisive in whether a null hypothesis should be rejected. However it is hard to change the interpretation of the p-value, since its use is deeply ingrained in our field. Altogether I would suggest to leave the p-value behind us and switch over to a less problematic method.


Andriani, P. (2011). Complexity and innovation. The SAGE Handbook of Complexity and                          Management, 454-470.
Meehl, P. E. (1990). Why summaries of research on psychological theories are oftenunin-                     terpretable. Psychological Reports, 66, 195-244.

The over- and underutilization of ANCOVA

After completing several statistics courses I lived in the illusion that I knew the ins-and-outs from, what I thought to be, basic statistical analyses. During this course, however, I saw pitfalls in almost all of them and came to the realization that the application of statistical procedures are not as straightforward as I once thought they were. One of the most striking examples is the analysis of covariance (ANCOVA). A statistical procedure used a lot, seemingly as a way to “control” for confounds. I was always impressed by this procedure, until I found out there is a lot more to it than just “controlling” for confounds.

The analysis of covariance (ANCOVA) was developed as an extension to the analysis of variance (ANOVA) to increase statistical power (Porter & Raudenbush, 1987). By including covariates, the variance associated with these covariates is being “removed” from the dependent variable (Field, 2009). This way, from the manipulation point of view, the error variance in the dependent variable is reduced and hence the statistical power increases, see Figure 1. Given that psychological research is often underpowered (Cohen, 1990), ANCOVA is an important statistical procedure in the revelation of psychological phenomena and effects.

Schermafbeelding 2013-10-25 om 16.15.11

This promising application of ANCOVA, however, only holds when there is no systematic relationship between the grouping variable and the covariate, i.e., the groups cannot differ on the covariate. This is an assumption that many researchers today fail to check. As a result, ANCOVA is widely misunderstood and misused (Miller & Chapman).

The importance of this assumption is illustrated in Figure 2. Namely, when group and covariate are related, removing the variance associated with the covariate will alter the group. In other words, the remaining variance of group after removing the variance associated with the covariate has poor construct validity and the results are therefore uninterpretable.

Schermafbeelding 2013-10-25 om 16.33.59

The general point is that the legitimacy of ANCOVA depends on the relationship between the grouping variable and the covariate. ANCOVA is justified only when there is no systematic relationship between these variables.

On the one hand, it is quite straightforward to defend this judgement in a randomized experiment; given random assignment, individual characteristics are equally distributed across the groups and thus, group means should not differ except by chance, see left panel of Figure 3. As a result, including a covariate in a randomized experiment increases the statistical power. In this sense, ANCOVA is underutilized. On the other hand, when studying pre-existing groups (i.e., non-random assignment), individual characteristics are not evenly distributed across groups and hence a relationship between group and covariate can exist. Thus, including a covariate in a non-randomized experiment might alter the grouping variable and result in unfounded interpretations and conclusions. In this sense, ANCOVA is overutilized.
It is worrisome that ANCOVA is more often applied in non-randomized experiments than in randomized experiments (Keselman et al., 1998). The idea is that researchers want to “control” for pre-existing differences. ). This idea is incorrect since there just is no statistical way to “control” for these pre-existing differences. ANCOVA “removes” the variance that is associated with the covariate, but it does not “control” for the covariate.

Schermafbeelding 2013-10-25 om 16.50.32
We, as researchers, should acknowledge the inabilities (ANCOVA cannot “control” for pre-existing differences) and abilities (ANCOVA can increase statistical power) of ANCOVA. This way we should be able to eliminate unfounded conclusions that are the result of the misapplication of ANCOVA. And, most important, we can expand the strengths of its application: increase statistical power. This way ANCOVA can help to reveal real psychological phenomena and effects.

For a good overview about this problem consult Miller and Chapman (2001). The original paper introducing this problem gives a good example on how the inclusion of a covariate can lead to incorrect conclusions (Lord, 1967).

Publication bias: how often do researchers check for this?

Researchers (and us, the “upcoming researchers”) have been complaining about all the things that seem to be wrong with psychology or science in general for years: researchers don’t share their data, they use the wrong statistical analyses, they come up with their hypotheses after they have seen their data and they massage their data until something “interesting” pops up. Whenever we, the upcoming researchers, discuss these problems we always end up talking about publication bias: the tendency of journals and researchers to publish studies with significant results, resulting in file drawers full of non-significant (but at least as interesting) studies.

Because of publication bias people only read a small portion of articles on a specific effect and start to believe this effect is true, even though it might not actually exist. In the late 1990’s, for instance, articles were published that supported the hypothesis that reboxetine was an effective medicine in the treatment of major depressive disorder. It was not until 2010, when a meta-analysis looked into the possible presence of publication bias, that researchers discovered that not only was the drug ineffective, it was potentially harmful! What had happened? Only 26% of the patient data had been published. The remaining 74% was not significant and was therefore not published, resulting in a terrible mistake: psychiatrists had been prescribing a potentially harmful pill to patients who were battling major depression. This example clearly shows that publication bias should not be taken lightly. However, for years journals have failed to combat this problem.

But now, finally, things seem to be changing: journals such as Cortex have started working with preregistration, a system in which articles are chosen for publication based on the quality of their methods instead of their outcome and “interestingness”. While this is a wonderful development and will definitely help combat publication bias, it is not enough. In some fields publication bias may have been present for years and preventing it from occurring in future articles is not enough. Therefore it’s very important that researchers check for the possible presence of publication bias when conducting a meta-analysis. My question for the final assignment was: how often do researchers actually do this?

I checked this for 140 randomly drawn meta-analyses (twenty for every two years, from 2000-2013). What I found was that in only 37.14% of the articles researchers checked for the presence of publication bias. Perhaps even more shocking was that in the 88 articles in which no check was conducted only 6 (6.82%) articles mentioned why the authors did not do this (i.e. “because we added unpublished studies in our analyses, publication bias cannot be present” or “we wanted to check for publication bias with a funnel plot, but this was not possible due to a small sample of studies”).

Whether or not these reasons are correct, the main issue here is that apparently a lot of researchers either do not know that publication bias is a serious problem or they simply fail to see it as a problem. Either way: researchers and upcoming researchers need to be taught or reminded of the problems with publication bias and how they can check for this in the future. What I also think would help, is if journals demand these checks for any meta-analysis that is considered for publication.

My question to you is: How do you think we, researchers and upcoming researchers, should combat publication bias? Is there a possibility for a science in which publication bias is not an issue anymore?

Innocent… until someone gets suspicious.

Scientists ought to be critical and sceptical of each other’s work. But how far can we take this norm? I analysed 4 publications of a UvA professor, to determine whether or not there were signs of data fabrication of falsification.

The professor that was the topic of my final assignment is called Mr. 132. This is a pseudonym that was inspired by the row number of the data file on which he stood when I performed a replication study of Bakker and Wicherts’ study (2011) in 2012. In this replication study, Mr. 132 stood out (not in a good way) because he misreported a high amount of p-values, and in particular he made a lot of gross errors (errors that change the significance of the result). The high amount of misreported p-values made me suspicious and when this assignment was announced, I knew what the topic was going to be.

It turned out that Mr. 132 still misreports a high amount of p-values. This barplot  demonstrates this. In this figure, the percentage of errors and gross errors is plotted against the percentages that were found by Bakker and Wicherts (2011). If I was not suspicious already, I would be after these results.

Luckily, I did not need to make a trip to the head of my faculty to accuse a professor of data fabrication or falsification. Using the simulation method proposed by Simonsohn (2013), I simulated 69 pairs of standard deviations, each 100.000 times (a more detailed description of this method can be found in Simonsohn’s paper or in the paper on my profile on the Open Science Framework https://openscienceframework.org/profile/fAHMC ). Of these simulations, four pairs turned out to be significant (α = .05). Three of these significant probabilities were due to standard deviations that were equal; the simulation method does not work when the standard deviations are exactly equal. The fourth was turned out to be .042. To me, this is a borderline case. Only 4 out of 69 simulations were significant, which is not convincing me that Mr. 132 fabricated or falsified data.

In sum, the odds are in Mr. 132’s favour. I think it is highly unlikely that he fabricated or falsified data, based on these results –even though he misreported a high amount of his p-values. I also do not think that further investigation is needed to prove he is in fact innocent (or guilty, since I can not rule that out with absolute certainty). So in this case, Mr. 132 is innocent, until someone gets suspicious again.



Bakker, M., & Wicherts, J. M. (2011). The (mis) reporting of statistical results in psychology journals. Behavior Research Methods, 43, 666–678.

Simonsohn, U. (2013). Just post it: The lesson from two cases of fabricated data detected by statistics alone. Psychological Science, 20, 1–14.

How to implement open science?

I think that many researchers and students would agree that we need to change science. We simply do not know what researchers do with their data, as we have no access to them. This freedom gives researchers the chance do either consciously or unconsciously perform questionable research practices. Science is in need of improvement, and openness is a central theme within this revolution. But just agreeing on this will not help us any further, we need a practical change. So the key question is how we are going to change the scientific field into a more open field. One way is to implement the Open Science Framework, an initiative by Brian Nosek, that encourages scientists to share their research projects online. The Open Science Framework enables researchers to create a project, upload files, and share them, whenever they think they are ready to do so. This way, journal reviewers (or others) can check on the researcher’s work any time. Moreover, it shows that the researcher at hand does not have anything to hide. In my final assignment for the Good Science, Bad Science course, I propose an implementation of this framework in undergraduate education. I created a lesson plan that explores different features of the framework and that also involves several analytical and writing skills. The students will create an account on the Open Science Framework, review a project of a researcher, make a project themselves, check projects of other students, and write an essay on the positives and negatives of the Open Science Framework and its potential within the scientific field. Through working on their analytical and writing skills on the Open Science Framework, they will hopefully incorporate the framework and use it in their future (scientific) careers. Learning future researchers to use the OSF may change the scientific world from the inside out. We do not have to oblige researchers to practice open science, but we can simply introduce the young ones to proper working methods that they will hopefully use later on in their lives. Changing science will be a matter of time; learning students how to use the Open Science Framework will not cause an immediate change, but it might cause a more thorough change on the long term than if we just tell researchers to change their working habits right now.

We are students in an exciting time. Science is on the move. And perhaps in ten years time, science will have become its true self again.