The problematic p-value: How large sample sizes lead to the significance of noise

The most reliable sample with the highest power is of course the complete population itself. How is it possible that this sample can reject null hypotheses that we do not want to reject?

When the sample size increases this has as a consequence that very small effect sizes can become significant. I did an independent samples t-test over two simulated groups on a variable over which they are normally distributed. This was repeated 1000 times with a sample size of 200 persons per group and 1000 times with a sample size of 20000 persons per group (see figure 1.). I analyzed what proportion of these t-test are significant. Note that even if there is no effect or the effect is too small to detect, you expect to find at least 5 percent of the t-test to be significant due to a significance level of .05 (type I error). When the total sample size consists of 400 persons (200 per group) you will find in approximately 5.6% of the times a significant effect (running this several times resulted in values between .05 and .06). The proportion of times the null hypothesis is rejected increases when you use a bigger sample size. When you use groups of 20.000 this proportion increases to approximately 73.3%.


The Crud Factor is described by Meehl (1990) as the phenomenon that ultimately everything correlates to some extent with everything else. This phenomenon was supported by a large exploratory research he conducted with Lykken in 1966, in which 15 variables were cross tabulated with a sample size of 57.000 school children. All 105 cross tabulations were significant, of which 101 were significant with a probability of less than 10^-6.

Similar findings were described by Starbuck (as cited in Andriani, 2011, p. 457), who found that: “choosing two variables at random, a researcher has a 2-to-1 odds of finding a significant correlation on the first try, and 24-to-1 odds of finding a significant correlation within three tries.” Starbuck concludes from these findings (as cited in Andriani, 2011, p. 457): “The main inference I drew from these statistics was that the social sciences are drowning in statistically significant but meaningless noise.”

Taken literally, the null hypothesis is always false. (as cited in Meehl, 1990, p 205). When this phenomenon is combined with the fact that very large samples can make every small effect a significant effect, one has to conclude that with the ideal sample (as large as possible) one have to reject every null hypothesis.
This is problematic because this will turn research into a tautology. Every experiment that has a p-value > .05, will become a type II error since the sample was just not big enough to detect the effect. A solution could be to attach more importance to effect sizes and make them decisive in whether a null hypothesis should be rejected. However it is hard to change the interpretation of the p-value, since its use is deeply ingrained in our field. Altogether I would suggest to leave the p-value behind us and switch over to a less problematic method.


Andriani, P. (2011). Complexity and innovation. The SAGE Handbook of Complexity and                          Management, 454-470.
Meehl, P. E. (1990). Why summaries of research on psychological theories are oftenunin-                     terpretable. Psychological Reports, 66, 195-244.

2 thoughts on “The problematic p-value: How large sample sizes lead to the significance of noise

  1. What you have shown is simply that your random number generator is extremely bad. Your conclusion that t tests would reject true Ho with the sample sizes 20000 does not make any sense. Try some better random normal generator normal generator and you would certainly find that the Type one error is only (approx) 5% (ie the nominal level of significance will be very close to the prescribed one). Regards.

  2. Hi Ronald, this is not the point I am trying to make. H0 is not “true” in my simulation; there is an effect, but just a very small effect. So I do not argue that a significant p value is wrong, but rather I hope to question whether such small effects are the effects we are looking for. After all, if everything correlates to everything in some extent (although maybe just a very small effect) then, one will always find these very small effects (p values will get significant) as long as one takes a very large sample size. And I therefore ask whether this ‘noise’ is meaningful? So when the effect is very small, a significant p value is not “wrong”, but one should be careful with interpreting such p values without looking at the effect size.


Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>