To reject the null hypothesis correctly or incorrectly: that is the question.

Comparing more than one (multiple) variables or conditions leads to an increase in the chance of finding false positives: rejecting the null hypothesis even though the null hypothesis is true (see Table 1). Here I shortly describe four approaches on how to deal with the problem of multiple comparisons.

 

Null hypothesis (H0) is true

Null hypothesis (H0) is false

Reject null hypothesis

Type I error
False positive

(α)

Correct outcome
True positive

(1-β)

Fail to reject null hypothesis

Correct outcome
True negative

(1-α)

Type II error
False negative

(β)

Table 1. The four possible outcomes of null hypothesis significance testing. The outcome in bold is the main concern in multiple comparison testing.

  1. Family Wise Error Rate (FWER). The FWER (Hochberg & Tamhane, 1987) is the probability of making at least one Type I error among all the tested hypotheses. It can be used to control the amount of false positives by changing α (the chance of a false positive). Normally, α set on 0.05 or 0.01. Using one of the FWER based corrections, from the overall α, an α for each comparison is calculated using the amount of comparisons made. The most known FWER based correction is the Bonferroni correction. Here αcomparison = α/the amount of comparisons.
  2. False Discovery Rate (FDR). The FDR (Benjamini & Hochberg, 1995) is the expected proportion of false positives (α) divided by the total amount of positives (the sum of all hypotheses falling in the categories α and 1-α in Table 1). The general procedure is to order all p-values from small to large and compare each p-value to a FDR threshold. Is the p-value is smaller or equal to this threshold it can be interpreted as being significant. How the threshold is calculated depends on the correction methods used.
  3. ‘Doing nothing’- approach. Since all correction methods have their flaws, advocates of this approach are of the opinion that no corrections should be made (Perneger, 1998). Scientists should state in their article clearly how and which comparisons they made, and what the outcome of these comparisons was.
  4. Bayesian approach. The Bayesian approach discards the frequentist approach, including the null hypothesis statistical testing. By doing this, the whole problem of multiple comparisons, and Type I and II errors do not exist. Rather than correcting for a perceived problem, Bayesian based methods build ‘the multiplicity into the model from the start’ (p.190, Gelman, Hill, & Yajima, 2012).

All four approaches of dealing with multiple comparisons have their advantages and disadvantages. Overall, I believe that the Bayesian approach is by far the most favourable option. Apart from dismissing the problem of multiple comparisons (and others), it provides researchers with the opportunity to collect data in favour of either hypothesis, instead of making probabilistic statements about rejecting or not rejecting the null hypothesis. What stands in the way of applying the Bayesian approach is its theoretical difficulty (as compared to the frequentist approach). With the increase in approachable books, articles, and workshops about the Bayesian approach, and the development Bayesian scripts for statistical software a revolution in how scientists practice science seems to get closer.

 

References

Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society,  57, 289-300.

Gelman, A., Hill, J., & Yajima, M. (2012). Why we (usually) don’t have to worry about multiple comparisons. Journal of Research on Educational Effectiveness,  5, 189-211.

Hochberg, Y., & Tamhane, A.C. (1987). Multiple Comparison Procedures. Wiley, New York.

Perneger, T. V. (1998). What’s wrong with Bonferroni adjustments. British Medical Journal,  316, 1236-1238.

 

Bringing back science!

In order for science to be of any value, several code of conducts have been written which scientists must (well, should) obey. Though there are slight differences, these code of conducts all embrace the same values: a scientist should be scrupulous, reliable, impartial, independent, and all of his or her work should be verifiable. To violate these codes implies disrespect for the practice of science. Among the worse forms of scientific misconduct are fabrication (making up data), falsification (changing collected data), and plagiarism.

Estimating how often the codes of conduct are violated is not easy task. For one people who have deviated from this code might not be willing to be honest about their deeds. For another while most scientists agree that practices such as plagiarism and fabrication are a form of scientific misconduct, there is a large grey area where in one context a scientific practice is completely in line with the codes of conduct, while the same practice in another context could be seen as a questionable research practice. However, attempts to estimate scientific misconduct have been made. Fanelli (2009) found that 1.97% of the scientists participating in her research admitted to have been fabricated or falsified data, and that up to 33.7% admitted other questionable research practices.

 

Even though scientific misconduct does happen, the amount of reported scientists is small. The most common ways by which scientists are caught is either by a whistle blower or through statistical detection. There are however many reasons why not a lot of people decide to blow the whistle, including that scientific misconduct is a problem of non-academic nature, that the whistle blower must invest time and run professional risk, and a lack of knowledge concerning detection. Concerning statistical detection, several scientists risk their time and career by concerning themselves with looking for scientific misconduct (for example Simonsohn, 2013). However, while there are many parties[1] involved in science and publishing findings, no party takes the responsibility to actively check for and deal with questionable research practices.

Despite all of this, there are a few solutions that can help us increase good scientific practices. Clear guidelines and code of conducts should be written and actively distributed. Also (raw) data and instructions should be distributed, as well as lab books and the documentation of experiments. Additionally, it would help immensely if journals would be more open to post publication reviews and comments. Too often we seem forget what values are needed to practice science; it is time to take responsibility and bring them back.


[1] These parties include publishers and editors of a journal; co authers and colleges, peers, peer reviewers, the Royal Academy, research institutions, research funds, and professional organizations.