Correction for the multiple testing problem

“One mature Atlantic Salmon (Salmo salar) participated in the fMRI study. The salmon measured approximately 18 inches long, weighed 3.8 lbs, and was not alive at the time of scanning.” Although the salmon was dead several brain areas appeared to be processing what emotion a person on a picture displayed (Bennett, Baird, Miller & Wolford, 2011).
This false result was obtained by testing so many voxels that false positives emerged. With each added test the result of a type 1 error increases (Bender & Lange, 2001). A simulation study showed that when you simulate an image of two active areas 1000 times that every voxel in the voxel space is deemed as active at least once (Logan & Rowe, 2003).
There are methods for correcting the amount of false positives in multiple testing but this is not always done in fMRI research. Between 24% and 40% of the articles published in 2008 did not correct for multiple testing (Bennett, Baird, Miller & Wolford, 2011 – supplementary material).
Two often used procedures for correcting for the multiple testing problem are the family wise error correcting procedure (FWE) and the false discovery rate correcting procedure (FDR). All procedures have to find a balance between correcting for false positives (type 1 error) and false negatives (type 2 error).  Any method that protects more against one type of error is guaranteed to increase the rate of the other kind of error (Lieberman & Cunningham, 2009).
The family wise error rate (FWER) is the probability of making one or more type 1 errors in a family of comparisons. For a family wise error of 5% there is a 95% confidence level that there are no type 1 error in the data. The simplest FWE correction procedure is the original Bonferroni correction. This method divides the alpha level, the chance of a type 1 error (normally 5%), by the amount of voxels (Dunn, 1961). So for example when 100.000 voxels are tested at an FWE rate of 0.05 the threshold for a voxel would be 0.05/100000=0.0000005. Since the introduction of the Bonferroni procedure the procedure has been improved (Nichols & Kayasaka, 2003).
Another FWE correcting procedure is based on the random field theory. The reasoning behind the random field theory is that since the p-values of the voxels are (locally) dependent we have to use that dependency to correct for multiple testing. The random field theory does not test individual voxels but individual observations, ‘active brain clusters’. This can reduce the amount of tests by a several factors (Brett, Penny & Kiebel, 2003).
Another correcting procedure is the FDR correcting procedure. Instead of correcting in the whole family, all tested voxels, the method only corrects in active voxels. The false discovery rate (FDR) is the expected ratio of the number of erroneously rejected null hypotheses to the total number of rejected null hypotheses. The method guarantees that in all active voxels the maximum amount of false positives is at a specified level (i.a. 5%). So the FDR method is flexible, meaning it can chance with the numbers of tests.
A comparison between the FWE and FDR correcting procedures showed that the FDR maintained higher power in the active brain regions, meaning less type 2 error, but at the cost of more falsely detected voxels (Logan & Rowe, 2003). Verhoeven, Simonsen and McIntyre (2005) found that FWE is preferred only when the penalty of making a type 1 error is severe. FDR control is more powerful and often is more relevant than controlling the FWER.
The two procedures are not the only procedures for correcting for multiple testing, a promising new procedure is combining spatial information with Bayesian testing methods (Bowman, Caffo, Bassett & Kilts, 2008).

Bender, R. and Lange, S. 2001. Adjusting for multiple testing: when and how?                         Journal of Clinical Epidemiology, 54, 343-349.
Bennett, C. M, Baird, A. A, Miller, M. B., & Wolford, G. L., 2011. Neural                                correlates of interspecies perspective taking in the post mortem Atlantic Salmon:         An argument for proper multiple comparisons correction. Journal of serendipitous         and unexpected results, 1(1), 1–5.
Bowman, D., Caffo, B., Bassett, S. S. & Kilts, C., 2008. A Bayesian hierarchical                 framework for spatial modeling of fMRI data, NeuroImage, 39, 146–56.
Brett, M., Penny, W., Kiebel, S., 2003. An Introduction to random field theory, In:                 Frackowiak, R.S.J.,  Friston, K.J., Frith, C., Dolan, R., K.J., Price,  C.J., Zeki, S., Ashburner, J., Penny, W.D. (Eds.),   Human Brain Function, 2nd edition. Academic                 Press.
Dunn, O.J., 1961. Multiple Comparisons Among Means. Journal of the American                 statistical association56, 52-64.
Lieberman, M.D., & Cunningham, W.A., 2009. Type I and Type II  error concerns in         fMRI research: Rebalancing the scale. Social  Cognitive and Affective                         Neuroscience, 4, 423–428.
Logan, B. R. & Rowe, D. B., 2004. An evaluation of thresholding techniques in fMRI         analysis. NeuroImage 22, 95–108.
Nichols, T., Hayasaka, S., 2003. Controlling the familywise error rate in functional                 neuroimaging: a comparative review. Statistical Methods in Medical research,                 12(5), 419 – 446.
Verhoeven, K. J. F., Simonsen, K. & McIntyre, L. M, 2005. Implementing false discovery         rate control:  increasing your power. Oikos, 108, 643-647.


Measurement and permissible statistics in psychological science

In his influential paper on measurement theory Stevens (1946) argued that different statistical operators (i.a. mean), and therefore also the statistical tests that make use of these operators, are only permissible on certain measurement scales. The appropriateness of a statistical operator on a scale is measured by whether its transformations are invariant. Transformations can be applied to the data by several formulas, for example by the simple multiplication formula x’=x*4. If a statistical operator is not invariant than conclusions drawn from results of statistical tests making use of that operator will differ depending on the how the results were measured. As Scholten and Borsboom (2009) explain: “For instance, it is possible that when scores on the aforementioned mathematical proficiency test are analyzed for sex differences with a t test, different results are obtained for the original and transformed scores. Boys may significantly outperform girls when analyzing the original scores, while boys and girls may not differ significantly in their performance when analyzing the transformed scores (or vice versa; see Hand (2004), for some interesting examples). Since there is no sense in which the original scores are preferable or superior to the transformed, squared scores, this means that research findings and conclusions depend on arbitrary, and usually implicit, scaling decisions on part of the researcher. “

Stevens summarized four measurement scales with their permissible transformations and statistical operators. All permissible transformations and statistics are always also applicable in scale higher than the scale at which they are introduced.

The first scale is the nominal scale. At this scale numbers are used to classify data into mutually exclusive, exhaustive categories in which no order can be imposed. Nominal assignments can be on an individual scale, football players’ jersey numbers, or on a group scale, a person’s religion. Permissible transformations are any one-to-one or many-to-one transformation, although a many-to-one transformation loses information. Permissible statistics are number of cases, mode and association statistics (

The second scale is the ordinal scale. Ordinal measurements describe order, but not relative size or degree of difference between the items measured. For example, scores in tennis are rank ordered but cannot be subtracted; the difference between 15-30=15 and 30-40=10 is meaningless. According to Stevens most psychological measurements are made on ordinal scale. Examples of measurements on an ordinal scale are measurements of intelligence and personality traits. Permissible transformations are any monotone increasing transformation, although a transformation that is not strictly increasing loses information. Permissible statistics are median and percentiles. Note that mean is not a permissible statistic at this scale.

The third scale is the interval scale. Data points on the interval scale are ordered and the interval between data points is equal over all data points. For example, 10˚-5˚=5˚ where 35˚-30˚=5˚. However the null on the scale is arbitrary and the ratio therefore can’t be calculated. Permissible transformations are any affine transformation t(m) = c * m + d, where c and d are constants (general linear group, a” = a’ + b in Stevens). Permissible statistics are mean, standard deviation, rank-order correlation and product-moment correlation.

The fourth scale is the ratio scale. This scale is very similar to the interval scale except that the scale has an absolute null point. Examples of measurements on the ratio scale are degrees of Celsius, monthly salary and weight. Permissible transformations are any linear (similarity) transformation t(m) = c * m, where c is a constant (i.a. Logarithmic transformation). The (new) permissible statistic is coefficient of variation.


Lord (1953) wrote a satirical comment on the conclusions of Stevens. Lord describes a story of a professor who distributed jersey numbers to his students. He often administered tests to his students. In secret he compared the means and standard deviations of test results of students with different jersey numbers. He taught his students very carefully: “Test scores are ordinal numbers, not cardinal numbers. Ordinal numbers cannot be added”. He knew very well that his comparisons of different jersey numbers were incorrect according to the latest theories of measurement.
After a while the freshmen accused the professor of distributing low numbers to the freshmen. The professor consulted a statistician who simply calculated that the chance that the freshman had this average number when the numbers were randomly distributed was very low.
When the professor argued that the statistician couldn’t use multiplication on measurements taken on a nominal scale the statistician reacted: “If you doubt my conclusions… I suggest you try and see how often you can get a sample of 1,600 numbers from your machine with a mean below 50.3 or above 58.3.” So the professor starts taking of samples and indeed finds out that it’s indeed very unlikely to find a mean below 50,3 and 58,3.

So Lord argues that statistical methods can be used regardless of the scale of measurement. “The numbers do not know where they came from (p. 751)”. However in his paper Lord is using inferences regarding the measurements instead of inferences regarding the attributes. Or as Scholten and Borsboom (2009) argue: “ is argued that the football numbers do not represent just the nominal property of non-identity of the players; they also represent the amount of bias in the machine. It is a question about this property – not a property that relates to the identity of the football players – that the statistical test is concerned with”. Scholten and Borsboom show that when the bias of the machine is assessed the data are actually on an interval scale and that therefore Lord’s article actually supports Stevens’ view.

To give some information about the problem in psychological science that most measurements are done on ordinal instead of interval scale I will quote a text I found on the web: (
“Suppose we are doing a two-sample t-test; we are sure that the assumptions of ordinal measurement are satisfied, but we are not sure whether an equal-interval assumption is justified. A smooth monotone transformation of the entire data set will generally have little effect on the p value of the t-test. A robust variant of a t-test will likely be affected even less (and, of course, a rank version of a t-test will be affected not at all). It should come as no surprise then that a decision between an ordinal or an interval level of measurement is of no great importance in such a situation, but anyone with lingering doubts on the matter may consult the simulations in Baker, Hardyck, and Petrinovich (1966) for a demonstration of the obvious.
On the other hand, suppose we were comparing the variability instead of the location of the two samples. The F test for equality of variances is not robust, and smooth monotone transformations of the data could have a large effect on the p value. Even a more robust test could be highly sensitive to smooth monotone transformations if the samples differed in location.
Measurement level is of greatest importance in situations where the meaning of the null hypothesis depends on measurement assumptions. Suppose the data are 1-to-5 ratings obtained from two groups of people, say males and females, regarding how often the subjects have sex: frequently, sometimes, rarely, etc. Suppose that these two groups interpret the term ‘frequently’ differently as applied to sex; perhaps males consider ‘frequently’ to mean twice a day, while females consider it to mean once a week. Females may report having sex more ‘frequently’ than men on the 1-to-5 scale, even if men in fact have sex more frequently as measured by sexual acts per unit of time. Hence measurement considerations are crucial to the interpretation of the results.”

To conclude, always be aware of which attribute you measure, what scale you likely measure the attribute on, whether you can use certain statistics on that scale and in what way you can relate the results of the analysis to the attribute.

Lecture by Angélique Cramer

Baker, B. O., Hardyck, C, and Petrinovich, L. F. (1966), “Weak measurement vs. strong statistics: An empirical critique of S.S. Stevens’ proscriptions on statistics,” Educational and Psychological Measurement, 26, 291-309.
Lord, F. M. (1953). On the statistical treatment of football numbers. American psychologist, 8, 750- 751.
Scholten, A. Z. and Borsboom, D. (2009). A reanalysis of Lord’s statistical treatment of football numbers. Journal of statistical psychology, 53, 69-75.
Stevens, S. S. (1946), “On the theory of scales of measurement,” Science, 103, 677-680.

Checking Assumptions

We presented the article ‘Are assumptions of well-know statistical techniques checked, and why (not)?’ (Hoekstra, Kiers and Johnson, 2012). The two important assumptions we talked about in our presentation are: the assumption of normality and the assumption of homogeneity of variances. In the top figure is an example of two groups with homogeneity of variances and just beneath it two distributions with different variances. In the bottom figure (red) are two not normally distributed distributions. The t-test, ANOVA and regression analysis are all fairly robust to violations of these assumptions when sample sizes are equal over groups. However, if the sample sizes are not equal or if the violations of the assumptions are extreme than the results of these tests will be nonsensical.

Since the conclusions drawn from the test results can be nonsensical the authors of the presented article wanted to investigate whether researchers check the validity of assumptions and in cases where they don’t, if they at least have good reasons not to (e.g. equal sample sizes). Hoekstra et al. (2012) showed that researchers probably do not check their assumptions as often as they should. Only around six percent of the articles published in Psychological Science and in three educational psychology journals mention the validity of the assumptions. Hoekstra et al. (2012) wanted to know whether this lack of mentioned assumptions is because researchers always investigate the validity of the assumptions but never mention them or because researchers forget to investigate this validity. The authors asked PhD students to analyze six simple datasets. Almost none of the PhD students investigated the validity of the assumptions. Perhaps you´re thinking: well these fresh, just graduated PhD students know everything about robustness of tests, so that is probably why they don’t check assumptions. But not only did they not test for assumptions, most of them (over 80% depending on the test) were unfamiliar with the assumptions. Although the t-test, ANOVA and regression analysis are fairly robust to violations of both the normality assumption and the equality of variances assumption, you would expect that PhD students are at least familiar with the assumption. The results of the presented article show that we can’t blindly trust that researchers check their assumptions and that some results of tests in published articles might be nonsensical.

What should you do to check for these assumptions? Always plot your data! To check for normality use Q-Q plots and for homogeneity use multiple groups box plots. Another way to check your assumptions are preliminary tests, although the use of them are debated (e.g. they increase the probability of a type 1 error, regression: Caudill (1988), t-test: Rochon, Gondan & Kieser, 2012). For normality you can use the Kolmogorov-Smirnov test or the Shapiro-Wilks test, for homogeneity the Levene’s test.

Monique Duizer, Sam Prinssen

Caudill, S. B. (1988). Type 1 errors after preliminary tests for heteroscedasticity. The statistician, 37, 65-68.

Hoekstra, R., Kiers, H.A.L. & Johnson, A. (2012). Are assumptions of well-known statistical techniques checked, and why (not)? Frontiers in Psychology, 3:137. doi: 10.3389/fpsyg.2012.00137

Rochon, J., Gondan, M. & Kieser, M. (2012). To test or not to test: Preliminary         assessment of normality when comparing two independent samples. BMC Medical Research Methodology, 12:81, doi:10.1186/1471-2288-12-81.