Correction for the multiple testing problem

“One mature Atlantic Salmon (Salmo salar) participated in the fMRI study. The salmon measured approximately 18 inches long, weighed 3.8 lbs, and was not alive at the time of scanning.” Although the salmon was dead several brain areas appeared to be processing what emotion a person on a picture displayed (Bennett, Baird, Miller & Wolford, 2011).
This false result was obtained by testing so many voxels that false positives emerged. With each added test the result of a type 1 error increases (Bender & Lange, 2001). A simulation study showed that when you simulate an image of two active areas 1000 times that every voxel in the voxel space is deemed as active at least once (Logan & Rowe, 2003).
There are methods for correcting the amount of false positives in multiple testing but this is not always done in fMRI research. Between 24% and 40% of the articles published in 2008 did not correct for multiple testing (Bennett, Baird, Miller & Wolford, 2011 – supplementary material).
Two often used procedures for correcting for the multiple testing problem are the family wise error correcting procedure (FWE) and the false discovery rate correcting procedure (FDR). All procedures have to find a balance between correcting for false positives (type 1 error) and false negatives (type 2 error).  Any method that protects more against one type of error is guaranteed to increase the rate of the other kind of error (Lieberman & Cunningham, 2009).
The family wise error rate (FWER) is the probability of making one or more type 1 errors in a family of comparisons. For a family wise error of 5% there is a 95% confidence level that there are no type 1 error in the data. The simplest FWE correction procedure is the original Bonferroni correction. This method divides the alpha level, the chance of a type 1 error (normally 5%), by the amount of voxels (Dunn, 1961). So for example when 100.000 voxels are tested at an FWE rate of 0.05 the threshold for a voxel would be 0.05/100000=0.0000005. Since the introduction of the Bonferroni procedure the procedure has been improved (Nichols & Kayasaka, 2003).
Another FWE correcting procedure is based on the random field theory. The reasoning behind the random field theory is that since the p-values of the voxels are (locally) dependent we have to use that dependency to correct for multiple testing. The random field theory does not test individual voxels but individual observations, ‘active brain clusters’. This can reduce the amount of tests by a several factors (Brett, Penny & Kiebel, 2003).
Another correcting procedure is the FDR correcting procedure. Instead of correcting in the whole family, all tested voxels, the method only corrects in active voxels. The false discovery rate (FDR) is the expected ratio of the number of erroneously rejected null hypotheses to the total number of rejected null hypotheses. The method guarantees that in all active voxels the maximum amount of false positives is at a specified level (i.a. 5%). So the FDR method is flexible, meaning it can chance with the numbers of tests.
A comparison between the FWE and FDR correcting procedures showed that the FDR maintained higher power in the active brain regions, meaning less type 2 error, but at the cost of more falsely detected voxels (Logan & Rowe, 2003). Verhoeven, Simonsen and McIntyre (2005) found that FWE is preferred only when the penalty of making a type 1 error is severe. FDR control is more powerful and often is more relevant than controlling the FWER.
The two procedures are not the only procedures for correcting for multiple testing, a promising new procedure is combining spatial information with Bayesian testing methods (Bowman, Caffo, Bassett & Kilts, 2008).

Bender, R. and Lange, S. 2001. Adjusting for multiple testing: when and how?                         Journal of Clinical Epidemiology, 54, 343-349.
Bennett, C. M, Baird, A. A, Miller, M. B., & Wolford, G. L., 2011. Neural                                correlates of interspecies perspective taking in the post mortem Atlantic Salmon:         An argument for proper multiple comparisons correction. Journal of serendipitous         and unexpected results, 1(1), 1–5.
Bowman, D., Caffo, B., Bassett, S. S. & Kilts, C., 2008. A Bayesian hierarchical                 framework for spatial modeling of fMRI data, NeuroImage, 39, 146–56.
Brett, M., Penny, W., Kiebel, S., 2003. An Introduction to random field theory, In:                 Frackowiak, R.S.J.,  Friston, K.J., Frith, C., Dolan, R., K.J., Price,  C.J., Zeki, S., Ashburner, J., Penny, W.D. (Eds.),   Human Brain Function, 2nd edition. Academic                 Press.
Dunn, O.J., 1961. Multiple Comparisons Among Means. Journal of the American                 statistical association56, 52-64.
Lieberman, M.D., & Cunningham, W.A., 2009. Type I and Type II  error concerns in         fMRI research: Rebalancing the scale. Social  Cognitive and Affective                         Neuroscience, 4, 423–428.
Logan, B. R. & Rowe, D. B., 2004. An evaluation of thresholding techniques in fMRI         analysis. NeuroImage 22, 95–108.
Nichols, T., Hayasaka, S., 2003. Controlling the familywise error rate in functional                 neuroimaging: a comparative review. Statistical Methods in Medical research,                 12(5), 419 – 446.
Verhoeven, K. J. F., Simonsen, K. & McIntyre, L. M, 2005. Implementing false discovery         rate control:  increasing your power. Oikos, 108, 643-647.


No sign of right-oriented bias in goalkeepers: A ‘failed’ replication of Roskes et al. (2011): The right side? Under time pressure, approach motivation leads to right-oriented bias

Roskes, Sligte, Shalvi and De Dreu (2011) found that goalkeepers dive more to the right than to the left when they are behind in a penalty shootout than when they are tied or ahead. Roskes et al. (2011) argue this is the case because goalkeepers who are behind are more approach motivated than goalkeepers who are tied or ahead. The rightward bias occurs when people are approach motivated and under time pressure; like goalkeepers who are behind in a penalty shootout. Unfortunately Roskes et al. (2011) do not mention why only goalkeepers who are behind are approach motivated. There is also a methodological concern; the original data do not support the claim they make (see my earlier blog post). And this finding can have massive impact, because goalkeepers could train to overcome this bias and stop more penalties. For these reasons it is important to replicate this finding and see if the rightward bias is a real thing.

To do this in a confirmative way, we registered exactly what we were going to measure and which analysis we would perform (see an earlier blog post).  Unfortunately we could not stick with our initial plan entirely, because the quality of some videos was extremely bad. The most important thing about the registration is that we should stick with our original analysis and that is still the case.

The analysis showed that; goalkeepers dived equally to the right and the left, when their team is ahead, (1, N=124) =2.613, p = 0.106. Goalkeepers dived equally to the right and the left, when their team is tied, (1, N=163) =3.245, p = 0.072. Goalkeepers dived equally to the left and the right, when their team is behind, (1, N=41) =.610, p = 0.435. This results show there is no rightward bias in goalkeepers.

Our replication study showed there is no rightward bias in the diving direction of goalkeepers when their team is behind. Our study also showed how important it is to pre-register the analysis you intend to do. When we would analyze all the penalties together instead of separately for behind, tied and ahead, we would find goalkeepers do dive more to the right, (1, N=329) =6.71, p = 0.01. We might end up concluding that we extended the original findings of Roskes, Sligte, Shalvi and De Dreu to apply for all goalkeepers. Instead of only the ones that are behind. Because conclusion dependent so heavily on the analysis you do, more researchers should pre-register their analysis. In that way it is sure the conclusions are confirmatory, nowadays it is often unclear whether a finding is confirmatory or exploratory. This study showed in a confirmatory way that the rightward bias does not exist in goalkeepers.

Roskes, M., Sligte, D., Shalvi, S., & De Dreu, C, K, W. (2011). The Right Side? Under Time    Pressure, Approach Motivation Leads to Right-Oriented Bias Psychological Science, 22, 1403–1407.


In one of the first lectures, we discussed the priming study of Bargh, Chen and Burrows (1996). In this study the primed participants with the elderly stereotype. While the participants walked back to the exit through a long corridor, a confederate measured their walking speed. Bargh, Chen and Burrows found that participants primed with the elderly stereotype walked slower than participants who were not primed with the elderly stereotype. Doyen et al. (2012) did a replication of this research and used more advanced measurements. In this experiment they were not able to replicate the effects of the Bargh et al. study. Therefore, they conducted a second experiment in which they told the experimenters the specific hypotheses of the study: half of the experimenters were told the participants would walk slower and the other half that the participants would walk faster. They were also instructed to use a stopwatch for measuring walking speed, because the infrared sensors were not yet calibrated. The results were surprising. The results of Bargh et al. (1996) were replicated for the experimenters in the slow condition. Participants in the prime condition walked slower down the hallway, but only if the experimenter was in the slow condition. This effect was even more prominent for the manually measured walking times. Participant in the prime condition walked slower when the experimenter was in the slow condition and faster when the experimenter was in the fast condition. In this replication the priming effect is thus explained as an experimenter effect.

This and other replications of priming effects and the recent exposure of some fraudulent social psychologist from the priming field made Daniel Kahneman to write an e-mail to his colleagues with a proposal to deal with questions about priming effects (Kahneman, 2012). The most important message in his e-mail is that researchers from the priming field should solve their integrity problem. The best way to do this, according to Kahneman is to examine the replicability of priming results.

The big problem with priming research, is that researchers need specific training to be able to do priming research. So to solve the integrity problem in priming research, researchers from the field itself have to do the replications, because if you are not trained, you don’t find the effect. Do you see the problem here?

However, the solution lies not only in replication and in publishing the replication studies of priming. We need to make the data of all our research available. The availability of data sets should be normality and not an exception. Researchers, who are not willing to share their data, should be approached with a certain amount of suspicion.

Public commitment and pre-commit to publish the results can help solve the integrity problem that, as Kahneman points out to the members of his field, the priming field has to deal with.

An excellent example of data sharing comes from the main author of a recently published article about inhibition of neuroblastoma tumor growth. He explained on national television that all data is available for the public and other researchers in his field (Molenaar et al., 2012). The breakthrough in this area of research is, of course, worth a publication. Not only because the researchers did their jobs well, but more importantly will their research help other children in this world with a specific disease. A clear example of how research is a matter of the public. Science is not about getting on a high position as soon as possible. Science is about exploring the world, wanting to know how it works and sharing that with every human being that is part of that world.

Finally, Let us not forget that science brought to us all the progress in this world, science brought us knowledge, welfare, penicillin, and it is science that extends our life expectancy.

 While you live, tell truth and shame the Devil!

Shakespeare, Henry IV. Part I, 1597


Bargh, J.A., Chen, M., & Burrows, L. (1996). Automaticity of social behavior: Direct effects of trait construct and stereotype activation on action. Journal of Personality and Social Psychology, 2, 30-244.

Doyen, S., Klein, O., Pichon, C.L., & Cleeremans, A. (2012). Behavioral priming: It’s all in the mind, but whose mind? PLoS ONE, 7(1): e29081.10.1371/journal.pone.0029081

Kahneman, D. (2012). A proposal to deal with questions about priming effects.

Molenaar et al. (2012). LIN28B induces neuroblastoma and enhances MYCN levels via let-7 suppression. Nature Genetic, in press. doi:10.1038/ng.2436


The unlucky number seven: A rather painful critique of my internship project


For my internship project I was investigating the effects of deep brain stimulation (DBS) as a treatment for Parkinson’s disease. This included improvements in motor and quality of life, but also cognitive decline. I ended up dealing with a rather large and complex data set, which after all the computed variables had been made contained 233 variables for 281 participants. To make things more complex the data had been combined from three different sources which meant there were inconsistencies with coding and administered tests in the studies. I tried to be as stringent as possible with the initial data checking, handling and following analyses. However, I was still left feeling unconfident with my findings. Unfortunately, I had good reason for this uneasy feeling. Whilst checking I found two rather huge mistakes. Fortunately I still had time to improve my project before the final was handed in.

The Suspect P-Value

I cleaned up the final version of my data set and syntax and decided to rerun all the analyses and make sure I had consistent results. It was all going rather well until I moved onto the cognitive variables. In my write up I had already found a wrongly reported P-value for the Mattis Dementia Rating Scale (MDRS). I had reported it as significant, when in fact the p value was 0.017, insignificant for my alpha of 0.01. When I reran the analysis it turned out to be even worse, the P-value was actually 0.022. I knew I had double checked the analysis, so was rather baffled! I found the earlier version of data set which gave me the original 0.017 P-value and began my search. The data seemed identical, until I checked my IQ covariate. I had missed a missing value coding of 777 (inability to complete). The IQ covariate which I had used in all my cognitive analyses! I reran all my analyses, and changed my report. Mostly my analyses were not largely affected by this mistake, however I did lose significance on one comparison which went from P=.009 to P=.025!

The Questionable Predictor

I also had to question my inclusion of the MDRS as a significant baseline predictor of cognitive decline following deep brain stimulation. The MDRS was a very desirable predictor; it was measure of global cognition that was already in use by the neuropsychologists and doctors to set a cut off point for undergoing DBS. Originally significant, once made into a T-Score to correct for age and education it was pushed out by years of education and IQ. As IQ and education can be seen to conceptually overlap, I had made one model excluding education. Once again the MDRS was significant and IQ was no longer in the model. I had kept this model as it was more parsimonious and practical, as the IQ measure would not be as available as the MDRS. However, I was feeling a little uncertain about my decision. Once I had rerun my analysis with the corrected IQ I found that the original model was now medication dose and education. I decided to retain this model, which can be seen as more consistent and improved research practice.

I feel my mistakes were a combination of long research hours with a large data set and previous expectations clouding my judgement. Overall I learnt a lot from doing this critical review of my own work. It has made my very aware of my own fallibility, despite having good intentions. I am glad that I have managed to locate and fix these problems both for this project, and to help me improve my research practice for future projects.


Multilevel is needed when we want to model a hierarchical structure. Because the world is full of hierarchies (students within classes within schools, time within people, penalties within goalkeepers, clients within therapists and what we often forget in  science, research subjects within experimenters), we quite often, or almost always, need multilevel models. Through several examples (Clark, 1973; Van Baaren, Holland, Steenaert, & van Knippenberg; Hoeken & Hustinx, 2009) I hope to have shown the necessity of this procedure. However, I’ve gotten the question what the assumptions of multilevel models where. Because I did not had a straight answer, I would like to take this opportunity to address this question here.

To understand this, I quickly have to explain the nature of multilevel modeling. A complete (random intercept, random slope) multilevel model consists of a normal regression equation, yij=b0j+b1j*Xi+eij, where the subscripts i represent the differences between individuals and the subscripts j represent the differences between groups. The differences between groups are specified by regressing with a new formula (level 2) on the intercept and slope parameter, b0j=y00+u0j and b1j=y10+u1j (one of these equations can be dropped to get a fixed intercept, random slope or random intercept, fixed slope parameter model, see all three models in the figures below).


Now to the assumptions. These are the same as in ordinary multiple regression analysis

  • linear relationships,
  • normal distribution of the residuals,
  • and homoscedasticity.

When the assumption of linearity is violated we could check for other relations (for instance, by using the square of a time variable in a longitudinal study). Note that due to the introduction of multiple levels there is now more than one residuals. Of course this complicates matters a bit, but it has been shown that multilevel estimation methods are quite robust for violations of this assumption on the second level (Hox & Maas, 2004). Another great advantage of multilevel is that heteroscedasticity can be modeled directly, to account violations of the final assumption (cf. Goldstein, 1995, pp. 48–57).

So to sum up, USE MULTILEVEL!



Goldstein, H., 1995. Multilevel Statistical Models. Edward Arnold, London; Halsted, New York.


Maas, C.J.M. & Hox, J.J., (2003). The influence of violations of assumptions on multilevel parameter estimates and

their standard errors. Computational Statistics & Data Analysis, 46, 427-440

The Good, The Bad, And The Science

It takes a lot of knowledge, effort and diligence to be a good researcher. Every day we make decisions that can affect either our personal life – we may work overtime for extended periods of time to get an article published and in the process neglect our friends, our career – we might cave to a supervisor’s not-so-subtle suggestions to massage the data of our last experiment to get favorable results, but most of the time, they affect both.

When we try to do everything right, we suddenly realize that not every issue is black and white, that there are various valid ways to design a study, various unexpectedly invalid ways to operationalize our dependent variable and many different ways to analyze our data. We find that what one considers a clear instruction puzzles another, that some necessary steps are overlooked by the majority of researchers who publish articles, that some design entire studies without noticing how futile it is to conduct said study when it does not directly test the hypothesis. We discover that many researchers cut corners, often without ill intent, but nonetheless to great effect.

Throughout this course we have heard many stories – those we had heard before, those we heard for the first time and perhaps overlapping, those we will hear time and time again. From researcher degrees of freedom and beyond the sad truth about p-values to philosophical questions such as “What exactly is the probability of a hypothesis?”; from strictly mathematical truths about what analyses are appropriate for what kind of data down to outright fraud.

The last lecture was a colorful composition of numerous short talks, which compared psychiatric (mal-)practice to a displaced exercise in legislation and religion, reminded us of how important it is to stay organized and to keep data-sets neat and tidy , strongly suggested we use multi-level analyses in our future analytic endeavors, informed us that we should simulate dependent data to test p-level adjustment methods developed for dependent data, presented us with a number of options to remove outliers, reminded us again of how important it is to correct for multiple comparisons, criticized the way psychology students are taught statistics and research methods, suggested that we might have to doubly correct for multiple comparisons when investigating brain networks, inquired whether therapist allegiance effects are real, offered a puzzling account of how Simonsohn’s (2012) fraud-detection method failed to detect in-vitro fraud,… and provided us with a brief overview of some good research practices.

At this point I have little to add, but I will leave you with a subtle quote:

The most exciting phrase to hear in science, the one that heralds new discoveries, is not ‘Eureka!’ but ‘That’s funny…’ – Isaac Asimov


To my fellow “Good Science, Bad Science” students:

Mathias, Frank, Sanne, Sara, Marie, Monique, Rachel, Sam, Anja, Mattis, Vera, Barbara, Daan



Simonsohn, U. (2012). Just post it: The Lesson from Two Cases of Fabricated Data Detected by Statistics Alone. Available at SSRN: or

Something Fishy This Way Comes

Once every now and then, we come across an article that strikes us as particularly strange. For me, Harris et al. (1999), investigating the effect of intercessory prayer on patients in a hospital’s cardiovascular and coronary unit, is such an article. Not just because they devised an arbitrary outcome measure, but because of patterns in their reported data.

Typically, these data should differ a bit. Perhaps not much if the measurement is very precise and the sampling and randomization procedures worked exceptionally well, but they should differ at least a bit. Harris et al. report six rather suspect standard errors (SE) of the means (M) for three measurements (there were two groups): the SE pairs are [.27, .26], [.1, .1] and [.009, .008]. I was puzzled, mainly because the two groups were of different size. Generally, the larger the sample size, the smaller the standard error.

I set out to test whether this conspicuous coincidence could be expected on the basis of chance. Simonsohn (2012) provided a bootstrap-method to test for this, taking into account means and standard deviations. The values presented in this particular paper did not produce a significant result, at least not by using this method.

I looked at two more articles that list Mr. Harris as co-author (Duda et al., 2009; Skulas-Ray et al.,2011), incidentally, both investigate the effect of omega-3 fatty acids on animals (one on human, the other on non-human animals – rats). By the way, Mr. Harris would probably take issue with the previous sentence, as he is a vocal proponent of Intelligent Design. Of course, as long as what is being referred to as ‘intelligently designed’ is a scientific study, I have no trouble promoting the same. But I tend to doubt an omega-3-researcher’s integrity if said researcher seems heavily invested in the testing of blood omega-3 levels, such as Mr. Harris, who is president and CEO of a certain company called Omegaquant.

I must admit that I am not very knowledgeable in the field of medicine, so perhaps I missed something important – if so, please let me know, but the Simonsohn analysis of these two papers gave interesting results.

Duda et al. produced a table of results for 13 variables, 4 of which seemed suspect to me, because they had virtually (and in one case de facto) identical standard errors across 8 conditions. However, only the de facto identical values produced a significant p-value for Simonsohn’s method: out of 100000 simulations, not a single one generated data similar or more extreme than Duda et al.’s. As it happens, this is the variable the authors used to test their main hypothesis.

The last article, Skulas-Ray et al. (2011), was truly something new for me. In a first table, they present 25 variables and 3 measurement points—all standard errors are identical across conditions and in 3 cases even across variables. Perhaps unsurprisingly, the Simonsohn method reports all of these data to be unlikely with p=0 for 10000 simulations. In a second table, out of 18 new variables, only 4 have instances of deviation from identity across conditions and all of these deviations are minimal. I doubted my own results so much that I started looking at similar papers with similar methodology, but patterns in the data of those articles are not similar to that degree (Stirban et al., 2011; Goodfellow, Bellamy, Ramsay, Jones and Lewis, 2000), so I am left wondering.

Either there is something seriously wrong with the authors’ data, my implementation of Simonsohn’s method or the way I applied it to the data. There are other alternatives, but these are the most obvious conclusions given these results. Certainly, the field of fraud detection requires a great deal more attention and research to avoid false accusations or even witch-hunts. To avoid confusion, researchers should take the initiative and simply post their raw data online.



Duda, M. K., Shea, K. M., Tintinu, A., et al. (2009). Fish oil, but not flaxseed oil, decreases inflammation and prevents pressure overload-induced cardiac dysfunction. Cardiovascular Research, 81, 319-327.

Harris, W. S., & Calvert, J. H. (2003). Intelligent Design: The Scientific Alternative to Evolution. National Catholic Bioethics Quarterly, 531-561.

Harris, W. S., Gowda, M., & Kolb, J. W., (1999). “A randomized, controlled trial of the effects of remote, intercessory prayer on outcomes in patients admitted to the coronary care unit”. Archives of Internal Medicine, 159, 2273–2278.

Goodfellow, J., Bellamy, M. F., Ramsey, M. W., Jones, C. J., & Lewis, M. J. (2000). Dietary supplementation with marine omega-3 fatty acids improve systemic large artery endothelial function in subjects with hypercholesterolemia. Journal of the American College of Cardiology, 35, 265-270.

Simonsohn, U. (2012). Just post it: The Lesson from Two Cases of Fabricated Data Detected by Statistics Alone. Available at SSRN: or

Skulas-Ray, A. C., Kris-Etherton, P. M., Harris, W. S., Vanden Heuvel, J. P., Wagner, P. R., & West, S. G. (2011). Dose-response effects of omega-3 fatty acids on triglycerides, inflammation, and endothelial function in healthy persons with moderate hypertriglyceridemia1-3. The American Journal of Clinical Nutrition, 93, 243–252.

Stirban, A., Nandrean, S., Götting, C., et al. (2009). Effects of n—3 fatty acids on macro- and microvascular function in subjects with type 2 diabetes mellitus. American Journal of Clinical Nutrition, 91, 808-813.

Questionable research practices

I work in the Methodologiewinkel, where psychology students can seek advice from other students on their research methods and data analysis. My colleagues and I often encounter a specific pattern in the analysis strategies among undergraduate (but also graduate!) psychology students. There is a tendency to gather as many variables as possible, without a clear rationale on how they link to the purpose of the study. Consequently, many models are tested during the analysis (after all, one has not took the trouble to gather so many variables for nothing). Moreover, and importantly, many students seem puzzled when they are reminded that they should report every exploratory finding as such. Albeit showing a genuine wish to find something important in the data, students don’t seem to understand what it really means to have to separate exploratory form confirmatory findings. Their behaviour cannot be considered cheating, but surely questionable.

For my final paper, I evaluated 14 statistics and methodology books to see whether they address 4 of such questionable research practices (those used in Simmons, Nelson, & Simonsohn ,2011; Testing on two or more dependent variables, testing additional subjects or optional stopping, including covariates ad hoc, dropping conditions/ not reporting them) and how elaborate their chapters are on the ethical implications and the ‘do’s and don’ts’  in research. Surprisingly, I found some clearly wrong, some at least misleading, and only a few good accounts that address the issues in depth, giving illustrative examples and useful practical solutions. Although all books do, in fact, discuss exploratory versus confirmatory research, the explanations remain abstract, and without concrete practical implications.

In one of the books I stumbled over a discussion on the possible reasons of fraud, where the ‘publish or perish’ – culture  is seen as one of the contributing factors. This is the first time that I have never, during my bachelor in Vienna or the master in Amsterdam, come across a note about what kind of pressures you might face in your later research career. Never are we really told about those daily issues that De Vries, Anderson & Martinson (2006) have called the “normal misbehaviours”:  Conflicts of interest among colleagues, ‘rules of conduct in a lab’, the normal practice of deciding upon authorship, possibilities on how to behave if someone is being cut out,  how to keep proper research records….etc. While there might be many abstract discussions about the ethics of research, there are no practical guidelines that would prepare young researchers to deal with those kinds of social conflicts (and how to avoid that they influence the quality of research). Maybe we cannot avoid such ‘misbehaviours’ entirely, as De Vries et al. (2006) argue.  But we can cat least raise awareness for them among the new generations of students

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359-1366

De Vries, R., Anderson, M.S.& Martinson, B.C.(2006).Normal misbehavior: Scientists talk about the ethics of research.  Journal of Empirical Research on Human Research Ethics: JERHRE,1, 43 – 50



Is Psychometrics Pathological?

Every few months the highly controversial question whether psychometrics can be classified as real empirical science arises again in the specialist literature of the field. In this context, one often reads eloquent philosophical articles focusing on the status of the social sciences compared to the physical sciences. Frequently, articles point out inherent differences in the study object of the disciplines: namely the complex dynamics within, around and between human beings versus the exact rules that govern the natural world. In the natural sciences, scientists seek to understand how the world, nature and the universe around us work by using scientific methods. They rely on experimental data that can be measured by quantitative representations. In psychometrics there is also an existing conviction that psychological attributes are quantitative: psychometricians assume that they are able to measure concepts like personality or ability.

In my paper, I focused on a recent discussion about this questionable assumption: In the year 2000, the journal Theory & Psychology published an article titled “Normal Science, Pathological Science and Psychometrics”, written by Joel Michell from the University of Sydney. In 2004, Denny Borsboom and Gideon Mellenbergh from the University of Amsterdam wrote a comment on Michell’s paper, which has been published in Theory & Psychology, too. Their discussion centers on the question whether psychometrics is a pathological science.

In his article, Michell (2000) claims that psychometrics is a pathological form of science. He contrasts two terms: normal science and pathological science. Michell’s definition of optimal or normal science is based on the principles of critical inquiry (Michell, 2000, p.640). Critical inquiry consists of two forms of testing, emerging from the two existing types of “error”: logical and empirical. Through critical inquiry it is possible to identify this “error” and be aware of its occurrence. Normal science turns into pathological science when one does not work based on the principle of critical inquiry. According to Michell, psychometrics is a scientific discipline that is in breach of this principle on two distinct levels: (1) the hypothesis that psychological attributes and concepts are quantitative is accepted as true without a serious attempt to test this assumption, and (2) this fact is not discussed and even disguised.

In their comment to Michell’s article, Borsboom and Mellenbergh (2004) argue that, when one declares psychometrics as a pathological discipline, nearly all other scientific disciplines are pathological, too. They base their argument on the Quine-Duhem thesis, originating from the philosophy of science. The Quine-Duhem thesis holds that hypotheses are never tested in isolation since they are always part of a bigger network of hypotheses. Borsboom and Mellenbergh state that it is therefore impossible to test the hypothesis that psychological attributes are quantitative, separately. In this context they claim that one should distinguish between classical test theory (CTT; Lord & Novick, 1968) and item response theory (IRT; Hambleton & Swaminathan, 1985). With their comment, Borsboom & Mellenbergh show the importance of Michell’s critique on psychometrics. Also, they underline that not testing the assumption of quantitative psychological attributes cannot be assigned to the imputed ignorance of psychometricians, but to the restriction that counts for all hypothesis testing: one cannot isolate a hypothesis. Furthermore, they make clear that some of Michell’s claims are an important point of attention for much psychological research: “…often, item scores are simply summed and declared to be measurement of an attribute, without any attempt being made to justify this conclusion”. As we conclude, every week of this course again: it is all about                  a w a r e n e s s.

Freedom versus Fraud and Force in Psychiatry

According to psychiatrist Thomas Szasz (1920-2012), the discipline of mental healing resembles religion, rather than medicine or science. At the beginning, psychiatrists were neuropathologists who were expert in the histopathology of the central nervous system. Today however, they are psychopharmacologists who are expert in creating and uncreating psychiatric diagnoses, attaching these diagnoses to bad behaviour, and prescribing drugs to persons labeled with such diagnoses (Szasz, 2004).

Szasz states that there are two kinds of brain diseases; 1) proven or real brain diseases like strokes, for which clear evidence of damaged brain tissue is present, and 2) putative or fake brain diseases like schizophrenia, for which specific evidence of physiological brain damage is lacking. Mental illnesses do not exist; they are diseases of the brain, not the mind. Diagnoses should be driven by identifying bodily lesions and their material causes, rather than by non-medical considerations and incentives (Szasz, 1994). According to Szasz, psychiatrists have the power to accredit their own claims as scientific facts and sound treatment, to discredit the claims of their mental patients, and to enlist coercive power to impose their own views (Szasz, 1994). His main point is that psychiatrists function as legislators, rather than as scientists, confining and controlling the ‘deviant’. Szasz concludes that psychiatry is a branch of the law and a secular religion, rather than a real science or therapy (Szasz, 1994).

E. Fuller Torrey is the most prominent advocate of forced psychiatric treatment in the United States today. He considers using coerced therapy as so medically and socially important, that it justifies actively deceiving the patient. Torrey advocates actively deceiving the ‘severely mentally ill’ and compelling them to be drugged with chemicals that Torrey deems good for them (Szasz, 2004). According to Szasz, Torrey’s active deception is against the patient’s free will. Szasz draws the religion-like parallel of Torrey’s practice with Christian servants in Jewish households who used to baptize their children ‘to save them from going to hell’.

All diseases, either putative or proven, with all its symptoms and societal consequences, fall on a continuum, making it difficult to draw the line between ‘serious enough to justify coercion’ and ‘free will may not be denied’.  Therefore, the question should be: Who is allowed to and has the capability of making this distinction and hence to make policy and take responsibility for all consequences of this policy? The answer is probably that nobody can. That’s why we need more communication and collective awareness regarding the way we construed our psychiatric system. Perhaps Szasz would have agreed on the statement that it is ironic that psychiatrists resemble scientists most in the sense that they mainly consider their own interests, rather than the interests of their very ‘mentally ill’ patients. It would do no harm if psychiatrists open themselves to critically review and reflect their roles in the psychiatric system and to check whether they still have the intended highest priority; that of the patient. That is, if the patient himself wants to have this priority.


Szasz, T. (1994). Mental Illness Is Still a Myth. Society, 34-39.

Szasz, T. (2004). Psychiatric Fraud and Force: A Critique of E. Fuller Torrey. Journal of

Humanistic Psychology, 44, 416-430.