The Good, The Bad, And The Science

It takes a lot of knowledge, effort and diligence to be a good researcher. Every day we make decisions that can affect either our personal life – we may work overtime for extended periods of time to get an article published and in the process neglect our friends, our career – we might cave to a supervisor’s not-so-subtle suggestions to massage the data of our last experiment to get favorable results, but most of the time, they affect both.

When we try to do everything right, we suddenly realize that not every issue is black and white, that there are various valid ways to design a study, various unexpectedly invalid ways to operationalize our dependent variable and many different ways to analyze our data. We find that what one considers a clear instruction puzzles another, that some necessary steps are overlooked by the majority of researchers who publish articles, that some design entire studies without noticing how futile it is to conduct said study when it does not directly test the hypothesis. We discover that many researchers cut corners, often without ill intent, but nonetheless to great effect.

Throughout this course we have heard many stories – those we had heard before, those we heard for the first time and perhaps overlapping, those we will hear time and time again. From researcher degrees of freedom and beyond the sad truth about p-values to philosophical questions such as “What exactly is the probability of a hypothesis?”; from strictly mathematical truths about what analyses are appropriate for what kind of data down to outright fraud.

The last lecture was a colorful composition of numerous short talks, which compared psychiatric (mal-)practice to a displaced exercise in legislation and religion, reminded us of how important it is to stay organized and to keep data-sets neat and tidy , strongly suggested we use multi-level analyses in our future analytic endeavors, informed us that we should simulate dependent data to test p-level adjustment methods developed for dependent data, presented us with a number of options to remove outliers, reminded us again of how important it is to correct for multiple comparisons, criticized the way psychology students are taught statistics and research methods, suggested that we might have to doubly correct for multiple comparisons when investigating brain networks, inquired whether therapist allegiance effects are real, offered a puzzling account of how Simonsohn’s (2012) fraud-detection method failed to detect in-vitro fraud,… and provided us with a brief overview of some good research practices.

At this point I have little to add, but I will leave you with a subtle quote:

The most exciting phrase to hear in science, the one that heralds new discoveries, is not ‘Eureka!’ but ‘That’s funny…’ – Isaac Asimov


To my fellow “Good Science, Bad Science” students:

Mathias, Frank, Sanne, Sara, Marie, Monique, Rachel, Sam, Anja, Mattis, Vera, Barbara, Daan



Simonsohn, U. (2012). Just post it: The Lesson from Two Cases of Fabricated Data Detected by Statistics Alone. Available at SSRN: or

Something Fishy This Way Comes

Once every now and then, we come across an article that strikes us as particularly strange. For me, Harris et al. (1999), investigating the effect of intercessory prayer on patients in a hospital’s cardiovascular and coronary unit, is such an article. Not just because they devised an arbitrary outcome measure, but because of patterns in their reported data.

Typically, these data should differ a bit. Perhaps not much if the measurement is very precise and the sampling and randomization procedures worked exceptionally well, but they should differ at least a bit. Harris et al. report six rather suspect standard errors (SE) of the means (M) for three measurements (there were two groups): the SE pairs are [.27, .26], [.1, .1] and [.009, .008]. I was puzzled, mainly because the two groups were of different size. Generally, the larger the sample size, the smaller the standard error.

I set out to test whether this conspicuous coincidence could be expected on the basis of chance. Simonsohn (2012) provided a bootstrap-method to test for this, taking into account means and standard deviations. The values presented in this particular paper did not produce a significant result, at least not by using this method.

I looked at two more articles that list Mr. Harris as co-author (Duda et al., 2009; Skulas-Ray et al.,2011), incidentally, both investigate the effect of omega-3 fatty acids on animals (one on human, the other on non-human animals – rats). By the way, Mr. Harris would probably take issue with the previous sentence, as he is a vocal proponent of Intelligent Design. Of course, as long as what is being referred to as ‘intelligently designed’ is a scientific study, I have no trouble promoting the same. But I tend to doubt an omega-3-researcher’s integrity if said researcher seems heavily invested in the testing of blood omega-3 levels, such as Mr. Harris, who is president and CEO of a certain company called Omegaquant.

I must admit that I am not very knowledgeable in the field of medicine, so perhaps I missed something important – if so, please let me know, but the Simonsohn analysis of these two papers gave interesting results.

Duda et al. produced a table of results for 13 variables, 4 of which seemed suspect to me, because they had virtually (and in one case de facto) identical standard errors across 8 conditions. However, only the de facto identical values produced a significant p-value for Simonsohn’s method: out of 100000 simulations, not a single one generated data similar or more extreme than Duda et al.’s. As it happens, this is the variable the authors used to test their main hypothesis.

The last article, Skulas-Ray et al. (2011), was truly something new for me. In a first table, they present 25 variables and 3 measurement points—all standard errors are identical across conditions and in 3 cases even across variables. Perhaps unsurprisingly, the Simonsohn method reports all of these data to be unlikely with p=0 for 10000 simulations. In a second table, out of 18 new variables, only 4 have instances of deviation from identity across conditions and all of these deviations are minimal. I doubted my own results so much that I started looking at similar papers with similar methodology, but patterns in the data of those articles are not similar to that degree (Stirban et al., 2011; Goodfellow, Bellamy, Ramsay, Jones and Lewis, 2000), so I am left wondering.

Either there is something seriously wrong with the authors’ data, my implementation of Simonsohn’s method or the way I applied it to the data. There are other alternatives, but these are the most obvious conclusions given these results. Certainly, the field of fraud detection requires a great deal more attention and research to avoid false accusations or even witch-hunts. To avoid confusion, researchers should take the initiative and simply post their raw data online.



Duda, M. K., Shea, K. M., Tintinu, A., et al. (2009). Fish oil, but not flaxseed oil, decreases inflammation and prevents pressure overload-induced cardiac dysfunction. Cardiovascular Research, 81, 319-327.

Harris, W. S., & Calvert, J. H. (2003). Intelligent Design: The Scientific Alternative to Evolution. National Catholic Bioethics Quarterly, 531-561.

Harris, W. S., Gowda, M., & Kolb, J. W., (1999). “A randomized, controlled trial of the effects of remote, intercessory prayer on outcomes in patients admitted to the coronary care unit”. Archives of Internal Medicine, 159, 2273–2278.

Goodfellow, J., Bellamy, M. F., Ramsey, M. W., Jones, C. J., & Lewis, M. J. (2000). Dietary supplementation with marine omega-3 fatty acids improve systemic large artery endothelial function in subjects with hypercholesterolemia. Journal of the American College of Cardiology, 35, 265-270.

Simonsohn, U. (2012). Just post it: The Lesson from Two Cases of Fabricated Data Detected by Statistics Alone. Available at SSRN: or

Skulas-Ray, A. C., Kris-Etherton, P. M., Harris, W. S., Vanden Heuvel, J. P., Wagner, P. R., & West, S. G. (2011). Dose-response effects of omega-3 fatty acids on triglycerides, inflammation, and endothelial function in healthy persons with moderate hypertriglyceridemia1-3. The American Journal of Clinical Nutrition, 93, 243–252.

Stirban, A., Nandrean, S., Götting, C., et al. (2009). Effects of n—3 fatty acids on macro- and microvascular function in subjects with type 2 diabetes mellitus. American Journal of Clinical Nutrition, 91, 808-813.

The Difference Between Significant and Non-significant

How to make a difference

Suppose you are a psychotherapist looking for an effective treatment for disorder X, a disorder you had been unfamiliar with up until now. You have discovered that there are two relatively new treatments (T1 and T2) available and you decide to take a look at the available evidence. It turns out that only two studies have investigated the efficacy of T1 and T2 (one for each treatment); unfortunately, neither reports effect sizes or enough data for you to calculate them. All you can go by are the p-values: the first study reports p=.001 for T1 (compared with the control group), whereas the second study reports p=.16 for T2 (compared to control). Both studies use an alpha level of .05; what do you conclude?

You may be tempted to conclude that there is evidence that T1 is effective whereas T2 is not effective, but this need not be the case. Gelman and Stern (2006) argue that whenever a researcher wants to compare the effects of multiple factors, said researcher should test whether the difference between two effects is significant. The authors point out that it is wrong to simply dichotomize the statistical result into “significant” and “non-significant” and conclude that the effects differ.

Of course this is helpful advice for anyone planning a study involving such comparisons, but what about researchers who want to estimate the difference between two effects when they only have limited data such as the p-values or treatment effect estimates and their standard error to compare these effects?

Gelman and Stern do not offer any ways to calculate the significance level of these differences, but it turns out that several other researchers have tried to solve this problem long before Gelman and Stern. For example, Stouffer et al. (1949; as cited in Cooper and Hedges, 1994) presented the first statistic (Stouffer’s z) to test whether a number of p-values were significantly different from each other:

ZStouffer = , where k is the total number of p-values.

Rosenthal and Rubin (1979) offered a similar solution: Zdiff. For two p-values, the idea is to convert both p-values back to Z values, to calculate the difference between the Z values and to look up the p-value associated with this “Zdiff” value.

For multiple p-values, the authors suggest one use “[t]he sum of squares of the deviations about the mean Z” (p.1167) as χ² statistic with df=k-1, where k is the total number of Z values. Hence:


However, we must stress that it is much better for the original authors to plan the comparisons ahead of time and test for significance of the difference directly.

Real-life Example

Gelman and Stern mention a regression analysis where significance levels were compared. The researchers wanted to know if the high birth order found in homosexual men was due to having more older brothers or more siblings of both sexes than heterosexual men. In the regression analysis there were 6 coefficients, the first one being for the number of older brothers and second one for the number of older sisters. When only the coefficient for the number of older brothers was found a significant predictor of the sexual preference of the men, the researchers concluded the following: “homosexual men have a higher birth order primarily because they have more older brothers”. The problem here is that the significance level of the coefficient for the number of older brothers was compared to the significance level of the coefficient for the number of older sisters. One was significant, the other was not. However, this is not what they wanted to test. They wanted to know if just older brothers were a significant predictor, or siblings of both sexes. This is not what they tested with this regression.

They could have tested this by creating two new coefficients: 1. the first two predictors transformed into their sum (coefficient for the number of older siblings), 2. the first two predictors transformed into their difference (coefficient for the number of older brothers minus the number of older sisters). This way the first coefficient tells you the predictive value of having older siblings, the second coefficient tells you whether older brothers have more predictive value than older sisters. This is exactly the question the researchers wanted to answer.

So just be careful with comparing the predictive value of coefficients in a regression analysis. If you want to differentiate between two coefficients, you should not do this based on the significance levels of those two coefficients, but directly compare them.

(by David & Barbara)


Cooper, H. And Hedges, L. V. (Eds.) (1994). The Handbook of Research Synthesis. New York: Russell Sage Foundation.

Gelman and Stern (2006). The Difference between Significant and Non-Significant is itself not statistically significant. The American Statistician, 60, 328-331.

Rosenthal, R., & Rubin, D. B. (1979). Comparing significance levels of independent studies. Psychological Bulletin, 86, 1165-1168.