Therapeutic Allegiance Effects – Are they real?

Therapeutic Allegiance refers to the belief of a psychotherapy researcher that one therapy is superior to others. It has been suggested that this might pose a problem, as Therapeutic Allegiance could influence the outcomes of research. This suggestion has been underlined by the fact that Allegiance and effect sizes are often associated when looking at a set of psychotherapy studies.

Nevertheless, there has been no demonstration that Therapeutic Allegiance has a causal effect on outcome, except for some anecdotal evidence. This is important to keep in mind, especially, as some researchers went as far as proposing a statistical correction for Allegiance effects. However, we cannot draw causal inferences from correlational data.

There are tow major points we should heed when dealing with this issue:

1) There is a good way to minimize the chance of Therapeutic Allegiance effects occuring: adversarial collaboration. This means including experts from all the therapies examined in the research team, so that all sides of the story are represented.

2) Before we can claim that Allegiance has a causal effect on outcome, there have to be more focused research efforts. One good idea is to find studies that are almost completely the same but which are conducted by researchers with different Allegiances. Diverging results in such studies would provide a strong indication for Allegiance Bias.


Main reference:

Leykin, Y. & DeRubeis, R. J. (2009). Allegiance in Psychotherapy Outcome Research: Separating Association From Bias. Clinical Psychology: Science and Practice, 16, 54-65.

Statistical detection of fraud

In this course we have often seen the picture of the evil scientist: a mad professor sitting in his office all by himself, throwing out conditions and outliers, adding dependent variables or even making up data. The former are called questionable research practices: dangerous for the outcome of a study (or not if you are looking for a significant result), but quite common – 33% of the researchers in the study of Fanelli (2009) admitted doing these kind of things. Fraud, on the other hand, is less common: only 2% admitted having falsified or fabricated data.

Although fraud is less common than questionable research practices, it is still dangerous, because papers with non-proven conclusions are read and cited every day. Therefore, it is important to find out if someone is committing fraud. However, this is very hard. To blow the whistle on someone is very hard and replications studies are so rare that the chances of discovering fraud because a finding cannot be replicated are very small.

Simonsohn (2012) invented a method to statistically detect fraud. When I first read this I thought: this is the solution to the whole problem! This method looks whether means or standard deviations within a paper are too similar to have originated from random sampling by simulating the study 100,000 times. Best is it if an article contains multiple studies, so the overall likelihood of this article originating from random sampling can be determined. The method can also inspect raw data, but because raw data is almost never published and means and standard deviations are, I will not go into the inspection of the raw data.

To see the effect of this method with my own eyes, I took two experiments of an article of Stapel and Koomen (2001). I chose a retracted article of Diederik Stapel, to enhance the chances of having to do with an article with fraud. However, when I applied the method of Simonsohn, I did not detect fraud at all in the paper of Stapel and Koomen. The chances of these data originating from random sampling ranged from 0.35 to 0.97. And when I fabricated data myself to see whether this method could detect it, the chances of my data originating from random sampling ranged from 0.999 to 1.000.

I have to be honest: these findings still confuse me. How is it possible that this method that has been proven to work in a paper didn’t work when I tried to apply it. Did I make a mistake? Everyone who thinks that this is the case, can ask me for the R script of my simulations.

The method of Simonsohn (2012) has been called promising, but is very new so probably needs some research on itself. Not just because I couldn’t make it work, but also because the impact of such a method can be very high. If this method for example wrongly finds misconduct in one out of 10,000 papers, this is not a lot, but the damage for this one researcher will be very large, says Wagenmakers (Enserenk, 2012). That’s why I do not think we have to apply this method to every article. However, it can be very useful in case there are already suspicions about the nature of a paper.

Enserink, M. (2012). Fraud detection tool could shake up psychology. Science, 337, 21-22.
Fanelli, D. (2009). How many scientists fabricate and falsify research? A systematic review and meta-analysis of survey data. Plos one, 4(5), 1-11.
Simonsohn, U. (2012). Just post it: The lesson from two cases of fabricated data detected by statistics alone (submitted).
Stapel, D.A., & Koomen, W. (2011). I, we, and the effects of others on me: How self-construal level moderates social comparison effects. Journal of Personality and Social Psychology, 80, 766-781.

Are the BH and BL procedures solutions to the multiple comparisons problem?

“The gene for depression is found” and “The brain region for fear is located” are typical headlines that appear in newspapers and on news sites every now and then. In these studies lots of hypotheses are tested; many genes are tested to correlate with some sort of behaviour and many voxels (the smallest brain unit fMRI can measure) are tested to correlate with emotions. As the number of tested hypotheses in a study increases the chance of finding an effect increases. Because scientists only want 5% of the effects to be false positive they should control for multiple comparisons.

The traditional way to control for multiple comparisons is the Bonferroni correction, but using the Bonferroni correction decreases the power. When the power is low, it is hard to find existing effects. Because of the low power, researchers do not control for multiple comparisons. Benjamini et al. (2001) proposed the BH (for independent data) and BL (for dependent data) procedures to control for the False Discovery Rate (FDR), but maintain the power. The FDR controls the number false positive findings per study; for example, 2 false positive results out of 50 is ok, but 20 false positive results out of 50 is way too much. The BH and BL procedures are statistical techniques that control the FDR.

To see how much power is maintained using the BH and BL procedures I conducted a simulation study of which the results can be found in the figure below.


I simulated data in which 20%, 50%, 80% or 100% of the hypotheses were true and 10, 50, 100 or 500 hypotheses were tested. The power of Alpha and Bonferroni are the same across the four panels of the figure, so it is easy to compare them with the power of the BH and BL procedure. As is shown in the figure, both BH and BL procedure gain power as the number of true effects increases. But the power of the BL procedure decreases as the number of tested hypotheses increases. This is unfortunate, because in genetic and fMRI research many hypotheses are tested and the data is dependent. The BL procedure could be a solution for these fields, but when the number of tested hypotheses is large the power is only slightly better than the power of the Bonferroni procedure. Although the BH and BL procedure maintain the power overall quite well, in the condition that could help the genetic and fMRI researchers the power doesn’t really differ from the Bonferroni procedure. The BH and BL procedure are a step in right direction, but are not (yet) the solution to the multiple comparisons problem.

A twofold multiple comparison problem in brain networks.

How does the brain work? Since the discovery that our brain is of most importance to many characteristics that defines us as human beings, many scientists have been trying to examine and understand this fascinating organ. The basic idea of neural networks providing spontaneous order through functional connections already stems from the late 19th century (e.g., James, 1890), but only quite recently it has become increasingly popular to study the brain by modelling the connections between brain units as a network.. For my internship I read about networks from various scientific fields and I noticed that network analysis seems particularly appealing to study the brain, as we consider the brain to represent a network both physiologically as functionally. Several studies have indeed illustrated that using network analysis to study the brain is a productive method. However, as intuitive as brain network analysis may be, it remains often difficult to interpret these findings. One of these difficulties I did not notice myself during my internship. However, during the course my concept of statistical thresholds being as steady as rocks became progressively more loosely, and that was when I realised brain network analysis struggles in fact with a twofold multiple comparison problem, that is, a potential high amount of false positive results in two phases of the analysis. The first problem arises, when one has to construct the connections between the nodes, generating a massive amount of comparisons. A second problem may manifest in a different stage of the analysis. Once the brain networks are constructed, whole networks, or network parameter values from different experimental conditions or populations may be compared. Although in this stage correcting multiple comparisons may not necessarily be an issue, as the amount of comparisons is never near as massive as in the first phase, it will definitely not improve the already high error rate.
Therefore I decided to reread my internship literature, and explored how brain network studies have dealt with this problem so far. To this end, I listed for 30 brain network studies whether they corrected for multiple comparison in the two different stages. I specified search criteria so I could check whether authors addressed the problem, performed some procedure to correct for the problem and whether the estimated the extent this procedure would control for the problem.
I found that nearly half of the studies did not use a correction for multiplicity at all, or used a rather arbitrary threshold, most of the time as result of a lack of knowledge how to approach the problem. I do want to emphasize that this seems not a problem easy to overcome, and reasons for not correcting seemed to make sense in some cases. Therefore, I argued that future research should study how the trade-off between high power and the error rate in brain network analysis influences reported network effects. Then, we might be able to study how different correction procedures relate to specific parameters in brain network analysis, not in the last place, to develop explicit knowledge for an optimal approach that lowers the chance for false positive results, and simultaneously accounts for the massive dependency in brain networks.

A few steps towards Good Science

The course ‘Good Science, Bad Science’ aimed to teach students to be critical. There are many questionable learning practices conducted in science that contribute to Bad Science. Because students are the scientists of the future another aim was to start changing bad behaviour, to recognize pitfalls and to learn good scientific practices.

This course has taught me to be aware of bad practices that occur in science. At the start of my career I want to be aware of bad practices and I want to be reminded of it in the future. For this reason I made a checklist with a few steps towards good science. When I conduct a study I want to use this poster as a checklist. Feel free to suggest any other steps/improvements or ideas that are important in science!

Outliers: What to do with them?

For the final assignment of the course I chose to learn more about outliers. When inspecting data for my master thesis I found an outlier. I looked up what to do with it, because I wanted to do the ‘right’ thing. That sent me down a path that seemed to have too many different answers to some questions and no answers to other questions and mostly seemed to have no end. My frustration led me to decide to make an overview.

I’ll explain a simple problem that I encountered. There are a few ways to identify an outlier. You can use z-scores, 1.5 x IQR or a window of acceptable values. This is where I had to make the first choice. A window of acceptable values did not make sense for my data. I found 1 outlier with a z-score of 3, but 3 outliers when using the 1.5 x IQR criterion. It wasn’t clear to me which option would be best option to choose. After writing an overview that was as comprehensive as I could make it, I am a little bit closer to finding answers.

I mainly found that there are just many definitions of outliers, test to identify outliers, and ways to ‘deal’ with outliers. It now makes sense to me that there are different options of how to identify and then deal with outliers. Researchers investigate different questions with different types of data. One rule does not fit all. But this also means that there is a lot of room for interpretation. And a lot of decisions have to be made. And sometimes researchers may decide while not being aware of all the options. So to rule that out for myself and learn more about outliers this overview definitely helped.

Replication of Roskes et al. (2011)


In yesterday’s lecture, the different groups presented their work of the past weeks, some preliminary results, and an evaluation of the process of our replication study. The results will be more conclusive at the end of the week, but much can already be said about the process. I think we all agree that it was not easy: Working within a group of 5 already takes a lot of effort, but coordinating 15 people seems to be an order of magnitude harder! Here it becomes more obvious how important it is, for example, to keep proper analysis and work protocols, or how to write clear and informative emails, so that work can be smoothly taken over by a different group (things I did not find in any books on research methodology I scanned for my final paper). Nevertheless, we can be proud that we have managed to dive into the literature and set up the theoretical framework,  gather 500 penalty shots, let 7 people rate them and do the first sets of analyses in about 5 weeks’ time. And the possibility to make a contribution to the PsychFileDrawer project is a rewarding prospect. Finally, we saw how it feels like to carry out a study sticking exactly to the preregistered report. During my onderzoekspracticum in my bachelor, I remember being astonished by the amount of fellow students delivering sloppy work,  often with the excuse that ‘it is just for the practicum, nobody will ever read it anyway.’ I think involving students in a study that will be posted online beforehand could be a way of making students feel more responsibility of doing research properly. I think we did a good job overall, and I can imagine groups of students everywhere carrying out some of the many awaiting exact replications in psychology and posting them online.

Roskes, M., Sligte, D., Shalvi, S., & De Dreu, C. K. (2011). The right side? Under time pressure, approach motivation leads to right-oriented bias. Psychological science, 22, 1403-1407.

The Difference Between Significant and Non-significant

How to make a difference

Suppose you are a psychotherapist looking for an effective treatment for disorder X, a disorder you had been unfamiliar with up until now. You have discovered that there are two relatively new treatments (T1 and T2) available and you decide to take a look at the available evidence. It turns out that only two studies have investigated the efficacy of T1 and T2 (one for each treatment); unfortunately, neither reports effect sizes or enough data for you to calculate them. All you can go by are the p-values: the first study reports p=.001 for T1 (compared with the control group), whereas the second study reports p=.16 for T2 (compared to control). Both studies use an alpha level of .05; what do you conclude?

You may be tempted to conclude that there is evidence that T1 is effective whereas T2 is not effective, but this need not be the case. Gelman and Stern (2006) argue that whenever a researcher wants to compare the effects of multiple factors, said researcher should test whether the difference between two effects is significant. The authors point out that it is wrong to simply dichotomize the statistical result into “significant” and “non-significant” and conclude that the effects differ.

Of course this is helpful advice for anyone planning a study involving such comparisons, but what about researchers who want to estimate the difference between two effects when they only have limited data such as the p-values or treatment effect estimates and their standard error to compare these effects?

Gelman and Stern do not offer any ways to calculate the significance level of these differences, but it turns out that several other researchers have tried to solve this problem long before Gelman and Stern. For example, Stouffer et al. (1949; as cited in Cooper and Hedges, 1994) presented the first statistic (Stouffer’s z) to test whether a number of p-values were significantly different from each other:

ZStouffer = , where k is the total number of p-values.

Rosenthal and Rubin (1979) offered a similar solution: Zdiff. For two p-values, the idea is to convert both p-values back to Z values, to calculate the difference between the Z values and to look up the p-value associated with this “Zdiff” value.

For multiple p-values, the authors suggest one use “[t]he sum of squares of the deviations about the mean Z” (p.1167) as χ² statistic with df=k-1, where k is the total number of Z values. Hence:


However, we must stress that it is much better for the original authors to plan the comparisons ahead of time and test for significance of the difference directly.

Real-life Example

Gelman and Stern mention a regression analysis where significance levels were compared. The researchers wanted to know if the high birth order found in homosexual men was due to having more older brothers or more siblings of both sexes than heterosexual men. In the regression analysis there were 6 coefficients, the first one being for the number of older brothers and second one for the number of older sisters. When only the coefficient for the number of older brothers was found a significant predictor of the sexual preference of the men, the researchers concluded the following: “homosexual men have a higher birth order primarily because they have more older brothers”. The problem here is that the significance level of the coefficient for the number of older brothers was compared to the significance level of the coefficient for the number of older sisters. One was significant, the other was not. However, this is not what they wanted to test. They wanted to know if just older brothers were a significant predictor, or siblings of both sexes. This is not what they tested with this regression.

They could have tested this by creating two new coefficients: 1. the first two predictors transformed into their sum (coefficient for the number of older siblings), 2. the first two predictors transformed into their difference (coefficient for the number of older brothers minus the number of older sisters). This way the first coefficient tells you the predictive value of having older siblings, the second coefficient tells you whether older brothers have more predictive value than older sisters. This is exactly the question the researchers wanted to answer.

So just be careful with comparing the predictive value of coefficients in a regression analysis. If you want to differentiate between two coefficients, you should not do this based on the significance levels of those two coefficients, but directly compare them.

(by David & Barbara)


Cooper, H. And Hedges, L. V. (Eds.) (1994). The Handbook of Research Synthesis. New York: Russell Sage Foundation.

Gelman and Stern (2006). The Difference between Significant and Non-Significant is itself not statistically significant. The American Statistician, 60, 328-331.

Rosenthal, R., & Rubin, D. B. (1979). Comparing significance levels of independent studies. Psychological Bulletin, 86, 1165-1168.

Measurement and permissible statistics in psychological science

In his influential paper on measurement theory Stevens (1946) argued that different statistical operators (i.a. mean), and therefore also the statistical tests that make use of these operators, are only permissible on certain measurement scales. The appropriateness of a statistical operator on a scale is measured by whether its transformations are invariant. Transformations can be applied to the data by several formulas, for example by the simple multiplication formula x’=x*4. If a statistical operator is not invariant than conclusions drawn from results of statistical tests making use of that operator will differ depending on the how the results were measured. As Scholten and Borsboom (2009) explain: “For instance, it is possible that when scores on the aforementioned mathematical proficiency test are analyzed for sex differences with a t test, different results are obtained for the original and transformed scores. Boys may significantly outperform girls when analyzing the original scores, while boys and girls may not differ significantly in their performance when analyzing the transformed scores (or vice versa; see Hand (2004), for some interesting examples). Since there is no sense in which the original scores are preferable or superior to the transformed, squared scores, this means that research findings and conclusions depend on arbitrary, and usually implicit, scaling decisions on part of the researcher. “

Stevens summarized four measurement scales with their permissible transformations and statistical operators. All permissible transformations and statistics are always also applicable in scale higher than the scale at which they are introduced.

The first scale is the nominal scale. At this scale numbers are used to classify data into mutually exclusive, exhaustive categories in which no order can be imposed. Nominal assignments can be on an individual scale, football players’ jersey numbers, or on a group scale, a person’s religion. Permissible transformations are any one-to-one or many-to-one transformation, although a many-to-one transformation loses information. Permissible statistics are number of cases, mode and association statistics (

The second scale is the ordinal scale. Ordinal measurements describe order, but not relative size or degree of difference between the items measured. For example, scores in tennis are rank ordered but cannot be subtracted; the difference between 15-30=15 and 30-40=10 is meaningless. According to Stevens most psychological measurements are made on ordinal scale. Examples of measurements on an ordinal scale are measurements of intelligence and personality traits. Permissible transformations are any monotone increasing transformation, although a transformation that is not strictly increasing loses information. Permissible statistics are median and percentiles. Note that mean is not a permissible statistic at this scale.

The third scale is the interval scale. Data points on the interval scale are ordered and the interval between data points is equal over all data points. For example, 10˚-5˚=5˚ where 35˚-30˚=5˚. However the null on the scale is arbitrary and the ratio therefore can’t be calculated. Permissible transformations are any affine transformation t(m) = c * m + d, where c and d are constants (general linear group, a” = a’ + b in Stevens). Permissible statistics are mean, standard deviation, rank-order correlation and product-moment correlation.

The fourth scale is the ratio scale. This scale is very similar to the interval scale except that the scale has an absolute null point. Examples of measurements on the ratio scale are degrees of Celsius, monthly salary and weight. Permissible transformations are any linear (similarity) transformation t(m) = c * m, where c is a constant (i.a. Logarithmic transformation). The (new) permissible statistic is coefficient of variation.


Lord (1953) wrote a satirical comment on the conclusions of Stevens. Lord describes a story of a professor who distributed jersey numbers to his students. He often administered tests to his students. In secret he compared the means and standard deviations of test results of students with different jersey numbers. He taught his students very carefully: “Test scores are ordinal numbers, not cardinal numbers. Ordinal numbers cannot be added”. He knew very well that his comparisons of different jersey numbers were incorrect according to the latest theories of measurement.
After a while the freshmen accused the professor of distributing low numbers to the freshmen. The professor consulted a statistician who simply calculated that the chance that the freshman had this average number when the numbers were randomly distributed was very low.
When the professor argued that the statistician couldn’t use multiplication on measurements taken on a nominal scale the statistician reacted: “If you doubt my conclusions… I suggest you try and see how often you can get a sample of 1,600 numbers from your machine with a mean below 50.3 or above 58.3.” So the professor starts taking of samples and indeed finds out that it’s indeed very unlikely to find a mean below 50,3 and 58,3.

So Lord argues that statistical methods can be used regardless of the scale of measurement. “The numbers do not know where they came from (p. 751)”. However in his paper Lord is using inferences regarding the measurements instead of inferences regarding the attributes. Or as Scholten and Borsboom (2009) argue: “ is argued that the football numbers do not represent just the nominal property of non-identity of the players; they also represent the amount of bias in the machine. It is a question about this property – not a property that relates to the identity of the football players – that the statistical test is concerned with”. Scholten and Borsboom show that when the bias of the machine is assessed the data are actually on an interval scale and that therefore Lord’s article actually supports Stevens’ view.

To give some information about the problem in psychological science that most measurements are done on ordinal instead of interval scale I will quote a text I found on the web: (
“Suppose we are doing a two-sample t-test; we are sure that the assumptions of ordinal measurement are satisfied, but we are not sure whether an equal-interval assumption is justified. A smooth monotone transformation of the entire data set will generally have little effect on the p value of the t-test. A robust variant of a t-test will likely be affected even less (and, of course, a rank version of a t-test will be affected not at all). It should come as no surprise then that a decision between an ordinal or an interval level of measurement is of no great importance in such a situation, but anyone with lingering doubts on the matter may consult the simulations in Baker, Hardyck, and Petrinovich (1966) for a demonstration of the obvious.
On the other hand, suppose we were comparing the variability instead of the location of the two samples. The F test for equality of variances is not robust, and smooth monotone transformations of the data could have a large effect on the p value. Even a more robust test could be highly sensitive to smooth monotone transformations if the samples differed in location.
Measurement level is of greatest importance in situations where the meaning of the null hypothesis depends on measurement assumptions. Suppose the data are 1-to-5 ratings obtained from two groups of people, say males and females, regarding how often the subjects have sex: frequently, sometimes, rarely, etc. Suppose that these two groups interpret the term ‘frequently’ differently as applied to sex; perhaps males consider ‘frequently’ to mean twice a day, while females consider it to mean once a week. Females may report having sex more ‘frequently’ than men on the 1-to-5 scale, even if men in fact have sex more frequently as measured by sexual acts per unit of time. Hence measurement considerations are crucial to the interpretation of the results.”

To conclude, always be aware of which attribute you measure, what scale you likely measure the attribute on, whether you can use certain statistics on that scale and in what way you can relate the results of the analysis to the attribute.

Lecture by Angélique Cramer

Baker, B. O., Hardyck, C, and Petrinovich, L. F. (1966), “Weak measurement vs. strong statistics: An empirical critique of S.S. Stevens’ proscriptions on statistics,” Educational and Psychological Measurement, 26, 291-309.
Lord, F. M. (1953). On the statistical treatment of football numbers. American psychologist, 8, 750- 751.
Scholten, A. Z. and Borsboom, D. (2009). A reanalysis of Lord’s statistical treatment of football numbers. Journal of statistical psychology, 53, 69-75.
Stevens, S. S. (1946), “On the theory of scales of measurement,” Science, 103, 677-680.

Replication – Data Collection

So the data collection for the replication study is in progress now.

It was a really hard piece of work to compile all the penalty videos and figure out a good way to rate the direction of the shot and the direction of the goalkeeper dive. We started a few weeks ago to look for tournaments we could use. We decided to use the Champions League, Europa League, European Champions League and Copa America. Then it was time to dig through internet archives to identify and look for the videos of the shootouts we had found in that way. We even went to the Dutch television and radio archive (“beeld en geluid”) to find high quality footage of these games but the results were disappointing, so we tried to find as many games as possible through YouTube. This was sometimes really gruesome because some of the matches were just not to be found and for a lot of them the quality of the footage was on the lower end of mediocre. And then we had to check for every video, whether the shootout was complete, which was particularly frustrating for videos that missed only one penalty. I am very grateful to YouTube user “sp1873″ who had a lot of footage that we probably could not have found elsewhere. Although the whole process was sometimes tedious, for me as a football enthusiast, it was also a lot of fun. We ended up with 50 different games, approximately 3 hours of footage in total. We are now in the process of recruiting our friends (who are blind to the hypotheses) to rate the penalties.

I hope that we will finish the data collection as quickly as possible in order to see whether the original results – that goalkeepers that are behind are more likely to jump to the right – hold in our sample.