# Fooled by Variance?!

In the first assignment of this course we analyzed a dataset. This dataset was only half of the original dataset and in the third assignment we redid the analysis on the other half of this dataset. Some people found the same results in both halves, but most of us weren’t able to find all the effects found earlier. Some even found effects in the opposite direction. I for example found a difference between men and women on two tests in the first half of the dataset. I thought these effects were strong because the p-values of both effects were .003, in the second half of the sample these effects had p-values of.169 and .143. Because the sample was randomly split I was very surprised to see what I though were strong effects turn into no effects. Somebody else found ten significant effects in the first sample and was able to find only two of them in the second sample. This assignment shows how careful we should be when drawing conclusions.

I think the reason why I and probably most us put too much trust in p-values is because we underestimate variance. That people underestimate variance is shown over and over again when they are asked to make up a sequence of coin flips that looks random. Most people make a sequence with only five or six times heads or tails in a row. In reality it is not unlikely to find heads or tails ten or even more times in a row. I think the way we underestimate the variance of the coin flip, is the same way we underestimate the variance of the effects we study. When we find a low p-value we think the effect is there and we underestimate the possibility that the effect cannot be found in a similar sample.

An in my opinion perfect example of a significant effect that doesn’t exist is the study by Roskes et al. (2011). They found that soccer goalkeepers are more likely to dive to the right than the left, when they are behind in a penalty shootout. To see if this is a true effect we are going to replicate their study. We will use the exact same methods as Roskes et al. and analyze the data in the same way they did. We do this to make sure our studies can be compared. But because they analyzed the data in the wrong way (see blog post below) we will also do the proper analysis. To do this we are going to document beforehand exactly which and how many data we will collect, how we score these data and what analyses we will use to establish if there is an effect. Hopefully Roskes et al. can agree with us in advance that we do the replication just like the original study. In that way we can decide whether Roskes et al. got fooled by variance or found an existing effect.

Roskes, M., Sligte, D., Shalvi, S., & De Dreu, C. K. W. (2011). The right side? Under time pressure approach motivation leads to right-oriented bias. Psychological Science, 22, 1403-1407.

# What went wrong in the original analysis of Roskes et al. (2011)?

Roskes et al. (2011) measured whether goalkeepers dove to the right, middle or left when they are behind, tied or ahead. To analyze this data you would expect a 3×3 table which can be analyzed with the Chi square test. With their data that table looks like this:

 Left Middle Right Behind 7 0 17 Tied 47 3 48 Ahead 42 2 38

To analyze this data with the Chi square test the assumptions must be checked. In this case enough observations in each cell and independence of the data. There is a problem that there are very few observations in the middle category and there is a problem that the data is not independent because a lot of same goalkeepers defend their goal on different penalties. The assumptions are not met, so the Chi square test cannot be used.

Roskes et al. used the Chi square anyway, but in the wrong way. When doing a Chi square test you first do the analysis on all the data. If there is an effect, you can explore the data further and see what these effects are. Testing this data gives, X² (4) = 4.98, p = .289, so there is no effect. We can conclude the diving direction of goalkeepers is independent of whether they are behind, tied or ahead. This is not what Roskes et al. concluded, they went on analyzing the data. Perhaps they dropped the middle category and took the tied and ahead category together since there were interested if goalkeepers dive more to the right when behind, but not when tied or ahead. The data then looks like this:

 Left Right Behind 7 17 Tied or Ahead 89 86

Testing this data gives, X² (1) = 3.16, p = .076, so there is no effect. This is not what Roskes et al. conclude, they decide to drop the control condition (Tied or Ahead) and just test if goalkeepers dive more to the right than to the left when behind. They find a significant effect, X² (1) = 4.17, p = .041, and draw the wrong conclusion that goalkeepers dive more to the right than to the left when behind, but not when their team was tied or ahead. This conclusion is wrong because they dropped the control condition, they can only conclude that goalkeepers dive more to the right than to the left when behind. Of course given the violated assumptions even that conclusion is problematic.

Roskes, M., Sligte, D., Shalvi, S., & De Dreu, C. K. W. (2011). The right side? Under time pressure approach motivation leads to right-oriented bias. Psychological Science, 22, 1403-1407.

# Controlling the False Discovery Rates in Behavior Genetic Research

Rachel King & Sara Boxhoorn

Firstly to understand the importance of controlling for false discoveries in genetic research, we must appreciate that genetics is still a novel and interesting research area that gets much media attention. One must also be aware, however, that the media tends to distort or misrepresent the research. Although not always problematic, sometimes this misunderstanding can lead to negative social consequences. Therefore, it is of most importance to make sure that the research underlying the Medias assertions is both clear and reliable. However, large false discovery rates in behavioural genetics make it very likely that scientific research is already flawed and the conclusions to be incorrect in the first place.
Some reviews have investigated how big the problem of false discoveries are for genetic research and have concluded that a large percentage of initial findings could be incorrect or simply a capitalization on chance. A review of environment gene interaction research from 2000 to 2009 compared novel studies to replication studies for significance rates. The novel studies obtained around 96% significance rate compared to the attempted replications which only gained only 27% significance rate (Keller & Duncan, 2011), therefore, illustrating the large amount of initial false positives. Additionally, this is backed up by the inability of genome wide research to replicate associations reported in candidate studies, highlighting the need for improved methodology.
In order to improve the false discovery rate in genetic research, we must first identify the causes. There are many contributing factors to the amount of false positives in genetic research. Firstly, genetics is complex and has inherent sensitivities. For example genes only provide a small effect on behaviour and are interacting not only with other genes but also the environment. On a larger scale there is also a lack of replication and also publication bias. However, the more severe problems occur on the level of methodology. As the effects of genes are generally unknown, there are normally a large number of comparisons of different behavioural characteristics performed per study. This inflates the error rate massively and without correction can lead to many type I errors.
Researchers have tried to correct for the problem of multiple comparisons, but they have come across problems. Firstly, the Bonferroni correction, the most sophisticated approach, may be too conservative. The large number of comparisons results in the alpha becoming very small and the study loses power. Therefore, it is difficult to find effects. In order to try and regain power genetic researchers made the following compromise: firstly they ran a liberal study which allowed the error rate to be as large as half, then proceeded to do a smaller more conservative follow up study to confirm the findings. However, this was not the most optimal solution as the confirmatory studies mostly failed to replicate the initial (too) liberal studies. In addition, due to the expense and time consuming nature of research, sometimes the confirmatory study is not performed, subsequently inflating the problem of false discoveries. Another method tried by researchers was to move away from hypothesis testing and check confidence statements. However, the individual checking of confidence intervals per comparison raises the same multiple comparison problem. As a consequence of the loss of power and obvious inadequacies of the other trailed solutions, most researchers, unfortunately, have argued against correction for multiple comparisons. Although Psychological journals require a test of multiplicity, many medical journals do not and therefore many false discoveries are published in top medical journals. Once again this emphasises the crucial need for alternative solutions to tackle the false discovery rate within genetic research.
Benjamini and his colleges (2001) proposed several procedures to correct for the problem of multiple comparisons whilst still retaining power based on the false discovery rate. The first procedure was proposed by Benjamini and Hochberg (BH). Firstly you rank the observed P values from the smallest to the largest. Then you start from the largest P Value and compare them to a constant worked out based on the P values ranking and the overall number of comparisons (0.05*i/m). Once you reach a observed P value which is smaller than the constant, you stop the stepwise procedure and use the value to compare to the rest of the observed P values. This procedure means the alpha remains larger than with the Bonferroni correction and therefore is less conservative. However, it is based on the assumption that the comparisons are either independent or positively dependent. For instances when the assumption is not met, Benjamini and Liu (BL) proposed a variation on the original formula min(0.05, 0.05 *m/(m + 1 – i)2). This procedure is performed in a stepwise fashion again, but starting from the smallest P value. Once you reach a constant which is larger than observed value, you take the last constant which was smaller and use it to compare to all the significant observed values. This procedure is slightly more conservative than the BH procedure. It is worth noting that when you have many significant P values the constant may become too large and therefore once the constant reaches above 0.05 the rest of the values are compared to 0.05.
Moreover, considering possible downfalls of these corrections, it appears that the BL procedure is not the most powerful solution when your observations are correlated. Both resampling methods and Bayesian statistics may offer better solutions. Furthermore, it is possible to manipulate these procedures by including behavioural comparisons which you already know to be highly significant. As the procedure is based on the ranking the constant gets larger as the ranking becomes larger. The addition of highly significant comparisons will result in a larger constant to compare significant results by. If suspected then sensitivity test can be performed by removing the smallest P values and recalculating the constants.
In conclusion, although the proposed procedures discussed above may not be optimal or unambiguous, as multiplicity is key to explaining the false discovery rates in behavioural genetics, one should treat studies that do not acknowledge this by not using any correction for multiplicity, with the greatest possible caution. As Benjamini et al. illustrated, not correcting might lead to falsely accepting around half of the hypotheses. Most importantly, using procedures such as the BH procedure provide a relative balance between low power and missing findings, leaving researchers no choice left to argue not to correct for multiplicity.

Benjamini, Y,. Drai, D,. Elmer, G,. Kafkafi, N., & Golani, I. (2000). Controlling the false discovery rate in behaviour genetics research. Behavioural Brain Research. 125, 279-284.

Duncan, L, E. & Keller, M, C. (2011). A Critical Review of the First 10 Years of Candidate Gene -by -Environment Interaction Research in Psychiatry. Am J Psychiatry, 168 ,1041– 049.

# Checking Assumptions

We presented the article ‘Are assumptions of well-know statistical techniques checked, and why (not)?’ (Hoekstra, Kiers and Johnson, 2012). The two important assumptions we talked about in our presentation are: the assumption of normality and the assumption of homogeneity of variances. In the top figure is an example of two groups with homogeneity of variances and just beneath it two distributions with different variances. In the bottom figure (red) are two not normally distributed distributions. The t-test, ANOVA and regression analysis are all fairly robust to violations of these assumptions when sample sizes are equal over groups. However, if the sample sizes are not equal or if the violations of the assumptions are extreme than the results of these tests will be nonsensical.

Since the conclusions drawn from the test results can be nonsensical the authors of the presented article wanted to investigate whether researchers check the validity of assumptions and in cases where they don’t, if they at least have good reasons not to (e.g. equal sample sizes). Hoekstra et al. (2012) showed that researchers probably do not check their assumptions as often as they should. Only around six percent of the articles published in Psychological Science and in three educational psychology journals mention the validity of the assumptions. Hoekstra et al. (2012) wanted to know whether this lack of mentioned assumptions is because researchers always investigate the validity of the assumptions but never mention them or because researchers forget to investigate this validity. The authors asked PhD students to analyze six simple datasets. Almost none of the PhD students investigated the validity of the assumptions. Perhaps you´re thinking: well these fresh, just graduated PhD students know everything about robustness of tests, so that is probably why they don’t check assumptions. But not only did they not test for assumptions, most of them (over 80% depending on the test) were unfamiliar with the assumptions. Although the t-test, ANOVA and regression analysis are fairly robust to violations of both the normality assumption and the equality of variances assumption, you would expect that PhD students are at least familiar with the assumption. The results of the presented article show that we can’t blindly trust that researchers check their assumptions and that some results of tests in published articles might be nonsensical.

What should you do to check for these assumptions? Always plot your data! To check for normality use Q-Q plots and for homogeneity use multiple groups box plots. Another way to check your assumptions are preliminary tests, although the use of them are debated (e.g. they increase the probability of a type 1 error, regression: Caudill (1988), t-test: Rochon, Gondan & Kieser, 2012). For normality you can use the Kolmogorov-Smirnov test or the Shapiro-Wilks test, for homogeneity the Levene’s test.

Monique Duizer, Sam Prinssen

References:
Caudill, S. B. (1988). Type 1 errors after preliminary tests for heteroscedasticity. The statistician, 37, 65-68.

Hoekstra, R., Kiers, H.A.L. & Johnson, A. (2012). Are assumptions of well-known statistical techniques checked, and why (not)? Frontiers in Psychology, 3:137. doi: 10.3389/fpsyg.2012.00137

Rochon, J., Gondan, M. & Kieser, M. (2012). To test or not to test: Preliminary         assessment of normality when comparing two independent samples. BMC Medical Research Methodology, 12:81, doi:10.1186/1471-2288-12-81.

# Do you want to dichotomize your continuous variable? Think twice!

In 2000 George Bush won the election from Al Gore. What was remarkable about these elections was the fact that Gore got more votes than Bush. Bush won because the voting goes by states. Every state must be either democratic or republican. Because of this voting system it is possible to get more votes overall, but still lose the elections. This is an example of what can happen to conclusions you draw when the data is forced into groups.

Dichotomization is the division of a continuous variable into two groups. An example in science is the personality trait Extraversion. Figure 1 shows the scatterplot of the relationship between extraversion and performance. To make it simple you divide the extraversion scale into two groups: low extraverts and high extraverts (Figure 2). This division of a continuous variable into two groups is called dichotomization.

Figure 1                                                   Figure 2

The use of dichotomization makes analysis simple and it is simple to understand for readers. But dichotomization is dangerous for your data!

When dichotomization is used it influences your measurement negatively. A few problems are summed up here:

1. Loss of information
2. Loss of effect size and power
3. Occurrence of spurious effects
4. Risk of overlooking non-lineair effects
5. Problems in comparing findings

The reasons why people use dichotomization differ and we mention a few: Some researchers argue their data is more reliable after dichotomization. They claim that the continuous measure X (Extraversion for example) is not highly reliable in providing precise information about individual differences. A dichotomized variable can be trusted to indicate if individuals belong to a high or low group on that variable, but MacCallum et al. (2002) show this is not the case. Other researchers use dichotomization because they find a higher correlation and report those because it is tempting but they forget to look to the negative influences on the measurement. The most heard argument is that there are real different groups that underlie the data. In that case dichotomization is acceptable, but researchers often suppose there are different groups based on theory. They don’t use taxometric methods to see if the different groups are really there. Only when theory and statistical analysis show there are different groups dichotomization is allowed.

This is the case in depression research. In clinical research a measure for depression is the Beck’s Depressive Inventory. A small percentage of people will score high on this questionnaire; the rest of the people will score 0. Now it is possible to divide this continuous variable into two groups: depressive and non-depressive participants. However this is an example where dichotomization is acceptable, information is still lost. Because all the participants in the depressive group are supposed to be equal, but differ in their severity of a depression. Lightly depressed people can differ in their behaviour compared to severely depressed people. So they are not similar and cannot be seen as one group. A solution would be to conduct separate regression analyses within the depression group and treat this group as a continuous variable.

The solution for all these problems is simple. Use regression analysis with continuous variables!

If you want to know more about dichotomization we recommend you to read the following article: MacCallum, R.C., Zhang, S., Preacher, K., & Rucker, D.D. (2002). On the practice of dichotomization of quantitative variables. Psychological Methods, 7, 19-40.

Daan & Sanne

# From American psychology students to all people: can we generalize using only WEIRD subjects?

WEIRD subjects, that is Western Educated Industrialized Rich and Democratic subjects, are overly represented along the total subject pool in psychological research. In a perfect world, we would take a representative sample of all people in the group to which we want to generalize. We test them and generalize towards the group we took the sample from, for example all people in the world or all people in northern America. However, in reality we often end up testing only WEIRD subjects and generalizing towards all people in the entire world. Arnett (2008) analyzed six top journals and found that 68% of the subjects were from the USA (4,5% of the world population) and 96% were from Western industrialized countries (15% of the world population). Also, 67% of the American samples were psychology students and 80% of the other samples were psychology students. You are therefore not only taking a sample from a small western subgroup of the world, but also from a small subgroup that is a lot more educated, industrialized, rich and democratic than the rest of the world. This is a problem, because the studies that are conducted with WEIRD and non-WEIRD subjects show differences between these two groups. For example, the Muller-Lyer illusion (see picture), in which line (a) looks longer than line (b) while they’re actually equal in length, that is so evidently present in subjects from the USA was not shown for foragers in South Africa.

To solve this, Henrich, Heine, and Norenzayan (2010) recommend a structural change in the incentive system to overcome the issue of WEIRD subjects. They recommend to explicitly discuss and defend generalizability of results and representativeness of conclusions. Generalizations should be empirically grounded and for each generalization we need to ask ourselves ‘for exactly which people do we generalize?’; we should give explicit information about our subject pools, make our data available, and even put our data files available online; granting agencies should credit researchers for tapping and comparing diverse subject pools; departments and universities should build research links to diversify subject pools and networks; and partnership with non-WEIRD institutions should be stimulated (Henrich et al. 2010).

Astuti and Bloch state that information about each participant should be taken into consideration, like personal history, education, and values. Only with specific information about each participant, conclusions might be drawn about all participants together as a group. Fessler states that cultural congruence between investigators and participants should be avoided. Key features of the psychological phenomenon at issue may be overlooked when researchers and their participants share fundamental cultural commonalities and when researchers take their own folk models as a starting point. Rai and Fiske state that psychological research should be observation- and description-based. Researchers should get away from their desks and go into the naturalistic field setting and should extensively observe and explicitly describe their participants detailed as they perform a task. In sum, researchers recommend that research programs need to be large-scaled, highly interdisciplinary, consist of fully international networks, diverse populations concerning both investigators and participants, and an integrated set of methodological tools (Henrich et al., 2010).

However, you could argue that there is no harm in just using WEIRD subjects. Who is interested in the math skills of a non-educated child in Africa? This is simply not relevant. So in saying that we should also test non-WEIRD subjects, you should always keep in mind that the subjects you test should also be logical. However, we think that the main point is that if you use WEIRD subjects, you have to be more careful to what group you generalize in your conclusions. You cannot say ‘all people’ anymore, when testing a non-representative sample of all people.

Frank & Vera

References
Arnett, J.J. (2008). The neglected 95%. Why American psychology needs to become less American. American Psychologist, 63, 602-614.

Astuti, R., & Bloch, M. (2010). Why a theory of human nature cannot be based on the distinction between universality and variability: Lessons from anthropology. [Peer commentary on the journal article ‘The weirdest people in the world?’, by J. Henrich, S. J. Heine, & H. Norenzayan]. Behavioral and Brain Sciences, 33, 83-84. doi:10.1017/S0140525X10000026

Fessler, D. M. T. (2010). Cultural congruence between investigators and participants masks the unknown unknowns: Shame research as an example. [Peer commentary on the journal article ‘The weirdest people in the world?’, by J. Henrich, S. J. Heine, & H. Norenzayan]. Behavioral and Brain Sciences, 33, 92. doi:10.1017/S0140525X10000087

Henrich, J., Heine, S.J., & Norenzayan, A. (2010). The weirdest people in the world? Behavioral and Brain Sciences, 33, 61-135.

Rai, T. S., & Fiske, A. (2010). ODD (observation- and description-deprived) psychological research. [Peer commentary on the journal article ‘The weirdest people in the world?’,      by J. Henrich, S. J. Heine, & H. Norenzayan]. Behavioral and Brain Sciences, 33,           106-107. doi:10.1017/S0140525X10000221

# Methodological Mistakes part 2

Today was part two of the presentations of Methodological Mistakes. Mathias and Anja started with a presentation of Simpson’s Paradox of which you can read more in a previous blog. They explained to us not only what Simpson’s Paradox is, using some fun and understandable examples, but also how we can prevent making this mistake and offering some solutions. As they started out, it occurred to me that it was vaguely familiar and I remembered having seen some of the examples before. However, since this was a long time ago, it was a good thing that they refreshed my memory. Anja and Mathias explained to us during the presentation and in their blog that one of the reasons that this paradox is not recognized by people is “the tendency to interpret correlational data causally”. It was suggested that Simpson’s paradox has a frequency of 1.67%; mistaken correlation for causation is probably an even bigger problem. One harmful example is the mistaken causality between vaccinations and autism, which caused (yes caused!) a sharp drop in children that were vaccinated (for more information on this specific example see: http://en.wikipedia.org/wiki/MMR_vaccine_controversy).  This shows how dangerous our common sense conclusions can be and how important it is we keep looking for these “causations”.

Next Sara and Rachel presented to us the false discovery rate in behavior genetics research. They explained to us how multiple comparisons inflate the error rate in scientific research. To explain this problem I need more than a blog post, so for more information on this topic see:  http://en.wikipedia.org/wiki/Multiple_comparisons.

A solution for the inflation of the error rate in multiple comparisons in not that simple, nonetheless it is important that researchers think of solutions before they start to analyze their data.  We concluded this day with some discussion on our replications study. And I start to realize that a replication is not that simple at all. Will we be able to finish this in such a short amount of time?

# “The Earth Is Round (p<.05)"

In the context of our presentation about power issues in psychological research, we were very pleased to introduce you to or remind you of one of the most brilliant and therefore classical articles in the field of psychological methods: “The Earth Is Round(p<.05)”, by Jacob Cohen (1994). In his article, Cohen questions the, nowadays still, dominant procedure of null hypothesis significance testing. The fact that this procedure is still broadly used is quite remarkable since we know that:

• H0 is seldom true
• we are simplifying reality when we think in those binary terms, like: effect vs no effect
• we don’t even “get what we want” by applying this procedure

Cohen questions the idea of living by p values alone and suggests the scientists avail themselves of the multiple tools they can find in the statistical toolbox. He made clear that namely what many researchers want and therefore conclude from this statistical testing is the probability that an hypothesis is true, given the evidence (=data), whereas what one gets from standard test is the probability of the evidence assuming that the H0 is true.

Ignorance of many people of the exact operation called NHST is the source of many errors in the application of this procedure. One of the most ignored and/or neglected components of this procedure is power! Whether an effect is called significant depends on the chosen alpha, the sample size, the (expected) effect size and the power. To illustrate the interrelations between the different components, check the graphical representation on the following website: http://www.stat.wvu.edu/SRS/Modules/HypTest/exam1.html

The power of a statistical test is the probability that the test will reject the null hypothesis when the null hypothesis is false. Though in 1994 Cohen questions NHST, he has spent a huge amount of effort on describing and clarifying this procedure. For instance, Cohen has written a book in 1988 that describes different power calculation and has large index tables for power values with various effect sizes, alpha’s and sample sizes. A more recent development in the area of power calculation is the invention of G*Power (2007). This is an easy to use software to compute the power for many different statistical tests.

We hope that this exploration of probability space has shown you that the neglect of power is bad science.

M&M

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum
Cohen, J. (1994). The earth is round (p<.05). American Psychologist, 49, 997-1003
Faul, F., Erdfelder, E., Lang, A.-G. & Buchner, A. (2007). G*Power 3: A flexible   statistical power analysis program for the social, behavioral, and   biomedical sciences. Behavior Research Methods, 39, 175-191

Today, we presented the Article about the Simpson’s Paradox by Rogier Kievit et al. (in press). Simpson’s Paradox refers to the counterintuitive phenomenon that an association appearing on the population level is reversed within subgroups. One popular example of the Simpson’s paradox is the Berkeley Admissions problem. Berkeley University was actually sued because, when considered on aggregate, the proportion of women admitted was considerable smaller than the proportion of men admitted. As it turned out, this was no instutionalized gender discrimination, but across  the single departments, women were actually more likely to be admitted in four departments, while men were admitted more frequently in only three departments. This somewhat confusing picture can be resolved, when looking at base rates: More women applied to the more selective departments. In this example, the originally hidden confounder “department” was the cause of the confusion.

Generally, Simpson’s Paradox seems to be more frequent as would be assumed, at least in categorical data. A simulation study by Pavlides and Perlman (2009) suggests a frequency of 1.67%. A number of shortcomings in human reasoning have been made responsible for the inability to detect Simpson’s paradox, the tendency to interpret correlational data causally being the most persistent and harmful. Another related problem is inference across wrong levels: From the aggregate group to subgroups, from interindividual to intraindividual trends or from cross-sectional to longitudinal trends.

Some of our examples showed that Simpson’s paradox can also occur in continuous data. While with categorical data, careful consideration of conditional independence is the only way to find Simpson’s paradox, there are two principal techniques to detect it in continuous data: Analysis of Homoscedasticity and Cluster Analysis. Especially Cluster Analysis can reveal different subgroups in the data, but only if the clusters are clearcut or if there is an additional grouping variable provided.

All in all, it does seem important to detect Simpson’s paradox. However, it is even more important to keep our basic reasoning errors in mind and try to avoid them. Humans do have a tendency to ascribe causility easily and researchers are no different, despite their methodological training. So always inspect your data carefully and as a general rule: Be careful to generalize!

Anja & Mathias

References:

Pavlides, M. G., & Perlman, M. D. (2009). How likely is Simpson’s Paradox? The American Statistician, 63, 226-233.

Kieviet, R. A., Frankenhuis, W. E., Waldorp, L. J., & Borsboom, D. (in press). Simpson’s Paradox in Psychological Science: A Practical Guide. Perspectives on Psychological Science.

# P-value Confusion

P-value confusion

In psychology we propose hypotheses to explain or to describe behaviour.  Experiments are designed or questionnaires are used to test these hypotheses.  We analyse the data of those experiments with statistics. These statistics decide if the data support our hypotheses.

During the bachelor of psychology we learned one statistical approach to analyse the data. This is with the use of a p-value. In practice when the p-value is below .05, the data support your hypotheses and you can send your article to a journal for publication. During the bachelor we didn’t learn that there are pervasive problems with p-values and that there is an alternative approach to analyse data: A Bayesian approach.  I recommend you to read the following if you are interested in why p-values are a problem and if you want to know more about the Bayesian approach: Bayesian versus frequentist inference (see the reference)

I was confused when the lecture was finished. If the use of p-values is problematic and an alternative approach is available, why aren’t we taught this Bayesian approach?  People make big decisions based on conclusions from science. If these conclusions are based on p-values, is it safe to make these decisions? A final disturbing thought was that I realized that I continue to work with problematic p-values.  A lack of knowledge in how to conduct Bayesian analysis is one reason. Another reason is that my future supervisor probably uses the p-value.

Universities should teach the deficits of p-values and the existence of an alternative approach. The course Good Science Bad Science is a good start. It has taught me to be more sceptical towards p-values.

Reference

Wagenmakers, E.-J., Lee, M. D., Lodewyckx, T., & Iverson, G. (2008). Bayesian versus frequentist inference. In H. Hoijtink, I. Klugkist, and P. A. Boelen (Eds.), Bayesian Evaluation of Informative Hypotheses, pp. 181-207. Springer: New York.