Missing Data: What to Do?

Often researchers have te deal with missing data in psychology and social sciences. Missing values have to be dealt with because most statistical analyses are not designed for missing data. At the moment most of the methods often used to handle missing data have a lot of problems including biased results. Therefor they are not recommended to use. Some examples of these methods are listwise deletion, pairwise deletion and mean imputation/replacement.
Luckily there are methods that can be used and have less of these problems. In this blog two of them will be discussed: multiple imputation and maximum likelihood.
With multiple impuation the distribution of the variable with missing data is estimated through the observed data. When this distribution is estimated a new dataset is created with the missing values replaced by random drawn values from the estimated distribution. But when only one dataset is made one assumes that the estimated distribution is the same as the population distribution. This is often not the case and will give an underestimation of the standard error. To tackle this problem more datasets are made. When all these datasets are made it is possible to calculate a pooled mean and standard error. Finally with this pooled mean and standard error the analyses can be performed.
Maximum likelihood is a more complicated method for handling missing data. With this method missing data is not impuated but it uses the observed data of a participant with missing values to correct the parameters used in a model. This is done with a maximizing function. So although missing values are not replaced with an estimate of what the missing value should be, the observed data of a participant is still used in the estimation of the model parameters. This looks similar to multiple imputation but the difference is that no new dataset is created and then the analysis is done but the maximum likelihood method is used together with the analysis. The advantage of this is that produces accurate standard errors because the sample size is the same. Which is not the case with the pooled means and standard errors in multiple imputation. This method mainly has practical problems. It is not included in many statistical software packages and the sample size has to be rather large. This is often a problem in psychological research.
Because it in psychological research sample sizes are often small it is probably better to use the multiple imputation method. It is important to educate researchers about this methods and about how to report missing data. But there is also a responsibility for statistical software developers to make methods like multiple imputation and maximum likelihood more accessible. Furthermore it is suggested to not make listwise or pairwise deletion the default method in handling missing data in statistical software.

Donders, A. R. T., van der Heijden, G. J. M. G., Stijnen, T., & Moons K. G. M. (2006). Review: a gentle introduction to imputation of missing values. Journal of Clinical Epidemiology, 59, 1087-1091.
Enders, C. K., Bandalos, D. L. (2001). The relative performance of full information maximum likelihood estimation for missing data in structural equation models. Structural Equation Modeling, 8, 430-457.

The Future of Fraud Detection: Do we Want a Science Police?

Academic Fraud is a rare phenomenon. Still, it’s effects can be very serious for the scientific community. Fraudulent research can serve as the basis for further research in the given direction, leading to a waste of time and resources. Additionally, wrong findings can serve as the arguments for policy makers and their decisions have an impact on a lot of people. Colleagues involved with fraudsters suffer from a negative impact on their career and have to regain trust after the discovery of the fraud. Finally, honest researchers get disadvantaged. Clearly, something has to be done about it. But how can we?

One way to make fraud less appealing is punishment combined with heightening the fear of being discovered. Currently, the consequences of being detected as a fraudster are already very severe, what undermines this system though, is that there is little reason to fear discovery. Whistleblowing often has very negative consequences for the whistleblower. Social pressure and fear for the own career make pointing out a fraudster a difficult act. Another point is, that there are no routine checks in place to discover fraud Research is mostly done in private, without the necessity to disclose the data and every researcher is given the benefit of doubt. Recent fraud cases however show that fraud seems to still be a gamble worth making. Years of salary, grant money and prestige seem to outweigh the chance of the negative consequences.

One way to help to prevent fraud would be a different way to think about data.  Right now, many researchers consider data to be something they own. Collecting it gives them the right to do whatever they want with it. But there is a counter movement gaining momentum in science and other fields, advocating the idea of open data. Open data means that the raw data  is accessible for everyone to see after the research is done. One project that is at the cutting edge of this idea, (along with other great ideas to make science better), is the open science framework.

If scientific data would have to be disclosed, it would become possible to run advanced analytics on this data. There is already a big body of knowledge about how to discover fraud in raw financial data and it could be used and extended. We humans tend to make characteristic mistakes when making up data and those can be found. Not only are we bad at making up random numbers, but we also do not know much about the distributions that are common to many statistics or the digit structure.

The next step could then be to develop tools that help to flag data as suspicious or trustworthy and trained easily on the disclosed data. This way, fraud would become much more difficult and risky.

Of course, this is only a raw sketch for the future of science, but it also paves the way for some questions: How much do we want to trust each other as scientists? Do we need a “science police”?  What do you think?

The (un)glory of human existence

Just a few months ago, I watched Stanley Kubrick’s  2001: A Space Odyssey for the first time. It’s  a brilliant science fiction movie, from the late 60’s, about human evolution and its associated technological development.  In the movie we see our species starting as cavemen, but rapidly evolving into sophisticated creatures who explore  the universe in spaceships, hoping to find out where everything is coming from.
Given this incredible evolution of human species to what we are today, it can’t be a coincidence we are living on earth by chance, can it? Proteins could be randomly shuffled for billions of years before humans emerged on earth in all our glory. Since the chance of human existence by chance is so small, surely intelligent design must be the best explanation (Vul & Kanwisher, 2010)!
Although this rationale is very flattering for us as humans, it doesn’t follow the rules of logic. The rationale  is an example of the logical fallacy known as the “non-independence”error. Unfortunately, this error is not restricted to the unscientific domain, but is common in science as well.
So, what is the non-independence error, that leads  to the flattering, but logical erroneous conclusion? Essentially, the error of non-independence is a problem of selection bias. When we use statistics hypothesis testing on a data set, we assume that the selection of data does not influence the data analysis. When the selection does influence the analysis, this assumption is violated.
To relate this to the human evolution rationale: If we would assume that the protein combination that led to the emergence of human species was a sample from the population of all possible protein combinations in the universe, and if the emergence of humans was specified in advance by some higher power (sorry, evidence from the Bible doesn’t count, since the book is written 196.500 years later than the emergence of the first modern humans), only then human existence would have been a miracle indeed; our path must have been predestined by intelligent design!
However,  our data selection process was different . Our protein combination did not originate from the population of all possible protein combinations, because it was the only protein combination we observed. We did not look into any reference  protein combinations that could have confirmed or reject our rationale (maybe there intelligent life on a planet that we don’t know yet!)  Therefore, our selection is biased and results will be guaranteed: it leads to the erroneous conclusion that chance cannot be the reason human live on earth.
Until now, we don’t have evidence for intelligent life somewhere else in the universe. But as Kubrick outlines in his space Odyssey,  humans are explorative in nature and will continue their search for life elsewhere. Maybe one day we will find out we’re not so special, after all.

Joanne

Whose Fall is the Academic Spring?

Link

The past has seen several protests against the controversial business practices of commercial publishing-houses. However none of them had been as influential as the recent Elsevier boycott, a movement, which has later been titled The Academic Spring. It is now almost two years later, as it is in the case of its eponym, The Arab Spring. Recognizing that in many countries of the Arab world the protests have not resulted in the changes, many have hoped for, it seems reasonable to take a look to have a critical look at the outcomes of the Academic uprising.

From unreasonable journal prices, over the publication of fake journals to promote the products of pharmaceutical companies [], the support of the Research Works Act (RWA) however was for many the final nail in the coffin of Elsevier’s integrity []. Elsevier’s opponents however could quickly celebrate their first victory when the RWA was declared dead in late February. The RWA however was only one aspect of the critique and its failure prevented that the situation would turn worse, but it didn’t really affect the status quo. The costofknowledge still continues to gather members, but does not seem to effectively threaten Elsevier’s market-domination. That the problem remains is demonstrated when in April of 2012, the library of Harvard University released a memorandum in which they described increasing difficulties to pay the annual costs for journal subscriptions and conclude that “many large journal publishers have made the scholarly communication environment fiscally unsustainable and academically restrictive.”[]. One year later Greg Martin resigned from his position on the editorial board of Elsevier’s Journal of Number Theory. In the resignation letter he concludes that there have been no observable changes in Elsevier’s business []. In their ‘one year resume’ the initiators of the boycott have been a little bit more optimistic []. While they admit that not much has changed in the general business strategy of Elsevier, the ‘Big Deal’ price negotiations haven’t become more transparent and bundling is still a common practice, they report some minor price drops. However they state, that more importantly, the boycott has raised awareness and increased the support for newer more open business models. What almost all of the critics unites, is a shared hope in open access. The recent Access2Research petition can be seen as a further success of the open access movement, as it convinced the White house to release a memorandum that directs all federal agencies to make all federally-funded research freely available within 12 month after initial publication [].

PLoS One, an open access journal is now by far the largest academic journal in the world [] and open access journals are being founded almost daily. While some of these new journals appear what one would call ‘scam’, they, as well as all the journals with poor quality standards won’t have a long life expectancy. The genie however left the bottle and it is unlikely that the fresh spirit of open access will disappear any time soon.

 

The independent t-test—violations of assumptions

Some methodologists complain that researchers do not check assumptions before doing a test. But in some cases, it does not help much when researchers do. For example, when there are no clear standards on how to test assumptions, and on the robustness of tests against violation of these assumptions, results can still be interpreted badly.

In the case of the independent t-test, it is often assumed that this test is robust against violation of the normality assumption, at least when the sample sizes of both groups are at least 30. But is becoming clear that even small deviations from normality may lead to a very low power (Wilcox, 2011). So, when a groups do differ, t-tests do often not detect it. This is especially true for mixed normal distributions which have a relative large variance. At the left side of figure 1, two normal distributions are compared and the independent t-test has a power of .9; at the right side of this figure, two mixed normal distributions with variance 10.9 are compared, but here the power is only .28. So, in this case researcher might conclude from a plot that the assumption of normality is not violated. The result of the t-test might be that the null-hypothesis of equal means cannot be discarded, and therefore the conclusion is that the groups do not differ. What is missed is that the distribution is actually a mixed normal distribution, and a difference was not detected due to low power.

mixed-normal-power

Figure 1 At the left two normal distributions that differ 1, power of the independent t-test is .9; at the right two mixed-normal distributions with variance=10.9 that differ 1, power of the independent t-test is now .28

The main problem of the mixed normal distribution is its heavy tails. There are relatively many outliers. One solution is to use a 20% trimmed mean. This means that the highest twenty percent of the values is discarded, as is the lowest twenty percent. A method called Yuen’s method uses this trimmed mean in combination with Winsorized variance (Wilcox, 2011). This is the variance when the highest twenty percent of values is displaced by the remaining highest value, and the lowest twenty percent of values is displaced by the remaining lowest value. Yuen’s method performs better in terms of power, controlling type I errors and getting accurate confidence intervals.

Another assumption that is often violated with large consequences is the assumption of homoscedasticity. Homoscedasticity means that both groups have equal variances. Often Levene’s test and an F-test are used to assess whether the assumption is violated. But both tests are themselves susceptible to violation of the normality assumption. Therefore it is recommended to use a variant of the independent t-test that does not assume equal variances, for example Welch’s test. Zimmerman (2004) even suggests that Welch’s test should be used in all cases, even when distributions are homoscedastic. With Welch’s test the probability of a type I error can be controlled better, and therefore the power is higher in most situations.

 

References

Wilcox, R. (2011) Modern Statistics for the Social and Behavioral Sciences. New York: CRCPress.

The Cost of Knowledge, a matter of life and death

Did you know that psychological research can be a matter of life and death? Two researchers, Rodin and Langer, showed in the late 70-ties that elderly in nursing homes lived longer and had improved quality of life when they were more involved in daily events such as the choose of diner, type of leisure activity and so on. This is a result of a well-known fundamental psychological process, namely the relationship between perceived control and reward. It is a process that is often studied with visual illusions like the dot-illusion. However, this is not about illusions, but about the fact that even the most fundamental theoretical research can have a profound impact on real world problems. There is only one crucial problem… the accessibility to this knowledge.

With our traditional publication systems, people have to pay to have access to published research. This means that professionals from outside the scientific world have often no access to it, because they are often not willing to pay the exorbitant prices for it. But, also scientist themselves have often no access to the their peers research or even to their own research because their academic institution cannot afford to subscribe for all the scientific journals. This problem is known as the pay-wall and access barrier which seriously limits the value of our scientific research.

Although the problem was already known for several years and many researchers drew the attention on it, it was not until 2012 that the problem got a lot of attention both within and outside the academic world. On January 21, 2012 the mathematician Timothy Growers of Cambridge University posted his blog “Elsevier – my part in its downfall”. In this blog he stated that he refuse to have anything to do in the future with one of the biggest commercial publishers, Elsevier. According to him and many researchers who supported his statement Elsevier is an exemplar of everything that is wrong with the current publication system. The pay-wall and access barrier were the central points in his objections toward them.

As a response on his call to attention, the petition “The Cost of Knowledge” was created. In this petition researchers declare that they will not publish, refer or do editorial work for journals published by Elsevier. Nowadays, almost 14,000 researchers signed the petition and so, support the boycott of Elsevier. At the same time, it enhanced the Academic Spring, a movement formed by researchers, academics and scholars who oppose the traditional publication system and promote open access as an alternative model. Their ultimate goal is fairer access to published research. Only then can science fully contribute to our society and its problems, and… can it makes the difference between life and death.

Evalyne Thauvoye

Afbeelding1

What Does the Psychologist Do?

Like a lot of people we don’t just say what we did, but we add to that why we did it. For example: “I scream at you because I am very angry at you” (what is anger?) or “Because I’m very intelligent I was able to get a high score on that test” (what is intelligence?). We use unobserved entities to explain our behaviour. In psychology we are also concerned with what the causes are of human behaviour.

Psychologists measure all kinds of things that have to do with behaviour. But what exactly do they measure? To answer this question I will make a distinction between manifest variables and latent variables. A manifest variable is a variable that a psychologist is able to directly observe like the response someone gives on a questionnaire. A latent variable is unobserved and that makes it very hard to measure it or even worse, to prove the existence of a latent variable. Examples of latent variables are depression, intelligence, motivation, power, and sadness.

Depression, intelligence, motivation, power, and sadness are all examples of what a psychologist tries to measure. You might think that measuring these things is not that hard. If you think about yourself you might say that you know very well if you are depressed, intelligent or motivated. You might even say, “Come here psychologist, I will tell you how depressed, intelligent and motivated I am”. But then the psychologist will answer that such information is not of any use because it is subjective. And if a psychologist is subjective he cannot work as a researcher at the university.

What the psychologist does very often is indirect measurement of latent variables. If (s)he wants to measure latent variables indirectly (s)he needs manifest variables. Why? Because the psychologist thinks that responses from people that are tested on manifest variables are caused by a latent variable. For example: If someone responds to the statement “I don’t sleep very well” on a seven-point Likert scale with a seven, meaning that someone barely sleeps, we believe that this response is caused by a depression. If we have a collection of statements, we believe we can say something “objective” about a depression (a subjectively constructed latent variable fyi). We do this by putting all the data in a computer so that the latent variable can be calculated. By calculating a number for a latent variable it becomes kind of real.

And how do we do that?

Psychologists have a collection of formal models to their disposal to measure latent variables. A formal model is just a lot of mathematics that have nothing to do with psychology or behaviour.

But how can you measure motivation with something that has nothing to do with motivation? I don’t measure length with a weighing scale do I?

What a psychologist does, is dressing the formal model up with theories, theories about depression, motivation or intelligence. Then he will look if the formal model with all the equations fits the clothes with which the psychologist tried to dress him up. If it does the psychologist can say that his theory is approximately true. We can use the same clothing analogy to show one of the shortcomings.

Shopping stores have a limited amount of sizes they sell their clothes in. Often a lot of different people fit the same t-shirt.

 

Boris Stapel

Recommended reading:

Borsboom, D., Mellenbergh, G. J., & Van Heerden, J. (2003). The theoretical status of latent variables. Psychological review110, 203.

Borsboom, D. (2008). A tour guide to the latent realm. Measurement6, 134-146.

To reject the null hypothesis correctly or incorrectly: that is the question.

Comparing more than one (multiple) variables or conditions leads to an increase in the chance of finding false positives: rejecting the null hypothesis even though the null hypothesis is true (see Table 1). Here I shortly describe four approaches on how to deal with the problem of multiple comparisons.

 

Null hypothesis (H0) is true

Null hypothesis (H0) is false

Reject null hypothesis

Type I error
False positive

(α)

Correct outcome
True positive

(1-β)

Fail to reject null hypothesis

Correct outcome
True negative

(1-α)

Type II error
False negative

(β)

Table 1. The four possible outcomes of null hypothesis significance testing. The outcome in bold is the main concern in multiple comparison testing.

  1. Family Wise Error Rate (FWER). The FWER (Hochberg & Tamhane, 1987) is the probability of making at least one Type I error among all the tested hypotheses. It can be used to control the amount of false positives by changing α (the chance of a false positive). Normally, α set on 0.05 or 0.01. Using one of the FWER based corrections, from the overall α, an α for each comparison is calculated using the amount of comparisons made. The most known FWER based correction is the Bonferroni correction. Here αcomparison = α/the amount of comparisons.
  2. False Discovery Rate (FDR). The FDR (Benjamini & Hochberg, 1995) is the expected proportion of false positives (α) divided by the total amount of positives (the sum of all hypotheses falling in the categories α and 1-α in Table 1). The general procedure is to order all p-values from small to large and compare each p-value to a FDR threshold. Is the p-value is smaller or equal to this threshold it can be interpreted as being significant. How the threshold is calculated depends on the correction methods used.
  3. ‘Doing nothing’- approach. Since all correction methods have their flaws, advocates of this approach are of the opinion that no corrections should be made (Perneger, 1998). Scientists should state in their article clearly how and which comparisons they made, and what the outcome of these comparisons was.
  4. Bayesian approach. The Bayesian approach discards the frequentist approach, including the null hypothesis statistical testing. By doing this, the whole problem of multiple comparisons, and Type I and II errors do not exist. Rather than correcting for a perceived problem, Bayesian based methods build ‘the multiplicity into the model from the start’ (p.190, Gelman, Hill, & Yajima, 2012).

All four approaches of dealing with multiple comparisons have their advantages and disadvantages. Overall, I believe that the Bayesian approach is by far the most favourable option. Apart from dismissing the problem of multiple comparisons (and others), it provides researchers with the opportunity to collect data in favour of either hypothesis, instead of making probabilistic statements about rejecting or not rejecting the null hypothesis. What stands in the way of applying the Bayesian approach is its theoretical difficulty (as compared to the frequentist approach). With the increase in approachable books, articles, and workshops about the Bayesian approach, and the development Bayesian scripts for statistical software a revolution in how scientists practice science seems to get closer.

 

References

Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society,  57, 289-300.

Gelman, A., Hill, J., & Yajima, M. (2012). Why we (usually) don’t have to worry about multiple comparisons. Journal of Research on Educational Effectiveness,  5, 189-211.

Hochberg, Y., & Tamhane, A.C. (1987). Multiple Comparison Procedures. Wiley, New York.

Perneger, T. V. (1998). What’s wrong with Bonferroni adjustments. British Medical Journal,  316, 1236-1238.

 

The problematic p-value: How large sample sizes lead to the significance of noise

The most reliable sample with the highest power is of course the complete population itself. How is it possible that this sample can reject null hypotheses that we do not want to reject?

When the sample size increases this has as a consequence that very small effect sizes can become significant. I did an independent samples t-test over two simulated groups on a variable over which they are normally distributed. This was repeated 1000 times with a sample size of 200 persons per group and 1000 times with a sample size of 20000 persons per group (see figure 1.). I analyzed what proportion of these t-test are significant. Note that even if there is no effect or the effect is too small to detect, you expect to find at least 5 percent of the t-test to be significant due to a significance level of .05 (type I error). When the total sample size consists of 400 persons (200 per group) you will find in approximately 5.6% of the times a significant effect (running this several times resulted in values between .05 and .06). The proportion of times the null hypothesis is rejected increases when you use a bigger sample size. When you use groups of 20.000 this proportion increases to approximately 73.3%.


simulatie

The Crud Factor is described by Meehl (1990) as the phenomenon that ultimately everything correlates to some extent with everything else. This phenomenon was supported by a large exploratory research he conducted with Lykken in 1966, in which 15 variables were cross tabulated with a sample size of 57.000 school children. All 105 cross tabulations were significant, of which 101 were significant with a probability of less than 10^-6.

Similar findings were described by Starbuck (as cited in Andriani, 2011, p. 457), who found that: “choosing two variables at random, a researcher has a 2-to-1 odds of finding a significant correlation on the first try, and 24-to-1 odds of finding a significant correlation within three tries.” Starbuck concludes from these findings (as cited in Andriani, 2011, p. 457): “The main inference I drew from these statistics was that the social sciences are drowning in statistically significant but meaningless noise.”

Taken literally, the null hypothesis is always false. (as cited in Meehl, 1990, p 205). When this phenomenon is combined with the fact that very large samples can make every small effect a significant effect, one has to conclude that with the ideal sample (as large as possible) one have to reject every null hypothesis.
This is problematic because this will turn research into a tautology. Every experiment that has a p-value > .05, will become a type II error since the sample was just not big enough to detect the effect. A solution could be to attach more importance to effect sizes and make them decisive in whether a null hypothesis should be rejected. However it is hard to change the interpretation of the p-value, since its use is deeply ingrained in our field. Altogether I would suggest to leave the p-value behind us and switch over to a less problematic method.

 

Andriani, P. (2011). Complexity and innovation. The SAGE Handbook of Complexity and                          Management, 454-470.
Meehl, P. E. (1990). Why summaries of research on psychological theories are oftenunin-                     terpretable. Psychological Reports, 66, 195-244.

The over- and underutilization of ANCOVA

After completing several statistics courses I lived in the illusion that I knew the ins-and-outs from, what I thought to be, basic statistical analyses. During this course, however, I saw pitfalls in almost all of them and came to the realization that the application of statistical procedures are not as straightforward as I once thought they were. One of the most striking examples is the analysis of covariance (ANCOVA). A statistical procedure used a lot, seemingly as a way to “control” for confounds. I was always impressed by this procedure, until I found out there is a lot more to it than just “controlling” for confounds.

The analysis of covariance (ANCOVA) was developed as an extension to the analysis of variance (ANOVA) to increase statistical power (Porter & Raudenbush, 1987). By including covariates, the variance associated with these covariates is being “removed” from the dependent variable (Field, 2009). This way, from the manipulation point of view, the error variance in the dependent variable is reduced and hence the statistical power increases, see Figure 1. Given that psychological research is often underpowered (Cohen, 1990), ANCOVA is an important statistical procedure in the revelation of psychological phenomena and effects.

Schermafbeelding 2013-10-25 om 16.15.11

This promising application of ANCOVA, however, only holds when there is no systematic relationship between the grouping variable and the covariate, i.e., the groups cannot differ on the covariate. This is an assumption that many researchers today fail to check. As a result, ANCOVA is widely misunderstood and misused (Miller & Chapman).

The importance of this assumption is illustrated in Figure 2. Namely, when group and covariate are related, removing the variance associated with the covariate will alter the group. In other words, the remaining variance of group after removing the variance associated with the covariate has poor construct validity and the results are therefore uninterpretable.

Schermafbeelding 2013-10-25 om 16.33.59

The general point is that the legitimacy of ANCOVA depends on the relationship between the grouping variable and the covariate. ANCOVA is justified only when there is no systematic relationship between these variables.

On the one hand, it is quite straightforward to defend this judgement in a randomized experiment; given random assignment, individual characteristics are equally distributed across the groups and thus, group means should not differ except by chance, see left panel of Figure 3. As a result, including a covariate in a randomized experiment increases the statistical power. In this sense, ANCOVA is underutilized. On the other hand, when studying pre-existing groups (i.e., non-random assignment), individual characteristics are not evenly distributed across groups and hence a relationship between group and covariate can exist. Thus, including a covariate in a non-randomized experiment might alter the grouping variable and result in unfounded interpretations and conclusions. In this sense, ANCOVA is overutilized.
It is worrisome that ANCOVA is more often applied in non-randomized experiments than in randomized experiments (Keselman et al., 1998). The idea is that researchers want to “control” for pre-existing differences. ). This idea is incorrect since there just is no statistical way to “control” for these pre-existing differences. ANCOVA “removes” the variance that is associated with the covariate, but it does not “control” for the covariate.

Schermafbeelding 2013-10-25 om 16.50.32
We, as researchers, should acknowledge the inabilities (ANCOVA cannot “control” for pre-existing differences) and abilities (ANCOVA can increase statistical power) of ANCOVA. This way we should be able to eliminate unfounded conclusions that are the result of the misapplication of ANCOVA. And, most important, we can expand the strengths of its application: increase statistical power. This way ANCOVA can help to reveal real psychological phenomena and effects.

For a good overview about this problem consult Miller and Chapman (2001). The original paper introducing this problem gives a good example on how the inclusion of a covariate can lead to incorrect conclusions (Lord, 1967).