Missing Data: What to Do?

Often researchers have te deal with missing data in psychology and social sciences. Missing values have to be dealt with because most statistical analyses are not designed for missing data. At the moment most of the methods often used to handle missing data have a lot of problems including biased results. Therefor they are not recommended to use. Some examples of these methods are listwise deletion, pairwise deletion and mean imputation/replacement.
Luckily there are methods that can be used and have less of these problems. In this blog two of them will be discussed: multiple imputation and maximum likelihood.
With multiple impuation the distribution of the variable with missing data is estimated through the observed data. When this distribution is estimated a new dataset is created with the missing values replaced by random drawn values from the estimated distribution. But when only one dataset is made one assumes that the estimated distribution is the same as the population distribution. This is often not the case and will give an underestimation of the standard error. To tackle this problem more datasets are made. When all these datasets are made it is possible to calculate a pooled mean and standard error. Finally with this pooled mean and standard error the analyses can be performed.
Maximum likelihood is a more complicated method for handling missing data. With this method missing data is not impuated but it uses the observed data of a participant with missing values to correct the parameters used in a model. This is done with a maximizing function. So although missing values are not replaced with an estimate of what the missing value should be, the observed data of a participant is still used in the estimation of the model parameters. This looks similar to multiple imputation but the difference is that no new dataset is created and then the analysis is done but the maximum likelihood method is used together with the analysis. The advantage of this is that produces accurate standard errors because the sample size is the same. Which is not the case with the pooled means and standard errors in multiple imputation. This method mainly has practical problems. It is not included in many statistical software packages and the sample size has to be rather large. This is often a problem in psychological research.
Because it in psychological research sample sizes are often small it is probably better to use the multiple imputation method. It is important to educate researchers about this methods and about how to report missing data. But there is also a responsibility for statistical software developers to make methods like multiple imputation and maximum likelihood more accessible. Furthermore it is suggested to not make listwise or pairwise deletion the default method in handling missing data in statistical software.

Donders, A. R. T., van der Heijden, G. J. M. G., Stijnen, T., & Moons K. G. M. (2006). Review: a gentle introduction to imputation of missing values. Journal of Clinical Epidemiology, 59, 1087-1091.
Enders, C. K., Bandalos, D. L. (2001). The relative performance of full information maximum likelihood estimation for missing data in structural equation models. Structural Equation Modeling, 8, 430-457.

The Future of Fraud Detection: Do we Want a Science Police?

Academic Fraud is a rare phenomenon. Still, it’s effects can be very serious for the scientific community. Fraudulent research can serve as the basis for further research in the given direction, leading to a waste of time and resources. Additionally, wrong findings can serve as the arguments for policy makers and their decisions have an impact on a lot of people. Colleagues involved with fraudsters suffer from a negative impact on their career and have to regain trust after the discovery of the fraud. Finally, honest researchers get disadvantaged. Clearly, something has to be done about it. But how can we?

One way to make fraud less appealing is punishment combined with heightening the fear of being discovered. Currently, the consequences of being detected as a fraudster are already very severe, what undermines this system though, is that there is little reason to fear discovery. Whistleblowing often has very negative consequences for the whistleblower. Social pressure and fear for the own career make pointing out a fraudster a difficult act. Another point is, that there are no routine checks in place to discover fraud Research is mostly done in private, without the necessity to disclose the data and every researcher is given the benefit of doubt. Recent fraud cases however show that fraud seems to still be a gamble worth making. Years of salary, grant money and prestige seem to outweigh the chance of the negative consequences.

One way to help to prevent fraud would be a different way to think about data.  Right now, many researchers consider data to be something they own. Collecting it gives them the right to do whatever they want with it. But there is a counter movement gaining momentum in science and other fields, advocating the idea of open data. Open data means that the raw data  is accessible for everyone to see after the research is done. One project that is at the cutting edge of this idea, (along with other great ideas to make science better), is the open science framework.

If scientific data would have to be disclosed, it would become possible to run advanced analytics on this data. There is already a big body of knowledge about how to discover fraud in raw financial data and it could be used and extended. We humans tend to make characteristic mistakes when making up data and those can be found. Not only are we bad at making up random numbers, but we also do not know much about the distributions that are common to many statistics or the digit structure.

The next step could then be to develop tools that help to flag data as suspicious or trustworthy and trained easily on the disclosed data. This way, fraud would become much more difficult and risky.

Of course, this is only a raw sketch for the future of science, but it also paves the way for some questions: How much do we want to trust each other as scientists? Do we need a “science police”?  What do you think?

The (un)glory of human existence

Just a few months ago, I watched Stanley Kubrick’s  2001: A Space Odyssey for the first time. It’s  a brilliant science fiction movie, from the late 60’s, about human evolution and its associated technological development.  In the movie we see our species starting as cavemen, but rapidly evolving into sophisticated creatures who explore  the universe in spaceships, hoping to find out where everything is coming from.
Given this incredible evolution of human species to what we are today, it can’t be a coincidence we are living on earth by chance, can it? Proteins could be randomly shuffled for billions of years before humans emerged on earth in all our glory. Since the chance of human existence by chance is so small, surely intelligent design must be the best explanation (Vul & Kanwisher, 2010)!
Although this rationale is very flattering for us as humans, it doesn’t follow the rules of logic. The rationale  is an example of the logical fallacy known as the “non-independence”error. Unfortunately, this error is not restricted to the unscientific domain, but is common in science as well.
So, what is the non-independence error, that leads  to the flattering, but logical erroneous conclusion? Essentially, the error of non-independence is a problem of selection bias. When we use statistics hypothesis testing on a data set, we assume that the selection of data does not influence the data analysis. When the selection does influence the analysis, this assumption is violated.
To relate this to the human evolution rationale: If we would assume that the protein combination that led to the emergence of human species was a sample from the population of all possible protein combinations in the universe, and if the emergence of humans was specified in advance by some higher power (sorry, evidence from the Bible doesn’t count, since the book is written 196.500 years later than the emergence of the first modern humans), only then human existence would have been a miracle indeed; our path must have been predestined by intelligent design!
However,  our data selection process was different . Our protein combination did not originate from the population of all possible protein combinations, because it was the only protein combination we observed. We did not look into any reference  protein combinations that could have confirmed or reject our rationale (maybe there intelligent life on a planet that we don’t know yet!)  Therefore, our selection is biased and results will be guaranteed: it leads to the erroneous conclusion that chance cannot be the reason human live on earth.
Until now, we don’t have evidence for intelligent life somewhere else in the universe. But as Kubrick outlines in his space Odyssey,  humans are explorative in nature and will continue their search for life elsewhere. Maybe one day we will find out we’re not so special, after all.