Missing Data: What to Do?

Often researchers have te deal with missing data in psychology and social sciences. Missing values have to be dealt with because most statistical analyses are not designed for missing data. At the moment most of the methods often used to handle missing data have a lot of problems including biased results. Therefor they are not recommended to use. Some examples of these methods are listwise deletion, pairwise deletion and mean imputation/replacement.
Luckily there are methods that can be used and have less of these problems. In this blog two of them will be discussed: multiple imputation and maximum likelihood.
With multiple impuation the distribution of the variable with missing data is estimated through the observed data. When this distribution is estimated a new dataset is created with the missing values replaced by random drawn values from the estimated distribution. But when only one dataset is made one assumes that the estimated distribution is the same as the population distribution. This is often not the case and will give an underestimation of the standard error. To tackle this problem more datasets are made. When all these datasets are made it is possible to calculate a pooled mean and standard error. Finally with this pooled mean and standard error the analyses can be performed.
Maximum likelihood is a more complicated method for handling missing data. With this method missing data is not impuated but it uses the observed data of a participant with missing values to correct the parameters used in a model. This is done with a maximizing function. So although missing values are not replaced with an estimate of what the missing value should be, the observed data of a participant is still used in the estimation of the model parameters. This looks similar to multiple imputation but the difference is that no new dataset is created and then the analysis is done but the maximum likelihood method is used together with the analysis. The advantage of this is that produces accurate standard errors because the sample size is the same. Which is not the case with the pooled means and standard errors in multiple imputation. This method mainly has practical problems. It is not included in many statistical software packages and the sample size has to be rather large. This is often a problem in psychological research.
Because it in psychological research sample sizes are often small it is probably better to use the multiple imputation method. It is important to educate researchers about this methods and about how to report missing data. But there is also a responsibility for statistical software developers to make methods like multiple imputation and maximum likelihood more accessible. Furthermore it is suggested to not make listwise or pairwise deletion the default method in handling missing data in statistical software.

Donders, A. R. T., van der Heijden, G. J. M. G., Stijnen, T., & Moons K. G. M. (2006). Review: a gentle introduction to imputation of missing values. Journal of Clinical Epidemiology, 59, 1087-1091.
Enders, C. K., Bandalos, D. L. (2001). The relative performance of full information maximum likelihood estimation for missing data in structural equation models. Structural Equation Modeling, 8, 430-457.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>