Having complete data that one sought to record infrequently happens, especially in a study setting where repeated measures from subjects are taken. A major shortcoming of longitudinal or follow-up studies, primarily due to their design, is loss to follow-up that may lead to attrition bias in case the subjects who withdraw from the study are systematically different than those who complete the study. Reasons for attrition could be: migration from the study area, death, subject fatigue and treatment side-effects. In other cases subjects might not attend scheduled observations times but attend subsequent ones, resulting in missing data.
There are a number of ways through which data analysts deal with missing data in longitudinal studies, and indeed other types of study designs:
Complete case analysis: In this option of dealing with missing data subjects/cases without complete information are dropped in the analysis sample. This approach result in loss of information because partly complete information of some subjects is dropped, and may lead to introduction of bias in the estimates of the model coefficients if the data is not missing completely at random.
Last-Observations-Carried-Forward (LOCF): This method can only be applied under a longitudinal study. The missing values, for each individual/case, are replaced by the last observation of a variable. This manner of dealing with missing values has been discouraged in literature recently. The means and precision measures such as the variance can be biased leading to wrong inferences. We advise against using this approach in dealing with missing values.
Mean imputation: Under mean imputation the missing values in a variable are replaced by its mean value of the non-missing observations of that variable. It preserves the mean (the mean in the data wont be biased) but does not preserve the relationship between variables; it might reduce/increase the correlation between the variables being studied. This approach does not account for the uncertainty in the imputed values by including an additional variance from imputation, hence less preferred over data imputation techniques such as multiple imputation that account for the uncertainty.
Hot-deck imputation: In this method of dealing with missing data, each missing value is replaced with an observed response from a similar unit in the same sample dataset. There are several ways of implementing the Hot-deck imputation method. For example, randomly picking the observed response from the set of cases that are similar to the case for which the imputation is needed, or finding the mean of the variable among the set of similar cases. This article provides a detailed review of the various Hot-deck imputations techniques. The performance of this imputation technique, in terms of the preservation of relationships between variables, differs according to the specific technique chosen.
Estimation maximisation (EM): This iterative procedure of dealing with missing values uses other variables to impute an expected value (estimation step), then checks whether that is the value that is most likely (maximization step). The EM algorithm preserves the relationship with other variables a feature that is important in regression analysis. However, they understate standard error and should be used when the extent of missing values is not big, for instance when the proportion of missing values is not more than 5%.
Multiple imputation: This approach has three stages. First, multiple copies of the dataset, with the missing values replaced by imputed values, are generated. The imputed values are sampled from their predictive distribution based on the observed data. Next, standard statistical methods are used to fit the model of interest to each of the imputed dataset. Lastly, the estimated of parameters from each imputed dataset are pooled to provide a single estimate for each parameter of interest. The standard errors of these pooled estimates are calculated using rules that take account of the variability between the imputed datasets. Valid inferences are obtained because results are averaged over the distribution of the missing data given the observed data. Nonetheless, there are pitfalls in multiple imputations that analysts should be aware of when they contemplate using this approach.