Thursday, October 1, 2015

What NOT To Do When Data Are Missing

Here's something that's very tempting, but it's not a good idea.

Suppose that we want to estimate a regression model by OLS. We have a full sample of size n for the regressors, but one of the values for our dependent variable, y, isn't available. Rather than estimate the model using just the (n - 1) available data-points, you might think that it would be preferable to use all of the available data, and impute the missing value for y.

Fair enough, but what imputation method are you going to use?

For simplicity, and without any loss of generality, suppose that the model has a single regressor,
             
                yi = β xi + εi ,                                                                       (1)

 and it's the nth value of y that's missing. We have values for x1, x2, ...., xn; and for y1, y2, ...., yn-1.

Here's a great idea! OLS will give us the Best Linear Predictor of y, so why don't we just estimate (1) by OLS, using the available (n - 1) sample values for x and y; use this model (and xn) to get a predicted value (y*n) for yn; and then re-estimate the model with all n data-points: x1, x2, ...., xn; y1, y2, ...., yn-1, y*n.

Unfortunately, this is actually a waste of time. Let's see why.