It can be fairly common to find missing values in a dataset. Having only a few missing values isn’t generally a problem and those records can be deleted listwise. In other words the entire record is simply removed from the analysis. The problem is even with a limited amount missing data, that can translate into a significant number of records that are omitted. In the example below, about two-thirds of the records would end up being omitted due to missing values.

The distribution of the missing values in the data is very important. If the data are missing at random then that is less serious than when there is a pattern of missing value that are, at least to some extent, dependent on the missing variables.

There are many approaches that can be used to impute missing data. The easiest way is to simply calculate the mean of each variable and substitute that for each of the missing values. The problem with this is that it reduces the variance and the absolute value of the covariance. Another common approach is called *Expectation – Maximization*. This technique iteratively goes through the data while still preserving the covariance structure of the data.

library(e1071)

raw < - replicate(10, rpois(50,100))
raw.orig <- raw
rand.miss <- rdiscrete(50,probs=rep(1:length(raw)), values=seq(1,length(raw)) )
raw[rand.miss] <- NA
raw <- data.frame(raw)
var(na.omit(raw) )
var(raw.imputed)
EMalg <- function(x, tol=.001){
missvals <- is.na(x)
new.impute<-x
old.impute <- x
count.iter <- 1
reach.tol <- 0
sig <- as.matrix(var(na.exclude(x)))
mean.vec <- as.matrix(apply(na.exclude(x),2,mean))
while(reach.tol != 1) {
for(i in 1:nrow(x)) {
pick.miss <-( c( missvals[i,]) )
if ( sum(pick.miss) != 0 ) {
inv.S <- solve(sig[!pick.miss,!pick.miss]) # we need the inverse of the covariance
# Run the EM
new.impute[i,pick.miss] <- mean.vec[pick.miss] +
sig[pick.miss,!pick.miss] %*%
inv.S %*%
(t(new.impute[i,!pick.miss])- t(t(mean.vec[!pick.miss])))
}
}
sig <- var((new.impute))
mean.vec <- as.matrix(apply(new.impute,2,mean))
if(count.iter > 1){ # we don’t want this to run on the first iteration or else if fails

for(l in 1:nrow(new.impute)){

for(m in 1:ncol(new.impute)){

if( abs((old.impute[l,m]-new.impute[l,m])) > tol ) {

reach.tol < - 0
} else {
reach.tol <- 1
}
}
}
}
count.iter <- count.iter+1 # used for debugging purposes to ensure process it iterating properly
old.impute <- new.impute
}
return(new.impute)
}
raw.imputed <- EMalg(raw, tol=.0001)
plot(raw.imputed[,1], raw.imputed[,2], pch=16, main="Scatterplot of Missing Data",
sub="Missing Values in Red", xlab="X",ylab="Y")
# overlay the imputed values on the plot
plot.imputed <- raw.imputed[
row.names(
subset(raw, is.na( raw[,2] ) | is.na( raw[,3]) )
),]
points(plot.imputed[,2],plot.imputed[,3], pch=16, col='red')
[/sourcecode]

Once the missing values are established it is important to review the data and do the standard assumption tests before proceeding with further analysis. This is one of many approaches for imputing missing data. Other approaches include random forests or some machine learning approaches to train the classifier directly over the missing data.

Imputing Missing Data With Expectation – Maximization

http://t.co/HEfCThTBCN

RT @Ali_Alkan: Imputing Missing Data With Expectation – Maximization

http://t.co/HEfCThTBCN

Can you elaborate on lines 30-33 in the script?

I believe it takes the mean for the column missing a value, and adds to it the the row of the covariance matrix of the complete data for the missing col, multiplied by the inv. cov matrix of the complete data minus the missing col, multiplied by the difference between the observed values for that row and their means.

But why does that work? Is there an intuition there, perhaps something in linear algebra that I am missing?