Imputing Missing Data With Expectation – Maximization

It can be fairly common to find missing values in a dataset. Having only a few missing values isn’t generally a problem and those records can be deleted listwise. In other words the entire record is simply removed from the analysis. The problem is even with a limited amount missing data, that can translate into a significant number of records that are omitted. In the example below, about two-thirds of the records would end up being omitted due to missing values.

The distribution of the missing values in the data is very important. If the data are missing at random then that is less serious than when there is a pattern of missing value that are, at least to some extent, dependent on the missing variables.

There are many approaches that can be used to impute missing data. The easiest way is to simply calculate the mean of each variable and substitute that for each of the missing values. The problem with this is that it reduces the variance and the absolute value of the covariance. Another common approach is called Expectation – Maximization. This technique iteratively goes through the data while still preserving the covariance structure of the data.


library(e1071)

raw < - replicate(10, rpois(50,100))
raw.orig <- raw

rand.miss <- rdiscrete(50,probs=rep(1:length(raw)), values=seq(1,length(raw)) )
raw[rand.miss] <- NA

raw <- data.frame(raw)

var(na.omit(raw) )
var(raw.imputed)

EMalg <- function(x, tol=.001){
missvals <- is.na(x)
new.impute<-x
old.impute <- x
count.iter <- 1
reach.tol <- 0
sig <- as.matrix(var(na.exclude(x)))
mean.vec <- as.matrix(apply(na.exclude(x),2,mean))

while(reach.tol != 1) {
for(i in 1:nrow(x)) {
pick.miss <-( c( missvals[i,]) )
if ( sum(pick.miss) != 0 ) {
inv.S <- solve(sig[!pick.miss,!pick.miss]) # we need the inverse of the covariance

# Run the EM
new.impute[i,pick.miss] <- mean.vec[pick.miss] +
sig[pick.miss,!pick.miss] %*%
inv.S %*%
(t(new.impute[i,!pick.miss])- t(t(mean.vec[!pick.miss])))
}
}

sig <- var((new.impute))
mean.vec <- as.matrix(apply(new.impute,2,mean))

if(count.iter > 1){ # we don't want this to run on the first iteration or else if fails
for(l in 1:nrow(new.impute)){
for(m in 1:ncol(new.impute)){
if( abs((old.impute[l,m]-new.impute[l,m])) > tol ) {
reach.tol < - 0
} else {
reach.tol <- 1
}
}
}
}

count.iter <- count.iter+1 # used for debugging purposes to ensure process it iterating properly
old.impute <- new.impute
}

return(new.impute)
}
raw.imputed <- EMalg(raw, tol=.0001)
plot(raw.imputed[,1], raw.imputed[,2], pch=16, main="Scatterplot of Missing Data",
sub="Missing Values in Red", xlab="X",ylab="Y")

# overlay the imputed values on the plot

plot.imputed <- raw.imputed[
row.names(
subset(raw, is.na( raw[,2] ) | is.na( raw[,3]) )
),]
points(plot.imputed[,2],plot.imputed[,3], pch=16, col='red')

&amp;nbsp;

Example Graph of Missing Data

Once the missing values are established it is important to review the data and do the standard assumption tests before proceeding with further analysis.  This is one of many approaches for imputing missing data.  Other approaches include random forests or some machine learning approaches to train the classifier directly over the missing data.

p5rn7vb
Leave a comment

2 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *


9 − = seven

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

%d bloggers like this: