Imputing Missing Data With Expectation – Maximization

It can be fairly common to find missing values in a dataset. Having only a few missing values isn’t generally a problem and those records can be deleted listwise. In other words the entire record is simply removed from the analysis. The problem is even with a limited amount missing data, that can translate into a significant number of records that are omitted. In the example below, about two-thirds of the records would end up being omitted due to missing values.

The distribution of the missing values in the data is very important. If the data are missing at random then that is less serious than when there is a pattern of missing value that are, at least to some extent, dependent on the missing variables.

There are many approaches that can be used to impute missing data. The easiest way is to simply calculate the mean of each variable and substitute that for each of the missing values. The problem with this is that it reduces the variance and the absolute value of the covariance. Another common approach is called Expectation – Maximization. This technique iteratively goes through the data while still preserving the covariance structure of the data.


raw < - replicate(10, rpois(50,100)) raw.orig <- raw rand.miss <- rdiscrete(50,probs=rep(1:length(raw)), values=seq(1,length(raw)) ) raw[rand.miss] <- NA raw <- data.frame(raw) var(na.omit(raw) ) var(raw.imputed) EMalg <- function(x, tol=.001){ missvals <- new.impute<-x old.impute <- x count.iter <- 1 reach.tol <- 0 sig <- as.matrix(var(na.exclude(x))) mean.vec <- as.matrix(apply(na.exclude(x),2,mean)) while(reach.tol != 1) { for(i in 1:nrow(x)) { pick.miss <-( c( missvals[i,]) ) if ( sum(pick.miss) != 0 ) { inv.S <- solve(sig[!pick.miss,!pick.miss]) # we need the inverse of the covariance # Run the EM new.impute[i,pick.miss] <- mean.vec[pick.miss] + sig[pick.miss,!pick.miss] %*% inv.S %*% (t(new.impute[i,!pick.miss])- t(t(mean.vec[!pick.miss]))) } } sig <- var((new.impute)) mean.vec <- as.matrix(apply(new.impute,2,mean)) if(count.iter > 1){ # we don’t want this to run on the first iteration or else if fails
for(l in 1:nrow(new.impute)){
for(m in 1:ncol(new.impute)){
if( abs((old.impute[l,m]-new.impute[l,m])) > tol ) {
reach.tol < - 0 } else { reach.tol <- 1 } } } } count.iter <- count.iter+1 # used for debugging purposes to ensure process it iterating properly old.impute <- new.impute } return(new.impute) } raw.imputed <- EMalg(raw, tol=.0001) plot(raw.imputed[,1], raw.imputed[,2], pch=16, main="Scatterplot of Missing Data", sub="Missing Values in Red", xlab="X",ylab="Y") # overlay the imputed values on the plot plot.imputed <- raw.imputed[ row.names( subset(raw, raw[,2] ) | raw[,3]) ) ),] points(plot.imputed[,2],plot.imputed[,3], pch=16, col='red') &nbsp; [/sourcecode]

Example Graph of Missing Data

Once the missing values are established it is important to review the data and do the standard assumption tests before proceeding with further analysis.  This is one of many approaches for imputing missing data.  Other approaches include random forests or some machine learning approaches to train the classifier directly over the missing data.

Some Common Approaches for Analyzing Likert Scales and Other Categorical Data

Analyzing Likert scale responses really comes down to what you want to accomplish (e.g. Are you trying to provide a formal report with probabilities or are you trying to simply understand the data better). Sometimes a couple of graphs are sufficient and a formalize statistical test isn’t even necessary. However, with how easy it is to conduct some of these statistical tests it is best to just formalize the analysis. There are several approaches that can be used. Here are just a few of them.

The code to set up the data for some testing is as follows.  Note that this is the same code used in Plotting Likert Scales:

probs < - cbind(c(.4,.2/3,.2/3,.2/3,.4),c(.1/4,.1/4,.9,.1/4,.1/4),c(.2,.2,.2,.2,.2))
my.n <- 100
my.len <- ncol(probs)*my.n
raw <- matrix(NA,nrow=my.len,ncol=2)
raw <- NULL
for(i in 1:ncol(probs)){
raw <- rbind(raw, cbind(i,rdiscrete(my.n,probs=probs&#91;,i&#93;,values=1:5)))
raw <- data.frame(raw)
names(raw) <- c("group","value")
raw$group <- as.factor(raw$group)
raw.1.2 <- subset(raw, raw$group %in% c(1,2))



I might as well get this one out of the way. It sure is easy to take this approach which helps explains why this is probably one of the more controversial approaches. Even something like Excel will spit this out without much thought. You have to stretch the assumptions of the t-test to its outer limits. So if taking this approach one must very carefully verify the t-test assumptions. Most notably:

-Z Follows a Standard Normal
-Variance S^2 Follows a Chi Square Distribution
-Variance From two Populations should Be Equal (unless using Welch's test).
-Two Populations Should Be Independent

Two independent populations assumption gets a little sticky unless you truly are looking at two different populations from your data (e.g. Male/Female or Hispanic/Non-Hispanic). In other words just because you're comparing two different questions from your questionnaire doesn't necessarily mean you have two independent populations.

&#91;sourcecode language="css"&#93;

> t.test(raw.1.2$value ~ raw.1.2$group, var.equal=TRUE)

Two Sample t-test

data: raw.1.2$value by raw.1.2$group
t = 2.5622, df = 198, p-value = 0.01114
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.1151745 0.8848255
sample estimates:
mean in group 1 mean in group 2
3.4 2.9


This ends up being the better approach and it’s relatively easy to understand. The chi square test is designed to handle categorical frequency data and test the association between two variables.

When the sample size is too small and the assumptions of the chi square test no longer are satisfied then an alternative option is to use Fisher’s Exact Test. The classical example of this is Fisher’s Lady Tasting Tea problem. Though this is designed for a 2×2 table there are ways to generalize it to larger tables and R makes it quite simple.

( c.test < - chisq.test(raw$group, raw$value) )

Pearson's Chi-squared test

data: raw$group and raw$value
X-squared = 195.1726, df = 8, p-value < 2.2e-16



In the above example some of the cells are quite small which could mean that the chi square approach may not work.  So Fisher's 2 x 2 test can be expanded and we can test this data.  However, keep in mind the assumptions on the martingale being fixed. Due to limitations in workspace size in R I have just found it easiest to simulate the p-value and achieve the desired outcome that way.

&#91;sourcecode language="css"&#93;

sim.table <- table(raw$group, raw$value)

fisher.test(sim.table, simulate.p.value=TRUE, B=1e6) # Simulate due to workspace limitations

Fisher's Exact Test for Count Data with simulated p-value (based on 1e+06

data: sim.table
p-value = 1e-06
alternative hypothesis: two.sided



This is used when the data come from a related sample and are from the same population. In other words this works well on a matched pairs sample.  So assuming the group 1 and group 2 come from the same population and are just a different measurement we can take this approach.

&#91;sourcecode language="css"&#93;

wilcox.test(raw.1.2$value&#91;raw.1.2$group==1&#93;, raw.1.2$value&#91;raw.1.2$group==2&#93;, paired=TRUE)



This tests whether two independent samples are the same.  In this case the only difference between the Mann-Whitney test and the Wilcoxon Signed Test is that the paired sample is specified in the Wilcoxon Signed Test

&#91;sourcecode language="css"&#93;

wilcox.test(raw.1.2$value&#91;raw.1.2$group==1&#93;, raw.1.2$value&#91;raw.1.2$group==2&#93;)


Analysis of Variance equivalent for categorical data.   I feel that this is probably very underused.  This is probably do to ANOVA being beyond the scope of most casual analysts and then throwing in categorical data makes it that much more obscure.  Like the ANOVA is also assumes independent populations.  But once you understand exactly what you're testing and what type of data you're dealing with the implementation of the test is quite simple:

&#91;sourcecode language="css"&#93;

kruskal.test(raw$value ~ raw$group)

Kruskal-Wallis rank sum test

data: raw$value by raw$group
Kruskal-Wallis chi-squared = 13.9105, df = 2, p-value = 0.0009536


This can be compared to the parametric ANOVA

&#91;sourcecode language="css"&#93;

fit < - lm(raw$value ~ raw$group)

Analysis of Variance Table

Response: raw$value
Df Sum Sq Mean Sq F value Pr(>F)
raw$group 2 14.91 7.4533 4.0289 0.01878 *
Residuals 297 549.44 1.8500
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Using R to Produce Scalable Vector Graphics for the Web

Statistical software is normally used during the analysis stage of a project and a cleaned up static graphic is created for the presentation.  If the presentation is in web format then there are some considerations that are needed. The trick is to find ways to implement those graphs in that web format so the graph is of the highest possible quality.  If all that is needed is an image then simply saving the graph as a JPG or PNG and posting it to a website is quite simple and usually sufficient.  However, if more flexibility and higher quality is needed then some additional work will be needed.

One of the best way to present a graph is using vectors (as opposed to raster graphics).  What this means is that if one uses vectors graphics then a user can zoom in and there won’t be any degradation in image quality.  Several formats support vector graphics including PDF and SVG.

Scalable Vector Graphics are a great way to put together graphs using an XML-based format. This means it can be easily implemented directly into a website and, as an added bonus, it can become a dynamic image changing with user input.   R will generate the base structure of the graphic but dynamic SVG requires a bit more work outside of R.   Most modern browsers (IE 8 is not considered modern anymore so it is not supported) support this type of graphic format . Supported browsers include  IE 9, Firefox, and Chrome.

Using the example from a previous post I can convert the image into Scalable Vector Graphic.  The code that R produces into the SVG file can be copied and pasted directly into a web page. I have a side-by-side comparison of the graphs using earthquake data from the week prior to June 28, 2013.

Earthquakes Week Prior to 2013-06-08

svg(filename = “c:\\eq.svg” ,
width = 7, height = 7

eq = read.table(file=””, fill=TRUE, sep=”,”, header=T) < - map("state", interior = FALSE, plot=F) x.lim <-$range[1:2]; x.lim[1] <- x.lim[1]-1; x.lim[2] <- x.lim[2]+1; y.lim <-$range[3:4]; y.lim[1] <- y.lim[1]-1; y.lim[2] <- y.lim[2]+1; map("state", interior = FALSE, xlim=x.lim, ylim=y.lim) map("state", boundary = FALSE, col="gray", add = TRUE) title("Magnitude 1+ Earthquakes Over the Past 7 Days") eq$mag.size <- NULL eq$mag.size[eq$Magnitude>=1 & eq$Magnitude < 2] < - .75 eq$mag.size[eq$Magnitude>=2 & eq$Magnitude < 3] < - 1.0 eq$mag.size[eq$Magnitude>=3 & eq$Magnitude < 4] < - 1.5 eq$mag.size[eq$Magnitude>=4] < - 2.0 eq$mag.col <- NULL eq$mag.col[eq$Magnitude>=1 & eq$Magnitude < 2] < - 'blue' eq$mag.col[eq$Magnitude>=2 & eq$Magnitude < 3] < - 'green' eq$mag.col[eq$Magnitude>=3 & eq$Magnitude < 4] < - 'orange' eq$mag.col[eq$Magnitude>=4] < - 'red' points(x=eq$Lon,y=eq$Lat,pch=16,cex=eq$mag.size, col=eq$mag.col) eq$magnitude.text <- eq$Magnitude eq$magnitude.text[eq$Magnitude < 4] <- NA text(x=eq$Lon,y=eq$Lat,col='black',labels=eq$magnitude.text,adj=c(2.5),cex=0.5) legend('bottomright',c('M 1-2','M 2-3','M 3-4','M4+'), ncol=2, pch=16, col=c('blue','green','orange','red')) box() [/sourcecode]   The package ggplot2 has a function that will identify that one wants an SVG file based on the filename provided.  This graph shows the depth of the earthquake compared to the magnitude.


( g <- ggplot(eq, aes(x = Magnitude, y = Depth)) + geom_point() + geom_point(data = eq, color = eq$mag.col, size = 3) ) ggsave(g, file="c:\\eqDepthMagnitude.svg") [/sourcecode]