Latent Class Modeling Election Data

Latent class analysis is a useful tool that is used to identify groups within multivariate categorical data.  An example of this is the likert scale. In categorical language these groups are known as latent classes. As a simple comparison this can be compared to the k-means multivariate cluster analysis. There are several key differences between the two methods. First, latent class analysis assigns observations to groups based on probability while k-means cluster analysis absolutely assigns observations to groups. While k-means is readily available in many software packages it is only appropriate for continuous data. Latent class analysis is not as widely available in many software packages but it is designed to handle categorical data.

There are a handful of latent class analysis software packages. Probably the best and most common is Latent Gold. However, the license can be somewhat cost prohibitive. This is particularly true if your daily routine does not include latent class modeling. Currently, SPSS does not include latent class analysis. IBM, the company that owns SPSS, has indicated that the enhancement request for latent class analysis has been added to SPSS Development. For SAS users there is proc lca, but once again that is somewhat cost prohibitive. On the open source side of things there are the R packages poLCA and MCLUST. Unless one needs the many features available in Latent Gold these R packages will generally be sufficient for data analysis.

In general latent class modeling has the following R code structure:



## An example of simulating likert scale data
probs = cbind(c(.4,.2/3,.2/3,.2/3,.4),c(.1/4,.1/4,.9,.1/4,.1/4),c(.2,.2,.2,.2,.2))
my.n = 1000
my.len = ncol(probs)*my.n
raw = matrix(NA,nrow=my.len,ncol=3)
raw = NULL
for(i in 1:ncol(probs)){
raw = rbind(raw, cbind(i,rdiscrete(my.n,probs=probs[,i],values=1:5)))
raw = data.frame(id = seq(1,my.n),raw)

# An example of how to transform data back from normalized data to a flat file
raw.flat = dcast(raw, id ~ i, value.var=”V2″)
names(raw.flat) = c(“id”,”A”,”B”,”C”)

# Simulation example of latent class models
f = cbind(B, C) ~ A;
lca.fit1 < - poLCA(f,raw.flat,nclass=1, nrep=5); lca.fit2 <- poLCA(f,raw.flat,nclass=2, nrep=5); f = cbind(A, B, C)~1; lca.fit1 <- poLCA(f,raw.flat,nclass=1, nrep=5); lca.fit2 <- poLCA(f,raw.flat,nclass=2, nrep=5); [/sourcecode]   ANES 2000 The following is an example of how one can analyze data from the American National Election Study (ANES).  This is an election study conducted for each election year.  This is a built-in data frame for the R package, and it is from 2000. However, I would recommend going to ElectionStudies  and then go to their Data Center to get the most recent dataset from 2012. Additionally, for great data on election analysis I would strongly encourage the National Election Pool Exit Poll data.  There are some great analyses that can be obtained through those data.  However, the raw data is a bit more difficult to obtain (as of today the Roper Center has disabled all access to the raw data).  Consequently, the analysis is fairly limited.   LCA American National Election Studies


# Example dataset from the poLCA package
# build the model with PARTY as the covariate
f < - cbind(MORALG,CARESG,KNOWG,LEADG,DISHONG,INTELG, MORALB,CARESB,KNOWB,LEADB,DISHONB,INTELB)~PARTY # Run LCA on the ANES 2000 dataset 3 classes anes2000 <- poLCA(f,election,nclass=3,nrep=5) # Build a matrix to prepare for graphing my.mat.max = 15 my.mat <- cbind(1,c(1:my.mat.max)) exb <- exp(pidmat %*% anes2000$coeff) # Run the matrix plot matplot(c(1:my.mat.max),(cbind(1,exb)/(1+rowSums(exb))),ylim=c(0,1),type="l", main="Party ID as a predictor of candidate affinity class", xlab="Party ID: strong Democratic (1) to strong Republican (7)", ylab="Probability of latent class membership",lwd=2,col=c('blue','green','red')) text(5.9,0.35,"Other") text(5.4,0.7,"Bush affinity") text(2.5,0.6,"Gore affinity") [/sourcecode]   National Election Pool Exit Poll 2012 Here is another example using the 2012 National Election Pool Exit Poll.  In this example I simply pull the data directly from the tables. This is to be used as a basic example and there are quite a few caveats (e.g. rounding, weighting, item nonresponse, using candidate vote, etc.) on creating a raw dataset this way  but the latent class model concept remains the same.  Also, the All Other category is not broken out by age so I simply divide out (through probably not a completely accurate approach) the count evenly. The n is 26565 so that will be the baseline.  Any member of the National Election Pool’s websites (ABC, CBS, CNN, Fox, NBC) can be used for this data.  Note that for some reason CBS has very wrong marginal data on their site for this table.

Probability of Latent Class Membership

# Cell counts pulled directly from the tables and based on n of 26565

table.raw = rbind(
cbind( rep(‘W’, 1286), rep(’18-29′, 1286), rep(‘O’, 1286) ),
cbind( rep(‘W’, 3395), rep(’30-44′, 3395), rep(‘O’, 3395) ),
cbind( rep(‘W’, 5239), rep(’45-64′, 5239), rep(‘O’, 5239) ),
cbind( rep(‘W’, 2417), rep(’65+’, 2417), rep(‘O’, 2417) ),
cbind( rep(‘B’, 534), rep(’18-29′, 534), rep(‘O’, 534) ),
cbind( rep(‘B’, 404), rep(’30-44′, 404), rep(‘O’, 404) ),
cbind( rep(‘B’, 404), rep(’45-64′, 404), rep(‘O’, 404) ),
cbind( rep(‘B’, 104), rep(’65+’, 104), rep(‘O’, 104) ),
cbind( rep(‘H’, 967), rep(’18-29′, 967), rep(‘O’, 967) ),
cbind( rep(‘H’, 749), rep(’30-44′, 749), rep(‘O’, 749) ),
cbind( rep(‘H’, 741), rep(’45-64′, 741), rep(‘O’, 741) ),
cbind( rep(‘H’, 247), rep(’65+’, 247), rep(‘O’, 247) ),
cbind( rep(‘O’, 197), rep(’18-29′, 197), rep(‘O’, 197) ),
cbind( rep(‘O’, 197), rep(’30-44′, 197), rep(‘O’, 197) ),
cbind( rep(‘O’, 197), rep(’45-64′, 197), rep(‘O’, 197) ),
cbind( rep(‘O’, 197), rep(’65+’, 197), rep(‘O’, 197) ),

cbind( rep(‘W’, 1490), rep(’18-29′, 1490), rep(‘R’, 1490) ),
cbind( rep(‘W’, 1339), rep(’30-44′, 1339), rep(‘R’, 1339) ),
cbind( rep(‘W’, 2388), rep(’45-64′, 2388), rep(‘R’, 2388) ),
cbind( rep(‘W’, 1302), rep(’65+’, 1302), rep(‘R’, 1302) ),
cbind( rep(‘B’, 247), rep(’18-29′, 247), rep(‘R’, 247) ),
cbind( rep(‘B’, 627), rep(’30-44′, 627), rep(‘R’, 627) ),
cbind( rep(‘B’, 648), rep(’45-64′, 648), rep(‘R’, 648) ),
cbind( rep(‘B’, 162), rep(’65+’, 162), rep(‘R’, 162) ),
cbind( rep(‘H’, 85), rep(’18-29′, 85), rep(‘R’, 85) ),
cbind( rep(‘H’, 40), rep(’30-44′, 40), rep(‘R’, 40) ),
cbind( rep(‘H’, 56), rep(’45-64′, 56), rep(‘R’, 56) ),
cbind( rep(‘H’, 16), rep(’65+’, 16), rep(‘R’, 16) ),
cbind( rep(‘O’, 61), rep(’18-29′, 61), rep(‘R’, 61) ),
cbind( rep(‘O’, 61), rep(’30-44′, 61), rep(‘R’, 61) ),
cbind( rep(‘O’, 61), rep(’45-64′, 61), rep(‘R’, 61) ),
cbind( rep(‘O’, 61), rep(’65+’, 61), rep(‘R’, 61) )

exitpoll2012 = data.frame(table.raw)
names(exitpoll2012) = c(“RACE”,”AGE”,”VOTE”)
table(table.raw[,1], table.raw[,2])
table(table.raw[,1], table.raw[,3])
f <- cbind(AGE, RACE)~VOTE xp.lca <- poLCA(f,exitpoll2012,nclass=2) table(exitpoll2012$AGE) # Build a matrix to prepare for graphing my.mat.max = 4 my.mat <- cbind(1,c(1:my.mat.max)) exb <- exp(my.mat %*% xp.lca$coeff) # Run the matrix plot matplot(c(1:my.mat.max),(cbind(1,exb)/(1+rowSums(exb))),ylim=c(0,1),type="l", main="Candidate Vote as a Predictor of Candidate Affinity Class using Voter Race and Age", xlab="Candidate Vote: Obama (1) Romney (2)", ylab="Probability of latent class membership",lwd=2,col=c('blue','red')) text(1.4,0.25,"Romney Leaning") text(1.4,0.8,"Obama Leaning") [/sourcecode]  

Hey, I Just Did a Significance Test!

I’ve seen it happens quite often. The sig test. Somebody simply needs to know the p-value and that one number will provide all of the information about the study that they need to know. The dataset is presented and the client/boss/colleague/etc invariably asks the question “is it significant?” and “what’s the correlation?”. To quote R.A. Fisher “To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.”

So obviously in the previous example there are probably lots of problems ranging from the person making the request doesn’t know what they want to a bigger problem of the study itself may have core structural problems that undermines the overall integrity of any results. I want to focus on the former and share some thoughts on how to do a couple of hypothesis tests and selecting the appropriate test. This is mainly a topic of parametric vs. non-parametric tests. I’ll show the one most people are probably most familiar (parametric tests) and then show an alternative, more appropriate hypothesis test. I won’t get into the philosophical topics of hypothesis testing or whether .04999 is significant and .05001 is not significant. This simply provides some hypothesis testing options that too often get overlooked in market and/or social research. Yes,  parametric tests provide more power but not everything in this world fits nicely into one package with a bow on top.

Box Plot

What’s the Correlation

I’ve had to happen more often that I would care to count. I’m given a dataset and then asked what’s the “correlation”? After some back and forth I find that they want the “correlation” on all permutations (with p-value) of the questions on the questionnaire. So here goes:

raw = data.frame( id=seq(1,n), replicate(2, rdiscrete(n, probs=c(1,1,1,1,1), values=c(1,2,3,4,5)) ) )

Pearson’s Correlation or, more formally, Pearson product-moment correlation coefficient. This is the correlation with which most people are familiar. Be sure to check the assumption (i.e. homoscedasticity, linearity, normality). However, Spearman’s Rho or Kendall’s Tau (depending on how you want to interpret the results) may in fact be the better options.

# My example gave the following
&gt; cor(raw$X1,raw$X2, method="kendall")
[1] -0.191943
&gt; cor(raw$X1,raw$X2, method="spearman")
[1] -0.2320708
&gt; cor(raw$X1,raw$X2, method="pearson")
[1] -0.2573766

I Need a T-Test

This is a popular hypothesis test because people want to know if something is better (or worse) than something else.  So the initial choice is a t-test. For example did Group 1 make more money than Group 2 or did more people remember seeing an ad on theater screen 1 versus theater screen 2.  So rather than the parametric t-test we can use the Mann-Whitney U Test on our data that doesn’t meet the t-test assumptions.

# Example when data is normalized
raw2 = melt(raw, id=c("id"))

#Mann-Whitney U Test (independent 2-groups)
wilcox.test(raw$X1,raw$X2, paired=FALSE, alternative="two.sided")
wilcox.test(raw2$value~raw2$variable, paired=FALSE, alternative="two.sided")

Is There Any Association in My Table

When testing tables and counts the first go-to test is the chi-square test. But suppose you have a table like this:

2 1
3 4
raw.chisq = data.frame( id=seq(1,n),replicate(2, rdiscrete(n, probs=c(1,1), values=c(1,2)) ) )
table(raw.chisq$X1, raw.chisq$X2)

Sure you can just run you chi-square test and be done with it. But there is one small problem. The assumptions for a chi-square test are in not met. Namely, the cell sizes are way too small. So what’s a researcher to do? This is where Fisher’s Exact Test works well. If we can assume that the marginal totals are given then we can solve the problem this way:

fisher.test(raw.chisq$X1, raw.chisq$X2)

Is There a Difference in My Three Groups

Yes, it’s true, you can run three t-tests on your groups (1 vs 2, 1 vs 3, 2 vs 3). But that causes not only extra work but problems with your hypothesis test itself. Plus why do multiple tests when you can be more efficient in you testing and do just one ANOVA. Here you can do a non-parametric Kruskal-Wallis Rank Sum Test when you can’t make the assumptions for the parametric analysis of variance.

raw.anova = data.frame( id=seq(1,n),replicate(3, rdiscrete(n, probs=c(1,1,1,1,1), values=c(1,2,3,4,5)) ) )
raw.anova2 = melt(raw.anova, id=c("id"))
kruskal.test(raw.anova2$value ~ raw.anova2$variable)
fit.aov = lm(raw.anova2$value ~ raw.anova2$variable)

Ultimately, it is important to understand what you’re doing and why. Just because R, SPSS, SAS, etc. gives you a number it doesn’t mean it’s correct.  If you find that something is significant make sure you understand what it is saying and what you’re testing.  You don’t want to run an independent 2-sample test only to find out that it should have been a matched pairs test.

Will Mu Go Out With Median

True story (no really, this did actually happen).  While in grad school one of the other teaching assistants was approached by one of the students and was asked “will mu go out with median?”  The teaching assistant thought the play on words was pretty funny, laughed, and then cluelessly walked away.  All of us other grad students were surprised because we knew that really was mean.

There are a lot of ways to calculate a measure of center.  Here are several examples that include arithmetic mean, geometric mean, harmonic mean, and for good measure the median.

Histogram of Pythagorean Means

Arithmetic Mean

By far the most common is the mean (aka the average).  This is simply taking a list of numbers and dividing by the count of those numbers.  It is useful when there are many numbers that add up to a total. What does this tell us?  If you were looking at a teeter totter with a bunch of kids on it then it’s where the bar balances.  It doesn’t really matter how many kids you have on either side it’s simply where the weight of the kids is even on each side.

Geometric Mean

Lesser used is the geometric mean.  This is used when there are many quantities that multiply together to produce a product of those numbers.  This is a more appropriate mean when dealing with proportional growth. Take for example when you invest in something like a 401k.  If you get a 8% growth for the first year, 12% for the second, and 11% for the third you would want to take the geometric mean.  This can be re-written as 1.08 the first year, 1.12 for the second, and 1.11 for the third.  The geometric mean is then calculated as \prod_{n=1}^3\left(1.08 \cdot 1.12 \cdot 1.11\right)^{\frac{1}{3}} - 1 = 10.32\% .

This table shows how the results from the geometric mean match the results when applying the rate year by year.

Yearly Geo-Mean
Rate 1000 1000
0.08 1.08 1080 1103.201691
0.12 1.12 1209.6 1217.053972
0.11 1.11  1342.66  1342.66


Harmonic Mean

Harmonic mean, like the arithmetic mean, is additive in nature.  However, the larger quantities get dampened down.  Consequently, it can be used in some situations when there are outliers.  This mean can also be useful in a variety of areas including machine learning when averaging precision and recall of classifiers.


Medians are another example of measure of center.  However, unlike arithmetic mean this is less sensitive to outliers.  For example when determining a measure of center for national income the mean income would result in a different number than the median income and would lean more toward the very wealthy.  However, the median is a better measure of center as it identifies the middle point where half the observations are on either side.

The following code snippets show the three Pythagorean means (arithmetic, geometric, harmonic) as well as the median.

### Generate some fake data
x = cbind(sort(rnorm(25,10,1)),rpois(25,10))
### Write a function for a weighted median
X = x[,1]; w = x[,2]
weighted.median = function(X,w=1){
### If a single value of 1 was entered then set up array
w = rep(1,length(X))

X = cbind(X,w)
X = X[complete.cases(X),]
y = X[order(X[,1]),] # Sort the matrix
y = cbind(y,cumsum(y[,2])) # Attach the cumulative sum

### locate the positions the need to be averaged.
### If there is an exact middle point then it uses the middle point.
which.min.lim = min( which(y[,3]/sum(y[,2]) >= 0.5 ) )
which.max.lim = max( which(y[,3]/sum(y[,2]) <= 0.5 ) ) weighted.median = mean(y[max(which.min.lim, which.max.lim),1]) return(weighted.median) } harmonic.mean = function(x,w=1){ if(length(w)==1){ w = rep(1,length(x)) } dem = w/x # Set up denominator values harmonic.mean = sum(w)/sum(dem) # Calculate harmonic mean return(harmonic.mean) } geometric.mean = function(x,w=1){ if(length(w)==1){ w = rep(1,length(x)) } a = x^w b = 1/sum(w) geometric.mean = prod(a) ^ b ### Same calculation just a different way # exp( sum(w * log(x) ) / sum(w) ) return(geometric.mean) } mean(x[,1]) weighted.mean(x[,1],x[,2]) median(x[,1]) weighted.median(x[,1],x[,2]) harmonic.mean(x[,1], x[,2]) harmonic.mean(x[,1]) geometric.mean(x[,1],x[,2]) geometric.mean(x[,1]) hist(x, nclass=100, xlim=c(10,11)); abline(v=weighted.mean(x[,1],x[,2]), col='red', lwd=2) abline(v=weighted.median(x[,1],x[,2]), col='blue', lwd=2) abline(v=harmonic.mean(x[,1], x[,2]), col='green', lwd=2) abline(v=geometric.mean(x[,1],x[,2]), col='purple', lwd=2) [/sourcecode]

The Future of Non Probability Sampling

While attending the American Association for Public Opinion Research conference in Boston, MA the topic of non-probability samples was something of a reoccurring theme. I attended the task force panel review on the topic. However, there is currently no commonly accepted solution.

It was about one year ago that Pew reported (Pew report) that their phone completion rate was down to 9%. I can’t imagine that out will be going up ant time soon. That makes one wonder how much longer phone surveys can be considered a probability sample (and that doesn’t mention the whole issue with cell phone adoption). It is certainly not a sustainable method.

One thing is clear, the time has come and something will need to be done in order to solve that problem. Some have even suggested that landline surveys be eliminated and move strictly to cell phone surveys. However, that is probably a band-aid at best and is likely not sustainable either. Some are using sample matching with opt-in Web panels with varying degrees of success. Twitter, Facebook, and other social media are constantly thrown around too.

Reg Baker over at The Survey Geek is heading up the AAPOR task force for the past couple of years trying to solve this problem.


George Box stated that “all models are wrong, but some are useful”. I guess the same now applies to samples. It will be interesting to follow this topic. For the recent update AAPOR just released their report.

A Brief Tour of the Trees and Forests

Tree methods such as CART (classification and regression trees) can be used as alternatives to logistic regression. It is a way that can be used to show the probability of being in any hierarchical group. The following is a compilation of many of the key R packages that cover trees and forests.  The goal here is to simply give some brief examples on a few approaches on growing trees and, in particular, the visualization of the trees. These packages include classification and regression trees, graphing and visualization, ensemble learning using random forests, as well as evolutionary learning trees. There are a wide array of package in R that handle decision trees including trees for longitudinal studies.  I have found that when using several combinations of these packages simultaneously that some of the function begin to fail to work.

The concept of trees and forests can be applied in many different settings and is often seen in machine learning and data mining settings or other settings where there is a significant amount of data.  The examples below are by no means comprehensive and exhaustive. However, there are several examples given using different datasets and a variety of R packages. The first example uses some data obtain from the Harvard Dataverse Network. For reference these data can be obtained from The study was recently released on April 22nd, 2013 and the raw data as well as the documentation is available on the Dataverse web site and the study ID is hdl:1902.1/21235. The other examples use data that are shipped with the R packages.


This package includes several example sets of data that can be used for recursive partitioning and regression trees.  Categorical or continuous variables can be used depending on whether one wants classification trees or regression trees. This package as well at the tree package are probably the two go-to packages for trees.  However, care should be taken as the tree package and the rpart package can produce very different results.

raw.orig < - read.csv(file="c:\\rsei212_chemical.txt", header=T, sep="\t")

# Keep the dataset small and tidy
# The Dataverse: hdl:1902.1/21235
raw = subset(raw.orig, select=c("Metal","OTW","AirDecay","Koc"))

row.names(raw) = raw.orig$CASNumber
raw = na.omit(raw);

frmla = Metal ~ OTW + AirDecay + Koc

# Metal: Core Metal (CM); Metal (M); Non-Metal (NM); Core Non-Metal (CNM)

fit = rpart(frmla, method="class", data=raw)

printcp(fit) # display the results
plotcp(fit) # visualize cross-validation results
summary(fit) # detailed summary of splits

# plot tree
plot(fit, uniform=TRUE, main="Classification Tree for Chemicals")
text(fit, use.n=TRUE, all=TRUE, cex=.8)

# tabulate some of the data
table(subset(raw, Koc>=190.5)$Metal)

Basic Decision Tree


This is the primary R package for classification and regression trees.  It has functions to prune the tree as well as general plotting functions and the mis-classifications (total loss). The output from tree can be easier to compare to the General Linear Model (GLM) and General Additive Model (GAM) alternatives.

# TREE package

tr = tree(frmla, data=raw)
plot(tr); text(tr)

Basic Tree (tree() function)


This is another package for recursive partitioning. One of the key functions in this package is ctree. As the package documention indicates it can be used for continuous, censored, ordered, nominal and multivariate response variable in a conditional inference framework. The party package also implements recursive partitioning for survival data.

# PARTY package

(ct = ctree(frmla, data = raw))
plot(ct, main="Conditional Inference Tree")

#Table of prediction errors
table(predict(ct), raw$Metal)

# Estimated class probabilities
tr.pred = predict(ct, newdata=raw, type="prob")

Conditional Tree


maptree is a very good at graphing, pruning data from hierarchical clustering, and CART models. The trees produced by this package tend to be better labeled and higher quality and the stock plots from rpart.

draw.tree( clip.rpart (rpart ( raw), best=7),
nodeinfo=TRUE, units="species",
cases="cells", digits=0)
a = agnes ( raw[2:4], method="ward" )
b = kgs (a, a$diss, maxclust=20)

plot(names(b), b, xlab="# clusters", ylab="penalty", type="n")
xloc = names(b)[b==min(b)]
yloc = min(b)
ngon(c(xloc,yloc+.75,10, "dark green"), angle=180, n=3)
apply(cbind(names(b), b, 3, 'blue'), 1, ngon, 4) # cbind(x,y,size,color)

Map Tree Example


This contains a re-implementation of the ctree function and it provides some very good graphing and visualization for tree models.  It is similar to the party package.  The example below uses data from airquality dataset and the famous species data available in R and can be found in the documentation.

<a href=""><img alt="Species Decision Tree" src="" width="437" height="472" /></a> <a href=""><img alt="Ozone Air Quality Decision Tree" src="" width="437" height="472" /></a>


This package uses evolutionary algorithms.  The idea behind this approach is that is will reduce the a priori bias.  I have seen trees of this sort in the area of environmental research, bioinformatics, systematics, and marine biology.  Though there are many other areas than that of phylogentics.

## EVTREE (Evoluationary Learning)

ev.raw = evtree(frmla, data=raw)
table(predict(ev.raw), raw$Metal)
1-mean(predict(ev.raw) == raw$Metal)

Evolutionary Tree Learning


Random forests are very good in that it is an ensemble learning method used for classification and regression.  It uses multiple models for better performance that just using a single tree model.  In addition because many sample are selected in the process a measure of variable importance can be obtain and this approach can be used for model selection and can be particularly useful when forward/backward stepwise selection is not appropriate and when working with an extremely high number of candidate variables that need to be reduced.

## randomForest
fit.rf = randomForest(frmla, data=raw)
plot( importance(fit.rf), lty=2, pch=16)
imp = importance(fit.rf)
impvar = rownames(imp)[order(imp[, 1], decreasing=TRUE)]
op = par(mfrow=c(1, 3))
for (i in seq_along(impvar)) {
partialPlot(fit.rf, raw, impvar[i], xlab=impvar[i],
main=paste("Partial Dependence on", impvar[i]),
ylim=c(0, 1))

%IncMSE IncNodePurity
x1 30.30146 8657.963
x2 7.739163 3675.853
x3 0.586905 240.275
x4 -0.82209 381.6304
x5 0.583622 253.3885

Importance Graph


This can be used for further variable selection procedure using random forests.  It implements both backward stepwise elimination as well as selection based on the importance spectrum.  This data uses randomly generated data so the correlation matrix can set so that the first variable is strongly correlated and the other variables are less so.

## varSelRF package
x = matrix(rnorm(25 * 30), ncol = 30)
x[1:10, 1:2] = x[1:10, 1:2] + 2
cl = factor(c(rep("A", 10), rep("B", 15)))
rf.vs1 = varSelRF(x, cl, ntree = 200, ntreeIterat = 100,
vars.drop.frac = 0.2)


## Example of importance function show that forcing x1 to be the most important
## while create secondary variables that is related to x1.
rf1 = randomForest(y~., data=my.df, mtry=2, ntree=50, importance=TRUE)

Importance and Out Of Bag (OOB) Error


This package grows an oblique decision tree (a general form of the axis-parallel tree).  This example uses the crab dataset (morphological measurements on Leptograpsus crabs) available in R as a stock dataset to grow the oblique tree.

library(oblique.tree) = data.frame( g=factor(rep(1:4,each=50)),
text([,-1], col=as.numeric([,1]), labels=as.numeric([,1]))
ob.tree = oblique.tree(formula = g~.,
data =,
oblique.splits = "only")

Oblique Tree


This is a great package that contain many different machine learning algorithms and functions.  It include trees, forests, naive Bayes, locally weighted regression, among others.

## CORElearn

## Random Forests
fit.rand.forest = CoreModel(frmla, data=raw, model="rf", selectionEstimator="MDL", minNodeWeightRF=5, rfNoTrees=100)

## decision tree with naive Bayes in the leaves
fit.dt = CoreModel(frmla, raw, model="tree", modelType=4)
plot(fit.dt, raw)

airquality.sub = subset(airquality, !$Ozone))
fit.rt = CoreModel(Ozone~., airquality.sub, model="regTree", modelTypeReg=1)
plot(fit.rt, airquality.sub, graphType="prototypes")

pred = predict(fit.rt, airquality.sub)



This provides an implementation for recursive partitioning for longitudinal data.  It uses the rules from rpart and the mixed effects models from nlme to grow regression trees. This can be a little resource intensive on some slower computers.


long.tree = longRPart(pbkph~Time,~age+gender,~1|Subject,pbkphData,R=corExp(form=~time))
lrpTreePlot(long.tree, use.n=TRE, place="bottomright")



This package is useful for longitudinal studies where random effects exist.  This example uses the pbkphData dataset available in the longRPart package.

## REEMtree Random Effects for Longitudinal Data
pbkphData.sub = subset(pbkphData, !$pbkph))
reem.tree = REEMtree(pbkph~Time, data=pbkphData.sub, random=~1|Subject)
ranef(reem.tree) #random effects

reem.tree = REEMtree(pbkph~Time, data=pbkphData.sub, random=~1|Subject,



Amazon AWS Summit 2013

I was fortunate enough to have been able to attend the Amazon AWS Summit in NYC and to listen to Werner Vogels give the keynote.  I will share a few of my thoughts on the AWS 2013 Summit and some of my take-aways.  I attended sessions that focused on two products in particular: Redshift and DynamoDB.  Amazon AWS seems to be a good option for projects that need a lot of disk space (e.g. up to 1.6 petabytes) or if you need to quickly increase system resources (e.g. RAM or CPU) to handle a lot of database writes or to handle a lot of data analysis on demand.


This is a new Amazon product was announced earlier this month and if it can do what Amazon says it can do then it seems that this is a great option data warehousing.  It will be interesting to see if some of the industries that have strict regulations (e.g. HIPAA, PCI compliant) move over to Amazon.  However, with some of the Virtual Private Cloud options and the encryption that is provided that may be able to solve the regulatory requirements.

I have done a fair amount of work on data warehousing but have generally only used Oracle for that work.  Some of the benchmarks for Redshift are quite impressive. As seen in the photo I took of one of the presentation slides they were able to read 2 billion rows in about 6 seconds (12 seconds for aggregate summaries).  Compare that to  SQL Server that was manually stopped after 6 hours and Hive took only about a half hour.  Not too long ago I personally ran ~3 billion rows on a local MySQL server.  I don’t have specific benchmarks.  However, to scrub and transform the data it took roughly 3 days.  Needless to say after that I moved over and used some of the Amazon products and was able to quickly scale up and use more Amazon instances.




Amazon DynamoDB

I haven’t had the opportunity to use this product but it looks very promising and appears to provide a great resource for a NoSQL alternative to relational databases and a strong competitor to some of the other NoSQL databases.  It is a proprietary software but I would be interested in comparing it to Cassandra or other Hadoop style systems.  Some of the libraries, mappers, and mock are available at



From personal experience I have been able to use R and Hadoop as well as some PHP scripts and Java programs on Amazon instances.  The price for using any of these products is very good and is generally a whole lot cheaper than purchasing in-house hardware.  Plus it provides flexibility to use a wide range of software.  It will be interesting to see how well Redshift performs as well as DynamoDB.  Redshift in particular looks very promising.


As a side,  I’m in no way associated with Amazon, I’m simply an occasional user of their products.

Simulating the Gambler’s Ruin

The gambler’s ruin problem is one where a player has a probability p of winning  and probability q of losing. For example let’s take a skill game where the player x can beat player y with probability 0.6 by getting closer to a target. The game play begins with player x being allotted 5 points and player y allotted 10 points. After each round a player’s points either decrease by one or increase by one we can determine the probability that player x will annihilate player y. The player that reaches 15 wins and the player that reach zero is annihilated. There is a wide range of application for this type of problem that goes being gambling.

Number of Time Until Annihilation

This is actually a fairly simple problem to solve on pencil and paper and to determine an exact probability. Without going into too much detail we can determine the probability of annihilation by \frac{1-\left(\frac{q}{p}\right)^i}{1-\left(\frac{q}{p}\right)^N}. In this example it works out to be \frac{1-\left(\frac{.4}{.6}\right)^5}{1-\left(\frac{.4}{.6}\right)^{10}} \approx 0.8703.

But this is a relatively boring approach and coding up an R script makes everything that much better. So here is a simulation of this same problem estimating that same probability plus it provides additional information on the distribution of how many times this game would have to be played.

gen.ruin = function(n, x.cnt, y.cnt, x.p){
x.cnt.c = x.cnt
y.cnt.c = y.cnt
x.rnd = rbinom(n, 1, p=x.p)
x.rnd[x.rnd==0] = -1
y.rnd = x.rnd*-1
x.cum.sum = cumsum(x.rnd)+x.cnt
y.cum.sum = cumsum(y.rnd)+y.cnt = cumsum(x.rnd)+x.cnt

if( any( which(>=x.cnt+y.cnt) ) | any( which(< =0) ) ){ = 1+min( which(>=x.cnt+y.cnt), which(< =0) )[] = 0 } return( } n.reps = 10000 ruin.sim = replicate(n.reps, gen.ruin(n=1000, x.cnt=5, y.cnt=10, x.p=.6)) ruin.sim[ruin.sim==0] = NA hist( apply(ruin.sim==15 |, 2, which.max) , nclass=100, col='8', main="Distribution of Number of Turns", xlab="Turn Number") abline(v=mean(apply(ruin.sim==15 |, 2, which.max)), lwd=3, col='red') abline(v=median(apply(ruin.sim==15 |, 2, which.max)), lwd=3, col='green') x.annihilation = apply(ruin.sim==15, 2, which.max) ( prob.x.annilate = length(x.annihilation[x.annihilation!=1]) / n.reps ) state.cnt = ruin.sim state.cnt[state.cnt!=15] = 0 state.cnt[state.cnt==15] = 1 mean.state = apply(ruin.sim, 1, mean, na.rm=T) plot(mean.state, xlim=c(0,which.max(mean.state)), ylim=c(0,20), ylab="Points", xlab="Number of Plays", pch=16, cex=.5, col='green') lines(mean.state, col='green') points(15-mean.state, pch=16, cex=.5, col='blue') lines(15-mean.state, col='blue') [/code]

Gambler's Ruin

Finding the Distribution Parameters

This is a brief description on one way to determine the distribution of given data. There are several ways to accomplish this in R especially if one is trying to determine if the data comes from a normal distribution. Rather than focusing on hypothesis testing and determining if a distribution is actually the said distribution this example shows one simple approach to determine the parameters of a distribution. I’ve found this useful when I’m given a dataset and I need to generate more of the same type of data for testing and simulation purposes. The example below uses data that originated from a gamma (1,2). Here we can see how well the data fits the distribution.

Simulated Gamma Distribution

raw = t( matrix(c(
1, 0.4789,
1, 0.1250,
2, 0.7048,
2, 0.2482,
2, 1.1744,
2, 0.2313,
2, 0.3978,
2, 0.1133,
2, 0.1008,
1, 0.7850,
2, 0.3099,
1, 2.1243,
2, 0.3615,
2, 0.2386,
1, 0.0883), nrow=2
) )
( fit.distr = fitdistr(raw[,2], "gamma") )
qqplot(rgamma(nrow(raw),fit.distr$estimate[1], fit.distr$estimate[2]), (raw[,2]),
xlab="Observed Data", ylab="Random Gamma")

simulated = rgamma(1000, fit.distr$estimate[1], fit.distr$estimate[2])
hist(simulated, main=paste("Histogram of Simulated Gamma using",round(fit.distr$estimate[1],3),"and",round(fit.distr$estimate[2],3)),
col=8, xlab="Random Gamma Distribution Value")

( fit.distr = fitdistr(raw[,2], "normal") )
qqplot(rnorm(nrow(raw),fit.distr$estimate[1], fit.distr$estimate[2]), (raw[,2]))

( fit.distr = fitdistr(raw[,2], "lognormal") )
qqplot(rlnorm(nrow(raw),fit.distr$estimate, fit.distr$sd), (raw[,2]))

( fit.distr = fitdistr(raw[,2], "exponential") )
qqplot(rexp(nrow(raw),fit.distr$estimate), (raw[,2]))

Distribution of QQPlot

Dirichlet Process, Infinite Mixture Models, and Clustering

The Dirichlet process provides a very interesting approach to understand group assignments and models for clustering effects.   Often time we encounter the k-means approach.  However, it is necessary to have a fixed number of clusters.  Often we encounter situations where we don’t know how many fixed clusters we need.  Suppose we’re trying to identify groups of voters.  We could use political partisanship (e.g. low/medium/high Democratic vote) but that may not necessary describe the data appropriately.  If this is the case then we can turn to Bayesian nonparametrics and the Dirichlet Process and use some approaches there to solve this problem.  Three in particular are commonly used as examples: the Chinese Restaurant ModelPólya’s Urn, and Stick Breaking.

Chinese Restaurant Model

Chinese Restaurant Model

 The Chinese Restaurant Model is based on idea that there is a restaurant with an infinite number of tables.  At each table there are an infinite number of seats.  The first customer arrives and sits down at a table.  The second customer then arrives and selects a table.  However, the customer selects the table that the first customer is currently sitting with probability \alpha/(1+\alpha) or selects a new table with 1/(1+\alpha).  This continues on to the (n+1)^{st} customer where they select a table that a current customer is sitting with probability n_{k}/(n+\alpha).

crp = function(num.customers, alpha) {
table < - c(1)
next.table <- 2
for (i in 1:num.customers) {
if (runif(1,0,1) < alpha / (alpha + i)) {
# Add a new ball color.
table <- c(table, next.table)
next.table <- next.table+1
} else {
# Pick out a ball from the urn, and add back a
# ball of the same color.
select.table <- table&#91;sample(1:length(table), 1)&#93;
table <- c(table, select.table)
crp(100, 4)
table( crp(10000, 2) )
,xlab="Table Index", ylab="Frequency"
<p style="text-align: center;"><a href=""> </a>

<h2>Pólya's Urn Model</h2>
In the Pólya's Urn model we take the approach where there exists a urn of colored balls.  We take a ball out of the urn and note its color.  We replace the ball back into the urn and then we add an additional ball of the same color to the urn. This process can continue on infinitely.

rgb2hex < - function(x){
hex.part = ""
hex <- ""
for(i in 1:3){
b16 <- x&#91;,i&#93;/16 <- trunc(b16)
if(>=10){ < - letters&#91;;
} else { <-
fract <- abs( b16 - )
int.two <- fract*16
val.two < - letters&#91;int.two-10+1&#93;
} else {
val.two <- int.two
hex.part&#91;i&#93; <- paste(,val.two, sep="")
hex <- paste(hex,hex.part&#91;i&#93;, sep="")
polyaUrnModel = function(baseColorDistribution, num.balls, alpha) {
balls < - c()
for (i in 1:num.balls) {
if (runif(1,0,1) < alpha / (alpha + length(balls))) {
# Add a new ball color.
color.comb < - expand.grid(x=seq(0,255),y=seq(0,255),z=seq(0,255))
location.picker <- rnorm(1,nrow(color.comb)/2, (nrow(color.comb)-1)/4 )
the.color <- c( color.comb&#91;location.picker,1&#93;, color.comb&#91;location.picker,2&#93;, color.comb&#91;location.picker,3&#93;)
the.hex <- paste("#",rgb2hex(the.color), sep="")
new.color <- the.hex
balls = c(balls, new.color)
} else {
# Pick out a ball from the urn, and add back a
# ball of the same color.
ball = balls&#91;sample(1:length(balls), 1)&#93;
balls = c(balls, ball)
pum < - polyaUrnModel(function() rnorm(1,0,1), 100, 1)
barplot( table(pum), col=names(table(pum)), pch=10 )

<p style="text-align: center;"><a href=""><img class="aligncenter  wp-image-1245" alt="Polya Urn Model" src="" width="419" height="321" /></a>

<h2>Stick Breaking Model</h2>
<p style="text-align: left;">With this third model we simply start breaking a stick and continue to break that stick into smaller pieces.  This process works by taking a stick of length 1.0.  We then generate one random number from the Beta distribution (\beta_{1} ~ Beta(1,\alpha). We then break the stick at \beta_1. The left side of the stick we'll call \nu_1.  We then take the remaining stick to the right and break it again at location (\beta_{2}  ~<code> </code>Beta(1, \alpha). Once again the piece to the left we'll call \nu_2 = \left(1-\beta_1\right) \cdot \beta_2. The sum of all of the probabilities generated will add up to 1.0.</p>

stickBreakingModel = function(num.vals, alpha) {
betas = rbeta(num.vals, 1, alpha) = c(1, cumprod(1 - betas))[1:num.vals]
weights = * betas

plot( stickBreakingModel(100,5), pch=16, cex=.5 )


Stick Breaking Probabilities

Multivariate Clustering

# Generate some fake data with some uniform random means
generateFakeData < - function( num.vars=3, n=100, num.clusters=5, seed=NULL ) { if(is.null(seed)){ set.seed(runif(1,0,100)) } else { set.seed(seed) } data <- data.frame(matrix(NA, nrow=n, ncol=num.vars+1)) mu <- NULL for(m in 1:num.vars){ mu <- cbind(mu,rnorm(num.clusters, runif(1,-10,15), 5)) } for (i in 1:n) { cluster <- sample(1:num.clusters, 1) data[i, 1] <- cluster for(j in 1:num.vars){ data[i, j+1] <- rnorm(1, mu[cluster,j], 1) } } data$X1 <- factor(data$X1) var.names <- paste("VAR",seq(1,ncol(data)-1), sep="") names(data) <- c("cluster",var.names) return(data) } ## # Set up a procedure to calculate the cluster means using squared distance ## dirichletClusters <- function(, disp.param = NULL, max.iter = 100, tolerance = .001) { n <- nrow( ) data <- as.matrix( ) pick.clusters <- rep(1, n) k <- 1 mu <- matrix( apply(data,2,mean), nrow=1, ncol=ncol(data) ) is.converged <- FALSE iteration <- 0 ss.old <- Inf ss.curr <- Inf while ( !is.converged & iteration < max.iter ) { # Iterate until convergence iteration <- iteration + 1 for( i in 1:n ) { # Iterate over each observation and measure the distance each observation' from its mean center's squared distance from its mean distances <- rep(NA, k) for( j in 1:k ){ distances[j] <- sum( (data[i, ] - mu[j, ])^2 ) # Distance formula. } if( min(distances) > disp.param ) { # If the dispersion parameter is still less than the minimum distance then create a new cluster
k < - k + 1 pick.clusters[i] <- k mu <- rbind(mu, data[i, ]) } else { pick.clusters[i] <- which(distances == min(distances)) } } ## # Calculate new cluster means ## for( j in 1:k ) { if( length(pick.clusters == j) > 0 ) {
mu[j, ] < - apply(subset(data,pick.clusters == j), 2, mean) } } ## # Test for convergence ## ss.curr <- 0 for( i in 1:n ) { ss.curr <- ss.curr + sum( (data[i, ] - mu[pick.clusters[i], ])^2 ) } ss.diff <- ss.old - ss.curr ss.old <- ss.curr if( !is.nan( ss.diff ) & ss.diff < tolerance ) { is.converged <- TRUE } } centers <- data.frame(mu) ret.val <- list("centers" = centers, "cluster" = factor(pick.clusters), "k" = k, "iterations" = iteration) return(ret.val) } create.num.vars <- 3 <- generateFakeData(create.num.vars, num.clusters=3, n=1000, seed=123) dp.update <- dirichletClusters([,2:create.num.vars+1], disp.param=25) ggplot(, aes(x = VAR1, y = VAR3, color = cluster)) + geom_point() [/code]

 Clustering 1
In this example I have provided some R code that clusters variables based an any given number of variables.  The measure of distance from the group centroid is the multivariate sum of squared distance, though there are many other distance metrics that could be implemented (e.g. manhattan, euclidean, etc.)
Clustering 2

Significant P-Values and Overlapping Confidence Intervals

There are all sorts of problems with p-values and confidence intervals and I have no intention (or the time) to cover all those problems right now.  However, a big problem is that most people have no idea what p-values really mean. Here is one example of a common problem with p-values and how it relates to confidence intervals.  The problem arises when there are two  random variables (from independent populations), each with a mean and variance.  A confidence interval can be constructed around each sample mean.  Using these confidence intervals might be a tempting way to explain whether two values are statistically different.  The issue is that a person may see that the confidence intervals overlap and therefore declare that there is no difference.  Simply put this is not one of those iff (if and only if) situations.  If the confidence intervals do not overlap then one can conclude that there is a statistical difference between means.  However, the opposite can not be concluded.  When the confidence intervals do overlap then there still may be a difference.

Here are simulated data from two independent normally distributed populations testing the confidence intervals and the p-values.  It can easily be seen that this is a fairly frequent event. So don’t make the mistake and make conclusions solely on confidence intervals.Histogram of Confidence Intervals

Beta Distribution for Overlapping Confidence Intervals

#Set some constants
alpha = .05
m = 15
n = 15
nsim = 20000

#Function to calculate the t statistic. Same as t.test(x,y, var.eqal=T)
# `spooled` can easiliy be modified.
m = length(x)
n = length(y)
spooled = sqrt(((m-1)*sd(x)^2+(n-1)*sd(y)^2)/(m+n-2))
tstat = (mean(x)-mean(y))/(spooled*sqrt(1/m+1/n))


calcInterval = function(){
x = rnorm(m,0,1)
y = rnorm(n,1,.8)
se.x = sd(x)/sqrt(m)
se.y = sd(y)/sqrt(n)

t.stat = tstatfunc(x,y)

p.val = (1-pt(abs(t.stat),m+n-2))*2 #Pooled Variance, two sided hypothesis
ci.x.ll = mean(x)-abs(qt(alpha/2,m-1))*se.x
ci.x.ul = mean(x)+abs(qt(alpha/2,m-1))*se.x
ci.y.ll = mean(y)-abs(qt(alpha/2,n-1))*se.y
ci.y.ul = mean(y)+abs(qt(alpha/2,n-1))*se.y
TTest = t.test(x,y, var.equal=TRUE) #Run the t.test() function for comparison
ret.val = c(p=p.val, t.p=TTest$p.value,
ci.x.ll=ci.x.ll, ci.x.ul=ci.x.ul,
ci.y.ll=ci.y.ll, ci.y.ul=ci.y.ul)


#Replicate a few times
my.sims = t(replicate(nsim,calcInterval())) )

#Do the intervals overlap and it's significant?
ci.vals = cbind(my.sims$ci.x.ll - my.sims$ci.y.ul, my.sims$ci.x.ul-my.sims$ci.y.ul, my.sims$ci.x.ul-my.sims$ci.y.ll)
overlapTest = (ci.vals[,1] > 0 & ci.vals[,2] > 0 & ci.vals[,3] > 0) |
(ci.vals[,1] < 0 & ci.vals&#91;,2&#93; < 0 & ci.vals&#91;,3&#93; < 0)
my.sims = cbind(my.sims,ci.vals,NotOverlap=overlapTest)


rbeta(100000,sum(my.sims$CI.p)+1, nsim-sum(my.sims$CI.p)+1) ##CI.p is a binomial distribution.
, nclass=100, xlab="Proportion", freq=F, main=expression("Histogram of Overlapping Confidence Intervals and p<"*alpha*" from Beta Distribution"))
#What percent have overlapping CI and p < alpha? -- "Significant" but yet CI indicate otherwise.
#Multiple comparisons
my.sims$CI.p = as.numeric( !my.sims$NotOverlap & my.sims$p < alpha )
my.sims$CI.p2 = as.numeric( my.sims$NotOverlap & my.sims$p > alpha )
my.sims$my.diff = my.sims$p-my.sims$t.p # Check calculations for consistency


#How many have confidence intervals that do not overlap but yet are still "Significant"?
my.sims$CI.p2 = as.numeric( my.sims$NotOverlap & my.sims$p > alpha )

#Histograms of the intervals
hist(my.sims$t.p, nclass=100)
subset(my.sims,my.sims$p > .5)

#Histograms of the intervals
hist(my.sims$ci.y.ul, nclass=100, xlim=c(-2,3), ylim=c(0,2), col=4, freq=F,
main="Histogram of Confidence Intervals", xlab="Value")
hist(my.sims$ci.y.ll, nclass=100, add=T, col=4, freq=F)
hist(my.sims$ci.x.ul, nclass=100, add=T, col=2, freq=F)
hist(my.sims$ci.x.ll, nclass=100, add=T, col=2, freq=F)