A Brief Tour of the Trees and Forests

Tree methods such as CART (classification and regression trees) can be used as alternatives to logistic regression. It is a way that can be used to show the probability of being in any hierarchical group. The following is a compilation of many of the key R packages that cover trees and forests.  The goal here is to simply give some brief examples on a few approaches on growing trees and, in particular, the visualization of the trees. These packages include classification and regression trees, graphing and visualization, ensemble learning using random forests, as well as evolutionary learning trees. There are a wide array of package in R that handle decision trees including trees for longitudinal studies.  I have found that when using several combinations of these packages simultaneously that some of the function begin to fail to work.

The concept of trees and forests can be applied in many different settings and is often seen in machine learning and data mining settings or other settings where there is a significant amount of data.  The examples below are by no means comprehensive and exhaustive. However, there are several examples given using different datasets and a variety of R packages. The first example uses some data obtain from the Harvard Dataverse Network. For reference these data can be obtained from http://dvn.iq.harvard.edu/dvn/. The study was recently released on April 22nd, 2013 and the raw data as well as the documentation is available on the Dataverse web site and the study ID is hdl:1902.1/21235. The other examples use data that are shipped with the R packages.


This package includes several example sets of data that can be used for recursive partitioning and regression trees.  Categorical or continuous variables can be used depending on whether one wants classification trees or regression trees. This package as well at the tree package are probably the two go-to packages for trees.  However, care should be taken as the tree package and the rpart package can produce very different results.

raw.orig < - read.csv(file="c:\\rsei212_chemical.txt", header=T, sep="\t")

# Keep the dataset small and tidy
# The Dataverse: hdl:1902.1/21235
raw = subset(raw.orig, select=c("Metal","OTW","AirDecay","Koc"))

row.names(raw) = raw.orig$CASNumber
raw = na.omit(raw);

frmla = Metal ~ OTW + AirDecay + Koc

# Metal: Core Metal (CM); Metal (M); Non-Metal (NM); Core Non-Metal (CNM)

fit = rpart(frmla, method="class", data=raw)

printcp(fit) # display the results
plotcp(fit) # visualize cross-validation results
summary(fit) # detailed summary of splits

# plot tree
plot(fit, uniform=TRUE, main="Classification Tree for Chemicals")
text(fit, use.n=TRUE, all=TRUE, cex=.8)

# tabulate some of the data
table(subset(raw, Koc>=190.5)$Metal)

Basic Decision Tree


This is the primary R package for classification and regression trees.  It has functions to prune the tree as well as general plotting functions and the mis-classifications (total loss). The output from tree can be easier to compare to the General Linear Model (GLM) and General Additive Model (GAM) alternatives.

# TREE package

tr = tree(frmla, data=raw)
plot(tr); text(tr)

Basic Tree (tree() function)


This is another package for recursive partitioning. One of the key functions in this package is ctree. As the package documention indicates it can be used for continuous, censored, ordered, nominal and multivariate response variable in a conditional inference framework. The party package also implements recursive partitioning for survival data.

# PARTY package

(ct = ctree(frmla, data = raw))
plot(ct, main="Conditional Inference Tree")

#Table of prediction errors
table(predict(ct), raw$Metal)

# Estimated class probabilities
tr.pred = predict(ct, newdata=raw, type="prob")

Conditional Tree


maptree is a very good at graphing, pruning data from hierarchical clustering, and CART models. The trees produced by this package tend to be better labeled and higher quality and the stock plots from rpart.

draw.tree( clip.rpart (rpart ( raw), best=7),
nodeinfo=TRUE, units="species",
cases="cells", digits=0)
a = agnes ( raw[2:4], method="ward" )
b = kgs (a, a$diss, maxclust=20)

plot(names(b), b, xlab="# clusters", ylab="penalty", type="n")
xloc = names(b)[b==min(b)]
yloc = min(b)
ngon(c(xloc,yloc+.75,10, "dark green"), angle=180, n=3)
apply(cbind(names(b), b, 3, 'blue'), 1, ngon, 4) # cbind(x,y,size,color)

Map Tree Example


This contains a re-implementation of the ctree function and it provides some very good graphing and visualization for tree models.  It is similar to the party package.  The example below uses data from airquality dataset and the famous species data available in R and can be found in the documentation.

<a href="http://statistical-research.com/wp-content/uploads/2012/12/species.png"><img alt="Species Decision Tree" src="http://statistical-research.com/wp-content/uploads/2012/12/species.png" width="437" height="472" /></a> <a href="http://statistical-research.com/wp-content/uploads/2012/12/airqualityOzone.png"><img alt="Ozone Air Quality Decision Tree" src="http://statistical-research.com/wp-content/uploads/2012/12/airqualityOzone.png" width="437" height="472" /></a>


This package uses evolutionary algorithms.  The idea behind this approach is that is will reduce the a priori bias.  I have seen trees of this sort in the area of environmental research, bioinformatics, systematics, and marine biology.  Though there are many other areas than that of phylogentics.

## EVTREE (Evoluationary Learning)

ev.raw = evtree(frmla, data=raw)
table(predict(ev.raw), raw$Metal)
1-mean(predict(ev.raw) == raw$Metal)

Evolutionary Tree Learning


Random forests are very good in that it is an ensemble learning method used for classification and regression.  It uses multiple models for better performance that just using a single tree model.  In addition because many sample are selected in the process a measure of variable importance can be obtain and this approach can be used for model selection and can be particularly useful when forward/backward stepwise selection is not appropriate and when working with an extremely high number of candidate variables that need to be reduced.

## randomForest
fit.rf = randomForest(frmla, data=raw)
plot( importance(fit.rf), lty=2, pch=16)
imp = importance(fit.rf)
impvar = rownames(imp)[order(imp[, 1], decreasing=TRUE)]
op = par(mfrow=c(1, 3))
for (i in seq_along(impvar)) {
partialPlot(fit.rf, raw, impvar[i], xlab=impvar[i],
main=paste("Partial Dependence on", impvar[i]),
ylim=c(0, 1))

%IncMSE IncNodePurity
x1 30.30146 8657.963
x2 7.739163 3675.853
x3 0.586905 240.275
x4 -0.82209 381.6304
x5 0.583622 253.3885

Importance Graph


This can be used for further variable selection procedure using random forests.  It implements both backward stepwise elimination as well as selection based on the importance spectrum.  This data uses randomly generated data so the correlation matrix can set so that the first variable is strongly correlated and the other variables are less so.

## varSelRF package
x = matrix(rnorm(25 * 30), ncol = 30)
x[1:10, 1:2] = x[1:10, 1:2] + 2
cl = factor(c(rep("A", 10), rep("B", 15)))
rf.vs1 = varSelRF(x, cl, ntree = 200, ntreeIterat = 100,
vars.drop.frac = 0.2)


## Example of importance function show that forcing x1 to be the most important
## while create secondary variables that is related to x1.
rf1 = randomForest(y~., data=my.df, mtry=2, ntree=50, importance=TRUE)

Importance and Out Of Bag (OOB) Error


This package grows an oblique decision tree (a general form of the axis-parallel tree).  This example uses the crab dataset (morphological measurements on Leptograpsus crabs) available in R as a stock dataset to grow the oblique tree.


aug.crabs.data = data.frame( g=factor(rep(1:4,each=50)),
text( aug.crabs.data[,-1], col=as.numeric(aug.crabs.data[,1]), labels=as.numeric(aug.crabs.data[,1]))
ob.tree = oblique.tree(formula = g~.,
data = aug.crabs.data,
oblique.splits = "only")

Oblique Tree


This is a great package that contain many different machine learning algorithms and functions.  It include trees, forests, naive Bayes, locally weighted regression, among others.

## CORElearn

## Random Forests
fit.rand.forest = CoreModel(frmla, data=raw, model="rf", selectionEstimator="MDL", minNodeWeightRF=5, rfNoTrees=100)

## decision tree with naive Bayes in the leaves
fit.dt = CoreModel(frmla, raw, model="tree", modelType=4)
plot(fit.dt, raw)

airquality.sub = subset(airquality, !is.na(airquality$Ozone))
fit.rt = CoreModel(Ozone~., airquality.sub, model="regTree", modelTypeReg=1)
plot(fit.rt, airquality.sub, graphType="prototypes")

pred = predict(fit.rt, airquality.sub)



This provides an implementation for recursive partitioning for longitudinal data.  It uses the rules from rpart and the mixed effects models from nlme to grow regression trees. This can be a little resource intensive on some slower computers.


long.tree = longRPart(pbkph~Time,~age+gender,~1|Subject,pbkphData,R=corExp(form=~time))
lrpTreePlot(long.tree, use.n=TRE, place="bottomright")



This package is useful for longitudinal studies where random effects exist.  This example uses the pbkphData dataset available in the longRPart package.

## REEMtree Random Effects for Longitudinal Data
pbkphData.sub = subset(pbkphData, !is.na(pbkphData$pbkph))
reem.tree = REEMtree(pbkph~Time, data=pbkphData.sub, random=~1|Subject)
ranef(reem.tree) #random effects

reem.tree = REEMtree(pbkph~Time, data=pbkphData.sub, random=~1|Subject,



Amazon AWS Summit 2013

I was fortunate enough to have been able to attend the Amazon AWS Summit in NYC and to listen to Werner Vogels give the keynote.  I will share a few of my thoughts on the AWS 2013 Summit and some of my take-aways.  I attended sessions that focused on two products in particular: Redshift and DynamoDB.  Amazon AWS seems to be a good option for projects that need a lot of disk space (e.g. up to 1.6 petabytes) or if you need to quickly increase system resources (e.g. RAM or CPU) to handle a lot of database writes or to handle a lot of data analysis on demand.


This is a new Amazon product was announced earlier this month and if it can do what Amazon says it can do then it seems that this is a great option data warehousing.  It will be interesting to see if some of the industries that have strict regulations (e.g. HIPAA, PCI compliant) move over to Amazon.  However, with some of the Virtual Private Cloud options and the encryption that is provided that may be able to solve the regulatory requirements.

I have done a fair amount of work on data warehousing but have generally only used Oracle for that work.  Some of the benchmarks for Redshift are quite impressive. As seen in the photo I took of one of the presentation slides they were able to read 2 billion rows in about 6 seconds (12 seconds for aggregate summaries).  Compare that to  SQL Server that was manually stopped after 6 hours and Hive took only about a half hour.  Not too long ago I personally ran ~3 billion rows on a local MySQL server.  I don’t have specific benchmarks.  However, to scrub and transform the data it took roughly 3 days.  Needless to say after that I moved over and used some of the Amazon products and was able to quickly scale up and use more Amazon instances.




Amazon DynamoDB

I haven’t had the opportunity to use this product but it looks very promising and appears to provide a great resource for a NoSQL alternative to relational databases and a strong competitor to some of the other NoSQL databases.  It is a proprietary software but I would be interested in comparing it to Cassandra or other Hadoop style systems.  Some of the libraries, mappers, and mock are available at http://j.mp/dynamodb-libs.



From personal experience I have been able to use R and Hadoop as well as some PHP scripts and Java programs on Amazon instances.  The price for using any of these products is very good and is generally a whole lot cheaper than purchasing in-house hardware.  Plus it provides flexibility to use a wide range of software.  It will be interesting to see how well Redshift performs as well as DynamoDB.  Redshift in particular looks very promising.


As a side,  I’m in no way associated with Amazon, I’m simply an occasional user of their products.

Simulating the Gambler’s Ruin

The gambler’s ruin problem is one where a player has a probability p of winning  and probability q of losing. For example let’s take a skill game where the player x can beat player y with probability 0.6 by getting closer to a target. The game play begins with player x being allotted 5 points and player y allotted 10 points. After each round a player’s points either decrease by one or increase by one we can determine the probability that player x will annihilate player y. The player that reaches 15 wins and the player that reach zero is annihilated. There is a wide range of application for this type of problem that goes being gambling.

Number of Time Until Annihilation

This is actually a fairly simple problem to solve on pencil and paper and to determine an exact probability. Without going into too much detail we can determine the probability of annihilation by \frac{1-\left(\frac{q}{p}\right)^i}{1-\left(\frac{q}{p}\right)^N}. In this example it works out to be \frac{1-\left(\frac{.4}{.6}\right)^5}{1-\left(\frac{.4}{.6}\right)^{10}} \approx 0.8703.

But this is a relatively boring approach and coding up an R script makes everything that much better. So here is a simulation of this same problem estimating that same probability plus it provides additional information on the distribution of how many times this game would have to be played.

gen.ruin = function(n, x.cnt, y.cnt, x.p){
x.cnt.c = x.cnt
y.cnt.c = y.cnt
x.rnd = rbinom(n, 1, p=x.p)
x.rnd[x.rnd==0] = -1
y.rnd = x.rnd*-1
x.cum.sum = cumsum(x.rnd)+x.cnt
y.cum.sum = cumsum(y.rnd)+y.cnt

ruin.data = cumsum(x.rnd)+x.cnt

if( any( which(ruin.data>=x.cnt+y.cnt) ) | any( which(ruin.data< =0) ) ){ cut.data = 1+min( which(ruin.data>=x.cnt+y.cnt), which(ruin.data< =0) ) ruin.data[cut.data:length(ruin.data)] = 0 } return(ruin.data) } n.reps = 10000 ruin.sim = replicate(n.reps, gen.ruin(n=1000, x.cnt=5, y.cnt=10, x.p=.6)) ruin.sim[ruin.sim==0] = NA hist( apply(ruin.sim==15 | is.na(ruin.sim), 2, which.max) , nclass=100, col='8', main="Distribution of Number of Turns", xlab="Turn Number") abline(v=mean(apply(ruin.sim==15 | is.na(ruin.sim), 2, which.max)), lwd=3, col='red') abline(v=median(apply(ruin.sim==15 | is.na(ruin.sim), 2, which.max)), lwd=3, col='green') x.annihilation = apply(ruin.sim==15, 2, which.max) ( prob.x.annilate = length(x.annihilation[x.annihilation!=1]) / n.reps ) state.cnt = ruin.sim state.cnt[state.cnt!=15] = 0 state.cnt[state.cnt==15] = 1 mean.state = apply(ruin.sim, 1, mean, na.rm=T) plot(mean.state, xlim=c(0,which.max(mean.state)), ylim=c(0,20), ylab="Points", xlab="Number of Plays", pch=16, cex=.5, col='green') lines(mean.state, col='green') points(15-mean.state, pch=16, cex=.5, col='blue') lines(15-mean.state, col='blue') [/code]

Gambler's Ruin

Finding the Distribution Parameters

This is a brief description on one way to determine the distribution of given data. There are several ways to accomplish this in R especially if one is trying to determine if the data comes from a normal distribution. Rather than focusing on hypothesis testing and determining if a distribution is actually the said distribution this example shows one simple approach to determine the parameters of a distribution. I’ve found this useful when I’m given a dataset and I need to generate more of the same type of data for testing and simulation purposes. The example below uses data that originated from a gamma (1,2). Here we can see how well the data fits the distribution.

Simulated Gamma Distribution

raw = t( matrix(c(
1, 0.4789,
1, 0.1250,
2, 0.7048,
2, 0.2482,
2, 1.1744,
2, 0.2313,
2, 0.3978,
2, 0.1133,
2, 0.1008,
1, 0.7850,
2, 0.3099,
1, 2.1243,
2, 0.3615,
2, 0.2386,
1, 0.0883), nrow=2
) )
( fit.distr = fitdistr(raw[,2], "gamma") )
qqplot(rgamma(nrow(raw),fit.distr$estimate[1], fit.distr$estimate[2]), (raw[,2]),
xlab="Observed Data", ylab="Random Gamma")

simulated = rgamma(1000, fit.distr$estimate[1], fit.distr$estimate[2])
hist(simulated, main=paste("Histogram of Simulated Gamma using",round(fit.distr$estimate[1],3),"and",round(fit.distr$estimate[2],3)),
col=8, xlab="Random Gamma Distribution Value")

( fit.distr = fitdistr(raw[,2], "normal") )
qqplot(rnorm(nrow(raw),fit.distr$estimate[1], fit.distr$estimate[2]), (raw[,2]))

( fit.distr = fitdistr(raw[,2], "lognormal") )
qqplot(rlnorm(nrow(raw),fit.distr$estimate, fit.distr$sd), (raw[,2]))

( fit.distr = fitdistr(raw[,2], "exponential") )
qqplot(rexp(nrow(raw),fit.distr$estimate), (raw[,2]))

Distribution of QQPlot

Dirichlet Process, Infinite Mixture Models, and Clustering

The Dirichlet process provides a very interesting approach to understand group assignments and models for clustering effects.   Often time we encounter the k-means approach.  However, it is necessary to have a fixed number of clusters.  Often we encounter situations where we don’t know how many fixed clusters we need.  Suppose we’re trying to identify groups of voters.  We could use political partisanship (e.g. low/medium/high Democratic vote) but that may not necessary describe the data appropriately.  If this is the case then we can turn to Bayesian nonparametrics and the Dirichlet Process and use some approaches there to solve this problem.  Three in particular are commonly used as examples: the Chinese Restaurant ModelPólya’s Urn, and Stick Breaking.

Chinese Restaurant Model

Chinese Restaurant Model

 The Chinese Restaurant Model is based on idea that there is a restaurant with an infinite number of tables.  At each table there are an infinite number of seats.  The first customer arrives and sits down at a table.  The second customer then arrives and selects a table.  However, the customer selects the table that the first customer is currently sitting with probability \alpha/(1+\alpha) or selects a new table with 1/(1+\alpha).  This continues on to the (n+1)^{st} customer where they select a table that a current customer is sitting with probability n_{k}/(n+\alpha).

crp = function(num.customers, alpha) {
table < - c(1)
next.table <- 2
for (i in 1:num.customers) {
if (runif(1,0,1) < alpha / (alpha + i)) {
# Add a new ball color.
table <- c(table, next.table)
next.table <- next.table+1
} else {
# Pick out a ball from the urn, and add back a
# ball of the same color.
select.table <- table&#91;sample(1:length(table), 1)&#93;
table <- c(table, select.table)
crp(100, 4)
table( crp(10000, 2) )
,xlab="Table Index", ylab="Frequency"
<p style="text-align: center;"><a href="http://statistical-research.com/wp-content/uploads/2013/03/StickBreakingProb.png"> </a>

<h2>Pólya's Urn Model</h2>
In the Pólya's Urn model we take the approach where there exists a urn of colored balls.  We take a ball out of the urn and note its color.  We replace the ball back into the urn and then we add an additional ball of the same color to the urn. This process can continue on infinitely.

rgb2hex < - function(x){
hex.part = ""
hex <- ""
for(i in 1:3){
b16 <- x&#91;,i&#93;/16
int.one <- trunc(b16)
val.one < - letters&#91;int.one-10+1&#93;
} else {
val.one <- int.one
fract <- abs( b16 - int.one )
int.two <- fract*16
val.two < - letters&#91;int.two-10+1&#93;
} else {
val.two <- int.two
hex.part&#91;i&#93; <- paste(val.one,val.two, sep="")
hex <- paste(hex,hex.part&#91;i&#93;, sep="")
polyaUrnModel = function(baseColorDistribution, num.balls, alpha) {
balls < - c()
for (i in 1:num.balls) {
if (runif(1,0,1) < alpha / (alpha + length(balls))) {
# Add a new ball color.
color.comb < - expand.grid(x=seq(0,255),y=seq(0,255),z=seq(0,255))
location.picker <- rnorm(1,nrow(color.comb)/2, (nrow(color.comb)-1)/4 )
the.color <- c( color.comb&#91;location.picker,1&#93;, color.comb&#91;location.picker,2&#93;, color.comb&#91;location.picker,3&#93;)
the.hex <- paste("#",rgb2hex(the.color), sep="")
new.color <- the.hex
balls = c(balls, new.color)
} else {
# Pick out a ball from the urn, and add back a
# ball of the same color.
ball = balls&#91;sample(1:length(balls), 1)&#93;
balls = c(balls, ball)
pum < - polyaUrnModel(function() rnorm(1,0,1), 100, 1)
barplot( table(pum), col=names(table(pum)), pch=10 )

<p style="text-align: center;"><a href="http://statistical-research.com/wp-content/uploads/2013/03/BarplotPolyaUrn.png"><img class="aligncenter  wp-image-1245" alt="Polya Urn Model" src="http://statistical-research.com/wp-content/uploads/2013/03/BarplotPolyaUrn.png" width="419" height="321" /></a>

<h2>Stick Breaking Model</h2>
<p style="text-align: left;">With this third model we simply start breaking a stick and continue to break that stick into smaller pieces.  This process works by taking a stick of length 1.0.  We then generate one random number from the Beta distribution (\beta_{1} ~ Beta(1,\alpha). We then break the stick at \beta_1. The left side of the stick we'll call \nu_1.  We then take the remaining stick to the right and break it again at location (\beta_{2}  ~<code> </code>Beta(1, \alpha). Once again the piece to the left we'll call \nu_2 = \left(1-\beta_1\right) \cdot \beta_2. The sum of all of the probabilities generated will add up to 1.0.</p>

stickBreakingModel = function(num.vals, alpha) {
betas = rbeta(num.vals, 1, alpha)
stick.to.right = c(1, cumprod(1 - betas))[1:num.vals]
weights = stick.to.right * betas

plot( stickBreakingModel(100,5), pch=16, cex=.5 )


Stick Breaking Probabilities

Multivariate Clustering

# Generate some fake data with some uniform random means
generateFakeData < - function( num.vars=3, n=100, num.clusters=5, seed=NULL ) { if(is.null(seed)){ set.seed(runif(1,0,100)) } else { set.seed(seed) } data <- data.frame(matrix(NA, nrow=n, ncol=num.vars+1)) mu <- NULL for(m in 1:num.vars){ mu <- cbind(mu,rnorm(num.clusters, runif(1,-10,15), 5)) } for (i in 1:n) { cluster <- sample(1:num.clusters, 1) data[i, 1] <- cluster for(j in 1:num.vars){ data[i, j+1] <- rnorm(1, mu[cluster,j], 1) } } data$X1 <- factor(data$X1) var.names <- paste("VAR",seq(1,ncol(data)-1), sep="") names(data) <- c("cluster",var.names) return(data) } ## # Set up a procedure to calculate the cluster means using squared distance ## dirichletClusters <- function(orig.data, disp.param = NULL, max.iter = 100, tolerance = .001) { n <- nrow( orig.data ) data <- as.matrix( orig.data ) pick.clusters <- rep(1, n) k <- 1 mu <- matrix( apply(data,2,mean), nrow=1, ncol=ncol(data) ) is.converged <- FALSE iteration <- 0 ss.old <- Inf ss.curr <- Inf while ( !is.converged & iteration < max.iter ) { # Iterate until convergence iteration <- iteration + 1 for( i in 1:n ) { # Iterate over each observation and measure the distance each observation' from its mean center's squared distance from its mean distances <- rep(NA, k) for( j in 1:k ){ distances[j] <- sum( (data[i, ] - mu[j, ])^2 ) # Distance formula. } if( min(distances) > disp.param ) { # If the dispersion parameter is still less than the minimum distance then create a new cluster
k < - k + 1 pick.clusters[i] <- k mu <- rbind(mu, data[i, ]) } else { pick.clusters[i] <- which(distances == min(distances)) } } ## # Calculate new cluster means ## for( j in 1:k ) { if( length(pick.clusters == j) > 0 ) {
mu[j, ] < - apply(subset(data,pick.clusters == j), 2, mean) } } ## # Test for convergence ## ss.curr <- 0 for( i in 1:n ) { ss.curr <- ss.curr + sum( (data[i, ] - mu[pick.clusters[i], ])^2 ) } ss.diff <- ss.old - ss.curr ss.old <- ss.curr if( !is.nan( ss.diff ) & ss.diff < tolerance ) { is.converged <- TRUE } } centers <- data.frame(mu) ret.val <- list("centers" = centers, "cluster" = factor(pick.clusters), "k" = k, "iterations" = iteration) return(ret.val) } create.num.vars <- 3 orig.data <- generateFakeData(create.num.vars, num.clusters=3, n=1000, seed=123) dp.update <- dirichletClusters(orig.data[,2:create.num.vars+1], disp.param=25) ggplot(orig.data, aes(x = VAR1, y = VAR3, color = cluster)) + geom_point() [/code]

 Clustering 1
In this example I have provided some R code that clusters variables based an any given number of variables.  The measure of distance from the group centroid is the multivariate sum of squared distance, though there are many other distance metrics that could be implemented (e.g. manhattan, euclidean, etc.)
Clustering 2