Tree methods such as CART (classification and regression trees) can be used as alternatives to logistic regression. It is a way that can be used to show the probability of being in any hierarchical group. The following is a compilation of many of the key R packages that cover trees and forests. The goal here is to simply give some brief examples on a few approaches on growing trees and, in particular, the visualization of the trees. These packages include classification and regression trees, graphing and visualization, ensemble learning using random forests, as well as evolutionary learning trees. There are a wide array of package in R that handle decision trees including trees for longitudinal studies. I have found that when using several combinations of these packages simultaneously that some of the function begin to fail to work.

The concept of trees and forests can be applied in many different settings and is often seen in machine learning and data mining settings or other settings where there is a significant amount of data. The examples below are by no means comprehensive and exhaustive. However, there are several examples given using different datasets and a variety of R packages. The first example uses some data obtain from the Harvard Dataverse Network. For reference these data can be obtained from http://dvn.iq.harvard.edu/dvn/. The study was recently released on April 22nd, 2013 and the raw data as well as the documentation is available on the Dataverse web site and the study ID is hdl:1902.1/21235. The other examples use data that are shipped with the R packages.

**rpart**

This package includes several example sets of data that can be used for recursive partitioning and regression trees. Categorical or continuous variables can be used depending on whether one wants classification trees or regression trees. This package as well at the *tree* package are probably the two go-to packages for trees. However, care should be taken as the *tree* package and the *rpart* package can produce very different results.

library(rpart) raw.orig < - read.csv(file="c:\\rsei212_chemical.txt", header=T, sep="\t") # Keep the dataset small and tidy # The Dataverse: hdl:1902.1/21235 raw = subset(raw.orig, select=c("Metal","OTW","AirDecay","Koc")) row.names(raw) = raw.orig$CASNumber raw = na.omit(raw); frmla = Metal ~ OTW + AirDecay + Koc # Metal: Core Metal (CM); Metal (M); Non-Metal (NM); Core Non-Metal (CNM) fit = rpart(frmla, method="class", data=raw) printcp(fit) # display the results plotcp(fit) # visualize cross-validation results summary(fit) # detailed summary of splits # plot tree plot(fit, uniform=TRUE, main="Classification Tree for Chemicals") text(fit, use.n=TRUE, all=TRUE, cex=.8) # tabulate some of the data table(subset(raw, Koc>=190.5)$Metal)

**tree**

This is the primary R package for classification and regression trees. It has functions to prune the tree as well as general plotting functions and the mis-classifications (total loss). The output from *tree* can be easier to compare to the General Linear Model (GLM) and General Additive Model (GAM) alternatives.

############### # TREE package library(tree) tr = tree(frmla, data=raw) summary(tr) plot(tr); text(tr)

**party**

This is another package for recursive partitioning. One of the key functions in this package is ctree. As the package documention indicates it can be used for continuous, censored, ordered, nominal and multivariate response variable in a conditional inference framework. The party package also implements recursive partitioning for survival data.

############### # PARTY package library(party) (ct = ctree(frmla, data = raw)) plot(ct, main="Conditional Inference Tree") #Table of prediction errors table(predict(ct), raw$Metal) # Estimated class probabilities tr.pred = predict(ct, newdata=raw, type="prob")

**maptree**

*maptree* is a very good at graphing, pruning data from hierarchical clustering, and CART models. The trees produced by this package tend to be better labeled and higher quality and the stock plots from *rpart*.

############### # MAPTREE library(maptree) library(cluster) draw.tree( clip.rpart (rpart ( raw), best=7), nodeinfo=TRUE, units="species", cases="cells", digits=0) a = agnes ( raw[2:4], method="ward" ) names(a) a$diss b = kgs (a, a$diss, maxclust=20) plot(names(b), b, xlab="# clusters", ylab="penalty", type="n") xloc = names(b)[b==min(b)] yloc = min(b) ngon(c(xloc,yloc+.75,10, "dark green"), angle=180, n=3) apply(cbind(names(b), b, 3, 'blue'), 1, ngon, 4) # cbind(x,y,size,color)

**partykit**

This contains a re-implementation of the *ctree* function and it provides some very good graphing and visualization for tree models. It is similar to the *party* package. The example below uses data from *airquality *dataset and the famous *species* data available in R and can be found in the documentation.

<a href="http://statistical-research.com/wp-content/uploads/2012/12/species.png"><img alt="Species Decision Tree" src="http://statistical-research.com/wp-content/uploads/2012/12/species.png" width="437" height="472" /></a> <a href="http://statistical-research.com/wp-content/uploads/2012/12/airqualityOzone.png"><img alt="Ozone Air Quality Decision Tree" src="http://statistical-research.com/wp-content/uploads/2012/12/airqualityOzone.png" width="437" height="472" /></a>

**evtree**

This package uses evolutionary algorithms. The idea behind this approach is that is will reduce the *a priori* bias. I have seen trees of this sort in the area of environmental research, bioinformatics, systematics, and marine biology. Though there are many other areas than that of phylogentics.

############### ## EVTREE (Evoluationary Learning) library(evtree) ev.raw = evtree(frmla, data=raw) plot(ev.raw) table(predict(ev.raw), raw$Metal) 1-mean(predict(ev.raw) == raw$Metal)

**randomForest**

Random forests are very good in that it is an ensemble learning method used for classification and regression. It uses multiple models for better performance that just using a single tree model. In addition because many sample are selected in the process a measure of variable importance can be obtain and this approach can be used for model selection and can be particularly useful when forward/backward stepwise selection is not appropriate and when working with an extremely high number of candidate variables that need to be reduced.

################## ## randomForest library(randomForest) fit.rf = randomForest(frmla, data=raw) print(fit.rf) importance(fit.rf) plot(fit.rf) plot( importance(fit.rf), lty=2, pch=16) lines(importance(fit.rf)) imp = importance(fit.rf) impvar = rownames(imp)[order(imp[, 1], decreasing=TRUE)] op = par(mfrow=c(1, 3)) for (i in seq_along(impvar)) { partialPlot(fit.rf, raw, impvar[i], xlab=impvar[i], main=paste("Partial Dependence on", impvar[i]), ylim=c(0, 1)) }

>importance(rf1) | ||

%IncMSE | IncNodePurity | |

x1 | 30.30146 | 8657.963 |

x2 | 7.739163 | 3675.853 |

x3 | 0.586905 | 240.275 |

x4 | -0.82209 | 381.6304 |

x5 | 0.583622 | 253.3885 |

**varSelRF**

This can be used for further variable selection procedure using random forests. It implements both backward stepwise elimination as well as selection based on the importance spectrum. This data uses randomly generated data so the correlation matrix can set so that the first variable is strongly correlated and the other variables are less so.

################## ## varSelRF package library(varSelRF) x = matrix(rnorm(25 * 30), ncol = 30) x[1:10, 1:2] = x[1:10, 1:2] + 2 cl = factor(c(rep("A", 10), rep("B", 15))) rf.vs1 = varSelRF(x, cl, ntree = 200, ntreeIterat = 100, vars.drop.frac = 0.2) rf.vs1 plot(rf.vs1) ## Example of importance function show that forcing x1 to be the most important ## while create secondary variables that is related to x1. x1=rnorm(500) x2=rnorm(500,x1,1) y=runif(1,1,10)*x1+rnorm(500,0,.5) my.df=data.frame(y,x1,x2,x3=rnorm(500),x4=rnorm(500),x5=rnorm(500)) rf1 = randomForest(y~., data=my.df, mtry=2, ntree=50, importance=TRUE) importance(rf1) cor(my.df)

**oblique.tree**

This package grows an oblique decision tree (a general form of the axis-parallel tree). This example uses the crab dataset (morphological measurements on Leptograpsus crabs) available in R as a stock dataset to grow the oblique tree.

############### ## OBLIQUE.TREE library(oblique.tree) aug.crabs.data = data.frame( g=factor(rep(1:4,each=50)), predict(princomp(crabs[,4:8]))[,2:3]) plot(aug.crabs.data[,-1],type="n") text( aug.crabs.data[,-1], col=as.numeric(aug.crabs.data[,1]), labels=as.numeric(aug.crabs.data[,1])) ob.tree = oblique.tree(formula = g~., data = aug.crabs.data, oblique.splits = "only") plot(ob.tree);text(ob.tree)

**CORElearn**

This is a great package that contain many different machine learning algorithms and functions. It include trees, forests, naive Bayes, locally weighted regression, among others.

################## ## CORElearn library(CORElearn) ## Random Forests fit.rand.forest = CoreModel(frmla, data=raw, model="rf", selectionEstimator="MDL", minNodeWeightRF=5, rfNoTrees=100) plot(fit.rand.forest) ## decision tree with naive Bayes in the leaves fit.dt = CoreModel(frmla, raw, model="tree", modelType=4) plot(fit.dt, raw) airquality.sub = subset(airquality, !is.na(airquality$Ozone)) fit.rt = CoreModel(Ozone~., airquality.sub, model="regTree", modelTypeReg=1) summary(fit.rt) plot(fit.rt, airquality.sub, graphType="prototypes") pred = predict(fit.rt, airquality.sub) print(pred) plot(pred)

**longRPart**

This provides an implementation for recursive partitioning for longitudinal data. It uses the rules from *rpart* and the mixed effects models from *nlme* to grow regression trees. This can be a little resource intensive on some slower computers.

################## ##longRPart library(longRPart) data(pbkphData) pbkphData$Time=as.factor(pbkphData$Time) long.tree = longRPart(pbkph~Time,~age+gender,~1|Subject,pbkphData,R=corExp(form=~time)) lrpTreePlot(long.tree, use.n=TRE, place="bottomright")

**REEMtree**

This package is useful for longitudinal studies where random effects exist. This example uses the *pbkphData* dataset available in the *longRPart *package.

################## ## REEMtree Random Effects for Longitudinal Data library(REEMtree) pbkphData.sub = subset(pbkphData, !is.na(pbkphData$pbkph)) reem.tree = REEMtree(pbkph~Time, data=pbkphData.sub, random=~1|Subject) plot(reem.tree) ranef(reem.tree) #random effects reem.tree = REEMtree(pbkph~Time, data=pbkphData.sub, random=~1|Subject, correlation=corAR1()) plot(reem.tree)

## chris

/ April 30, 2013Thanks for this informative article. I think you overlooked rpart.plot which offers a lot of flexibility to draw the trees.

## Gene

/ May 2, 2013Really nice post to show the breadth of “tree” methods. I use these quite a bit, and there are several new packages here (new to me anyway). Looking forward to experimenting with the “learning” models mentioned.

I’m not sure what’s going on, but I’m getting an error when I run the agnes clustering.

> a = agnes ( raw[2:4], method=”ward”, diss=TRUE )

Error in agnes(raw[2:4], method = “ward”, diss = TRUE) :

NA/NaN/Inf in foreign function call (arg 4)

In addition: Warning message:

In as.dist.default(x) : non-square matrix

Also @Chris, I would be interested in further elaboration on the flexibility of the rpart.plot

I’ve found that it’s pretty hard to get a nice plot out of rpart and I usually stick with the partykit package, but I would like to know what sorts of what sorts of tricks you use to get rpart to make nicer plots!

## Wesley

/ May 2, 2013When you add the

diss=TRUEto the agnes function then the first parameter of the function is assumed to be a dissimilarity matrix. For example you can calculate the dissimilarity matrix with the daisy() function. This will compute the pairwise dissimilarities between observations.You can replace the line of code with this:

a = agnes ( daisy(raw[2:4]), method=”ward”, diss=TRUE )

Also, @Gene I hope you don’t mind but I merged you’re two posts into one.

## Gene

/ May 7, 2013thanks for merging the posts! I’ll try that code out later, hopefully

## Wayne

/ May 3, 2013Excellent resource!

## CHIT

/ August 28, 2013Very good summary.

## Jose

/ October 16, 2013longRPart seems to have been removed from CRAN. Any idea why? Any alternatives for these type of longitudinal data trees?

## Wesley

/ October 16, 2013Even though longRPart has been removed from the CRAN repository, earlier versions of longRPart package can be obtained on the CRAN archive at http://cran.r-project.org/src/contrib/Archive/longRPart/.

## Han

/ December 21, 2013Thanks for your article: very helpful

I also had some trouble with the agnes function. I think the keep.diss argument should be specified:

a = agnes ( raw[,2:4],diss=FALSE, method="ward",keep.diss=TRUE)

## Madhu

/ April 6, 2014Data is not currently available at the link provided. Can you please share the data file ?

## Wesley

/ April 6, 2014Once you click on the link you will need to enter the “hdl” 1902.1/21235 into the search. That will take you to the data download page.

## Mary

/ July 10, 2014Could you please tell me if there are any r packages for building CHAID analysis or any multi branch trees ?

I need something that allows all the variables to be either continuous or categorical.

Thanks.