Will Mu Go Out With Median

True story (no really, this did actually happen).  While in grad school one of the other teaching assistants was approached by one of the students and was asked “will mu go out with median?”  The teaching assistant thought the play on words was pretty funny, laughed, and then cluelessly walked away.  All of us other grad students were surprised because we knew that really was mean.

There are a lot of ways to calculate a measure of center.  Here are several examples that include arithmetic mean, geometric mean, harmonic mean, and for good measure the median.

Histogram of Pythagorean Means

Arithmetic Mean

By far the most common is the mean (aka the average).  This is simply taking a list of numbers and dividing by the count of those numbers.  It is useful when there are many numbers that add up to a total. What does this tell us?  If you were looking at a teeter totter with a bunch of kids on it then it’s where the bar balances.  It doesn’t really matter how many kids you have on either side it’s simply where the weight of the kids is even on each side.

Geometric Mean

Lesser used is the geometric mean.  This is used when there are many quantities that multiply together to produce a product of those numbers.  This is a more appropriate mean when dealing with proportional growth. Take for example when you invest in something like a 401k.  If you get a 8% growth for the first year, 12% for the second, and 11% for the third you would want to take the geometric mean.  This can be re-written as 1.08 the first year, 1.12 for the second, and 1.11 for the third.  The geometric mean is then calculated as \prod_{n=1}^3\left(1.08 \cdot 1.12 \cdot 1.11\right)^{\frac{1}{3}} - 1 = 10.32\% .

This table shows how the results from the geometric mean match the results when applying the rate year by year.

Yearly Geo-Mean
Rate 1000 1000
0.08 1.08 1080 1103.201691
0.12 1.12 1209.6 1217.053972
0.11 1.11  1342.66  1342.66
0.103201691

 

Harmonic Mean

Harmonic mean, like the arithmetic mean, is additive in nature.  However, the larger quantities get dampened down.  Consequently, it can be used in some situations when there are outliers.  This mean can also be useful in a variety of areas including machine learning when averaging precision and recall of classifiers.

Median

Medians are another example of measure of center.  However, unlike arithmetic mean this is less sensitive to outliers.  For example when determining a measure of center for national income the mean income would result in a different number than the median income and would lean more toward the very wealthy.  However, the median is a better measure of center as it identifies the middle point where half the observations are on either side.

The following code snippets show the three Pythagorean means (arithmetic, geometric, harmonic) as well as the median.

### Generate some fake data
x = cbind(sort(rnorm(25,10,1)),rpois(25,10))
### Write a function for a weighted median
X = x[,1]; w = x[,2]
weighted.median = function(X,w=1){
### If a single value of 1 was entered then set up array
if(length(w)==1){
w = rep(1,length(X))
}

X = cbind(X,w)
X = X[complete.cases(X),]
y = X[order(X[,1]),] # Sort the matrix
y = cbind(y,cumsum(y[,2])) # Attach the cumulative sum

### locate the positions the need to be averaged.
### If there is an exact middle point then it uses the middle point.
which.min.lim = min( which(y[,3]/sum(y[,2]) >= 0.5 ) )
which.max.lim = max( which(y[,3]/sum(y[,2]) <= 0.5 ) ) weighted.median = mean(y[max(which.min.lim, which.max.lim),1]) return(weighted.median) } harmonic.mean = function(x,w=1){ if(length(w)==1){ w = rep(1,length(x)) } dem = w/x # Set up denominator values harmonic.mean = sum(w)/sum(dem) # Calculate harmonic mean return(harmonic.mean) } geometric.mean = function(x,w=1){ if(length(w)==1){ w = rep(1,length(x)) } a = x^w b = 1/sum(w) geometric.mean = prod(a) ^ b ### Same calculation just a different way # exp( sum(w * log(x) ) / sum(w) ) return(geometric.mean) } mean(x[,1]) weighted.mean(x[,1],x[,2]) median(x[,1]) weighted.median(x[,1],x[,2]) harmonic.mean(x[,1], x[,2]) harmonic.mean(x[,1]) geometric.mean(x[,1],x[,2]) geometric.mean(x[,1]) hist(x, nclass=100, xlim=c(10,11)); abline(v=weighted.mean(x[,1],x[,2]), col='red', lwd=2) abline(v=weighted.median(x[,1],x[,2]), col='blue', lwd=2) abline(v=harmonic.mean(x[,1], x[,2]), col='green', lwd=2) abline(v=geometric.mean(x[,1],x[,2]), col='purple', lwd=2) [/sourcecode]

The Future of Non Probability Sampling

While attending the American Association for Public Opinion Research conference in Boston, MA the topic of non-probability samples was something of a reoccurring theme. I attended the task force panel review on the topic. However, there is currently no commonly accepted solution.

It was about one year ago that Pew reported (Pew report) that their phone completion rate was down to 9%. I can’t imagine that out will be going up ant time soon. That makes one wonder how much longer phone surveys can be considered a probability sample (and that doesn’t mention the whole issue with cell phone adoption). It is certainly not a sustainable method.

One thing is clear, the time has come and something will need to be done in order to solve that problem. Some have even suggested that landline surveys be eliminated and move strictly to cell phone surveys. However, that is probably a band-aid at best and is likely not sustainable either. Some are using sample matching with opt-in Web panels with varying degrees of success. Twitter, Facebook, and other social media are constantly thrown around too.

Reg Baker over at The Survey Geek is heading up the AAPOR task force for the past couple of years trying to solve this problem.

image

George Box stated that “all models are wrong, but some are useful”. I guess the same now applies to samples. It will be interesting to follow this topic. For the recent update AAPOR just released their report.

A Brief Tour of the Trees and Forests

Tree methods such as CART (classification and regression trees) can be used as alternatives to logistic regression. It is a way that can be used to show the probability of being in any hierarchical group. The following is a compilation of many of the key R packages that cover trees and forests.  The goal here is to simply give some brief examples on a few approaches on growing trees and, in particular, the visualization of the trees. These packages include classification and regression trees, graphing and visualization, ensemble learning using random forests, as well as evolutionary learning trees. There are a wide array of package in R that handle decision trees including trees for longitudinal studies.  I have found that when using several combinations of these packages simultaneously that some of the function begin to fail to work.

The concept of trees and forests can be applied in many different settings and is often seen in machine learning and data mining settings or other settings where there is a significant amount of data.  The examples below are by no means comprehensive and exhaustive. However, there are several examples given using different datasets and a variety of R packages. The first example uses some data obtain from the Harvard Dataverse Network. For reference these data can be obtained from http://dvn.iq.harvard.edu/dvn/. The study was recently released on April 22nd, 2013 and the raw data as well as the documentation is available on the Dataverse web site and the study ID is hdl:1902.1/21235. The other examples use data that are shipped with the R packages.

rpart

This package includes several example sets of data that can be used for recursive partitioning and regression trees.  Categorical or continuous variables can be used depending on whether one wants classification trees or regression trees. This package as well at the tree package are probably the two go-to packages for trees.  However, care should be taken as the tree package and the rpart package can produce very different results.

library(rpart)
raw.orig < - read.csv(file="c:\\rsei212_chemical.txt", header=T, sep="\t")

# Keep the dataset small and tidy
# The Dataverse: hdl:1902.1/21235
raw = subset(raw.orig, select=c("Metal","OTW","AirDecay","Koc"))

row.names(raw) = raw.orig$CASNumber
raw = na.omit(raw);

frmla = Metal ~ OTW + AirDecay + Koc

# Metal: Core Metal (CM); Metal (M); Non-Metal (NM); Core Non-Metal (CNM)

fit = rpart(frmla, method="class", data=raw)

printcp(fit) # display the results
plotcp(fit) # visualize cross-validation results
summary(fit) # detailed summary of splits

# plot tree
plot(fit, uniform=TRUE, main="Classification Tree for Chemicals")
text(fit, use.n=TRUE, all=TRUE, cex=.8)

# tabulate some of the data
table(subset(raw, Koc>=190.5)$Metal)

Basic Decision Tree

tree

This is the primary R package for classification and regression trees.  It has functions to prune the tree as well as general plotting functions and the mis-classifications (total loss). The output from tree can be easier to compare to the General Linear Model (GLM) and General Additive Model (GAM) alternatives.


###############
# TREE package
library(tree)

tr = tree(frmla, data=raw)
summary(tr)
plot(tr); text(tr)

Basic Tree (tree() function)

party

This is another package for recursive partitioning. One of the key functions in this package is ctree. As the package documention indicates it can be used for continuous, censored, ordered, nominal and multivariate response variable in a conditional inference framework. The party package also implements recursive partitioning for survival data.


###############
# PARTY package
library(party)

(ct = ctree(frmla, data = raw))
plot(ct, main="Conditional Inference Tree")

#Table of prediction errors
table(predict(ct), raw$Metal)

# Estimated class probabilities
tr.pred = predict(ct, newdata=raw, type="prob")

Conditional Tree

maptree

maptree is a very good at graphing, pruning data from hierarchical clustering, and CART models. The trees produced by this package tend to be better labeled and higher quality and the stock plots from rpart.


###############
# MAPTREE
library(maptree)
library(cluster)
draw.tree( clip.rpart (rpart ( raw), best=7),
nodeinfo=TRUE, units="species",
cases="cells", digits=0)
a = agnes ( raw[2:4], method="ward" )
names(a)
a$diss
b = kgs (a, a$diss, maxclust=20)

plot(names(b), b, xlab="# clusters", ylab="penalty", type="n")
xloc = names(b)[b==min(b)]
yloc = min(b)
ngon(c(xloc,yloc+.75,10, "dark green"), angle=180, n=3)
apply(cbind(names(b), b, 3, 'blue'), 1, ngon, 4) # cbind(x,y,size,color)

Map Tree Example

partykit

This contains a re-implementation of the ctree function and it provides some very good graphing and visualization for tree models.  It is similar to the party package.  The example below uses data from airquality dataset and the famous species data available in R and can be found in the documentation.


<a href="http://statistical-research.com/wp-content/uploads/2012/12/species.png"><img alt="Species Decision Tree" src="http://statistical-research.com/wp-content/uploads/2012/12/species.png" width="437" height="472" /></a> <a href="http://statistical-research.com/wp-content/uploads/2012/12/airqualityOzone.png"><img alt="Ozone Air Quality Decision Tree" src="http://statistical-research.com/wp-content/uploads/2012/12/airqualityOzone.png" width="437" height="472" /></a>

evtree

This package uses evolutionary algorithms.  The idea behind this approach is that is will reduce the a priori bias.  I have seen trees of this sort in the area of environmental research, bioinformatics, systematics, and marine biology.  Though there are many other areas than that of phylogentics.


###############
## EVTREE (Evoluationary Learning)
library(evtree)

ev.raw = evtree(frmla, data=raw)
plot(ev.raw)
table(predict(ev.raw), raw$Metal)
1-mean(predict(ev.raw) == raw$Metal)

Evolutionary Tree Learning

randomForest

Random forests are very good in that it is an ensemble learning method used for classification and regression.  It uses multiple models for better performance that just using a single tree model.  In addition because many sample are selected in the process a measure of variable importance can be obtain and this approach can be used for model selection and can be particularly useful when forward/backward stepwise selection is not appropriate and when working with an extremely high number of candidate variables that need to be reduced.


##################
## randomForest
library(randomForest)
fit.rf = randomForest(frmla, data=raw)
print(fit.rf)
importance(fit.rf)
plot(fit.rf)
plot( importance(fit.rf), lty=2, pch=16)
lines(importance(fit.rf))
imp = importance(fit.rf)
impvar = rownames(imp)[order(imp[, 1], decreasing=TRUE)]
op = par(mfrow=c(1, 3))
for (i in seq_along(impvar)) {
partialPlot(fit.rf, raw, impvar[i], xlab=impvar[i],
main=paste("Partial Dependence on", impvar[i]),
ylim=c(0, 1))
}

>importance(rf1)
%IncMSE IncNodePurity
x1 30.30146 8657.963
x2 7.739163 3675.853
x3 0.586905 240.275
x4 -0.82209 381.6304
x5 0.583622 253.3885

Importance Graph

varSelRF

This can be used for further variable selection procedure using random forests.  It implements both backward stepwise elimination as well as selection based on the importance spectrum.  This data uses randomly generated data so the correlation matrix can set so that the first variable is strongly correlated and the other variables are less so.


##################
## varSelRF package
library(varSelRF)
x = matrix(rnorm(25 * 30), ncol = 30)
x[1:10, 1:2] = x[1:10, 1:2] + 2
cl = factor(c(rep("A", 10), rep("B", 15)))
rf.vs1 = varSelRF(x, cl, ntree = 200, ntreeIterat = 100,
vars.drop.frac = 0.2)

rf.vs1
plot(rf.vs1)

## Example of importance function show that forcing x1 to be the most important
## while create secondary variables that is related to x1.
x1=rnorm(500)
x2=rnorm(500,x1,1)
y=runif(1,1,10)*x1+rnorm(500,0,.5)
my.df=data.frame(y,x1,x2,x3=rnorm(500),x4=rnorm(500),x5=rnorm(500))
rf1 = randomForest(y~., data=my.df, mtry=2, ntree=50, importance=TRUE)
importance(rf1)
cor(my.df)

Importance and Out Of Bag (OOB) Error

oblique.tree

This package grows an oblique decision tree (a general form of the axis-parallel tree).  This example uses the crab dataset (morphological measurements on Leptograpsus crabs) available in R as a stock dataset to grow the oblique tree.


###############
## OBLIQUE.TREE
library(oblique.tree)

aug.crabs.data = data.frame( g=factor(rep(1:4,each=50)),
predict(princomp(crabs[,4:8]))[,2:3])
plot(aug.crabs.data[,-1],type="n")
text( aug.crabs.data[,-1], col=as.numeric(aug.crabs.data[,1]), labels=as.numeric(aug.crabs.data[,1]))
ob.tree = oblique.tree(formula = g~.,
data = aug.crabs.data,
oblique.splits = "only")
plot(ob.tree);text(ob.tree)

Oblique Tree

CORElearn

This is a great package that contain many different machine learning algorithms and functions.  It include trees, forests, naive Bayes, locally weighted regression, among others.


##################
## CORElearn

library(CORElearn)
## Random Forests
fit.rand.forest = CoreModel(frmla, data=raw, model="rf", selectionEstimator="MDL", minNodeWeightRF=5, rfNoTrees=100)
plot(fit.rand.forest)

## decision tree with naive Bayes in the leaves
fit.dt = CoreModel(frmla, raw, model="tree", modelType=4)
plot(fit.dt, raw)

airquality.sub = subset(airquality, !is.na(airquality$Ozone))
fit.rt = CoreModel(Ozone~., airquality.sub, model="regTree", modelTypeReg=1)
summary(fit.rt)
plot(fit.rt, airquality.sub, graphType="prototypes")

pred = predict(fit.rt, airquality.sub)
print(pred)
plot(pred)

CORElearn

longRPart

This provides an implementation for recursive partitioning for longitudinal data.  It uses the rules from rpart and the mixed effects models from nlme to grow regression trees. This can be a little resource intensive on some slower computers.


##################
##longRPart
library(longRPart)

data(pbkphData)
pbkphData$Time=as.factor(pbkphData$Time)
long.tree = longRPart(pbkph~Time,~age+gender,~1|Subject,pbkphData,R=corExp(form=~time))
lrpTreePlot(long.tree, use.n=TRE, place="bottomright")

longRTreeplot

REEMtree

This package is useful for longitudinal studies where random effects exist.  This example uses the pbkphData dataset available in the longRPart package.


##################
## REEMtree Random Effects for Longitudinal Data
library(REEMtree)
pbkphData.sub = subset(pbkphData, !is.na(pbkphData$pbkph))
reem.tree = REEMtree(pbkph~Time, data=pbkphData.sub, random=~1|Subject)
plot(reem.tree)
ranef(reem.tree) #random effects

reem.tree = REEMtree(pbkph~Time, data=pbkphData.sub, random=~1|Subject,
correlation=corAR1())
plot(reem.tree)

 

 

Amazon AWS Summit 2013

I was fortunate enough to have been able to attend the Amazon AWS Summit in NYC and to listen to Werner Vogels give the keynote.  I will share a few of my thoughts on the AWS 2013 Summit and some of my take-aways.  I attended sessions that focused on two products in particular: Redshift and DynamoDB.  Amazon AWS seems to be a good option for projects that need a lot of disk space (e.g. up to 1.6 petabytes) or if you need to quickly increase system resources (e.g. RAM or CPU) to handle a lot of database writes or to handle a lot of data analysis on demand.

Redshift

This is a new Amazon product was announced earlier this month and if it can do what Amazon says it can do then it seems that this is a great option data warehousing.  It will be interesting to see if some of the industries that have strict regulations (e.g. HIPAA, PCI compliant) move over to Amazon.  However, with some of the Virtual Private Cloud options and the encryption that is provided that may be able to solve the regulatory requirements.

I have done a fair amount of work on data warehousing but have generally only used Oracle for that work.  Some of the benchmarks for Redshift are quite impressive. As seen in the photo I took of one of the presentation slides they were able to read 2 billion rows in about 6 seconds (12 seconds for aggregate summaries).  Compare that to  SQL Server that was manually stopped after 6 hours and Hive took only about a half hour.  Not too long ago I personally ran ~3 billion rows on a local MySQL server.  I don’t have specific benchmarks.  However, to scrub and transform the data it took roughly 3 days.  Needless to say after that I moved over and used some of the Amazon products and was able to quickly scale up and use more Amazon instances.

 

20130418_144433

 

Amazon DynamoDB

I haven’t had the opportunity to use this product but it looks very promising and appears to provide a great resource for a NoSQL alternative to relational databases and a strong competitor to some of the other NoSQL databases.  It is a proprietary software but I would be interested in comparing it to Cassandra or other Hadoop style systems.  Some of the libraries, mappers, and mock are available at http://j.mp/dynamodb-libs.

 

Summary

From personal experience I have been able to use R and Hadoop as well as some PHP scripts and Java programs on Amazon instances.  The price for using any of these products is very good and is generally a whole lot cheaper than purchasing in-house hardware.  Plus it provides flexibility to use a wide range of software.  It will be interesting to see how well Redshift performs as well as DynamoDB.  Redshift in particular looks very promising.

 

As a side,  I’m in no way associated with Amazon, I’m simply an occasional user of their products.

Simulating the Gambler’s Ruin

The gambler’s ruin problem is one where a player has a probability p of winning  and probability q of losing. For example let’s take a skill game where the player x can beat player y with probability 0.6 by getting closer to a target. The game play begins with player x being allotted 5 points and player y allotted 10 points. After each round a player’s points either decrease by one or increase by one we can determine the probability that player x will annihilate player y. The player that reaches 15 wins and the player that reach zero is annihilated. There is a wide range of application for this type of problem that goes being gambling.

Number of Time Until Annihilation

This is actually a fairly simple problem to solve on pencil and paper and to determine an exact probability. Without going into too much detail we can determine the probability of annihilation by \frac{1-\left(\frac{q}{p}\right)^i}{1-\left(\frac{q}{p}\right)^N}. In this example it works out to be \frac{1-\left(\frac{.4}{.6}\right)^5}{1-\left(\frac{.4}{.6}\right)^{10}} \approx 0.8703.

But this is a relatively boring approach and coding up an R script makes everything that much better. So here is a simulation of this same problem estimating that same probability plus it provides additional information on the distribution of how many times this game would have to be played.

gen.ruin = function(n, x.cnt, y.cnt, x.p){
x.cnt.c = x.cnt
y.cnt.c = y.cnt
x.rnd = rbinom(n, 1, p=x.p)
x.rnd[x.rnd==0] = -1
y.rnd = x.rnd*-1
x.cum.sum = cumsum(x.rnd)+x.cnt
y.cum.sum = cumsum(y.rnd)+y.cnt

ruin.data = cumsum(x.rnd)+x.cnt

if( any( which(ruin.data>=x.cnt+y.cnt) ) | any( which(ruin.data< =0) ) ){ cut.data = 1+min( which(ruin.data>=x.cnt+y.cnt), which(ruin.data< =0) ) ruin.data[cut.data:length(ruin.data)] = 0 } return(ruin.data) } n.reps = 10000 ruin.sim = replicate(n.reps, gen.ruin(n=1000, x.cnt=5, y.cnt=10, x.p=.6)) ruin.sim[ruin.sim==0] = NA hist( apply(ruin.sim==15 | is.na(ruin.sim), 2, which.max) , nclass=100, col='8', main="Distribution of Number of Turns", xlab="Turn Number") abline(v=mean(apply(ruin.sim==15 | is.na(ruin.sim), 2, which.max)), lwd=3, col='red') abline(v=median(apply(ruin.sim==15 | is.na(ruin.sim), 2, which.max)), lwd=3, col='green') x.annihilation = apply(ruin.sim==15, 2, which.max) ( prob.x.annilate = length(x.annihilation[x.annihilation!=1]) / n.reps ) state.cnt = ruin.sim state.cnt[state.cnt!=15] = 0 state.cnt[state.cnt==15] = 1 mean.state = apply(ruin.sim, 1, mean, na.rm=T) plot(mean.state, xlim=c(0,which.max(mean.state)), ylim=c(0,20), ylab="Points", xlab="Number of Plays", pch=16, cex=.5, col='green') lines(mean.state, col='green') points(15-mean.state, pch=16, cex=.5, col='blue') lines(15-mean.state, col='blue') [/code]

Gambler's Ruin

Finding the Distribution Parameters

This is a brief description on one way to determine the distribution of given data. There are several ways to accomplish this in R especially if one is trying to determine if the data comes from a normal distribution. Rather than focusing on hypothesis testing and determining if a distribution is actually the said distribution this example shows one simple approach to determine the parameters of a distribution. I’ve found this useful when I’m given a dataset and I need to generate more of the same type of data for testing and simulation purposes. The example below uses data that originated from a gamma (1,2). Here we can see how well the data fits the distribution.

Simulated Gamma Distribution


raw = t( matrix(c(
1, 0.4789,
1, 0.1250,
2, 0.7048,
2, 0.2482,
2, 1.1744,
2, 0.2313,
2, 0.3978,
2, 0.1133,
2, 0.1008,
1, 0.7850,
2, 0.3099,
1, 2.1243,
2, 0.3615,
2, 0.2386,
1, 0.0883), nrow=2
) )
( fit.distr = fitdistr(raw[,2], "gamma") )
qqplot(rgamma(nrow(raw),fit.distr$estimate[1], fit.distr$estimate[2]), (raw[,2]),
xlab="Observed Data", ylab="Random Gamma")
abline(0,1,col='red')

simulated = rgamma(1000, fit.distr$estimate[1], fit.distr$estimate[2])
hist(simulated, main=paste("Histogram of Simulated Gamma using",round(fit.distr$estimate[1],3),"and",round(fit.distr$estimate[2],3)),
col=8, xlab="Random Gamma Distribution Value")

( fit.distr = fitdistr(raw[,2], "normal") )
qqplot(rnorm(nrow(raw),fit.distr$estimate[1], fit.distr$estimate[2]), (raw[,2]))
abline(0,1,col='red')

( fit.distr = fitdistr(raw[,2], "lognormal") )
qqplot(rlnorm(nrow(raw),fit.distr$estimate, fit.distr$sd), (raw[,2]))
abline(0,1,col='red')

( fit.distr = fitdistr(raw[,2], "exponential") )
qqplot(rexp(nrow(raw),fit.distr$estimate), (raw[,2]))
abline(0,1,col='red')

Distribution of QQPlot

Dirichlet Process, Infinite Mixture Models, and Clustering

The Dirichlet process provides a very interesting approach to understand group assignments and models for clustering effects.   Often time we encounter the k-means approach.  However, it is necessary to have a fixed number of clusters.  Often we encounter situations where we don’t know how many fixed clusters we need.  Suppose we’re trying to identify groups of voters.  We could use political partisanship (e.g. low/medium/high Democratic vote) but that may not necessary describe the data appropriately.  If this is the case then we can turn to Bayesian nonparametrics and the Dirichlet Process and use some approaches there to solve this problem.  Three in particular are commonly used as examples: the Chinese Restaurant ModelPólya’s Urn, and Stick Breaking.

Chinese Restaurant Model

Chinese Restaurant Model

 The Chinese Restaurant Model is based on idea that there is a restaurant with an infinite number of tables.  At each table there are an infinite number of seats.  The first customer arrives and sits down at a table.  The second customer then arrives and selects a table.  However, the customer selects the table that the first customer is currently sitting with probability \alpha/(1+\alpha) or selects a new table with 1/(1+\alpha).  This continues on to the (n+1)^{st} customer where they select a table that a current customer is sitting with probability n_{k}/(n+\alpha).

crp = function(num.customers, alpha) {
table < - c(1)
next.table <- 2
for (i in 1:num.customers) {
if (runif(1,0,1) < alpha / (alpha + i)) {
# Add a new ball color.
table <- c(table, next.table)
next.table <- next.table+1
} else {
# Pick out a ball from the urn, and add back a
# ball of the same color.
select.table <- table&#91;sample(1:length(table), 1)&#93;
table <- c(table, select.table)
}
}
table
}
crp(100, 4)
plot(
table( crp(10000, 2) )
,xlab="Table Index", ylab="Frequency"
)
&#91;/code&#93;
<p style="text-align: center;"><a href="http://statistical-research.com/wp-content/uploads/2013/03/StickBreakingProb.png"> </a>

<h2>Pólya's Urn Model</h2>
In the Pólya's Urn model we take the approach where there exists a urn of colored balls.  We take a ball out of the urn and note its color.  We replace the ball back into the urn and then we add an additional ball of the same color to the urn. This process can continue on infinitely.


rgb2hex < - function(x){
hex.part = ""
hex <- ""
for(i in 1:3){
b16 <- x&#91;,i&#93;/16
int.one <- trunc(b16)
if(int.one>=10){
val.one < - letters&#91;int.one-10+1&#93;
} else {
val.one <- int.one
}
fract <- abs( b16 - int.one )
int.two <- fract*16
if(int.two>=10){
val.two < - letters&#91;int.two-10+1&#93;
} else {
val.two <- int.two
}
hex.part&#91;i&#93; <- paste(val.one,val.two, sep="")
hex <- paste(hex,hex.part&#91;i&#93;, sep="")
}
hex
}
polyaUrnModel = function(baseColorDistribution, num.balls, alpha) {
balls < - c()
for (i in 1:num.balls) {
if (runif(1,0,1) < alpha / (alpha + length(balls))) {
# Add a new ball color.
library(colorspace)
color.comb < - expand.grid(x=seq(0,255),y=seq(0,255),z=seq(0,255))
location.picker <- rnorm(1,nrow(color.comb)/2, (nrow(color.comb)-1)/4 )
the.color <- c( color.comb&#91;location.picker,1&#93;, color.comb&#91;location.picker,2&#93;, color.comb&#91;location.picker,3&#93;)
the.hex <- paste("#",rgb2hex(the.color), sep="")
new.color <- the.hex
balls = c(balls, new.color)
} else {
# Pick out a ball from the urn, and add back a
# ball of the same color.
ball = balls&#91;sample(1:length(balls), 1)&#93;
balls = c(balls, ball)
}
}
balls
}
pum < - polyaUrnModel(function() rnorm(1,0,1), 100, 1)
barplot( table(pum), col=names(table(pum)), pch=10 )
&#91;/code&#93;

&nbsp;
<p style="text-align: center;"><a href="http://statistical-research.com/wp-content/uploads/2013/03/BarplotPolyaUrn.png"><img class="aligncenter  wp-image-1245" alt="Polya Urn Model" src="http://statistical-research.com/wp-content/uploads/2013/03/BarplotPolyaUrn.png" width="419" height="321" /></a>

<h2>Stick Breaking Model</h2>
<p style="text-align: left;">With this third model we simply start breaking a stick and continue to break that stick into smaller pieces.  This process works by taking a stick of length 1.0.  We then generate one random number from the Beta distribution (\beta_{1} ~ Beta(1,\alpha). We then break the stick at \beta_1. The left side of the stick we'll call \nu_1.  We then take the remaining stick to the right and break it again at location (\beta_{2}  ~<code> </code>Beta(1, \alpha). Once again the piece to the left we'll call \nu_2 = \left(1-\beta_1\right) \cdot \beta_2. The sum of all of the probabilities generated will add up to 1.0.</p>


stickBreakingModel = function(num.vals, alpha) {
betas = rbeta(num.vals, 1, alpha)
stick.to.right = c(1, cumprod(1 - betas))[1:num.vals]
weights = stick.to.right * betas
weights
}

plot( stickBreakingModel(100,5), pch=16, cex=.5 )

 

Stick Breaking Probabilities

Multivariate Clustering

##
# Generate some fake data with some uniform random means
##
generateFakeData < - function( num.vars=3, n=100, num.clusters=5, seed=NULL ) { if(is.null(seed)){ set.seed(runif(1,0,100)) } else { set.seed(seed) } data <- data.frame(matrix(NA, nrow=n, ncol=num.vars+1)) mu <- NULL for(m in 1:num.vars){ mu <- cbind(mu,rnorm(num.clusters, runif(1,-10,15), 5)) } for (i in 1:n) { cluster <- sample(1:num.clusters, 1) data[i, 1] <- cluster for(j in 1:num.vars){ data[i, j+1] <- rnorm(1, mu[cluster,j], 1) } } data$X1 <- factor(data$X1) var.names <- paste("VAR",seq(1,ncol(data)-1), sep="") names(data) <- c("cluster",var.names) return(data) } ## # Set up a procedure to calculate the cluster means using squared distance ## dirichletClusters <- function(orig.data, disp.param = NULL, max.iter = 100, tolerance = .001) { n <- nrow( orig.data ) data <- as.matrix( orig.data ) pick.clusters <- rep(1, n) k <- 1 mu <- matrix( apply(data,2,mean), nrow=1, ncol=ncol(data) ) is.converged <- FALSE iteration <- 0 ss.old <- Inf ss.curr <- Inf while ( !is.converged & iteration < max.iter ) { # Iterate until convergence iteration <- iteration + 1 for( i in 1:n ) { # Iterate over each observation and measure the distance each observation' from its mean center's squared distance from its mean distances <- rep(NA, k) for( j in 1:k ){ distances[j] <- sum( (data[i, ] - mu[j, ])^2 ) # Distance formula. } if( min(distances) > disp.param ) { # If the dispersion parameter is still less than the minimum distance then create a new cluster
k < - k + 1 pick.clusters[i] <- k mu <- rbind(mu, data[i, ]) } else { pick.clusters[i] <- which(distances == min(distances)) } } ## # Calculate new cluster means ## for( j in 1:k ) { if( length(pick.clusters == j) > 0 ) {
mu[j, ] < - apply(subset(data,pick.clusters == j), 2, mean) } } ## # Test for convergence ## ss.curr <- 0 for( i in 1:n ) { ss.curr <- ss.curr + sum( (data[i, ] - mu[pick.clusters[i], ])^2 ) } ss.diff <- ss.old - ss.curr ss.old <- ss.curr if( !is.nan( ss.diff ) & ss.diff < tolerance ) { is.converged <- TRUE } } centers <- data.frame(mu) ret.val <- list("centers" = centers, "cluster" = factor(pick.clusters), "k" = k, "iterations" = iteration) return(ret.val) } create.num.vars <- 3 orig.data <- generateFakeData(create.num.vars, num.clusters=3, n=1000, seed=123) dp.update <- dirichletClusters(orig.data[,2:create.num.vars+1], disp.param=25) ggplot(orig.data, aes(x = VAR1, y = VAR3, color = cluster)) + geom_point() [/code]

 Clustering 1
In this example I have provided some R code that clusters variables based an any given number of variables.  The measure of distance from the group centroid is the multivariate sum of squared distance, though there are many other distance metrics that could be implemented (e.g. manhattan, euclidean, etc.)
Clustering 2

Significant P-Values and Overlapping Confidence Intervals

There are all sorts of problems with p-values and confidence intervals and I have no intention (or the time) to cover all those problems right now.  However, a big problem is that most people have no idea what p-values really mean. Here is one example of a common problem with p-values and how it relates to confidence intervals.  The problem arises when there are two  random variables (from independent populations), each with a mean and variance.  A confidence interval can be constructed around each sample mean.  Using these confidence intervals might be a tempting way to explain whether two values are statistically different.  The issue is that a person may see that the confidence intervals overlap and therefore declare that there is no difference.  Simply put this is not one of those iff (if and only if) situations.  If the confidence intervals do not overlap then one can conclude that there is a statistical difference between means.  However, the opposite can not be concluded.  When the confidence intervals do overlap then there still may be a difference.

Here are simulated data from two independent normally distributed populations testing the confidence intervals and the p-values.  It can easily be seen that this is a fairly frequent event. So don’t make the mistake and make conclusions solely on confidence intervals.Histogram of Confidence Intervals

Beta Distribution for Overlapping Confidence Intervals


###
#Set some constants
###
alpha = .05
m = 15
n = 15
nsim = 20000

###
#Function to calculate the t statistic. Same as t.test(x,y, var.eqal=T)
# `spooled` can easiliy be modified.
###
tstatfunc=function(x,y){
m = length(x)
n = length(y)
spooled = sqrt(((m-1)*sd(x)^2+(n-1)*sd(y)^2)/(m+n-2))
tstat = (mean(x)-mean(y))/(spooled*sqrt(1/m+1/n))

return(tstat)
}

calcInterval = function(){
x = rnorm(m,0,1)
y = rnorm(n,1,.8)
se.x = sd(x)/sqrt(m)
se.y = sd(y)/sqrt(n)

t.stat = tstatfunc(x,y)

p.val = (1-pt(abs(t.stat),m+n-2))*2 #Pooled Variance, two sided hypothesis
ci.x.ll = mean(x)-abs(qt(alpha/2,m-1))*se.x
ci.x.ul = mean(x)+abs(qt(alpha/2,m-1))*se.x
ci.y.ll = mean(y)-abs(qt(alpha/2,n-1))*se.y
ci.y.ul = mean(y)+abs(qt(alpha/2,n-1))*se.y
TTest = t.test(x,y, var.equal=TRUE) #Run the t.test() function for comparison
ret.val = c(p=p.val, t.p=TTest$p.value,
ci.x.ll=ci.x.ll, ci.x.ul=ci.x.ul,
ci.y.ll=ci.y.ll, ci.y.ul=ci.y.ul)

return(ret.val)
}

###
#Replicate a few times
###
my.sims = as.data.frame( t(replicate(nsim,calcInterval())) )

###
#Do the intervals overlap and it's significant?
###
ci.vals = cbind(my.sims$ci.x.ll - my.sims$ci.y.ul, my.sims$ci.x.ul-my.sims$ci.y.ul, my.sims$ci.x.ul-my.sims$ci.y.ll)
overlapTest = (ci.vals[,1] > 0 & ci.vals[,2] > 0 & ci.vals[,3] > 0) |
(ci.vals[,1] < 0 & ci.vals&#91;,2&#93; < 0 & ci.vals&#91;,3&#93; < 0)
my.sims = cbind(my.sims,ci.vals,NotOverlap=overlapTest)

sum(my.sims$CI.p)/nsim

hist(
rbeta(100000,sum(my.sims$CI.p)+1, nsim-sum(my.sims$CI.p)+1) ##CI.p is a binomial distribution.
, nclass=100, xlab="Proportion", freq=F, main=expression("Histogram of Overlapping Confidence Intervals and p<"*alpha*" from Beta Distribution"))
###
#What percent have overlapping CI and p < alpha? -- "Significant" but yet CI indicate otherwise.
#Multiple comparisons
###
my.sims$CI.p = as.numeric( !my.sims$NotOverlap & my.sims$p < alpha )
my.sims$CI.p2 = as.numeric( my.sims$NotOverlap & my.sims$p > alpha )
my.sims$my.diff = my.sims$p-my.sims$t.p # Check calculations for consistency

sum(my.sims$CI.p)/length(my.sims$CI.p)

###
#How many have confidence intervals that do not overlap but yet are still "Significant"?
###
my.sims$CI.p2 = as.numeric( my.sims$NotOverlap & my.sims$p > alpha )
sum(my.sims$CI.p2)

###
#Histograms of the intervals
###
hist(my.sims$t.p, nclass=100)
subset(my.sims,my.sims$p > .5)

###
#Histograms of the intervals
###
hist(my.sims$ci.y.ul, nclass=100, xlim=c(-2,3), ylim=c(0,2), col=4, freq=F,
main="Histogram of Confidence Intervals", xlab="Value")
hist(my.sims$ci.y.ll, nclass=100, add=T, col=4, freq=F)
hist(my.sims$ci.x.ul, nclass=100, add=T, col=2, freq=F)
hist(my.sims$ci.x.ll, nclass=100, add=T, col=2, freq=F)

Simulating Random Multivariate Correlated Data (Categorical Variables)

Graph of Random Categorical Data and Groups

This is a repost of the second part of an example that I posted last year but at the time I only had the PDF document (written in \LaTeXe).

This is the second example to generate multivariate random associated data. This example shows how to generate ordinal, categorical, data. It is a little more complex than generating continuous data in that the correlation matrix and the marginal distribution is required.  This uses the R library GenOrd.

The graph above plots out the randomly generated data with the given correlation matrix and groups it  by the second variable.  Though there are many other approaches on graphing categorical data available.  One source is available here.

This example creates a 2-variable dataset. However, this can easily be extended to many more variables. The correlation matrix R for this 2-dimensional example.

R = \left( \begin{smallmatrix} 1&-0.6\\ -0.6&1 \end{smallmatrix} \right)

The R code below will generate an ordinal dataset with a correlation matrix of:

R = \left( \begin{smallmatrix} 1&-0.5469243\\ -0.5469243&1 \end{smallmatrix} \right)

Increasing the sample size will let the correlation coefficients converge on the target correlations.

library(GenOrd)
set.seed(1)
# Sets the marginals.
# The values are cumulative so for the first variable the first marginal will be .1, the second is .2, the third is .3, and the fourth is .4
marginal < - list(c(0.1,0.3,0.6),c(0.4,0.7,0.9)) # Checks the lower and upper bounds of the correlation coefficients. corrcheck(marginal) # Sets the correlation coefficients R <- matrix(c(1,-0.6,-0.6,1),2,2) # Correlation matrix n <- 100 ##Selects and ordinal sample with given correlation R and given marginals. m <- ordsample(n, marginal, R) ##compare it with the pre-defined R cor(m) table(m[,1],m[,2]) chisq.test(m) gbar < - tapply(m[,1], list(m[,1], m[,2]), length) par(mfrow=c(1,1)) barplot(gbar, beside=T, col=cm.colors(4), main="Example Bar Chart of Counts by Group",xlab="Group",ylab="Frequency") [/sourcecode]

Simulating Random Multivariate Correlated Data (Continuous Variables)

Randomly Generated Data Before Cholesky Decomposition

This is a repost of an example that I posted last year but at the time I only had the PDF document (written in \LaTeXe).  I’m reposting it directly into WordPress and I’m including the graphs.

From time-to-time a researcher needs to develop a script or an application to collect and analyze data. They may also need to test their application under a variety of scenarios prior to data collection. However, because the data has not been collected yet it is necessary to create test data. Creating continuous data is relatively simple and is fairly straight forward using the Cholesky (pronounced kol-eh-ski) decomposition. This approach takes an original X variable (or matrix) and uses the Cholesky transformation to create a new, correlated, Y variable. To make things simple and straight forward this example will generate data from the a random normal distribution N(0,1).

 

The reason this approach is so useful is that that correlation structure can be specifically defined. The scripts can be used to create many different variables with different correlation structures. The method to transform the data into correlated variables is seen below using the correlation matrix R.

R = \left( \begin{smallmatrix} 1&0.8&0.2\\ 0.8&1&0.7\\0.2&0.7&1 \end{smallmatrix} \right)

 

Once the correlation matrix is set the researcher takes the Cholesky decomposition of the correlation matrix. Multiplying the Cholesky decomposition of the correlation matrix by the data matrix the resulting matrix is a transformed dataset with the specified correlation.

W = \left[Cholesky (R)\right]\left[X\right]
The R code from below will generate a correlation matrix of:

R = \left( \begin{smallmatrix} 1&0.7997999&0.1998661\\ 0.7997999&1&0.7000217\\0.1998661&0.7000217&1 \end{smallmatrix} \right)

Randomly Generated Data After Cholesky Decomposition
R = matrix(cbind(1,.80,.2,  .80,1,.7,  .2,.7,1),nrow=3)
U = t(chol(R))
nvars = dim(U)[1]
numobs = 100000
set.seed(1)
random.normal = matrix(rnorm(nvars*numobs,0,1), nrow=nvars, ncol=numobs);
X = U %*% random.normal
newX = t(X)
raw = as.data.frame(newX)
orig.raw = as.data.frame(t(random.normal))
names(raw) = c("response","predictor1","predictor2")
cor(raw)
plot(head(raw, 100))
plot(head(orig.raw,100))