Simulating Random Multivariate Correlated Data (Categorical Variables)

Graph of Random Categorical Data and Groups

This is a repost of the second part of an example that I posted last year but at the time I only had the PDF document (written in \LaTeXe).

This is the second example to generate multivariate random associated data. This example shows how to generate ordinal, categorical, data. It is a little more complex than generating continuous data in that the correlation matrix and the marginal distribution is required.  This uses the R library GenOrd.

The graph above plots out the randomly generated data with the given correlation matrix and groups it  by the second variable.  Though there are many other approaches on graphing categorical data available.  One source is available here.

This example creates a 2-variable dataset. However, this can easily be extended to many more variables. The correlation matrix R for this 2-dimensional example.

R = \left( \begin{smallmatrix} 1&-0.6\\ -0.6&1 \end{smallmatrix} \right)

The R code below will generate an ordinal dataset with a correlation matrix of:

R = \left( \begin{smallmatrix} 1&-0.5469243\\ -0.5469243&1 \end{smallmatrix} \right)

Increasing the sample size will let the correlation coefficients converge on the target correlations.

# Sets the marginals.
# The values are cumulative so for the first variable the first marginal will be .1, the second is .2, the third is .3, and the fourth is .4
marginal < - list(c(0.1,0.3,0.6),c(0.4,0.7,0.9)) # Checks the lower and upper bounds of the correlation coefficients. corrcheck(marginal) # Sets the correlation coefficients R <- matrix(c(1,-0.6,-0.6,1),2,2) # Correlation matrix n <- 100 ##Selects and ordinal sample with given correlation R and given marginals. m <- ordsample(n, marginal, R) ##compare it with the pre-defined R cor(m) table(m[,1],m[,2]) chisq.test(m) gbar < - tapply(m[,1], list(m[,1], m[,2]), length) par(mfrow=c(1,1)) barplot(gbar, beside=T, col=cm.colors(4), main="Example Bar Chart of Counts by Group",xlab="Group",ylab="Frequency") [/sourcecode]

Simulating Random Multivariate Correlated Data (Continuous Variables)

Randomly Generated Data Before Cholesky Decomposition

This is a repost of an example that I posted last year but at the time I only had the PDF document (written in \LaTeXe).  I’m reposting it directly into WordPress and I’m including the graphs.

From time-to-time a researcher needs to develop a script or an application to collect and analyze data. They may also need to test their application under a variety of scenarios prior to data collection. However, because the data has not been collected yet it is necessary to create test data. Creating continuous data is relatively simple and is fairly straight forward using the Cholesky (pronounced kol-eh-ski) decomposition. This approach takes an original X variable (or matrix) and uses the Cholesky transformation to create a new, correlated, Y variable. To make things simple and straight forward this example will generate data from the a random normal distribution N(0,1).


The reason this approach is so useful is that that correlation structure can be specifically defined. The scripts can be used to create many different variables with different correlation structures. The method to transform the data into correlated variables is seen below using the correlation matrix R.

R = \left( \begin{smallmatrix} 1&0.8&0.2\\ 0.8&1&0.7\\0.2&0.7&1 \end{smallmatrix} \right)


Once the correlation matrix is set the researcher takes the Cholesky decomposition of the correlation matrix. Multiplying the Cholesky decomposition of the correlation matrix by the data matrix the resulting matrix is a transformed dataset with the specified correlation.

W = \left[Cholesky (R)\right]\left[X\right]
The R code from below will generate a correlation matrix of:

R = \left( \begin{smallmatrix} 1&0.7997999&0.1998661\\ 0.7997999&1&0.7000217\\0.1998661&0.7000217&1 \end{smallmatrix} \right)

Randomly Generated Data After Cholesky Decomposition
R = matrix(cbind(1,.80,.2,  .80,1,.7,  .2,.7,1),nrow=3)
U = t(chol(R))
nvars = dim(U)[1]
numobs = 100000
random.normal = matrix(rnorm(nvars*numobs,0,1), nrow=nvars, ncol=numobs);
X = U %*% random.normal
newX = t(X)
raw =
orig.raw =
names(raw) = c("response","predictor1","predictor2")
plot(head(raw, 100))

Distribution of T-Scores

Like most of my post these code snippets derive from various other projects.  In this example it shows a simulation of how one can determine if a set of t statistics are distributed properly.  This can be useful when sampling known populations (e.g. U.S. census or hospital populations) or populations that will soon be known (e.g. pre-election, exit polling).  This is a simple example but the concept can be expanded upon to include varying sample sizes and varying known mean values.  When collecting data in real life the nsim value will likely be only a handful of random samples rather than a million.  In this example a fixed constant sample size of 50 is used.

If you’re collecting data and you begin to see that your distribution of t scores begins to deviate from the known distribution then it might be time to tweak some of the algorithms.

nsims <- 1000000 n <- 50 x <- replicate(nsims, rexp(n, 5)) <- apply(x, 2, sd) x.mean <- apply(x, 2, mean) x.t <- (x.mean - 0)/( qqnorm(x.t) # follows a normal distribution (x.grand.mean <- mean(x.t)) # ~0 median(x.t) # ~0 var(x.t) # v/(v-2) skewness(x.t) # ~0 library(e1071) kurtosis(x.t, type=1) theta <- seq(-4,4, by=.01) p <- dt(theta, n) p <- p/max(p) d <- density(x.t) plot(d) plot(theta, p, type = "l", ylab = "Density", lty = 2, lwd = 3) abline(v=x.grand.mean, col="red") [/sourcecode]

Bootstrap Confidence Intervals

Here is an example of nonparametric bootstrapping.  It’s a powerful technique that is similar to the Jackknife. With the bootstrap, however, the approach uses re-sampling. It’s clearly not as good as parametric approaches but it gets the job done. This can be used in a variety of situations ranging from variance estimation to model selection. John Tukey, as the story goes, suggested the name “the shotgun” because you can blow the head off any statistical problem.

The code below is for illustrative purposes and compares a couple of different approaches for bootstrapping. The mean shows a very nice distribution but something like a median is not so symmetrical  The code below can easily be changed to allow for any single statistic (e.g. any percentile). A little bit of alteration and bivariate statistics (e.g. correlation) can be bootstrapped.  One can observe that it is quite simple to obtain the confidence interval directly.  By using nboot=10000 (or any other number that can easily be divided) it makes it quite simple to find the confidence interval by merely taking the alpha/2 and (1-alpha/2) percentiles; in this case below the 50 and 9950 positions.

nboot <- 10000 # Number of simulations alpha <- .01 # alpha level n <- 1000 # sample size bootThetaQuantile <- function(x,i) { quantile(x[i], probs=.5) } bootThetaMean <- function(x,i) { mean(x[i]) } raw <- rnorm(n,0, 1) # raw data ( theta.boot.median <- boot(raw, bootThetaQuantile, R=nboot) ), conf=(1-alpha)) ( theta.boot.mean <- boot(raw, bootThetaMean, R=nboot) ), conf=(1-alpha)) my.replicate <- replicate(nboot, raw[sample(1:length(raw), n, replace=TRUE)]) # Bootstrap theta.median <- apply(my.replicate, 2, bootThetaQuantile) theta.mean <- apply(my.replicate, 2, bootThetaMean) hist(theta.median, xlim=c(-.2,.2), nclass=50, col=3, main="Histogram of Bootstrap Confidence Intervals for Median") hist(theta.mean, xlim=c(-.2,.2), nclass=50, col=3, main="Histogram of Bootstrap Confidence Intervals for Mean") sort(theta.median)[nboot*alpha/2] sort(theta.median)[nboot*(1-alpha/2)] sort(theta.mean)[nboot*alpha/2] sort(theta.mean)[nboot*(1-alpha/2)] ### Randomly generated data my.replicate <- replicate(nboot, rnorm(n,0,1)) theta.rand.median <- apply(my.replicate, 2, bootThetaQuantile) theta.rand.mean <- apply(my.replicate, 2, bootThetaMean) ci.u <- mean(theta.rand.mean)+qnorm(1-alpha/2)*sd(raw)/sqrt(n) ci.l <- mean(theta.rand.mean)-qnorm(1-alpha/2)*sd(raw)/sqrt(n) hist(theta.rand.median, xlim=c(-.2,.2), nclass=100, col=3, main="Histogram of Randomly Generated Data for Medians") hist(theta.rand.mean, xlim=c(-.2,.2), nclass=50, col=3, main="Histogram of Randomly Generated Data for Means") abline(v=c(ci.u,ci.l)) [/sourcecode]

Histogram for Bootstrap Median






Binomial Confidence Intervals

This stems from a couple of binomial distribution projects I have been working on recently.  It’s widely known that there are many different flavors of confidence intervals for the binomial distribution.  The reason for this is that there is a coverage problem with these intervals (see Coverage Probability).  A 95% confidence interval isn’t always (actually rarely) 95%.  So I got curious what would happen if I generated random binomial data to find out what percent of the simulated data actually fell within the confidence interval.  For simplicity I used p=0.5.  I wrote up some quick code that generates binomial data for varying sample sizes and runs a comparison for each of the nine confidence interval methods using the binom.confint() function.

nsims <- 10000 maxn <- 500 n <- seq(2,maxn, by=2) my.method <- c("exact", "ac", "asymptotic", "wilson", "prop.test", "bayes", "logit", "cloglog", "probit") my.method <- my.method[sort.list(my.method)] coverage <- matrix(NA, nrow=length(n), ncol=length(my.method)) ci.lower <- ci.upper <- matrix(NA, ncol=length(my.method), nrow=length(n)) for(i in 1:length(n)){ m <- n[i]/2 y <- rbinom(nsims, n[i], m/n[i]) ll <- binom.confint(m,n[i], conf.level=.95, method=my.method)$lower ul <- binom.confint(m,n[i], conf.level=.95, method=my.method)$upper ci.lower[i,] <- ll ci.upper[i,] <- ul for(j in 1:length(my.method)){ sig <- length(y[y/n[i]<=ul[j] &amp;amp;amp;amp; y/n[i]>=ll[j]])
coverage[i,j] <- sig/nsims } } plot(n,NULL, xlim=c(1,nrow(coverage)+1), ylim=c(.83,1), col=1, pch=16, ylab="Percent", xlab="N", main="95% Confidence Intervals for p=.5") points(replicate(ncol(coverage),n),coverage, col=c(1:9), pch=16, cex=.5) abline(h=seq(.93,.97, by=.01), col="#EEEEEE") abline(h=.95, col="#000000", lwd=2) abline(v=seq(2,maxn, by=20), col="#EEEEEE") legend("bottomright", my.method, col=c(1:9), pch=16, title="Interval Types", bg="#FFFFFF") plot(n,NULL, xlim=c(1,100), ylim=c(0,1), col=1, pch=16, ylab="Percent", xlab="N", main="95% Confidence Interval for p=.5") for(k in 1:ncol(coverage)){ lines(n, ci.lower[,k], col=k, lwd=1) lines(n, ci.upper[,k], col=k, lwd=1) } legend("bottomright", my.method, col=c(1:9), ncol=2, lwd=1, title="Interval Types", bg="#FFFFFF") [/sourcecode]

Binomial Confidence Intervals

Binomial 95% Confidence Interval

Brief Thoughts on Data Entry: Excel Should Not to Be Used

Over the years I have worked with many people who do the field work to collect data. They have their process and it works for them. However, their work rarely extends beyond data collection. This often means data is collected in Excel. If an observation meets a specific criterion then they are colored yellow, another criterion and they are blue.  If there was a problem with the observation then it is highlighted in bold. This is a very easy approach. It takes very little preparation and one can sling data around like it’s nobody’s business.

When analyzing the data dealing with this on a small scale usually isn’t too difficult. However, when working on a data warehouse sized dataset this becomes an extremely tedious task (if not impossible).  The data scrubbing and preparation can be extremely difficult.  Take for example a simple yes or no question.  Suppose there are four people are entering data about whether soil samples contain a specific contaminate.  Each person will begin entering their data and one person enters ‘Y’/’N’, another enters ‘Yes’/’No’, yet another enters ‘y’/’n’.  Worse yet the fourth person enters 1 for yes and 2 for no.  All of the sudden there are a total of eight codes when there should really only be two.


So some extra work in advance.  It’s well worth the time spent to prepare for data collection and have the data entered cleanly rather than trying to make sense of messy data after the fact.

There are a number of solutions that will work depending on the needs of those entering the data.  I mention a few here

  • Custom developed web application.  It take some work and potentially additional development costs but it can be completely customized to the project.  I have often taken this approach because it allows me to have a completely customized application for the job.
  • Open source client software.  There are many open source software packages that can be used for data entry.  An example of this is EpiData or EpiInfo.
  • Commercial software like Qualtrics.  It’s a little pricier but these software packages offer a lot of features.
  • Worst case: use SurveyMonkey

Y2K38: Our Own Mayan Calendar…Again

It’s not quite the end of the world as we know it.  We made it through December 21, 2012 unscathed. It’s not going to be the last time we will make it through such a pseudo-calamity.  After all we have built our own end of the world before (e.g. Y2K).

Next up January 19, 2038.  We’ve unknowingly made this the date that time will stop on all (well just 32-bit systems) computers.  Similar to the Y2K problem this is a problem due to the lack of available integers.  On January 19, 2038 3:14:07 (GMT) it is going to be exactly 2,147,483,647 seconds since the beginning of time (Unix epoch is January 1, 1970).  32-bit systems will only support 2,147,483,647 (2^{32-1}-1). It is 32-1 minus because the indexing begins at 0) as its largest integer and since Unix time is measured in seconds that means 2147483647 + 1 will generate an error and fail to calculate the correct date and time.  The most obvious solution will be to upgrade to 64-bit hardware.  This upgrade should buy us some more time and we won’t have to worry about another upgrade for another 9223372036854775808 seconds (Sunday, December 4, 292277026596).  However, since years are often still computed in 32-bit int (even in 64-bit systems) we’re going to loose roughly 290 millions years.  Ultimately, we’ll need to start worrying about an upgrade somewhere in the year 2147483647 (or later if the years were originally configured to start at year 1900).  Let’s hope by then someone can figure out how to deal with really big numbers.

Fortunately for users of R (and some other programs)  this is not necessarily an issue and there are ways around the maximum integer constraint problem. We’re not going to first encounter this problem in 2038. However, it’s possible that it has already exhibited itself in some software that needs to calculate future dates beyond January 19, 2038. An example would be if one needs a predictive time series model extending beyond January 19, 2038.

If you are on a 32-bit system you can give the following lines of code a try as a proof of concept:

as.POSIXlt( as.Date("1970-01-01") )+2147483647

>[1] "2038-01-19 03:14:07 UTC"

as.POSIXlt( as.Date("1970-01-01") )+2147483648

[1] "2038-01-19 03:14:08 UTC"

This can be extended even farther and it can be seen that the year is only limited by four digits.  After the year 9999-12-31 it becomes 0000-01-01

as.POSIXlt( as.Date("9999-12-31") )+86399

>[1] "9999-12-31 23:59:59 UTC"

as.POSIXlt( as.Date("9999-12-31") )+86401

>[1] "0000-01-01 00:00:01 UTC"

In the end it sounds like the system administrator for the Mayan calendar was probably just laid off during the very first financial crises and it was just long overdue for all of the software and hardware updates.

True Significance of a T Statistic

The example is more of a statistical exercise that  shows the true significance and the density curve of simulated random normal data.  The code can be changed to generate data using either a different mean and standard deviation or a different distribution altogether.

This extends the idea of estimating pi by generating random normal data to determine the actual significance level. First, there is a simple function to calculate a pooled t statistic. Then repeat that function N times. The second part of the R code will produce a density graph of the t statistic (normalized to 1). By changing the distribution, mean, standard deviation one can see what happens to the true significance level and the density graph.  Just keep that in mind the next time someone assumes a standard normal distribution for hypothesis testing when it’s really a different distribution altogether and they are set on alpha = 0.05

#Function to calculate the t statistic
m <- length(x) n <- length(y) spooled <- sqrt(((m-1)*sd(x)^2+(n-1)*sd(y)^2)/(m+n-2)) tstat <- (mean(x)-mean(y))/(spooled*sqrt(1/m+1/n)) return(tstat) } ### #Set some constants ### alpha <- .05 m <- 15 n <- 15 N <- 20000 n.reject <- 0 ### #Iterate N times ### for (i in 1:N){ x <- rexp(m,1) y <- rexp(n,1) t.stat <- tstatfunc(x,y) if (abs(t.stat)>qt(1-alpha/2,n+m-2)){
n.reject <- n.reject+1 } } true.sig.level <- n.reject/N true.sig.level ### #Function to simulate t-statistic vector ### tsim=function(){ tstatfunc(rnorm(m,mean=0,sd=1), rnorm(n,mean=0,sd=1)) } ### #Set up the values to graph ### tstat.vec <- replicate(10000, tsim()) dens.y <- density(tstat.vec)$y/max(density(tstat.vec)$y) dens.x <- density(tstat.vec)$x ### #Graph the density for each ### plot( NULL, NULL, xlim=c(-5,5),ylim=c(0,1), main="Simulated and True Density" ) lines(dens.x,dens.y, lty=1, col=3, lwd=3) curve( dt(x,df=n+m-2)/max(dt(x,df=n+m-2)),add=TRUE ) [/sourcecode]

Estimating Pi

Recently I’ve been working on some jackknife and bootstrapping problems.  While working on those projects I figured it would be a fun distraction to take the process and estimate pi.  I’m sure this problem has been tackled countless times but I have never bothered to try it using a Monte Carlo approach.  Here is the code that can be used to estimate pi.  The idea is to generate a 2 by 2 box and then draw a circle inside of it.  Then start randomly selecting points inside the box.  This example uses 10,000 iterations using a sample size of 500.  An interesting follow-up exercise would be to change the sample size, mean, and standard deviation to and see what it does to the jackknife variance.

nsims <- 10000; size <- 500; alpha <- .05; theta <- function(x){sd(x)} pi.estimate <- function(nsims,size,alpha){ out <-; call <-; pi.sim <- matrix(NA, nrow=nsims); for(i in 1:nsims){ x <- runif(size,-1,1); y <- runif(size,-1,1); radius <- -1; x.y <- cbind(x,y); d.origin <- sqrt(x^2+y^2); x.y.d <- cbind(x.y,d.origin); est.pi <- 4*length(x.y.d[,3][x.y.d[,3]<1])/ length(x.y.d[,3]) pi.sim[i] <- est.pi; } pi.mean <- mean(pi.sim); jk <- jackknife(c(pi.sim), theta); <- jk$*qnorm(1-alpha/2); pi.mean <- pi.mean; <- jk$; <-; <-; <-; return(list(pi.mean=pi.mean,,,, call=call)); } pi.estimate(nsims,size,alpha); [/sourcecode]

Mean Value from Grouped Data

Occasionally, I will get requests from clients to calculate the mean. Most of the time it’s a simple request but from time-to-time the data was originally from grouped data. A common approach is to take the midpoint of each of the groups and just assume that all respondents within that group average out to the midpoint.  The biggest problem I have run into has been trying to decide what to do with two extremes.  What does one do with the age ’65+’ category or the ‘over $100,000’ category.  Since there is really no mid-point one must make a (somewhat arbitrary) determination on what value to use for those categories.  Over the years I have toyed around with various approaches but have found using this Monte Carlo approach has worked the best and seems to be most defensible.  In this example I use a sample of voters who were asked their age and were given categorical response options. The final deliverable will be the average age and standard deviation. 

Group Frequency
18 to <30 25
30 to <45 43
45 to <60 42
60+ 31


I have written some code that I have traditionally used for this type of work but have since found that the LearnBayes package contains the necessary functions as well.  For the purpose of making this code usable in other settings I have combined some of the code I have written with some of the functions from the LearnBayes package.  This way I have been able to modify some of the algorithms and graphs to work under a variety of settings.

I welcome any comment on other approaches people have taken to solve similar problems using grouped, categorical data.

###Note that this function is also available through the VGAM package.
###However, it is displayed here as part of the example.
laplace.dist <- function (logpost, mode, ...) { options(warn = -1) fit <- optim(mode, logpost, gr = NULL, ..., hessian = TRUE, control = list(fnscale = -1)) options(warn = 0) mode <- fit$par hess <- -solve(fit$hessian) l.mode <- length(mode) int <- l.mode/2 * log(2 * pi) + 0.5 * log(det(hess)) + logpost(mode, ...) ret.val <- list(mode = mode, var = hess, int = int, converge = fit$convergence == 0) return(ret.val) } metropolis.randwalk <- function (logpost, proposal, start, m, ...) { pb <- length(start) Mpar <- array(0, c(m, pb)) b <- matrix(t(start)) lb <- logpost(start, ...) a <- chol(proposal$var) scale <- proposal$scale accept <- 0 for (i in 1:m) { bc <- b + scale * t(a) %*% array(rnorm(pb), c(pb, 1)) lbc <- logpost(t(bc), ...) prob <- exp(lbc - lb) if ( == FALSE) { if (runif(1) < prob) { lb <- lbc b <- bc accept <- accept + 1 } } Mpar[i, ] <- b } accept <- accept/m ret.val <- list(par = Mpar, accept = accept) return(ret.val) } DataGroup <- function(theta, data){ cpoints=data$b freq <- data$f nbins <- length(cpoints) m <- theta[1] logsigma <- theta[2] z <- 0*m s <- exp(logsigma) z <- freq[1]*log(pnorm(cpoints[1],m,s)) for(j in 1:(nbins-1)){ z <- z+freq[j+1]*log(pnorm(cpoints[j+1],m,s)-pnorm(cpoints[j],m,s)) } z <- z+freq[nbins+1]*log(1-pnorm(cpoints[nbins],m,s)) return(z) } contour.plot = function (logf, limits, data, ...) { log.f = function(theta, data) { if (is.matrix(theta) == TRUE) { val = matrix(0, c(dim(theta)[1], 1)) for (j in 1:dim(theta)[1]) { val[j] = logf(theta[j,], data) } } else { val = logf(theta, data) } return(val) } ng <- 50 x0 <- seq(limits[1], limits[2], len = ng) y0 <- seq(limits[3], limits[4], len = ng) X <- outer(x0, rep(1, ng)) Y <- outer(rep(1, ng), y0) n2 <- ng^2 Z <- log.f(cbind(X[1:n2], Y[1:n2]), data) Z <- Z - max(Z) Z <- matrix(Z, c(ng, ng)) contour(x0, y0, Z, levels = seq(-6.9, 0, by = 2.3), lwd = 2) } lo <- c(18,30,45,60) hi <- c(30,45,60, Inf) d <- list(int.lo=lo, int.hi=hi, b = c(30,45,60), f=c(25,43,42,31)) y <- c(rep((18+30)/2,25),rep((30+45)/2,43),rep((45+60)/2,42),rep(61,31)) mean.y <- mean(y) <- log(sd(y)) start <- c(mean.y, fit <- laplace.dist(DataGroup,start,d) fit modal.sds <- sqrt(diag(fit$var)) proposal <- list(var=fit$var,scale=2) fit2 <- metropolis.randwalk(DataGroup,proposal,start,10000,d) fit2$accept post.means <- apply(fit2$par,2,mean) post.sds <- apply(fit2$par,2,sd) cbind(c(fit$mode),modal.sds) ##These should be similar cbind(post.means,post.sds) contour.plot(DataGroup,c(min(post.means[1])-10,max(post.means[1]+10),min(post.means[2]-.5),max(post.means[2]+.5)),d, xlab="mu",ylab="log sigma") points(fit2$par[1:2000,1],fit2$par[1:2000,2]) [/sourcecode]