Simulating Random Multivariate Correlated Data (Continuous Variables)

Randomly Generated Data Before Cholesky Decomposition

This is a repost of an example that I posted last year but at the time I only had the PDF document (written in \LaTeXe).  I’m reposting it directly into WordPress and I’m including the graphs.

From time-to-time a researcher needs to develop a script or an application to collect and analyze data. They may also need to test their application under a variety of scenarios prior to data collection. However, because the data has not been collected yet it is necessary to create test data. Creating continuous data is relatively simple and is fairly straight forward using the Cholesky (pronounced kol-eh-ski) decomposition. This approach takes an original X variable (or matrix) and uses the Cholesky transformation to create a new, correlated, Y variable. To make things simple and straight forward this example will generate data from the a random normal distribution N(0,1).

 

The reason this approach is so useful is that that correlation structure can be specifically defined. The scripts can be used to create many different variables with different correlation structures. The method to transform the data into correlated variables is seen below using the correlation matrix R.

R = \left( \begin{smallmatrix} 1&0.8&0.2\\ 0.8&1&0.7\\0.2&0.7&1 \end{smallmatrix} \right)

 

Once the correlation matrix is set the researcher takes the Cholesky decomposition of the correlation matrix. Multiplying the Cholesky decomposition of the correlation matrix by the data matrix the resulting matrix is a transformed dataset with the specified correlation.

W = \left[Cholesky (R)\right]\left[X\right]
The R code from below will generate a correlation matrix of:

R = \left( \begin{smallmatrix} 1&0.7997999&0.1998661\\ 0.7997999&1&0.7000217\\0.1998661&0.7000217&1 \end{smallmatrix} \right)

Randomly Generated Data After Cholesky Decomposition
R = matrix(cbind(1,.80,.2,  .80,1,.7,  .2,.7,1),nrow=3)
U = t(chol(R))
nvars = dim(U)[1]
numobs = 100000
set.seed(1)
random.normal = matrix(rnorm(nvars*numobs,0,1), nrow=nvars, ncol=numobs);
X = U %*% random.normal
newX = t(X)
raw = as.data.frame(newX)
orig.raw = as.data.frame(t(random.normal))
names(raw) = c("response","predictor1","predictor2")
cor(raw)
plot(head(raw, 100))
plot(head(orig.raw,100))

Leave a comment

12 Comments

  1. Cool stuff. Here’s my take on it:

    http://stevencarlislewalker.wordpress.com/2012/06/05/simulating-random-variables-with-a-particular-correlation-structure/

    And here’s a little package for doing this stuff (on R-forge not CRAN):

    install.packages(“rmv”, repos=”http://R-Forge.R-project.org”)

    Here’s the syntax:

    rmv(n, covmat, rfunc = rnorm,
    method = c(“chol”, “eigen”), …)

    Reply
  2. fd

     /  March 11, 2013

    Thanks, this is very useful. I know I’ve learned this in the past but I’m fuzzy on the details. Do you have a citation or link that provides more details on why this works?

    Reply
  3. That’s some very useful code. Thanks!

    Reply
  4. ddaa

     /  March 14, 2013

    What if I need different distributions on different variables (e.g. first column normal distributed, second one uniform, third one gamma).

    Thanks in advance for your inputs.

    Reply
    • Wesley

      When you create the variable ”random.normal <-” you can construct the matrix for each dimension. So rather than having ”nrows=nvars” just set it equal to one (1). Then you can rbind the different distributions together. So that way you can have, for example, a Normal(0,1), a Uniform(1,2), and a Gamma(9,.5). Each will be on a different row of the matrix random.normal.

      Reply
      • ddaa

         /  March 18, 2013

        It works but this is what you get as correlation matrix:
        response predictor1 predictor2
        response 1.00 0.84 0.09
        predictor1 0.84 1.00 0.27
        predictor2 0.09 0.27 1.00

        In fact this was the solution I gave to myself but also the answer comply with my suspects.

        Reply
      • RogerS

         /  March 30, 2014

        Any chance you can post some code with different distributions for each row/variable?

        Thanks

        Reply
  5. ddaa

     /  March 18, 2013

    P.S.: a more appropriate answer can be Copula

    Reply
  6. Thanks for data generation. Statistical Simulation

    Reply
  1. Comment on Simulating Random Multivariate Corre...

Leave a Reply

Your email address will not be published. Required fields are marked *


− one = 6

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

%d bloggers like this: