Generating Random Correlated Data Part 1

Generating Randomly Correlated Data with Cholesky Decomposition

This example shows how to transform randomly generated data continuous data into correlated data with a specified correlation.  This example uses the Cholesky decomposition to make the transformation into correlated data.  The second part to generating random data will be to discuss the process to generated associated random ordinal data.

Fisher’s Exact Test

Fisher’s Exact Test

Fisher’s Exact Test is a useful tool when dealing with smaller sample sizes.  It allows a researcher to use the hypergeometric distribution to calculate a p-value when the assumptions from the \chi^{2} test do not apply (e.g. small cell counts).  This example walks through the calculations to compute the probability density and cumulative density using Fisher’s classic Tea Drinker data.

Outlier Detection using Local Outlier Factor

PDF Document of Outlier Detection

Outlier detection is an extremely useful tool. There are many ways to identify an outlier. This example will discuss one univariate approach and one multivariate approach. There are many uses for outlier detection. One use can be to inspect a dataset prior to analysis to ensure accurate analysis. It can also be used to validate data during data entry to help prevent data entry errors. If a researcher has a simple univariate dataset then something like the Grubbs test for outliers would work. The approach taken here to identify outliers is an approach known as Local Outlier Factor (LOF). In the R package it is known as lofactor and it replaces the dprep package. The lofactor can help identify multivariate outliers. The below dataset creates an arti cal outlier and can be seen in the multivariate k-means clustering. With the LOF, the density of a point is compared to each of its neighbors. This example uses two packes: the DMwR for the LOF function and the outlier package for the grubbs test for outliers.

library(DMwR);
library(outliers);
set.seed(1234)
gen.xyz <- function(n, mean, sd) { cbind(rnorm(n, mean[1], sd[1]), rnorm(n, mean[2],sd[2]), rnorm(n, mean[3],sd[3]) ); } xyz <- rbind(gen.xyz(150, c(0,0,0), c(.2,.2,.2)), gen.xyz(150, c(2.5,0,1), c(.4,.2,.6)), gen.xyz(150, c(1.25,.5, .1), c(.3,.2, .5))); xyz[1,] <- c(0,2,1.5); km.3 <- kmeans(xyz, 3); outlier.scores <- lofactor(xyz, k=5) plot(density(outlier.scores)); outliers <- order(outlier.scores, decreasing=T)[1:5] print(outliers); grubbs.test(xyz[,1], type = 10, opposite = FALSE, two.sided = FALSE) grubbs.test(xyz[,2], type = 10, opposite = FALSE, two.sided = FALSE) grubbs.test(xyz[,3], type = 10, opposite = FALSE, two.sided = FALSE) pch <- rep(".", n) pch[outliers] <- "+" col <- rep("black", n) col[outliers] <- "red" pairs(xyz, pch=pch, col=col) my.cols = km.3$cluster; plot(xyz[,c(1,2)], col=my.cols); plot(xyz[,c(1,3)], col=my.cols); plot(xyz[,c(2,3)], col=my.cols); [/sourcecode]

The Jackknife

When it comes down to estimating parameters the standard error is often forgotten.   This is a slightly more complex way to estimate the standard error.  However, this is a good approach when dealing with complex samples.  This is a simple example of how to calculate a jackknife standard error.  This approach can be extended to a cluster samples.

x = rbeta(100,runif(1,0,10),runif(1,0,10));
x = cbind(x);
msum = matrix(NA,nrow=length(x),ncol=1);

for(i in 1:length(x)){
msum[i,1] = (mean(x[-i]) – mean(x))^2;

}

jk.var = (length(msum)-1)/length(msum)*sum(msum[,1]);
jk.se = sqrt(jk.var);
se = sd(x[,1]);

Iterative Proportional Fitting

Iterative Proportional Fitting

Once a survey is conducted it is common for the researcher to adjust the survey weights to match known population values.  This process is known as iterative proportional fitting (IPF) or also known as raking.  This process was first introduced by Edwards Deming.  This process is something that can actually be performed using something as simple as Microsoft Excel.  This example uses R to show how to adjust survey weights for one variable.