Distribution of T-Scores

Like most of my post these code snippets derive from various other projects.  In this example it shows a simulation of how one can determine if a set of t statistics are distributed properly.  This can be useful when sampling known populations (e.g. U.S. census or hospital populations) or populations that will soon be known (e.g. pre-election, exit polling).  This is a simple example but the concept can be expanded upon to include varying sample sizes and varying known mean values.  When collecting data in real life the nsim value will likely be only a handful of random samples rather than a million.  In this example a fixed constant sample size of 50 is used.

If you’re collecting data and you begin to see that your distribution of t scores begins to deviate from the known distribution then it might be time to tweak some of the algorithms.

nsims <- 1000000 n <- 50 x <- replicate(nsims, rexp(n, 5)) x.sd <- apply(x, 2, sd) x.mean <- apply(x, 2, mean) x.t <- (x.mean - 0)/(x.sd/sqrt(nrow(x))) qqnorm(x.t) # follows a normal distribution (x.grand.mean <- mean(x.t)) # ~0 median(x.t) # ~0 var(x.t) # v/(v-2) skewness(x.t) # ~0 library(e1071) kurtosis(x.t, type=1) theta <- seq(-4,4, by=.01) p <- dt(theta, n) p <- p/max(p) d <- density(x.t) plot(d) plot(theta, p, type = "l", ylab = "Density", lty = 2, lwd = 3) abline(v=x.grand.mean, col="red") [/sourcecode]

Bootstrap Confidence Intervals

Here is an example of nonparametric bootstrapping.  It’s a powerful technique that is similar to the Jackknife. With the bootstrap, however, the approach uses re-sampling. It’s clearly not as good as parametric approaches but it gets the job done. This can be used in a variety of situations ranging from variance estimation to model selection. John Tukey, as the story goes, suggested the name “the shotgun” because you can blow the head off any statistical problem.

The code below is for illustrative purposes and compares a couple of different approaches for bootstrapping. The mean shows a very nice distribution but something like a median is not so symmetrical  The code below can easily be changed to allow for any single statistic (e.g. any percentile). A little bit of alteration and bivariate statistics (e.g. correlation) can be bootstrapped.  One can observe that it is quite simple to obtain the confidence interval directly.  By using nboot=10000 (or any other number that can easily be divided) it makes it quite simple to find the confidence interval by merely taking the alpha/2 and (1-alpha/2) percentiles; in this case below the 50 and 9950 positions.

nboot <- 10000 # Number of simulations alpha <- .01 # alpha level n <- 1000 # sample size bootThetaQuantile <- function(x,i) { quantile(x[i], probs=.5) } bootThetaMean <- function(x,i) { mean(x[i]) } raw <- rnorm(n,0, 1) # raw data ( theta.boot.median <- boot(raw, bootThetaQuantile, R=nboot) ) boot.ci(theta.boot.median, conf=(1-alpha)) ( theta.boot.mean <- boot(raw, bootThetaMean, R=nboot) ) boot.ci(theta.boot.mean, conf=(1-alpha)) my.replicate <- replicate(nboot, raw[sample(1:length(raw), n, replace=TRUE)]) # Bootstrap theta.median <- apply(my.replicate, 2, bootThetaQuantile) theta.mean <- apply(my.replicate, 2, bootThetaMean) hist(theta.median, xlim=c(-.2,.2), nclass=50, col=3, main="Histogram of Bootstrap Confidence Intervals for Median") hist(theta.mean, xlim=c(-.2,.2), nclass=50, col=3, main="Histogram of Bootstrap Confidence Intervals for Mean") sort(theta.median)[nboot*alpha/2] sort(theta.median)[nboot*(1-alpha/2)] sort(theta.mean)[nboot*alpha/2] sort(theta.mean)[nboot*(1-alpha/2)] ### Randomly generated data my.replicate <- replicate(nboot, rnorm(n,0,1)) theta.rand.median <- apply(my.replicate, 2, bootThetaQuantile) theta.rand.mean <- apply(my.replicate, 2, bootThetaMean) ci.u <- mean(theta.rand.mean)+qnorm(1-alpha/2)*sd(raw)/sqrt(n) ci.l <- mean(theta.rand.mean)-qnorm(1-alpha/2)*sd(raw)/sqrt(n) hist(theta.rand.median, xlim=c(-.2,.2), nclass=100, col=3, main="Histogram of Randomly Generated Data for Medians") hist(theta.rand.mean, xlim=c(-.2,.2), nclass=50, col=3, main="Histogram of Randomly Generated Data for Means") abline(v=c(ci.u,ci.l)) [/sourcecode]

Histogram for Bootstrap Median






Binomial Confidence Intervals

This stems from a couple of binomial distribution projects I have been working on recently.  It’s widely known that there are many different flavors of confidence intervals for the binomial distribution.  The reason for this is that there is a coverage problem with these intervals (see Coverage Probability).  A 95% confidence interval isn’t always (actually rarely) 95%.  So I got curious what would happen if I generated random binomial data to find out what percent of the simulated data actually fell within the confidence interval.  For simplicity I used p=0.5.  I wrote up some quick code that generates binomial data for varying sample sizes and runs a comparison for each of the nine confidence interval methods using the binom.confint() function.

nsims <- 10000 maxn <- 500 n <- seq(2,maxn, by=2) my.method <- c("exact", "ac", "asymptotic", "wilson", "prop.test", "bayes", "logit", "cloglog", "probit") my.method <- my.method[sort.list(my.method)] coverage <- matrix(NA, nrow=length(n), ncol=length(my.method)) ci.lower <- ci.upper <- matrix(NA, ncol=length(my.method), nrow=length(n)) for(i in 1:length(n)){ m <- n[i]/2 y <- rbinom(nsims, n[i], m/n[i]) ll <- binom.confint(m,n[i], conf.level=.95, method=my.method)$lower ul <- binom.confint(m,n[i], conf.level=.95, method=my.method)$upper ci.lower[i,] <- ll ci.upper[i,] <- ul for(j in 1:length(my.method)){ sig <- length(y[y/n[i]<=ul[j] &amp;amp;amp;amp; y/n[i]>=ll[j]])
coverage[i,j] <- sig/nsims } } plot(n,NULL, xlim=c(1,nrow(coverage)+1), ylim=c(.83,1), col=1, pch=16, ylab="Percent", xlab="N", main="95% Confidence Intervals for p=.5") points(replicate(ncol(coverage),n),coverage, col=c(1:9), pch=16, cex=.5) abline(h=seq(.93,.97, by=.01), col="#EEEEEE") abline(h=.95, col="#000000", lwd=2) abline(v=seq(2,maxn, by=20), col="#EEEEEE") legend("bottomright", my.method, col=c(1:9), pch=16, title="Interval Types", bg="#FFFFFF") plot(n,NULL, xlim=c(1,100), ylim=c(0,1), col=1, pch=16, ylab="Percent", xlab="N", main="95% Confidence Interval for p=.5") for(k in 1:ncol(coverage)){ lines(n, ci.lower[,k], col=k, lwd=1) lines(n, ci.upper[,k], col=k, lwd=1) } legend("bottomright", my.method, col=c(1:9), ncol=2, lwd=1, title="Interval Types", bg="#FFFFFF") [/sourcecode]

Binomial Confidence Intervals

Binomial 95% Confidence Interval

Brief Thoughts on Data Entry: Excel Should Not to Be Used

Over the years I have worked with many people who do the field work to collect data. They have their process and it works for them. However, their work rarely extends beyond data collection. This often means data is collected in Excel. If an observation meets a specific criterion then they are colored yellow, another criterion and they are blue.  If there was a problem with the observation then it is highlighted in bold. This is a very easy approach. It takes very little preparation and one can sling data around like it’s nobody’s business.

When analyzing the data dealing with this on a small scale usually isn’t too difficult. However, when working on a data warehouse sized dataset this becomes an extremely tedious task (if not impossible).  The data scrubbing and preparation can be extremely difficult.  Take for example a simple yes or no question.  Suppose there are four people are entering data about whether soil samples contain a specific contaminate.  Each person will begin entering their data and one person enters ‘Y’/’N’, another enters ‘Yes’/’No’, yet another enters ‘y’/’n’.  Worse yet the fourth person enters 1 for yes and 2 for no.  All of the sudden there are a total of eight codes when there should really only be two.


So some extra work in advance.  It’s well worth the time spent to prepare for data collection and have the data entered cleanly rather than trying to make sense of messy data after the fact.

There are a number of solutions that will work depending on the needs of those entering the data.  I mention a few here

  • Custom developed web application.  It take some work and potentially additional development costs but it can be completely customized to the project.  I have often taken this approach because it allows me to have a completely customized application for the job.
  • Open source client software.  There are many open source software packages that can be used for data entry.  An example of this is EpiData or EpiInfo.
  • Commercial software like Qualtrics.  It’s a little pricier but these software packages offer a lot of features.
  • Worst case: use SurveyMonkey

Y2K38: Our Own Mayan Calendar…Again

It’s not quite the end of the world as we know it.  We made it through December 21, 2012 unscathed. It’s not going to be the last time we will make it through such a pseudo-calamity.  After all we have built our own end of the world before (e.g. Y2K).

Next up January 19, 2038.  We’ve unknowingly made this the date that time will stop on all (well just 32-bit systems) computers.  Similar to the Y2K problem this is a problem due to the lack of available integers.  On January 19, 2038 3:14:07 (GMT) it is going to be exactly 2,147,483,647 seconds since the beginning of time (Unix epoch is January 1, 1970).  32-bit systems will only support 2,147,483,647 (2^{32-1}-1). It is 32-1 minus because the indexing begins at 0) as its largest integer and since Unix time is measured in seconds that means 2147483647 + 1 will generate an error and fail to calculate the correct date and time.  The most obvious solution will be to upgrade to 64-bit hardware.  This upgrade should buy us some more time and we won’t have to worry about another upgrade for another 9223372036854775808 seconds (Sunday, December 4, 292277026596).  However, since years are often still computed in 32-bit int (even in 64-bit systems) we’re going to loose roughly 290 millions years.  Ultimately, we’ll need to start worrying about an upgrade somewhere in the year 2147483647 (or later if the years were originally configured to start at year 1900).  Let’s hope by then someone can figure out how to deal with really big numbers.

Fortunately for users of R (and some other programs)  this is not necessarily an issue and there are ways around the maximum integer constraint problem. We’re not going to first encounter this problem in 2038. However, it’s possible that it has already exhibited itself in some software that needs to calculate future dates beyond January 19, 2038. An example would be if one needs a predictive time series model extending beyond January 19, 2038.

If you are on a 32-bit system you can give the following lines of code a try as a proof of concept:

as.POSIXlt( as.Date("1970-01-01") )+2147483647

>[1] "2038-01-19 03:14:07 UTC"

as.POSIXlt( as.Date("1970-01-01") )+2147483648

[1] "2038-01-19 03:14:08 UTC"

This can be extended even farther and it can be seen that the year is only limited by four digits.  After the year 9999-12-31 it becomes 0000-01-01

as.POSIXlt( as.Date("9999-12-31") )+86399

>[1] "9999-12-31 23:59:59 UTC"

as.POSIXlt( as.Date("9999-12-31") )+86401

>[1] "0000-01-01 00:00:01 UTC"

In the end it sounds like the system administrator for the Mayan calendar was probably just laid off during the very first financial crises and it was just long overdue for all of the software and hardware updates.

True Significance of a T Statistic

The example is more of a statistical exercise that  shows the true significance and the density curve of simulated random normal data.  The code can be changed to generate data using either a different mean and standard deviation or a different distribution altogether.

This extends the idea of estimating pi by generating random normal data to determine the actual significance level. First, there is a simple function to calculate a pooled t statistic. Then repeat that function N times. The second part of the R code will produce a density graph of the t statistic (normalized to 1). By changing the distribution, mean, standard deviation one can see what happens to the true significance level and the density graph.  Just keep that in mind the next time someone assumes a standard normal distribution for hypothesis testing when it’s really a different distribution altogether and they are set on alpha = 0.05

#Function to calculate the t statistic
m <- length(x) n <- length(y) spooled <- sqrt(((m-1)*sd(x)^2+(n-1)*sd(y)^2)/(m+n-2)) tstat <- (mean(x)-mean(y))/(spooled*sqrt(1/m+1/n)) return(tstat) } ### #Set some constants ### alpha <- .05 m <- 15 n <- 15 N <- 20000 n.reject <- 0 ### #Iterate N times ### for (i in 1:N){ x <- rexp(m,1) y <- rexp(n,1) t.stat <- tstatfunc(x,y) if (abs(t.stat)>qt(1-alpha/2,n+m-2)){
n.reject <- n.reject+1 } } true.sig.level <- n.reject/N true.sig.level ### #Function to simulate t-statistic vector ### tsim=function(){ tstatfunc(rnorm(m,mean=0,sd=1), rnorm(n,mean=0,sd=1)) } ### #Set up the values to graph ### tstat.vec <- replicate(10000, tsim()) dens.y <- density(tstat.vec)$y/max(density(tstat.vec)$y) dens.x <- density(tstat.vec)$x ### #Graph the density for each ### plot( NULL, NULL, xlim=c(-5,5),ylim=c(0,1), main="Simulated and True Density" ) lines(dens.x,dens.y, lty=1, col=3, lwd=3) curve( dt(x,df=n+m-2)/max(dt(x,df=n+m-2)),add=TRUE ) [/sourcecode]

Estimating Pi

Recently I’ve been working on some jackknife and bootstrapping problems.  While working on those projects I figured it would be a fun distraction to take the process and estimate pi.  I’m sure this problem has been tackled countless times but I have never bothered to try it using a Monte Carlo approach.  Here is the code that can be used to estimate pi.  The idea is to generate a 2 by 2 box and then draw a circle inside of it.  Then start randomly selecting points inside the box.  This example uses 10,000 iterations using a sample size of 500.  An interesting follow-up exercise would be to change the sample size, mean, and standard deviation to and see what it does to the jackknife variance.

nsims <- 10000; size <- 500; alpha <- .05; theta <- function(x){sd(x)} pi.estimate <- function(nsims,size,alpha){ out <- as.data.frame(NA); call <- match.call(); pi.sim <- matrix(NA, nrow=nsims); for(i in 1:nsims){ x <- runif(size,-1,1); y <- runif(size,-1,1); radius <- -1; x.y <- cbind(x,y); d.origin <- sqrt(x^2+y^2); x.y.d <- cbind(x.y,d.origin); est.pi <- 4*length(x.y.d[,3][x.y.d[,3]<1])/ length(x.y.d[,3]) pi.sim[i] <- est.pi; } pi.mean <- mean(pi.sim); jk <- jackknife(c(pi.sim), theta); jk.me <- jk$jack.se*qnorm(1-alpha/2); pi.mean <- pi.mean; pi.se <- jk$jack.se; jk.me <- jk.me; pi.me.u <- pi.mean+jk.me; pi.me.l <- pi.mean-jk.me; return(list(pi.mean=pi.mean, pi.se=pi.se,pi.me.u=pi.me.u,pi.me.l=pi.me.l, call=call)); } pi.estimate(nsims,size,alpha); [/sourcecode]

Mean Value from Grouped Data

Occasionally, I will get requests from clients to calculate the mean. Most of the time it’s a simple request but from time-to-time the data was originally from grouped data. A common approach is to take the midpoint of each of the groups and just assume that all respondents within that group average out to the midpoint.  The biggest problem I have run into has been trying to decide what to do with two extremes.  What does one do with the age ’65+’ category or the ‘over $100,000’ category.  Since there is really no mid-point one must make a (somewhat arbitrary) determination on what value to use for those categories.  Over the years I have toyed around with various approaches but have found using this Monte Carlo approach has worked the best and seems to be most defensible.  In this example I use a sample of voters who were asked their age and were given categorical response options. The final deliverable will be the average age and standard deviation. 

Group Frequency
18 to <30 25
30 to <45 43
45 to <60 42
60+ 31


I have written some code that I have traditionally used for this type of work but have since found that the LearnBayes package contains the necessary functions as well.  For the purpose of making this code usable in other settings I have combined some of the code I have written with some of the functions from the LearnBayes package.  This way I have been able to modify some of the algorithms and graphs to work under a variety of settings.

I welcome any comment on other approaches people have taken to solve similar problems using grouped, categorical data.

###Note that this function is also available through the VGAM package.
###However, it is displayed here as part of the example.
laplace.dist <- function (logpost, mode, ...) { options(warn = -1) fit <- optim(mode, logpost, gr = NULL, ..., hessian = TRUE, control = list(fnscale = -1)) options(warn = 0) mode <- fit$par hess <- -solve(fit$hessian) l.mode <- length(mode) int <- l.mode/2 * log(2 * pi) + 0.5 * log(det(hess)) + logpost(mode, ...) ret.val <- list(mode = mode, var = hess, int = int, converge = fit$convergence == 0) return(ret.val) } metropolis.randwalk <- function (logpost, proposal, start, m, ...) { pb <- length(start) Mpar <- array(0, c(m, pb)) b <- matrix(t(start)) lb <- logpost(start, ...) a <- chol(proposal$var) scale <- proposal$scale accept <- 0 for (i in 1:m) { bc <- b + scale * t(a) %*% array(rnorm(pb), c(pb, 1)) lbc <- logpost(t(bc), ...) prob <- exp(lbc - lb) if (is.na(prob) == FALSE) { if (runif(1) < prob) { lb <- lbc b <- bc accept <- accept + 1 } } Mpar[i, ] <- b } accept <- accept/m ret.val <- list(par = Mpar, accept = accept) return(ret.val) } DataGroup <- function(theta, data){ cpoints=data$b freq <- data$f nbins <- length(cpoints) m <- theta[1] logsigma <- theta[2] z <- 0*m s <- exp(logsigma) z <- freq[1]*log(pnorm(cpoints[1],m,s)) for(j in 1:(nbins-1)){ z <- z+freq[j+1]*log(pnorm(cpoints[j+1],m,s)-pnorm(cpoints[j],m,s)) } z <- z+freq[nbins+1]*log(1-pnorm(cpoints[nbins],m,s)) return(z) } contour.plot = function (logf, limits, data, ...) { log.f = function(theta, data) { if (is.matrix(theta) == TRUE) { val = matrix(0, c(dim(theta)[1], 1)) for (j in 1:dim(theta)[1]) { val[j] = logf(theta[j,], data) } } else { val = logf(theta, data) } return(val) } ng <- 50 x0 <- seq(limits[1], limits[2], len = ng) y0 <- seq(limits[3], limits[4], len = ng) X <- outer(x0, rep(1, ng)) Y <- outer(rep(1, ng), y0) n2 <- ng^2 Z <- log.f(cbind(X[1:n2], Y[1:n2]), data) Z <- Z - max(Z) Z <- matrix(Z, c(ng, ng)) contour(x0, y0, Z, levels = seq(-6.9, 0, by = 2.3), lwd = 2) } lo <- c(18,30,45,60) hi <- c(30,45,60, Inf) d <- list(int.lo=lo, int.hi=hi, b = c(30,45,60), f=c(25,43,42,31)) y <- c(rep((18+30)/2,25),rep((30+45)/2,43),rep((45+60)/2,42),rep(61,31)) mean.y <- mean(y) log.sd.y <- log(sd(y)) start <- c(mean.y,log.sd.y) fit <- laplace.dist(DataGroup,start,d) fit modal.sds <- sqrt(diag(fit$var)) proposal <- list(var=fit$var,scale=2) fit2 <- metropolis.randwalk(DataGroup,proposal,start,10000,d) fit2$accept post.means <- apply(fit2$par,2,mean) post.sds <- apply(fit2$par,2,sd) cbind(c(fit$mode),modal.sds) ##These should be similar cbind(post.means,post.sds) contour.plot(DataGroup,c(min(post.means[1])-10,max(post.means[1]+10),min(post.means[2]-.5),max(post.means[2]+.5)),d, xlab="mu",ylab="log sigma") points(fit2$par[1:2000,1],fit2$par[1:2000,2]) [/sourcecode]  

Importing Data Into R from Different Sources

I have found that I get data from many different sources.  These sources range from simple .csv files to more complex relational databases, to structure XML or JSON files.  I have compiled the different approaches that one can use to easily access these datasets.

Local Column Delimited Files

This is probably the most common and easiest approach to load data into R.  It simply requires one line to do everything that is needed to set up the data.  Then a couple additional lines to tidy up the dataset.

file <- "c:\\my_folder\\my_file.txt" raw_data <- read.csv(file, sep=",");  ##'sep' can be a number of options including \t for tab delimited names(raw_data) <- c("VAR1","VAR2","RESPONSE1") [/sourcecode] Text File From the Internet

I find this very useful when I need to get datasets from a Web site.  This is particularly useful if I need to rerun the script and the Web site continually updates their data.  This save me from having to download the dataset into a csv file each time I need to run an update.  In this example I use one of my favorite data sources which comes from the National Data Buoy Center.  This example pulls data from a buoy (buoy #44025) off the coast of New Jersey.  Conveniently you can use the same read.csv() function that you would use if read the file from you own computer.  You simply replace the file location with the URL of the data.

file <- "http://www.ndbc.noaa.gov/view_text_file.php?filename=44025h2011.txt.gz&dir=data/historical/stdmet/

raw_data <- read.csv(file, header=T, skip=1) [/sourcecode] Files From Other Software

Often I will have Excel files, SPSS files, or SAS dataset set to me.  Once again I can either export the data as a csv file and then import using the read.csv function.  However, taking that approach every time means that there is an additional step.  By adding unnecessary steps to a process increases the risk that the data might get corrupted due to human error.  Furthermore, if the data is updated from time to time then the data that you downloaded last week may not have the most current data.



file <- "C:\\my_folder\\my_file.sav" raw <- as.data.frame(read.spss(file)) [/sourcecode]   Microsoft Excel


file <- "C:\\my_folder\\my_file.xlsx" raw_wb <- loadWorkbook(file, create=F) raw <- as.data.frame( readWorksheet(raw_wb, sheet='Sheet1') ) [/sourcecode] Data From Relational Databases

There is the RMySQL library which is very useful.  However, I have generally been in the habit of using the RODBC library.  The reason for this is that I will often jump between databases (e.g. Oracle, MSSQL, MySQL).  By using the RODBC library I can keep all of my connections in one location and use the same functions regardless of the databases.  This example below will work on any standard SQL database.  You just need to make sure you set up an ODBC connection call (in this example) MY_DATABASE.

channel <- odbcConnect("MY_DATABASE", uid="username", pwd="password") raw <- sqlQuery(channel, "SELECT * FROM Table1"); [/sourcecode] Data from Non-Relational Databases

R has the capability to pull data from non-relational databases.  These include Hadoop (rhbase), Cassandra (RCassandra), MongoDB (rmongodb).  I personally have not used RCassandra but here is the documentation.  The example here uses MongoDB using an example provided by MongoDB.

MyMongodb <- "test" ns <- "articles" mongo <- mongo.create(db=MyMmongodb) list.d <- mongo.bson.from.list(list( "_id"="wes", name=list(first="Wesley", last=""), sex="M", age=40, value=c("7", "5","8","2") )) mongo.insert(mongo, "test.MyPeople", list.d) list.d2 <- mongo.bson.from.list(list( "_id"="Article1", when=mongo.timestamp.create(strptime("2012-10-01 01:30:00", "%Y-%m-%d %H:%M:%s"), increment=1), author="wes", title="Importing Data Into R from Different Sources", text="Provides R code on how to import data into R from different sources.", tags=c("R", "MongoDB", "Cassandra","MySQL","Excel","SPSS"), comments=list( list( who="wes", when=mongo.timestamp.create(strptime("2012-10-01 01:35:00", "%Y-%m-%d %H:%M:%s"), increment=1), comment="I'm open to comments or suggestions on other data sources to include." ) ) ) ) list.d2 mongo.insert(mongo, "test.MyArticles", list.d2) res <- mongo.find(mongo, "test.MyArticles", query=list(author="wes"), fields=list(title=1L)) out <- NULL while (mongo.cursor.next(res)){ out <- c(out, list(mongo.bson.to.list(mongo.cursor.value(res)))) } out [/sourcecode]   Copied and Pasted Text

raw_txt <- " STATE READY TOTAL AL 36 36 AK 5 8 AZ 15 16 AR 21 27 CA 43 43 CT 56 68 DE 22 22 DC 7 7 FL 130 132 GA 53 54 HI 11 16 ID 11 11 IL 24 24 IN 65 77 IA 125 130 KS 22 26 KY 34 34 LA 27 34 ME 94 96 MD 25 26 MA 82 92 Mi 119 126 MN 69 80 MS 43 43 MO 74 82 MT 34 40 NE 9 13 NV 64 64 NM 120 137 NY 60 62 NJ 29 33 NH 44 45 ND 116 135 NC 29 33 OH 114 130 OK 19 22 PA 101 131 RI 32 32 Sc 35 45 SD 25 25 TN 30 34 TX 14 25 UT 11 11 VT 33 49 VA 108 124 WV 27 36 WI 122 125 WY 12 14 " raw_data <- textConnection(raw_txt) raw <- read.table(raw_data, header=TRUE, comment.char="#", sep="") close.connection(raw_data) raw ###Or the following line can be used raw <- read.table(header=TRUE, text=raw_txt) [/sourcecode]  Structured Local or Remote Data

One feature that I find quite useful is when there is a Web site with a table that I want to analyze.  R has the capability to read through the HTML and import the table that you want.  This example uses the XML library and pulls down the population by country in the world.  Once the data is brought into R it may need to be cleaned up a bit removing unnecessary columns and other stray characters.  The examples here use remote data from other Web sites.  If the data is available as a local file then it can be imported in a similar fashion just using filename rather than the URL.


url <- "http://en.wikipedia.org/wiki/List_of_countries_by_population" population = readHTMLTable(url, which=3) population [/sourcecode] Or you can use the feature to simple grab XML content.  I have found this particularly useful when I need geospatial data and need to get the latitude/longitude of a location (this example uses Open Street Maps API provided by MapQuest).  This example obtains the results for the coordinates of the United States White House. [sourcecode language="css"] url <- "http://open.mapquestapi.com/geocoding/v1/address?location=1600%20Pennsylvania%20Ave,%20Washington,%20DC&outFormat=xml" mygeo <- xmlToDataFrame(url) mygeo$result [/sourcecode] An alternate approach is to use a JSON format.  I generally find that JSON is a better format and it can be readily used in most programming languages. [sourcecode language="css"] library(rjson) url <- "http://open.mapquestapi.com/geocoding/v1/address?location=1600%20Pennsylvania%20Ave,%20Washington,%20DC&outFormat=json" raw_json <- scan(url, "", sep="\n") mygeo <- fromJSON(raw_json) [/sourcecode]

Plotting Likert Scales

Graphs can provide an excellent way to emphasize a point and to quickly and efficiently show important information. Sadly, poor graphs can be a good way to waste space in an article, take up time in a presentation, and waste a lot of ink all while providing little to no information.

Excel has made it possible to make all sort of graphs. However, just because the graph looks like a spider web or like something you can eat for dessert doesn’t mean you should use it.

This discussion here will show five options on how to graph Likert scale data, will show best/common practice for graphing, and will provide the R code for each graph. These graphing approaches are based on a list that I have compiled that the different people that I have worked with have used to graph and interpret Likert scales within their organization.

Likert scales usually have 5 or 7 response options. However, the exact number and whether there should be an odd or even number of responses is a topic for another psycometric discussion. A typical Likert scale is:

1 Strongly Agree
2 Agree
3 Neutral
4 Disagree
5 Strongly Disagree

For example purposes I generated some random discrete data that is formatted as a Likert scale.   I have created three examples to show the extremities of Likert scale responses.

probs <- cbind(c(.4,.2/3,.2/3,.2/3,.4),c(.1/4,.1/4,.9,.1/4,.1/4),c(.2,.2,.2,.2,.2))
my.n <- 100
my.len <- ncol(probs)*my.n
raw <- matrix(NA,nrow=my.len,ncol=2)
raw <- NULL
for(i in 1:ncol(probs)){
raw <- rbind(raw, cbind(i,rdiscrete(my.n,probs=probs&#91;,i&#93;,values=1:5)))

r <- data.frame( cbind(
as.numeric( row.names( tapply(raw&#91;,2&#93;, raw&#91;,1&#93;, mean) ) ),
tapply(raw&#91;,2&#93;, raw&#91;,1&#93;, mean),
tapply(raw&#91;,2&#93;, raw&#91;,1&#93;, mean) + sqrt( tapply(raw&#91;,2&#93;, raw&#91;,1&#93;, var)/tapply(raw&#91;,2&#93;, raw&#91;,1&#93;, length) ) * qnorm(1-.05/2,0,1),
tapply(raw&#91;,2&#93;, raw&#91;,1&#93;, mean) - sqrt( tapply(raw&#91;,2&#93;, raw&#91;,1&#93;, var)/tapply(raw&#91;,2&#93;, raw&#91;,1&#93;, length) ) * qnorm(1-.05/2,0,1)
names(r) <- c("group","mean","ll","ul")

gbar <- tapply(raw&#91;,2&#93;, list(raw&#91;,2&#93;, raw&#91;,1&#93;), length)

sgbar <- data.frame( cbind(c(1:max(unique(raw&#91;,1&#93;))),t(gbar)) )

sgbar.likert<- sgbar&#91;,2:6&#93;


<strong>Diverging Stacked Bar Chart</strong>

Diverging stacked bar charts are often the best choice when visualizing Likert scale data.  There are various ways to produce these graphs but I have found the easiest approach uses the <em>HH </em>package.  There are many graphs that can be produced using this package. I have provided three approaches here.

sgbar.likert<- sgbar&#91;,2:6&#93;
main='Example Diverging Stacked Bar Chart for Likert Scale',
sub="Likert Scale")

likert(sgbar.likert, horizontal=FALSE,
main="Example Diverging Stacked Bar Chart for Likert Scale",
auto.key=list(space="right", columns=1,
reverse=TRUE, padding.text=2),
sub="Likert Scale")

auto.key=list(between=1, between.columns=2),
main="Example Diverging Stacked Bar Chart for Likert Scale",
sub="Likert Scale")

<p style="text-align: center;"><a href="http://statistical-research.com/wp-content/uploads/2012/12/divergingstacked_h.likert.png"><img class="aligncenter  wp-image-859" title="Diverging Stacked Bar Chart for Likert Scale" src="http://statistical-research.com/wp-content/uploads/2012/12/divergingstacked_h.likert.png" alt="" width="452" height="256" /></a></p>
<p style="text-align: center;"><a href="http://statistical-research.com/wp-content/uploads/2012/12/divergingstacked_v.likert.png"><img class="aligncenter  wp-image-860" title="Diverging Stacked Bar Chart (Vertical) for Likert Scale" src="http://statistical-research.com/wp-content/uploads/2012/12/divergingstacked_v.likert.png" alt="" width="502" height="284" /></a></p>
<p style="text-align: center;"><a href="http://statistical-research.com/wp-content/uploads/2012/12/divergingstacked_h_blue.likert.png"><img class="aligncenter  wp-image-861" title="Diverging Stacked Bar Chart for Likert Scale Using Blue Tones" src="http://statistical-research.com/wp-content/uploads/2012/12/divergingstacked_h_blue.likert.png" alt="" width="452" height="256" /></a></p>
<strong>Mean Value</strong>

Often researchers will simply take each response option and interpret it as a real number.  Using this approach makes it very convenient to calculate the mean value and standard deviation (and confidence intervals). This is particularly useful when working with non-analytical clients.  However, this is a controversial issue. Taking this approach requires a lot of statistical assumptions that may not be correct. For starters the response options really need to be equidistant from each other. For example, is the distance from <em>Strongly Agree</em> to <em>Agree</em> the same distance from <em>Disagree</em> to <em>Strongly Disagree</em>? This may be true for one question but it might not be true for all questions on a questionnaire. Different questions and question wording are quite likely going to have different distributions. Furthermore, confidence intervals require normality assumptions which may also be incorrect.

What makes matters worse is that if there are 50 respondents and 25 of them mark <em>Strongly Disagree</em> and 25 of them mark <em>Strongly Agree</em> then the mean will be 3 implying that, on average, the results are neutral.  But that clearly does not adequately describe the data. Ultimately, it is up to the statistician to work with the client to use an appropriate method that appropriately conveys the message and that both parties can agree upon.

plot.new(); par(mfrow=c(1,1));

plot(r$group,r$mean, type="o", cex=1, col="blue", pch=16,
ylim=c(1,5), lwd=2,
, ylab="Mean Value", xlab="Group"
, main=paste("Likert Scale Mean Values Example")
, cex.sub=.60
, xaxt = "n", yaxt = "n");
axis(1, at=(1:3), tcl = -0.7, lty = 1, lwd = 0.8, labels=TRUE)
axis(2, at=(1:5), labels=TRUE, tcl = -0.7, lty = 1)
abline(h=c(1:5), col="grey")
lines(r$group,r$ll, col='red', lwd=2)
lines(r$group,r$ul, col='red', lwd=2)

legend("topright", c("Mean","Confidence Interval"),
col=c('blue','red'), title="Legend",
lty=1, lwd=2,
inset = .05)

Pies and Multiple Pies

A table is nearly always better than a dumb pie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial disarray both within and between charts (…). Given their low density and failure to order numbers along a visual dimension, pie charts should never be used. — Edward Tufte

Pie charts are notoriously difficult to convey the information that was intended.  As far as pie charts go I don’t ever use them.  There are far better ways to visualize data.  However, I have heard some people give a reason for using them that are somewhat justified and generally are based on the ‘eye candy’ argument.  But as far as creating a graph that both provides information and looks good a 3-D pie chart is probably not the best choice.  I debated whether I should even include the R code for the example but to provide full disclosure here’s the code.

my.table <- table(raw&#91;,2&#93;&#91;raw&#91;,1&#93;==1&#93;)

names(my.table) <- c("Strongly Agree","Agree","Neutral","Disagree","Strongly Disagree")
labl <- paste(names(my.table), "\n", my.table, sep="")
pie(my.table, labels=labl, main="Example Pie Chart of Likert Scale")

<p style="text-align: center;"><a href="http://statistical-research.com/wp-content/uploads/2012/12/pie.likert.png"><img class="aligncenter  wp-image-852" title="Pie Chart for Likert Scale" src="http://statistical-research.com/wp-content/uploads/2012/12/pie.likert.png" alt="" width="420" height="262" /></a></p>

num.groups <- length(unique(raw&#91;,1&#93;))
for(j in 1:num.groups){
my.table <- table(raw&#91;,2&#93;&#91;raw&#91;,1&#93;==j&#93;)
pie(my.table, labels=labl, main=paste("Example Pie Chart of\nLikert Scale Group ", j))

<p style="text-align: center;"> <a href="http://statistical-research.com/wp-content/uploads/2012/12/piemulti.likert1.png"><img class="aligncenter  wp-image-858" title="Multiple Pie Charts for Likert Scale" src="http://statistical-research.com/wp-content/uploads/2012/12/piemulti.likert1.png" alt="" width="502" height="284" /></a></p>

slices <- my.table
names(my.table) <- c("Strongly Agree","Agree","Neutral","Disagree","Strongly Disagree")
labl <- paste(names(my.table), "\n", my.table, sep="")
main="3D Pie Chart Example")

<p style="text-align: center;"><a href="http://statistical-research.com/wp-content/uploads/2012/12/pie3d.likert.png"><img class="aligncenter  wp-image-855" title="3D Pie Chart for Likert Scale" src="http://statistical-research.com/wp-content/uploads/2012/12/pie3d.likert.png" alt="" width="420" height="300" /></a></p>
<strong>Grouped Bar Chart</strong>

This is a nice approach when wanting to look at each group and highlight any particular Likert response option.  Here it is easy to see that in Group 2 the <em>Neutral </em>option is by far the most common response.

barplot(gbar, beside=T, col=cm.colors(5), main="Example Bar Chart of Counts by Group",xlab="Group",ylab="Frequency")

legend("topright", names(my.table),
col=cm.colors(5), title="Legend",
lty=1, lwd=15,
inset = .1)

Divided Bar Chart

This isn’t a bad approach and quite similar to the diverging stacked bar chart.  This approach shows the stacked percent for each category.


names(sgbar) <- c("group","Strongly Agree","Agree","Neutral","Disagree","Strongly Disagree") mx <- melt(sgbar, id.vars=1) names(mx) <- c("Group","Category","Percent") ggplot(mx, aes(x=Group, y=Percent, fill=Category)) + geom_bar(stat="identity") [/sourcecode]