Brief Thoughts on Data Entry: Excel Should Not to Be Used

Over the years I have worked with many people who do the field work to collect data. They have their process and it works for them. However, their work rarely extends beyond data collection. This often means data is collected in Excel. If an observation meets a specific criterion then they are colored yellow, another criterion and they are blue.  If there was a problem with the observation then it is highlighted in bold. This is a very easy approach. It takes very little preparation and one can sling data around like it’s nobody’s business.

When analyzing the data dealing with this on a small scale usually isn’t too difficult. However, when working on a data warehouse sized dataset this becomes an extremely tedious task (if not impossible).  The data scrubbing and preparation can be extremely difficult.  Take for example a simple yes or no question.  Suppose there are four people are entering data about whether soil samples contain a specific contaminate.  Each person will begin entering their data and one person enters ‘Y’/’N’, another enters ‘Yes’/’No’, yet another enters ‘y’/’n’.  Worse yet the fourth person enters 1 for yes and 2 for no.  All of the sudden there are a total of eight codes when there should really only be two.


So some extra work in advance.  It’s well worth the time spent to prepare for data collection and have the data entered cleanly rather than trying to make sense of messy data after the fact.

There are a number of solutions that will work depending on the needs of those entering the data.  I mention a few here

  • Custom developed web application.  It take some work and potentially additional development costs but it can be completely customized to the project.  I have often taken this approach because it allows me to have a completely customized application for the job.
  • Open source client software.  There are many open source software packages that can be used for data entry.  An example of this is EpiData or EpiInfo.
  • Commercial software like Qualtrics.  It’s a little pricier but these software packages offer a lot of features.
  • Worst case: use SurveyMonkey

Y2K38: Our Own Mayan Calendar…Again

It’s not quite the end of the world as we know it.  We made it through December 21, 2012 unscathed. It’s not going to be the last time we will make it through such a pseudo-calamity.  After all we have built our own end of the world before (e.g. Y2K).

Next up January 19, 2038.  We’ve unknowingly made this the date that time will stop on all (well just 32-bit systems) computers.  Similar to the Y2K problem this is a problem due to the lack of available integers.  On January 19, 2038 3:14:07 (GMT) it is going to be exactly 2,147,483,647 seconds since the beginning of time (Unix epoch is January 1, 1970).  32-bit systems will only support 2,147,483,647 (2^{32-1}-1). It is 32-1 minus because the indexing begins at 0) as its largest integer and since Unix time is measured in seconds that means 2147483647 + 1 will generate an error and fail to calculate the correct date and time.  The most obvious solution will be to upgrade to 64-bit hardware.  This upgrade should buy us some more time and we won’t have to worry about another upgrade for another 9223372036854775808 seconds (Sunday, December 4, 292277026596).  However, since years are often still computed in 32-bit int (even in 64-bit systems) we’re going to loose roughly 290 millions years.  Ultimately, we’ll need to start worrying about an upgrade somewhere in the year 2147483647 (or later if the years were originally configured to start at year 1900).  Let’s hope by then someone can figure out how to deal with really big numbers.

Fortunately for users of R (and some other programs)  this is not necessarily an issue and there are ways around the maximum integer constraint problem. We’re not going to first encounter this problem in 2038. However, it’s possible that it has already exhibited itself in some software that needs to calculate future dates beyond January 19, 2038. An example would be if one needs a predictive time series model extending beyond January 19, 2038.

If you are on a 32-bit system you can give the following lines of code a try as a proof of concept:

as.POSIXlt( as.Date("1970-01-01") )+2147483647

>[1] "2038-01-19 03:14:07 UTC"

as.POSIXlt( as.Date("1970-01-01") )+2147483648

[1] "2038-01-19 03:14:08 UTC"

This can be extended even farther and it can be seen that the year is only limited by four digits.  After the year 9999-12-31 it becomes 0000-01-01

as.POSIXlt( as.Date("9999-12-31") )+86399

>[1] "9999-12-31 23:59:59 UTC"

as.POSIXlt( as.Date("9999-12-31") )+86401

>[1] "0000-01-01 00:00:01 UTC"

In the end it sounds like the system administrator for the Mayan calendar was probably just laid off during the very first financial crises and it was just long overdue for all of the software and hardware updates.

True Significance of a T Statistic

The example is more of a statistical exercise that  shows the true significance and the density curve of simulated random normal data.  The code can be changed to generate data using either a different mean and standard deviation or a different distribution altogether.

This extends the idea of estimating pi by generating random normal data to determine the actual significance level. First, there is a simple function to calculate a pooled t statistic. Then repeat that function N times. The second part of the R code will produce a density graph of the t statistic (normalized to 1). By changing the distribution, mean, standard deviation one can see what happens to the true significance level and the density graph.  Just keep that in mind the next time someone assumes a standard normal distribution for hypothesis testing when it’s really a different distribution altogether and they are set on alpha = 0.05

#Function to calculate the t statistic
m <- length(x) n <- length(y) spooled <- sqrt(((m-1)*sd(x)^2+(n-1)*sd(y)^2)/(m+n-2)) tstat <- (mean(x)-mean(y))/(spooled*sqrt(1/m+1/n)) return(tstat) } ### #Set some constants ### alpha <- .05 m <- 15 n <- 15 N <- 20000 n.reject <- 0 ### #Iterate N times ### for (i in 1:N){ x <- rexp(m,1) y <- rexp(n,1) t.stat <- tstatfunc(x,y) if (abs(t.stat)>qt(1-alpha/2,n+m-2)){
n.reject <- n.reject+1 } } true.sig.level <- n.reject/N true.sig.level ### #Function to simulate t-statistic vector ### tsim=function(){ tstatfunc(rnorm(m,mean=0,sd=1), rnorm(n,mean=0,sd=1)) } ### #Set up the values to graph ### tstat.vec <- replicate(10000, tsim()) dens.y <- density(tstat.vec)$y/max(density(tstat.vec)$y) dens.x <- density(tstat.vec)$x ### #Graph the density for each ### plot( NULL, NULL, xlim=c(-5,5),ylim=c(0,1), main="Simulated and True Density" ) lines(dens.x,dens.y, lty=1, col=3, lwd=3) curve( dt(x,df=n+m-2)/max(dt(x,df=n+m-2)),add=TRUE ) [/sourcecode]

Estimating Pi

Recently I’ve been working on some jackknife and bootstrapping problems.  While working on those projects I figured it would be a fun distraction to take the process and estimate pi.  I’m sure this problem has been tackled countless times but I have never bothered to try it using a Monte Carlo approach.  Here is the code that can be used to estimate pi.  The idea is to generate a 2 by 2 box and then draw a circle inside of it.  Then start randomly selecting points inside the box.  This example uses 10,000 iterations using a sample size of 500.  An interesting follow-up exercise would be to change the sample size, mean, and standard deviation to and see what it does to the jackknife variance.

nsims <- 10000; size <- 500; alpha <- .05; theta <- function(x){sd(x)} pi.estimate <- function(nsims,size,alpha){ out <-; call <-; pi.sim <- matrix(NA, nrow=nsims); for(i in 1:nsims){ x <- runif(size,-1,1); y <- runif(size,-1,1); radius <- -1; x.y <- cbind(x,y); d.origin <- sqrt(x^2+y^2); x.y.d <- cbind(x.y,d.origin); est.pi <- 4*length(x.y.d[,3][x.y.d[,3]<1])/ length(x.y.d[,3]) pi.sim[i] <- est.pi; } pi.mean <- mean(pi.sim); jk <- jackknife(c(pi.sim), theta); <- jk$*qnorm(1-alpha/2); pi.mean <- pi.mean; <- jk$; <-; <-; <-; return(list(pi.mean=pi.mean,,,, call=call)); } pi.estimate(nsims,size,alpha); [/sourcecode]

Mean Value from Grouped Data

Occasionally, I will get requests from clients to calculate the mean. Most of the time it’s a simple request but from time-to-time the data was originally from grouped data. A common approach is to take the midpoint of each of the groups and just assume that all respondents within that group average out to the midpoint.  The biggest problem I have run into has been trying to decide what to do with two extremes.  What does one do with the age ’65+’ category or the ‘over $100,000’ category.  Since there is really no mid-point one must make a (somewhat arbitrary) determination on what value to use for those categories.  Over the years I have toyed around with various approaches but have found using this Monte Carlo approach has worked the best and seems to be most defensible.  In this example I use a sample of voters who were asked their age and were given categorical response options. The final deliverable will be the average age and standard deviation. 

Group Frequency
18 to <30 25
30 to <45 43
45 to <60 42
60+ 31


I have written some code that I have traditionally used for this type of work but have since found that the LearnBayes package contains the necessary functions as well.  For the purpose of making this code usable in other settings I have combined some of the code I have written with some of the functions from the LearnBayes package.  This way I have been able to modify some of the algorithms and graphs to work under a variety of settings.

I welcome any comment on other approaches people have taken to solve similar problems using grouped, categorical data.

###Note that this function is also available through the VGAM package.
###However, it is displayed here as part of the example.
laplace.dist <- function (logpost, mode, ...) { options(warn = -1) fit <- optim(mode, logpost, gr = NULL, ..., hessian = TRUE, control = list(fnscale = -1)) options(warn = 0) mode <- fit$par hess <- -solve(fit$hessian) l.mode <- length(mode) int <- l.mode/2 * log(2 * pi) + 0.5 * log(det(hess)) + logpost(mode, ...) ret.val <- list(mode = mode, var = hess, int = int, converge = fit$convergence == 0) return(ret.val) } metropolis.randwalk <- function (logpost, proposal, start, m, ...) { pb <- length(start) Mpar <- array(0, c(m, pb)) b <- matrix(t(start)) lb <- logpost(start, ...) a <- chol(proposal$var) scale <- proposal$scale accept <- 0 for (i in 1:m) { bc <- b + scale * t(a) %*% array(rnorm(pb), c(pb, 1)) lbc <- logpost(t(bc), ...) prob <- exp(lbc - lb) if ( == FALSE) { if (runif(1) < prob) { lb <- lbc b <- bc accept <- accept + 1 } } Mpar[i, ] <- b } accept <- accept/m ret.val <- list(par = Mpar, accept = accept) return(ret.val) } DataGroup <- function(theta, data){ cpoints=data$b freq <- data$f nbins <- length(cpoints) m <- theta[1] logsigma <- theta[2] z <- 0*m s <- exp(logsigma) z <- freq[1]*log(pnorm(cpoints[1],m,s)) for(j in 1:(nbins-1)){ z <- z+freq[j+1]*log(pnorm(cpoints[j+1],m,s)-pnorm(cpoints[j],m,s)) } z <- z+freq[nbins+1]*log(1-pnorm(cpoints[nbins],m,s)) return(z) } contour.plot = function (logf, limits, data, ...) { log.f = function(theta, data) { if (is.matrix(theta) == TRUE) { val = matrix(0, c(dim(theta)[1], 1)) for (j in 1:dim(theta)[1]) { val[j] = logf(theta[j,], data) } } else { val = logf(theta, data) } return(val) } ng <- 50 x0 <- seq(limits[1], limits[2], len = ng) y0 <- seq(limits[3], limits[4], len = ng) X <- outer(x0, rep(1, ng)) Y <- outer(rep(1, ng), y0) n2 <- ng^2 Z <- log.f(cbind(X[1:n2], Y[1:n2]), data) Z <- Z - max(Z) Z <- matrix(Z, c(ng, ng)) contour(x0, y0, Z, levels = seq(-6.9, 0, by = 2.3), lwd = 2) } lo <- c(18,30,45,60) hi <- c(30,45,60, Inf) d <- list(int.lo=lo, int.hi=hi, b = c(30,45,60), f=c(25,43,42,31)) y <- c(rep((18+30)/2,25),rep((30+45)/2,43),rep((45+60)/2,42),rep(61,31)) mean.y <- mean(y) <- log(sd(y)) start <- c(mean.y, fit <- laplace.dist(DataGroup,start,d) fit modal.sds <- sqrt(diag(fit$var)) proposal <- list(var=fit$var,scale=2) fit2 <- metropolis.randwalk(DataGroup,proposal,start,10000,d) fit2$accept post.means <- apply(fit2$par,2,mean) post.sds <- apply(fit2$par,2,sd) cbind(c(fit$mode),modal.sds) ##These should be similar cbind(post.means,post.sds) contour.plot(DataGroup,c(min(post.means[1])-10,max(post.means[1]+10),min(post.means[2]-.5),max(post.means[2]+.5)),d, xlab="mu",ylab="log sigma") points(fit2$par[1:2000,1],fit2$par[1:2000,2]) [/sourcecode]  

Importing Data Into R from Different Sources

I have found that I get data from many different sources.  These sources range from simple .csv files to more complex relational databases, to structure XML or JSON files.  I have compiled the different approaches that one can use to easily access these datasets.

Local Column Delimited Files

This is probably the most common and easiest approach to load data into R.  It simply requires one line to do everything that is needed to set up the data.  Then a couple additional lines to tidy up the dataset.

file <- "c:\\my_folder\\my_file.txt" raw_data <- read.csv(file, sep=",");  ##'sep' can be a number of options including \t for tab delimited names(raw_data) <- c("VAR1","VAR2","RESPONSE1") [/sourcecode] Text File From the Internet

I find this very useful when I need to get datasets from a Web site.  This is particularly useful if I need to rerun the script and the Web site continually updates their data.  This save me from having to download the dataset into a csv file each time I need to run an update.  In this example I use one of my favorite data sources which comes from the National Data Buoy Center.  This example pulls data from a buoy (buoy #44025) off the coast of New Jersey.  Conveniently you can use the same read.csv() function that you would use if read the file from you own computer.  You simply replace the file location with the URL of the data.

file <- "

raw_data <- read.csv(file, header=T, skip=1) [/sourcecode] Files From Other Software

Often I will have Excel files, SPSS files, or SAS dataset set to me.  Once again I can either export the data as a csv file and then import using the read.csv function.  However, taking that approach every time means that there is an additional step.  By adding unnecessary steps to a process increases the risk that the data might get corrupted due to human error.  Furthermore, if the data is updated from time to time then the data that you downloaded last week may not have the most current data.



file <- "C:\\my_folder\\my_file.sav" raw <- [/sourcecode]   Microsoft Excel


file <- "C:\\my_folder\\my_file.xlsx" raw_wb <- loadWorkbook(file, create=F) raw <- readWorksheet(raw_wb, sheet='Sheet1') ) [/sourcecode] Data From Relational Databases

There is the RMySQL library which is very useful.  However, I have generally been in the habit of using the RODBC library.  The reason for this is that I will often jump between databases (e.g. Oracle, MSSQL, MySQL).  By using the RODBC library I can keep all of my connections in one location and use the same functions regardless of the databases.  This example below will work on any standard SQL database.  You just need to make sure you set up an ODBC connection call (in this example) MY_DATABASE.

channel <- odbcConnect("MY_DATABASE", uid="username", pwd="password") raw <- sqlQuery(channel, "SELECT * FROM Table1"); [/sourcecode] Data from Non-Relational Databases

R has the capability to pull data from non-relational databases.  These include Hadoop (rhbase), Cassandra (RCassandra), MongoDB (rmongodb).  I personally have not used RCassandra but here is the documentation.  The example here uses MongoDB using an example provided by MongoDB.

MyMongodb <- "test" ns <- "articles" mongo <- mongo.create(db=MyMmongodb) list.d <- mongo.bson.from.list(list( "_id"="wes", name=list(first="Wesley", last=""), sex="M", age=40, value=c("7", "5","8","2") )) mongo.insert(mongo, "test.MyPeople", list.d) list.d2 <- mongo.bson.from.list(list( "_id"="Article1", when=mongo.timestamp.create(strptime("2012-10-01 01:30:00", "%Y-%m-%d %H:%M:%s"), increment=1), author="wes", title="Importing Data Into R from Different Sources", text="Provides R code on how to import data into R from different sources.", tags=c("R", "MongoDB", "Cassandra","MySQL","Excel","SPSS"), comments=list( list( who="wes", when=mongo.timestamp.create(strptime("2012-10-01 01:35:00", "%Y-%m-%d %H:%M:%s"), increment=1), comment="I'm open to comments or suggestions on other data sources to include." ) ) ) ) list.d2 mongo.insert(mongo, "test.MyArticles", list.d2) res <- mongo.find(mongo, "test.MyArticles", query=list(author="wes"), fields=list(title=1L)) out <- NULL while ({ out <- c(out, list( } out [/sourcecode]   Copied and Pasted Text

raw_txt <- " STATE READY TOTAL AL 36 36 AK 5 8 AZ 15 16 AR 21 27 CA 43 43 CT 56 68 DE 22 22 DC 7 7 FL 130 132 GA 53 54 HI 11 16 ID 11 11 IL 24 24 IN 65 77 IA 125 130 KS 22 26 KY 34 34 LA 27 34 ME 94 96 MD 25 26 MA 82 92 Mi 119 126 MN 69 80 MS 43 43 MO 74 82 MT 34 40 NE 9 13 NV 64 64 NM 120 137 NY 60 62 NJ 29 33 NH 44 45 ND 116 135 NC 29 33 OH 114 130 OK 19 22 PA 101 131 RI 32 32 Sc 35 45 SD 25 25 TN 30 34 TX 14 25 UT 11 11 VT 33 49 VA 108 124 WV 27 36 WI 122 125 WY 12 14 " raw_data <- textConnection(raw_txt) raw <- read.table(raw_data, header=TRUE, comment.char="#", sep="") close.connection(raw_data) raw ###Or the following line can be used raw <- read.table(header=TRUE, text=raw_txt) [/sourcecode]  Structured Local or Remote Data

One feature that I find quite useful is when there is a Web site with a table that I want to analyze.  R has the capability to read through the HTML and import the table that you want.  This example uses the XML library and pulls down the population by country in the world.  Once the data is brought into R it may need to be cleaned up a bit removing unnecessary columns and other stray characters.  The examples here use remote data from other Web sites.  If the data is available as a local file then it can be imported in a similar fashion just using filename rather than the URL.


url <- "" population = readHTMLTable(url, which=3) population [/sourcecode] Or you can use the feature to simple grab XML content.  I have found this particularly useful when I need geospatial data and need to get the latitude/longitude of a location (this example uses Open Street Maps API provided by MapQuest).  This example obtains the results for the coordinates of the United States White House. [sourcecode language="css"] url <- ",%20Washington,%20DC&outFormat=xml" mygeo <- xmlToDataFrame(url) mygeo$result [/sourcecode] An alternate approach is to use a JSON format.  I generally find that JSON is a better format and it can be readily used in most programming languages. [sourcecode language="css"] library(rjson) url <- ",%20Washington,%20DC&outFormat=json" raw_json <- scan(url, "", sep="\n") mygeo <- fromJSON(raw_json) [/sourcecode]

Plotting Likert Scales

Graphs can provide an excellent way to emphasize a point and to quickly and efficiently show important information. Sadly, poor graphs can be a good way to waste space in an article, take up time in a presentation, and waste a lot of ink all while providing little to no information.

Excel has made it possible to make all sort of graphs. However, just because the graph looks like a spider web or like something you can eat for dessert doesn’t mean you should use it.

This discussion here will show five options on how to graph Likert scale data, will show best/common practice for graphing, and will provide the R code for each graph. These graphing approaches are based on a list that I have compiled that the different people that I have worked with have used to graph and interpret Likert scales within their organization.

Likert scales usually have 5 or 7 response options. However, the exact number and whether there should be an odd or even number of responses is a topic for another psycometric discussion. A typical Likert scale is:

1 Strongly Agree
2 Agree
3 Neutral
4 Disagree
5 Strongly Disagree

For example purposes I generated some random discrete data that is formatted as a Likert scale.   I have created three examples to show the extremities of Likert scale responses.

probs <- cbind(c(.4,.2/3,.2/3,.2/3,.4),c(.1/4,.1/4,.9,.1/4,.1/4),c(.2,.2,.2,.2,.2))
my.n <- 100
my.len <- ncol(probs)*my.n
raw <- matrix(NA,nrow=my.len,ncol=2)
raw <- NULL
for(i in 1:ncol(probs)){
raw <- rbind(raw, cbind(i,rdiscrete(my.n,probs=probs&#91;,i&#93;,values=1:5)))

r <- data.frame( cbind(
as.numeric( row.names( tapply(raw&#91;,2&#93;, raw&#91;,1&#93;, mean) ) ),
tapply(raw&#91;,2&#93;, raw&#91;,1&#93;, mean),
tapply(raw&#91;,2&#93;, raw&#91;,1&#93;, mean) + sqrt( tapply(raw&#91;,2&#93;, raw&#91;,1&#93;, var)/tapply(raw&#91;,2&#93;, raw&#91;,1&#93;, length) ) * qnorm(1-.05/2,0,1),
tapply(raw&#91;,2&#93;, raw&#91;,1&#93;, mean) - sqrt( tapply(raw&#91;,2&#93;, raw&#91;,1&#93;, var)/tapply(raw&#91;,2&#93;, raw&#91;,1&#93;, length) ) * qnorm(1-.05/2,0,1)
names(r) <- c("group","mean","ll","ul")

gbar <- tapply(raw&#91;,2&#93;, list(raw&#91;,2&#93;, raw&#91;,1&#93;), length)

sgbar <- data.frame( cbind(c(1:max(unique(raw&#91;,1&#93;))),t(gbar)) )

sgbar.likert<- sgbar&#91;,2:6&#93;


<strong>Diverging Stacked Bar Chart</strong>

Diverging stacked bar charts are often the best choice when visualizing Likert scale data.  There are various ways to produce these graphs but I have found the easiest approach uses the <em>HH </em>package.  There are many graphs that can be produced using this package. I have provided three approaches here.

sgbar.likert<- sgbar&#91;,2:6&#93;
main='Example Diverging Stacked Bar Chart for Likert Scale',
sub="Likert Scale")

likert(sgbar.likert, horizontal=FALSE,
main="Example Diverging Stacked Bar Chart for Likert Scale",
auto.key=list(space="right", columns=1,
reverse=TRUE, padding.text=2),
sub="Likert Scale")

auto.key=list(between=1, between.columns=2),
main="Example Diverging Stacked Bar Chart for Likert Scale",
sub="Likert Scale")

<p style="text-align: center;"><a href=""><img class="aligncenter  wp-image-859" title="Diverging Stacked Bar Chart for Likert Scale" src="" alt="" width="452" height="256" /></a></p>
<p style="text-align: center;"><a href=""><img class="aligncenter  wp-image-860" title="Diverging Stacked Bar Chart (Vertical) for Likert Scale" src="" alt="" width="502" height="284" /></a></p>
<p style="text-align: center;"><a href=""><img class="aligncenter  wp-image-861" title="Diverging Stacked Bar Chart for Likert Scale Using Blue Tones" src="" alt="" width="452" height="256" /></a></p>
<strong>Mean Value</strong>

Often researchers will simply take each response option and interpret it as a real number.  Using this approach makes it very convenient to calculate the mean value and standard deviation (and confidence intervals). This is particularly useful when working with non-analytical clients.  However, this is a controversial issue. Taking this approach requires a lot of statistical assumptions that may not be correct. For starters the response options really need to be equidistant from each other. For example, is the distance from <em>Strongly Agree</em> to <em>Agree</em> the same distance from <em>Disagree</em> to <em>Strongly Disagree</em>? This may be true for one question but it might not be true for all questions on a questionnaire. Different questions and question wording are quite likely going to have different distributions. Furthermore, confidence intervals require normality assumptions which may also be incorrect.

What makes matters worse is that if there are 50 respondents and 25 of them mark <em>Strongly Disagree</em> and 25 of them mark <em>Strongly Agree</em> then the mean will be 3 implying that, on average, the results are neutral.  But that clearly does not adequately describe the data. Ultimately, it is up to the statistician to work with the client to use an appropriate method that appropriately conveys the message and that both parties can agree upon.; par(mfrow=c(1,1));

plot(r$group,r$mean, type="o", cex=1, col="blue", pch=16,
ylim=c(1,5), lwd=2,
, ylab="Mean Value", xlab="Group"
, main=paste("Likert Scale Mean Values Example")
, cex.sub=.60
, xaxt = "n", yaxt = "n");
axis(1, at=(1:3), tcl = -0.7, lty = 1, lwd = 0.8, labels=TRUE)
axis(2, at=(1:5), labels=TRUE, tcl = -0.7, lty = 1)
abline(h=c(1:5), col="grey")
lines(r$group,r$ll, col='red', lwd=2)
lines(r$group,r$ul, col='red', lwd=2)

legend("topright", c("Mean","Confidence Interval"),
col=c('blue','red'), title="Legend",
lty=1, lwd=2,
inset = .05)

Pies and Multiple Pies

A table is nearly always better than a dumb pie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial disarray both within and between charts (…). Given their low density and failure to order numbers along a visual dimension, pie charts should never be used. — Edward Tufte

Pie charts are notoriously difficult to convey the information that was intended.  As far as pie charts go I don’t ever use them.  There are far better ways to visualize data.  However, I have heard some people give a reason for using them that are somewhat justified and generally are based on the ‘eye candy’ argument.  But as far as creating a graph that both provides information and looks good a 3-D pie chart is probably not the best choice.  I debated whether I should even include the R code for the example but to provide full disclosure here’s the code.

my.table <- table(raw&#91;,2&#93;&#91;raw&#91;,1&#93;==1&#93;)

names(my.table) <- c("Strongly Agree","Agree","Neutral","Disagree","Strongly Disagree")
labl <- paste(names(my.table), "\n", my.table, sep="")
pie(my.table, labels=labl, main="Example Pie Chart of Likert Scale")

<p style="text-align: center;"><a href=""><img class="aligncenter  wp-image-852" title="Pie Chart for Likert Scale" src="" alt="" width="420" height="262" /></a></p>
num.groups <- length(unique(raw&#91;,1&#93;))
for(j in 1:num.groups){
my.table <- table(raw&#91;,2&#93;&#91;raw&#91;,1&#93;==j&#93;)
pie(my.table, labels=labl, main=paste("Example Pie Chart of\nLikert Scale Group ", j))

<p style="text-align: center;"> <a href=""><img class="aligncenter  wp-image-858" title="Multiple Pie Charts for Likert Scale" src="" alt="" width="502" height="284" /></a></p>

slices <- my.table
names(my.table) <- c("Strongly Agree","Agree","Neutral","Disagree","Strongly Disagree")
labl <- paste(names(my.table), "\n", my.table, sep="")
main="3D Pie Chart Example")

<p style="text-align: center;"><a href=""><img class="aligncenter  wp-image-855" title="3D Pie Chart for Likert Scale" src="" alt="" width="420" height="300" /></a></p>
<strong>Grouped Bar Chart</strong>

This is a nice approach when wanting to look at each group and highlight any particular Likert response option.  Here it is easy to see that in Group 2 the <em>Neutral </em>option is by far the most common response.

barplot(gbar, beside=T, col=cm.colors(5), main="Example Bar Chart of Counts by Group",xlab="Group",ylab="Frequency")

legend("topright", names(my.table),
col=cm.colors(5), title="Legend",
lty=1, lwd=15,
inset = .1)

Divided Bar Chart

This isn’t a bad approach and quite similar to the diverging stacked bar chart.  This approach shows the stacked percent for each category.


names(sgbar) <- c("group","Strongly Agree","Agree","Neutral","Disagree","Strongly Disagree") mx <- melt(sgbar, id.vars=1) names(mx) <- c("Group","Category","Percent") ggplot(mx, aes(x=Group, y=Percent, fill=Category)) + geom_bar(stat="identity") [/sourcecode]


Power Analysis and the Probability of Errors

Power analysis is a very useful tool to estimate the statistical power from a study. It effectively allows a researcher to determine the needed sample size in order to obtained the required statistical power. Clients often ask (and rightfully so) what the sample size should be for a proposed project. Sample sizes end up being a delicate balance between the amount of acceptable error, detectable effect size, power, and financial cost. A lot of factors go into this decision. This example will discuss one approach.

In more specific terms power is the probability that a statistical test will reject the null hypothesis when the null hypothesis is truly false. What this means is that when power increases the probability of making a Type II error decreases. The probability of a Type II error is denoted by \beta and power is calculated as power = 1 - \beta.

In order to calculate the probability of a Type II error a researcher needs to know a few pieces of information \mu, \sigma^2, n, and \alpha (probability of a Type I error). Normally, if a researcher already knows the population mean (\mu) and variance (\sigma^2) there is no need to take a sample to estimate them. However, we can set it up so we can look at a range of possible unknown population means and variances to see what the probability of a Type II error is for those values.

The following code shows a basic calculation and the density plot of a Type II error.

P\left(X>38| \mu,\sigma^2 \right)=P\left(Z>\frac{38-50}{10}\right)=P\left(Z>-1.2\right)=.8849303

my.means = c(38,50); = c(10,10);

X = seq(my.means[1][1]*5,my.means[1][1]*5, length=100);
dX = dnorm(X, mean=my.means[1],[1]); #True distribution
dX = dX/max(dX);

X.2 = seq(my.means[2][2]*5,my.means[2][2]*5, length=100);
dX.2 = dnorm(X.2, mean=my.means[2],[2]); #Sampled distribution
dX.2 = dX.2/max(dX.2);

plot(X, dX, type="l", lty=1, lwd=2, xlab="x value",
ylab="Density", main="Comparison of Type II error probability",
lines(X.2,dX.2, lty=2)
abline(v=my.means, lty=c(1,2));
legend("topleft", c("Sample","Type II Error"), lty=c(2,1), col=c("black","black"), lwd=2, title=expression(paste("Type II Error")))

x = (my.means[1]-my.means[2])/[2];

[1] 0.8849303



However, there is a more direct way to determine necessary power and the Type II error calculation using R by setting up a range of sample sizes. Often the mean and variance for power analysis are established through a pilot study or other preliminary research on the population of interest.

R has several convenient functions to calculate power for many different types of tests and designs.  For this example power.anova.test will be used for demonstration purposes.  However, power.t.test could have been used as well.

For each of these functions there are varying values that need to be supplied however they generally involve effect size (difference in group means), sample size, alpha level, measure of variability and power.  In this example the within variability is estimated by using the MSE from the ANOVA table and the between variability is estimated from the variance of group means.  Otherwise, three different alpha levels are measured over a range of 50 possible n’s.

x = matrix(NA,nrow=200,ncol=2);
x[,1]= rbind(rep(1,100),rep(2,100))
x[,2]= rbind(rnorm(100,23,20),rnorm(100,7,20));
x=data.frame(x); names(x) = c("group","value");
groupmeans = as.matrix(by(x$value,x$group,mean));
x.aov = summary(aov(x$value ~ as.factor(x$group)))

nlen = 50
withinvar = 470; ##From MSE in ANOVA
raw.pwr = matrix(NA, nrow=nlen, ncol=5)
for(i in 1:nlen){
pwr.i10 = power.anova.test(groups=length(groupmeans), n=1+i,
between.var=var(groupmeans), within.var=withinvar, sig.level=.10)
pwr.i05 = power.anova.test(groups=length(groupmeans), n=1+i,
between.var=var(groupmeans), within.var=withinvar, sig.level=.05)
pwr.i01 = power.anova.test(groups=length(groupmeans), n=1+i,
between.var=var(groupmeans), within.var=withinvar, sig.level=.01)
power10 = pwr.i10$power
power05 = pwr.i05$power
power01 = pwr.i01$power
raw.pwr[i,1] = power10
raw.pwr[i,2] = power05
raw.pwr[i,3] = power01
raw.pwr[i,5] = 1+i

plot(raw.pwr[,5], raw.pwr[,1], type="n", ylim=c(0,1),
ylab="Power",xlab="Replicates (sample size per group)",
main=expression(paste("Power Analysis: 2 Tx Groups for ", alpha, "=.10, .05, .01"))
, sub=paste("Within var=",withinvar,"; Between var=", round(var(groupmeans)),2), cex.sub=.75
lines(raw.pwr[,5], raw.pwr[,1], type="l", lty=1, lwd=2, col="blue")
lines(raw.pwr[,5], raw.pwr[,2], type="l", lty=2, lwd=2, col="red")
lines(raw.pwr[,5], raw.pwr[,3], type="l", lty=3, lwd=2, col="green")
abline(h=seq(.1:.9, by=.1), col="lightgrey", lty=2); abline(h=.9, col=1, lty=1);
abline(v=c(10,20,30,40), lty=2, col="lightgrey")
abline(v=raw.pwr[,5][round(raw.pwr[,1],2)==.90][1]); text(raw.pwr[,5][round(raw.pwr[,1],2)==.90][1]-2.5, 0, paste("n: ",raw.pwr[,5][round(raw.pwr[,1],2)==.90][1]));
legend("bottomright", c(".10",".05",".01"), lty=c(1,2,3), col=c("blue","red","green"), lwd=2, title=expression(paste(alpha, " levels")))

Power analysis over a range of n's

This graph shows what the power will be at a variety of sample sizes. In this example to obtain a power of 0.90 (\alpha=0.10) a sample of size 23 (per group) is needed.  So that will be a total of 46 observations.  It’s then up to the researcher to determine the appropriate sample size based on needed power, desired effect size, \alpha level, and cost.

There is no real fixed standard for power. However, 0.8 and 0.9 are often used. This means that the probability of a Type II error is 0.2 and 0.1, respectively. But it really comes down to whether the researcher is willing to accept a Type I error or a Type II error. For example, it’s probably better to erroneously have a healthy patient return for a follow-up test than it is to tell a sick patient they’re healthy.