June 2013 – Statistical Research

Software Packages for Graphs and Charts

Posted on June 17, 2013June 17, 2013 by Clint Stevenson

Graphs can be an important feature of analysis. A graph that has been well designed and put together can make summary statistics much more readable and increase the interpretability. It also makes reports and articles looks more professional.

There are many software packages that are available to design great graphs and charts. This seems to be a constant discussion I have with others over what graphing software should be used. There are some packages out there that simply make a graph. You can even use something like the default Microsoft Excel or Powerpoint. But then you end up with an “Excel Graph”. Whenever I see one of those graphs, sadly, my knee-jerk reaction is to investigate further into the analysis. So there is something to be said about having professional looking graphs. However, with that said given some time and work I can agree that Excel can create some attractive graphs.

There are a lot of packages out there and trying to learn every one of them is almost impossible. My suggestion is to find one that produces a high quality graph and stick with it. There are many commercial software packages including the most common: SPSS, SAS, Minitab. There’s even Photoshop/Illustrator for those who want absolute control over a completely custom graph. I’ll highlight a few (open source) packages that I have used for various purposes here.

1) R — This does, of course, do statistical analysis but it can produce some very high quality graphs. It will produce just about every graph imaginable. And if it doesn’t already do the graph then chances are you can write a function or package to do it. R graphs are built from the ground up. Usually, if you don’t want something in the graph then it won’t be there unless you say otherwise.

2) Gretl — This is a full feature statistical software package that handles a handful of graphs. It’s specifically designed for economics and quite easily performs general statistical tests and even handles iterative estimation (i.e. Monte Carlo). A review of the software package is available at the Journal of Statistical Software. The graphs are quick and easy to make and there is a command line option for more control. The graphs are gnuplot and are of high enough quality that they can be used in publications. The range of graphs are limited but the ones that is does are pretty good: there’s the standard boxplot, scatterplot, and qqplot.

3) Orange — I was working with several large datasets several years ago and looked into using Orange. This is a very powerful data mining package. It also has a graphing module to produce some very good graphs. It produces the standard graphs but it really focuses on the data mining graphs and plots. These include hierarchical clustering, scatterplots, it also provides a survey plot for identifying correlations between two variables.

4) GGobi — I was first introduced to GGobi back in about 2000. It is a great way to visualize multi-dimensional data. It can be a bit mind-boggling at times trying to wrap you head around a 5 dimensional space being projected on a two-dimensional plane. I’ve also used it to help spot outliers in a 4+ dimensional space and to just gain a better understanding of the data.

5) gnuplot — This is strictly a graphing program. And it does a good job at creating them. It will make a wide variety of graphs and has many built-in demos that can be easily modified and rerun on a new dataset. It also makes it really easy to rotate three-dimensional scatterplots by simply clicking and dragging the image. Here is an example from gnuplot.

These are just a few of the software packages that I have used and with which I have had success. There are many graphing packages out there and I would be interested in other programs that others have used to create graphs and charts.

Latent Class Modeling Election Data

Posted on June 14, 2013July 1, 2013 by Clint Stevenson

Latent class analysis is a useful tool that is used to identify groups within multivariate categorical data. An example of this is the likert scale. In categorical language these groups are known as latent classes. As a simple comparison this can be compared to the k-means multivariate cluster analysis. There are several key differences between the two methods. First, latent class analysis assigns observations to groups based on probability while k-means cluster analysis absolutely assigns observations to groups. While k-means is readily available in many software packages it is only appropriate for continuous data. Latent class analysis is not as widely available in many software packages but it is designed to handle categorical data.

There are a handful of latent class analysis software packages. Probably the best and most common is Latent Gold. However, the license can be somewhat cost prohibitive. This is particularly true if your daily routine does not include latent class modeling. Currently, SPSS does not include latent class analysis. IBM, the company that owns SPSS, has indicated that the enhancement request for latent class analysis has been added to SPSS Development. For SAS users there is proc lca, but once again that is somewhat cost prohibitive. On the open source side of things there are the R packages poLCA and MCLUST. Unless one needs the many features available in Latent Gold these R packages will generally be sufficient for data analysis.

In general latent class modeling has the following R code structure:

set.seed(1234)

library(e1071)
library(poLCA)
library(reshape2)

## An example of simulating likert scale data
probs = cbind(c(.4,.2/3,.2/3,.2/3,.4),c(.1/4,.1/4,.9,.1/4,.1/4),c(.2,.2,.2,.2,.2))
my.n = 1000
my.len = ncol(probs)*my.n
raw = matrix(NA,nrow=my.len,ncol=3)
raw = NULL
for(i in 1:ncol(probs)){
raw = rbind(raw, cbind(i,rdiscrete(my.n,probs=probs[,i],values=1:5)))
}
raw = data.frame(id = seq(1,my.n),raw)

# An example of how to transform data back from normalized data to a flat file
raw.flat = dcast(raw, id ~ i, value.var=”V2″)
names(raw.flat) = c(“id”,”A”,”B”,”C”)

# Simulation example of latent class models
f = cbind(B, C) ~ A;
lca.fit1 < - poLCA(f,raw.flat,nclass=1, nrep=5); lca.fit2 <- poLCA(f,raw.flat,nclass=2, nrep=5); f = cbind(A, B, C)~1; lca.fit1 <- poLCA(f,raw.flat,nclass=1, nrep=5); lca.fit2 <- poLCA(f,raw.flat,nclass=2, nrep=5); [/sourcecode] ANES 2000 The following is an example of how one can analyze data from the American National Election Study (ANES). This is an election study conducted for each election year. This is a built-in data frame for the R package, and it is from 2000. However, I would recommend going to ElectionStudies and then go to their Data Center to get the most recent dataset from 2012. Additionally, for great data on election analysis I would strongly encourage the National Election Pool Exit Poll data. There are some great analyses that can be obtained through those data. However, the raw data is a bit more difficult to obtain (as of today the Roper Center has disabled all access to the raw data). Consequently, the analysis is fairly limited.

# Example dataset from the poLCA package
data(election)
# build the model with PARTY as the covariate
f < - cbind(MORALG,CARESG,KNOWG,LEADG,DISHONG,INTELG, MORALB,CARESB,KNOWB,LEADB,DISHONB,INTELB)~PARTY # Run LCA on the ANES 2000 dataset 3 classes anes2000 <- poLCA(f,election,nclass=3,nrep=5) # Build a matrix to prepare for graphing my.mat.max = 15 my.mat <- cbind(1,c(1:my.mat.max)) exb <- exp(pidmat %*% anes2000$coeff) # Run the matrix plot matplot(c(1:my.mat.max),(cbind(1,exb)/(1+rowSums(exb))),ylim=c(0,1),type="l", main="Party ID as a predictor of candidate affinity class", xlab="Party ID: strong Democratic (1) to strong Republican (7)", ylab="Probability of latent class membership",lwd=2,col=c('blue','green','red')) text(5.9,0.35,"Other") text(5.4,0.7,"Bush affinity") text(2.5,0.6,"Gore affinity") [/sourcecode] National Election Pool Exit Poll 2012 Here is another example using the 2012 National Election Pool Exit Poll. In this example I simply pull the data directly from the tables. This is to be used as a basic example and there are quite a few caveats (e.g. rounding, weighting, item nonresponse, using candidate vote, etc.) on creating a raw dataset this way but the latent class model concept remains the same. Also, the All Other category is not broken out by age so I simply divide out (through probably not a completely accurate approach) the count evenly. The n is 26565 so that will be the baseline. Any member of the National Election Pool’s websites (ABC, CBS, CNN, Fox, NBC) can be used for this data. Note that for some reason CBS has very wrong marginal data on their site for this table.

http://elections.nbcnews.com/ns/politics/2012/all/president/#exitPoll

http://www.foxnews.com/politics/elections/2012-exit-poll

http://www.cbsnews.com/election-results-2012/exit.shtml?state=US&race=P&jurisdiction=0&party=G&tag=contentBody;exitLink

http://abcnews.go.com/Politics/2012_Elections_Exit_Polls/

http://www.cnn.com/election/2012/results/race/president

# Cell counts pulled directly from the tables and based on n of 26565

table.raw = rbind(
cbind( rep(‘W’, 1286), rep(’18-29′, 1286), rep(‘O’, 1286) ),
cbind( rep(‘W’, 3395), rep(’30-44′, 3395), rep(‘O’, 3395) ),
cbind( rep(‘W’, 5239), rep(’45-64′, 5239), rep(‘O’, 5239) ),
cbind( rep(‘W’, 2417), rep(’65+’, 2417), rep(‘O’, 2417) ),
cbind( rep(‘B’, 534), rep(’18-29′, 534), rep(‘O’, 534) ),
cbind( rep(‘B’, 404), rep(’30-44′, 404), rep(‘O’, 404) ),
cbind( rep(‘B’, 404), rep(’45-64′, 404), rep(‘O’, 404) ),
cbind( rep(‘B’, 104), rep(’65+’, 104), rep(‘O’, 104) ),
cbind( rep(‘H’, 967), rep(’18-29′, 967), rep(‘O’, 967) ),
cbind( rep(‘H’, 749), rep(’30-44′, 749), rep(‘O’, 749) ),
cbind( rep(‘H’, 741), rep(’45-64′, 741), rep(‘O’, 741) ),
cbind( rep(‘H’, 247), rep(’65+’, 247), rep(‘O’, 247) ),
cbind( rep(‘O’, 197), rep(’18-29′, 197), rep(‘O’, 197) ),
cbind( rep(‘O’, 197), rep(’30-44′, 197), rep(‘O’, 197) ),
cbind( rep(‘O’, 197), rep(’45-64′, 197), rep(‘O’, 197) ),
cbind( rep(‘O’, 197), rep(’65+’, 197), rep(‘O’, 197) ),

cbind( rep(‘W’, 1490), rep(’18-29′, 1490), rep(‘R’, 1490) ),
cbind( rep(‘W’, 1339), rep(’30-44′, 1339), rep(‘R’, 1339) ),
cbind( rep(‘W’, 2388), rep(’45-64′, 2388), rep(‘R’, 2388) ),
cbind( rep(‘W’, 1302), rep(’65+’, 1302), rep(‘R’, 1302) ),
cbind( rep(‘B’, 247), rep(’18-29′, 247), rep(‘R’, 247) ),
cbind( rep(‘B’, 627), rep(’30-44′, 627), rep(‘R’, 627) ),
cbind( rep(‘B’, 648), rep(’45-64′, 648), rep(‘R’, 648) ),
cbind( rep(‘B’, 162), rep(’65+’, 162), rep(‘R’, 162) ),
cbind( rep(‘H’, 85), rep(’18-29′, 85), rep(‘R’, 85) ),
cbind( rep(‘H’, 40), rep(’30-44′, 40), rep(‘R’, 40) ),
cbind( rep(‘H’, 56), rep(’45-64′, 56), rep(‘R’, 56) ),
cbind( rep(‘H’, 16), rep(’65+’, 16), rep(‘R’, 16) ),
cbind( rep(‘O’, 61), rep(’18-29′, 61), rep(‘R’, 61) ),
cbind( rep(‘O’, 61), rep(’30-44′, 61), rep(‘R’, 61) ),
cbind( rep(‘O’, 61), rep(’45-64′, 61), rep(‘R’, 61) ),
cbind( rep(‘O’, 61), rep(’65+’, 61), rep(‘R’, 61) )
)

exitpoll2012 = data.frame(table.raw)
names(exitpoll2012) = c(“RACE”,”AGE”,”VOTE”)
table(table.raw[,1], table.raw[,2])
table(table.raw[,1], table.raw[,3])
f <- cbind(AGE, RACE)~VOTE xp.lca <- poLCA(f,exitpoll2012,nclass=2) table(exitpoll2012$AGE) # Build a matrix to prepare for graphing my.mat.max = 4 my.mat <- cbind(1,c(1:my.mat.max)) exb <- exp(my.mat %*% xp.lca$coeff) # Run the matrix plot matplot(c(1:my.mat.max),(cbind(1,exb)/(1+rowSums(exb))),ylim=c(0,1),type="l", main="Candidate Vote as a Predictor of Candidate Affinity Class using Voter Race and Age", xlab="Candidate Vote: Obama (1) Romney (2)", ylab="Probability of latent class membership",lwd=2,col=c('blue','red')) text(1.4,0.25,"Romney Leaning") text(1.4,0.8,"Obama Leaning") [/sourcecode]

Posted in Uncategorized2 Comments on Latent Class Modeling Election Data

Hey, I Just Did a Significance Test!

Posted on June 7, 2013November 24, 2013 by Clint Stevenson

I’ve seen it happens quite often. The sig test. Somebody simply needs to know the p-value and that one number will provide all of the information about the study that they need to know. The dataset is presented and the client/boss/colleague/etc invariably asks the question “is it significant?” and “what’s the correlation?”. To quote R.A. Fisher “To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.”

So obviously in the previous example there are probably lots of problems ranging from the person making the request doesn’t know what they want to a bigger problem of the study itself may have core structural problems that undermines the overall integrity of any results. I want to focus on the former and share some thoughts on how to do a couple of hypothesis tests and selecting the appropriate test. This is mainly a topic of parametric vs. non-parametric tests. I’ll show the one most people are probably most familiar (parametric tests) and then show an alternative, more appropriate hypothesis test. I won’t get into the philosophical topics of hypothesis testing or whether .04999 is significant and .05001 is not significant. This simply provides some hypothesis testing options that too often get overlooked in market and/or social research. Yes, parametric tests provide more power but not everything in this world fits nicely into one package with a bow on top.

What’s the Correlation

I’ve had to happen more often that I would care to count. I’m given a dataset and then asked what’s the “correlation”? After some back and forth I find that they want the “correlation” on all permutations (with p-value) of the questions on the questionnaire. So here goes:

library(e1071) library(Hmisc) library(reshape) n=10 raw = data.frame( id=seq(1,n), replicate(2, rdiscrete(n, probs=c(1,1,1,1,1), values=c(1,2,3,4,5)) ) )

Pearson’s Correlation or, more formally, Pearson product-moment correlation coefficient. This is the correlation with which most people are familiar. Be sure to check the assumption (i.e. homoscedasticity, linearity, normality). However, Spearman’s Rho or Kendall’s Tau (depending on how you want to interpret the results) may in fact be the better options.

# My example gave the following > cor(raw$X1,raw$X2, method="kendall") [1] -0.191943 > cor(raw$X1,raw$X2, method="spearman") [1] -0.2320708 > cor(raw$X1,raw$X2, method="pearson") [1] -0.2573766

I Need a T-Test

This is a popular hypothesis test because people want to know if something is better (or worse) than something else. So the initial choice is a t-test. For example did Group 1 make more money than Group 2 or did more people remember seeing an ad on theater screen 1 versus theater screen 2. So rather than the parametric t-test we can use the Mann-Whitney U Test on our data that doesn’t meet the t-test assumptions.

# Example when data is normalized raw2 = melt(raw, id=c("id")) #Mann-Whitney U Test (independent 2-groups) wilcox.test(raw$X1,raw$X2, paired=FALSE, alternative="two.sided") wilcox.test(raw2$value~raw2$variable, paired=FALSE, alternative="two.sided")

Is There Any Association in My Table

When testing tables and counts the first go-to test is the chi-square test. But suppose you have a table like this:

2 1

3 4

raw.chisq = data.frame( id=seq(1,n),replicate(2, rdiscrete(n, probs=c(1,1), values=c(1,2)) ) ) table(raw.chisq$X1, raw.chisq$X2)

Sure you can just run you chi-square test and be done with it. But there is one small problem. The assumptions for a chi-square test are in not met. Namely, the cell sizes are way too small. So what’s a researcher to do? This is where Fisher’s Exact Test works well. If we can assume that the marginal totals are given then we can solve the problem this way:

fisher.test(raw.chisq$X1, raw.chisq$X2)

Is There a Difference in My Three Groups

Yes, it’s true, you can run three t-tests on your groups (1 vs 2, 1 vs 3, 2 vs 3). But that causes not only extra work but problems with your hypothesis test itself. Plus why do multiple tests when you can be more efficient in you testing and do just one ANOVA. Here you can do a non-parametric Kruskal-Wallis Rank Sum Test when you can’t make the assumptions for the parametric analysis of variance.

raw.anova = data.frame( id=seq(1,n),replicate(3, rdiscrete(n, probs=c(1,1,1,1,1), values=c(1,2,3,4,5)) ) ) raw.anova2 = melt(raw.anova, id=c("id")) kruskal.test(raw.anova2$value ~ raw.anova2$variable) fit.aov = lm(raw.anova2$value ~ raw.anova2$variable) anova(fit.aov)

Ultimately, it is important to understand what you’re doing and why. Just because R, SPSS, SAS, etc. gives you a number it doesn’t mean it’s correct. If you find that something is significant make sure you understand what it is saying and what you’re testing. You don’t want to run an independent 2-sample test only to find out that it should have been a matched pairs test.

Posted in Uncategorized3 Comments on Hey, I Just Did a Significance Test!