Data Scientists and Statisticians: Can’t We All Just Get Along

It seems that the title “data science” has taken the world by storm.  It’s a title that conjures up almost mystical abilities of a person garnering information from oceans of data with ease.  It’s where a data scientist can wave his or her hand like a Jedi Knight and simply tell the data what it should be.

What is interesting about the field of data science is it’s perceived (possibly real) threat to other fields, namely statistics.  It seems to me that the two fields are distinct areas.  Though the two fields can exist separately on their own each is weak without the other.  Hilary Mason (of Bitly) shares her definition of a data scientist.  I suppose my definition differs from Hilary Mason’s data science definition.  Statisticians need to understand the science and structure of data, and data scientists need to understand statistics.  Larry Wasserman over at the Normal Deviate blog shares his thoughts on statistics and data science.  There are others blogs but these two are probably sufficient.

Data science is emerging as a field of absolutes and that is something that the general public can wrap their heads around.  It’s no wonder that statistician are feeling threatened by data scientists.  Here are two (albeit extreme) examples:

If  a statistician presents an estimate to a journalist and says “here is the point estimate of the number of people listening to a given radio station and states that the margin of error is +/- 3% with a 90% confidence interval” there is almost always a follow-up discussion about the margin of error and how the standard error was calculated (simple random, stratified, cluster) why is it a 90% confidence interval rather than a 95% confidence interval.  And then someone is bound to ask what a confidence interval is anyway?  Then extend this even further and the statistician gives the journalist a p-value?  Now there is an argument between statisticians about hypothesis testing and the terms “frequentist” and “Bayesian” start getting thrown around.

It’s no wonder that people don’t want to work with statisticians.  Not only are they confusing to the general public but the statisticians can’t even agree (even if it’s a friendly disagreement) on what is correct.  Now if we take the following data scientist example:

A data scientist looks through a small file of 50 billion records where people have listened to songs through a registration-based online radio station (e.g. Spotify, Pandora, TuneIn, etc.).  This data scientist then merges and matches the records to a handful of public data sources to give the dataset a dimensionality of 10000.  The data scientist then simply reports that there are X number of listeners in a given metro area listening for Y amount of time and produces a a great SVG graph that can be dynamically updated each week with the click of a button on a website. It is a fairly simple task and just about everyone can understand what is means.

I feel that there will always be a need for a solid foundation in statistics.  There will always exists natural variation that must be measures and accounted.  There will always be data that is so expensive that only a limited number of observations can feasibility be collected.  Or suppose that a certain set of data is so difficult to actually obtain that only a handful of observations can even be collected.  I would conjecture that a data scientist would not have a clue what to do what that data without help from someone with a background in statistics.  At the same time if a statistician was told that there is a 50 billion by 10000 dimension dataset sitting on a Hadoop cluster then I would also guess that many statisticians would be hard pressed to set the data up to analyze without consulting a data scientist.  But at the same time a data scientist would probably struggle if they were asked to take those 10000 dimensions and reduce that down to a digestible and understandable set.

 

DNATake another example of genetic sequencing.  A data scientist could work the data and discover that in one sequence there is something different.  Then a domain expert can come in and find that the mutation is in the BRCA1 gene and that the BRCA1 gene relates to breast cancer.  A statistician can then be consulted and find the risk and probability that the particular mutation will result in an increased mortality and what the probability will be that the patient will ultimately get breast cancer.

Ultimately, the way I see it the two disciplines need to come together and become one.  I see no reason why is can’t be part of the curriculum in statistics department to teach students how to work with real world data.  Those working in the data science and statistics fields need to have the statistical training while having the ability to work with data regardless of the location, format, or size.

JSM 2013 – Wednesday

I was able to attend a continuing education short course workshops at the JSM conference that proved to be quite insightful.  The discussion was on data mining and was titled “Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets”.  The presentation was given by Dan Steinberg and the examples that he gave were based on a proprietary software called SPM (Salford Predictive Modeler).  I have not personally used the software so I’m in no position to endorse or discourage its use.  I generally prefer open source solutions unless there is resounding evidence to use commercial products.  So I’m interested in seeing how this software operates.  The slides for this presentation (as well as other continuing education courses) are available on their website http://info.salford-systems.com/jsm-2013-ctw.  Much of the workshop dealt with a dataset relating to car insurance fraud and how to use CART models and Random Forests.  As an aside I made a post a while back giving some examples in R on those models.  The workshop was educational and informative on how to approach these types of problems using a different software package.  I’m particularly interested in comparing SPM to R or seeing if others have already run some comparisons.

JSM 2013 – Tuesday

The Joint Statistical Meeting in Montreal has proven to be very good.   Here are a few highlight from Tuesday’s sessions.  There is one major problem that exists and that is there are too many good sessions to attend.  During one time block I had six session that I wanted to go to.  Unfortunately, it is simply not possible to make it to all of them.  However, the reoccurring theme is that if you don’t know at least R then you will quickly be left in the dust.  Based on the sessions so far knowing R is a must and knowing other languages such as Java, Scala or Python will certainly be good.

Session on Analytics and Data Visualization in Professional Sports

image

During the morning I attended a session on statistics in sports.  It was mostly several sabermetric presentations with some basketball in there too.  One presentation caught my attention due to the granularity of the data that the presenter used.  Benjamin Baumer’s R package is called openWAR which is an open source version of WAR (Wins Above Replacement) in baseball.  With the data that package accesses it is able to identify every play as well as the spatial location of where the batter hit the ball on the field.  If someone is interested in sports statistics or just interested in playing with a lot of publicly available data then openWAR is great resource (currently available on GitHub at https://github.com/beanumber/openWAR).  This presentation also discussed the distribution of the players on the field and their ability to field the ball once it was hit.  A different presentation from Sportvision presented on the location and trajectory of the ball as the pitcher throws the ball.  Sportvision also shows the location in the strike zone of where the batter hits the ball the hardest.  They are the same company that do the 1st & 10 graphics (i.e. the yellow line needed for a 1st down).

Session on Statistical Computing: Software and Graphics

I attended the Statistical Computing session and 5 of the 6 presentations were on R packages.  The first was a presentation on Muste the R implementation of Survo.  I have not used Survo before but I will certainly do some research into it.  The next presentation was by Stephan Ritter the maintainer for the relaxnet and widenet.  The third presentation was by David Dahl, the maintainer for jvmr.  With this package one can integrate Scala and Java into R without any special compilation.  TIBCO Spotfire then presented the TIBCO Enterprise Runtime for R (TERR).  This looks to be an interesting solution to some of the data management issues that exist in R.  The presenter indicated that it does a very good job at managing the system resources.  The fifth presentation discussed the Rcpp package and the final presentation by Christopher Fonnesbeck was on PyMC which allows a user to perform Bayesian statistical analysis in Python.

JSM 2013 – Monday

I am currently attending the 2013 Joint Statistical Meeting in Montreal. I will try to share a few of the things that I take away each day.

image

Last night (Monday) I attended the JSM keynote speaker with Nate Silver and it proved to be a very interesting discussion.  Silver is best known for his work on http://www.fivethirtyeight.com.  His speech was good and focused on the journalistic component of statistics.  He shared that, in his opinion, statisticians should blog and contribute to journalism.  He also added, though it’s a matter of personal opinion, and I don’t agree, that a statistician should get a few years of hands on work before going on for an advanced degree.  I’m of the philosophy that you just do it all at once, otherwise you might find yourself in a position where, for one reason out another, you simply can’t go back to school.

I think Silver gave the quote of the conference.  Silver was asked his opinion on the difference between Data Scientist and Statistician.  He response was, “Data Scientist is just a sexed up version of Statistician”. He then added that they are effectively redundant and just do good work and call yourself whatever you want.

The question was also asked during three Q&A portion why he feels election exit polls should not be trusted.  I disagree with Silver on this point.  He feels that exit polls are wrong and his arguments include the sample design of the exit poll (cluster design).  His argument was that a cluster design increases the Margin of Error.  This is a true statement but it misses the whole point of sample design and the fact that the exit poll uses a 99.9% confidence level to call a race. Which, that alone, increases the Margin of Error.  This is due to news networks not being in the business of calling races wrong and looking foolish.  Exit polls serve their purpose and have ultimately been quite accurate in calling races.  Exit polls also serve the purpose to help give context to why a candidate won or lost.

Imputing Missing Data With Expectation – Maximization

It can be fairly common to find missing values in a dataset. Having only a few missing values isn’t generally a problem and those records can be deleted listwise. In other words the entire record is simply removed from the analysis. The problem is even with a limited amount missing data, that can translate into a significant number of records that are omitted. In the example below, about two-thirds of the records would end up being omitted due to missing values.

The distribution of the missing values in the data is very important. If the data are missing at random then that is less serious than when there is a pattern of missing value that are, at least to some extent, dependent on the missing variables.

There are many approaches that can be used to impute missing data. The easiest way is to simply calculate the mean of each variable and substitute that for each of the missing values. The problem with this is that it reduces the variance and the absolute value of the covariance. Another common approach is called Expectation – Maximization. This technique iteratively goes through the data while still preserving the covariance structure of the data.

library(e1071)

raw < - replicate(10, rpois(50,100)) raw.orig <- raw rand.miss <- rdiscrete(50,probs=rep(1:length(raw)), values=seq(1,length(raw)) ) raw[rand.miss] <- NA raw <- data.frame(raw) var(na.omit(raw) ) var(raw.imputed) EMalg <- function(x, tol=.001){ missvals <- is.na(x) new.impute<-x old.impute <- x count.iter <- 1 reach.tol <- 0 sig <- as.matrix(var(na.exclude(x))) mean.vec <- as.matrix(apply(na.exclude(x),2,mean)) while(reach.tol != 1) { for(i in 1:nrow(x)) { pick.miss <-( c( missvals[i,]) ) if ( sum(pick.miss) != 0 ) { inv.S <- solve(sig[!pick.miss,!pick.miss]) # we need the inverse of the covariance # Run the EM new.impute[i,pick.miss] <- mean.vec[pick.miss] + sig[pick.miss,!pick.miss] %*% inv.S %*% (t(new.impute[i,!pick.miss])- t(t(mean.vec[!pick.miss]))) } } sig <- var((new.impute)) mean.vec <- as.matrix(apply(new.impute,2,mean)) if(count.iter > 1){ # we don’t want this to run on the first iteration or else if fails
for(l in 1:nrow(new.impute)){
for(m in 1:ncol(new.impute)){
if( abs((old.impute[l,m]-new.impute[l,m])) > tol ) {
reach.tol < - 0 } else { reach.tol <- 1 } } } } count.iter <- count.iter+1 # used for debugging purposes to ensure process it iterating properly old.impute <- new.impute } return(new.impute) } raw.imputed <- EMalg(raw, tol=.0001) plot(raw.imputed[,1], raw.imputed[,2], pch=16, main="Scatterplot of Missing Data", sub="Missing Values in Red", xlab="X",ylab="Y") # overlay the imputed values on the plot plot.imputed <- raw.imputed[ row.names( subset(raw, is.na( raw[,2] ) | is.na( raw[,3]) ) ),] points(plot.imputed[,2],plot.imputed[,3], pch=16, col='red') &nbsp; [/sourcecode]

Example Graph of Missing Data

Once the missing values are established it is important to review the data and do the standard assumption tests before proceeding with further analysis.  This is one of many approaches for imputing missing data.  Other approaches include random forests or some machine learning approaches to train the classifier directly over the missing data.

Some Common Approaches for Analyzing Likert Scales and Other Categorical Data

Analyzing Likert scale responses really comes down to what you want to accomplish (e.g. Are you trying to provide a formal report with probabilities or are you trying to simply understand the data better). Sometimes a couple of graphs are sufficient and a formalize statistical test isn’t even necessary. However, with how easy it is to conduct some of these statistical tests it is best to just formalize the analysis. There are several approaches that can be used. Here are just a few of them.

The code to set up the data for some testing is as follows.  Note that this is the same code used in Plotting Likert Scales:


set.seed(1234)
library(e1071)
probs < - cbind(c(.4,.2/3,.2/3,.2/3,.4),c(.1/4,.1/4,.9,.1/4,.1/4),c(.2,.2,.2,.2,.2))
my.n <- 100
my.len <- ncol(probs)*my.n
raw <- matrix(NA,nrow=my.len,ncol=2)
raw <- NULL
for(i in 1:ncol(probs)){
raw <- rbind(raw, cbind(i,rdiscrete(my.n,probs=probs&#91;,i&#93;,values=1:5)))
}
raw <- data.frame(raw)
names(raw) <- c("group","value")
raw$group <- as.factor(raw$group)
raw.1.2 <- subset(raw, raw$group %in% c(1,2))

&#91;/sourcecode&#93;

T-TEST

I might as well get this one out of the way. It sure is easy to take this approach which helps explains why this is probably one of the more controversial approaches. Even something like Excel will spit this out without much thought. You have to stretch the assumptions of the t-test to its outer limits. So if taking this approach one must very carefully verify the t-test assumptions. Most notably:

-Z Follows a Standard Normal
-Variance S^2 Follows a Chi Square Distribution
-Variance From two Populations should Be Equal (unless using Welch's test).
-Two Populations Should Be Independent

Two independent populations assumption gets a little sticky unless you truly are looking at two different populations from your data (e.g. Male/Female or Hispanic/Non-Hispanic). In other words just because you're comparing two different questions from your questionnaire doesn't necessarily mean you have two independent populations.

&#91;sourcecode language="css"&#93;

> t.test(raw.1.2$value ~ raw.1.2$group, var.equal=TRUE)

Two Sample t-test

data: raw.1.2$value by raw.1.2$group
t = 2.5622, df = 198, p-value = 0.01114
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.1151745 0.8848255
sample estimates:
mean in group 1 mean in group 2
3.4 2.9

CHI SQUARE / FISHER EXACT TEST

This ends up being the better approach and it’s relatively easy to understand. The chi square test is designed to handle categorical frequency data and test the association between two variables.

When the sample size is too small and the assumptions of the chi square test no longer are satisfied then an alternative option is to use Fisher’s Exact Test. The classical example of this is Fisher’s Lady Tasting Tea problem. Though this is designed for a 2×2 table there are ways to generalize it to larger tables and R makes it quite simple.


( c.test < - chisq.test(raw$group, raw$value) )

Pearson's Chi-squared test

data: raw$group and raw$value
X-squared = 195.1726, df = 8, p-value < 2.2e-16

&#91;/sourcecode&#93;

&nbsp;

In the above example some of the cells are quite small which could mean that the chi square approach may not work.  So Fisher's 2 x 2 test can be expanded and we can test this data.  However, keep in mind the assumptions on the martingale being fixed. Due to limitations in workspace size in R I have just found it easiest to simulate the p-value and achieve the desired outcome that way.

&#91;sourcecode language="css"&#93;

sim.table <- table(raw$group, raw$value)

fisher.test(sim.table, simulate.p.value=TRUE, B=1e6) # Simulate due to workspace limitations

Fisher's Exact Test for Count Data with simulated p-value (based on 1e+06
replicates)

data: sim.table
p-value = 1e-06
alternative hypothesis: two.sided

&#91;/sourcecode&#93;


WILCOXON SIGNED TEST

This is used when the data come from a related sample and are from the same population. In other words this works well on a matched pairs sample.  So assuming the group 1 and group 2 come from the same population and are just a different measurement we can take this approach.

&#91;sourcecode language="css"&#93;

wilcox.test(raw.1.2$value&#91;raw.1.2$group==1&#93;, raw.1.2$value&#91;raw.1.2$group==2&#93;, paired=TRUE)

&#91;/sourcecode&#93;

MANN-WHITNEY

This tests whether two independent samples are the same.  In this case the only difference between the Mann-Whitney test and the Wilcoxon Signed Test is that the paired sample is specified in the Wilcoxon Signed Test

&#91;sourcecode language="css"&#93;

wilcox.test(raw.1.2$value&#91;raw.1.2$group==1&#93;, raw.1.2$value&#91;raw.1.2$group==2&#93;)

&#91;/sourcecode&#93;

KRUSKAL-WALLIS TEST
Analysis of Variance equivalent for categorical data.   I feel that this is probably very underused.  This is probably do to ANOVA being beyond the scope of most casual analysts and then throwing in categorical data makes it that much more obscure.  Like the ANOVA is also assumes independent populations.  But once you understand exactly what you're testing and what type of data you're dealing with the implementation of the test is quite simple:

&#91;sourcecode language="css"&#93;

kruskal.test(raw$value ~ raw$group)

Kruskal-Wallis rank sum test

data: raw$value by raw$group
Kruskal-Wallis chi-squared = 13.9105, df = 2, p-value = 0.0009536

&#91;/sourcecode&#93;

This can be compared to the parametric ANOVA

&#91;sourcecode language="css"&#93;

fit < - lm(raw$value ~ raw$group)
anova(fit)

Analysis of Variance Table

Response: raw$value
Df Sum Sq Mean Sq F value Pr(>F)
raw$group 2 14.91 7.4533 4.0289 0.01878 *
Residuals 297 549.44 1.8500
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Using R to Produce Scalable Vector Graphics for the Web

Statistical software is normally used during the analysis stage of a project and a cleaned up static graphic is created for the presentation.  If the presentation is in web format then there are some considerations that are needed. The trick is to find ways to implement those graphs in that web format so the graph is of the highest possible quality.  If all that is needed is an image then simply saving the graph as a JPG or PNG and posting it to a website is quite simple and usually sufficient.  However, if more flexibility and higher quality is needed then some additional work will be needed.

One of the best way to present a graph is using vectors (as opposed to raster graphics).  What this means is that if one uses vectors graphics then a user can zoom in and there won’t be any degradation in image quality.  Several formats support vector graphics including PDF and SVG.

Scalable Vector Graphics are a great way to put together graphs using an XML-based format. This means it can be easily implemented directly into a website and, as an added bonus, it can become a dynamic image changing with user input.   R will generate the base structure of the graphic but dynamic SVG requires a bit more work outside of R.   Most modern browsers (IE 8 is not considered modern anymore so it is not supported) support this type of graphic format . Supported browsers include  IE 9, Firefox, and Chrome.

Using the example from a previous post I can convert the image into Scalable Vector Graphic.  The code that R produces into the SVG file can be copied and pasted directly into a web page. I have a side-by-side comparison of the graphs using earthquake data from the week prior to June 28, 2013.

Earthquakes Week Prior to 2013-06-08

library(maps)
library(maptools)
library(rgdal)
svg(filename = “c:\\eq.svg” ,
width = 7, height = 7
)

eq = read.table(file=”http://earthquake.usgs.gov/earthquakes/catalogs/eqs7day-M1.txt”, fill=TRUE, sep=”,”, header=T)
plot.new()
my.map < - map("state", interior = FALSE, plot=F) x.lim <- my.map$range[1:2]; x.lim[1] <- x.lim[1]-1; x.lim[2] <- x.lim[2]+1; y.lim <- my.map$range[3:4]; y.lim[1] <- y.lim[1]-1; y.lim[2] <- y.lim[2]+1; map("state", interior = FALSE, xlim=x.lim, ylim=y.lim) map("state", boundary = FALSE, col="gray", add = TRUE) title("Magnitude 1+ Earthquakes Over the Past 7 Days") eq$mag.size <- NULL eq$mag.size[eq$Magnitude>=1 & eq$Magnitude < 2] < - .75 eq$mag.size[eq$Magnitude>=2 & eq$Magnitude < 3] < - 1.0 eq$mag.size[eq$Magnitude>=3 & eq$Magnitude < 4] < - 1.5 eq$mag.size[eq$Magnitude>=4] < - 2.0 eq$mag.col <- NULL eq$mag.col[eq$Magnitude>=1 & eq$Magnitude < 2] < - 'blue' eq$mag.col[eq$Magnitude>=2 & eq$Magnitude < 3] < - 'green' eq$mag.col[eq$Magnitude>=3 & eq$Magnitude < 4] < - 'orange' eq$mag.col[eq$Magnitude>=4] < - 'red' points(x=eq$Lon,y=eq$Lat,pch=16,cex=eq$mag.size, col=eq$mag.col) eq$magnitude.text <- eq$Magnitude eq$magnitude.text[eq$Magnitude < 4] <- NA text(x=eq$Lon,y=eq$Lat,col='black',labels=eq$magnitude.text,adj=c(2.5),cex=0.5) legend('bottomright',c('M 1-2','M 2-3','M 3-4','M4+'), ncol=2, pch=16, col=c('blue','green','orange','red')) box() dev.off() [/sourcecode]   The package ggplot2 has a function that will identify that one wants an SVG file based on the filename provided.  This graph shows the depth of the earthquake compared to the magnitude.

 

library(ggplot2)
( g <- ggplot(eq, aes(x = Magnitude, y = Depth)) + geom_point() + geom_point(data = eq, color = eq$mag.col, size = 3) ) ggsave(g, file="c:\\eqDepthMagnitude.svg") [/sourcecode]    

Software Packages for Graphs and Charts

Graphs can be an important feature of analysis. A graph that has been well designed and put together can make summary statistics much more readable and increase the interpretability. It also makes reports and articles looks more professional.

There are many software packages that are available to design great graphs and charts.  This seems to be a constant discussion I have with others over what graphing software should be used. There are some packages out there that simply make a graph. You can even use something like the default Microsoft Excel or Powerpoint. But then you end up with an “Excel Graph”. Whenever I see one of those graphs, sadly, my knee-jerk reaction is to investigate further into the analysis. So there is something to be said about having professional looking graphs.  However, with that said given some time and work I can agree that Excel can create some attractive graphs.

There are a lot of packages out there and trying to learn every one of them is almost impossible.  My suggestion is to find one that produces a high quality graph and stick with it. There are many commercial software packages including the most common: SPSS, SAS, Minitab.  There’s even Photoshop/Illustrator for those who want absolute control over a completely custom graph. I’ll highlight a few (open source) packages that I have used for various purposes here.

1) R — This does, of course, do statistical analysis but it can produce some very high quality graphs. It will produce just about every graph imaginable. And if it doesn’t already do the graph then chances are you can write a function or package to do it. R graphs are built from the ground up. Usually, if you don’t want something in the graph then it won’t be there unless you say otherwise.

2) Gretl — This is a full feature statistical software package that handles a handful of graphs.  It’s specifically designed for economics and quite easily performs general statistical tests and even handles iterative estimation (i.e. Monte Carlo).  A review of the software package is available at the Journal of Statistical Software.  The graphs are quick and easy to make and there is a command line option for more control.  The graphs are gnuplot and are of high enough quality that they can be used in publications.  The range of graphs are limited but the ones that is does are pretty good: there’s the standard boxplot, scatterplot, and qqplot.

3) Orange — I was working with several large datasets several years ago and looked into using Orange.  This is a very powerful data mining package.  It also has a graphing module to produce some very good graphs.  It produces the standard graphs but it really focuses on the data mining graphs and plots.  These include hierarchical clustering, scatterplots, it also provides a survey plot for identifying correlations between two variables.

4) GGobi  — I was first introduced to GGobi back in about 2000.  It is a great way to visualize multi-dimensional data.  It can be a bit mind-boggling at times trying to wrap you head around a 5 dimensional space being projected on a two-dimensional plane. I’ve also used it to help spot outliers in a 4+ dimensional space and to just gain a better understanding of the data.

5) gnuplot — This is strictly a graphing program.  And it does a good job at creating them.  It will make a wide variety of graphs and has many built-in demos that can be easily modified and rerun on a new dataset. It also makes it really easy to rotate three-dimensional scatterplots by simply clicking and dragging the image.  Here is an example from gnuplot.

Example gnuplot

 

These are just a few of the software packages that I have used and with which I have had success.  There are many graphing packages out there and I would be interested in other programs that others have used to create graphs and charts.

 

 

Latent Class Modeling Election Data

Latent class analysis is a useful tool that is used to identify groups within multivariate categorical data.  An example of this is the likert scale. In categorical language these groups are known as latent classes. As a simple comparison this can be compared to the k-means multivariate cluster analysis. There are several key differences between the two methods. First, latent class analysis assigns observations to groups based on probability while k-means cluster analysis absolutely assigns observations to groups. While k-means is readily available in many software packages it is only appropriate for continuous data. Latent class analysis is not as widely available in many software packages but it is designed to handle categorical data.

There are a handful of latent class analysis software packages. Probably the best and most common is Latent Gold. However, the license can be somewhat cost prohibitive. This is particularly true if your daily routine does not include latent class modeling. Currently, SPSS does not include latent class analysis. IBM, the company that owns SPSS, has indicated that the enhancement request for latent class analysis has been added to SPSS Development. For SAS users there is proc lca, but once again that is somewhat cost prohibitive. On the open source side of things there are the R packages poLCA and MCLUST. Unless one needs the many features available in Latent Gold these R packages will generally be sufficient for data analysis.

In general latent class modeling has the following R code structure:

set.seed(1234)

library(e1071)
library(poLCA)
library(reshape2)

## An example of simulating likert scale data
probs = cbind(c(.4,.2/3,.2/3,.2/3,.4),c(.1/4,.1/4,.9,.1/4,.1/4),c(.2,.2,.2,.2,.2))
my.n = 1000
my.len = ncol(probs)*my.n
raw = matrix(NA,nrow=my.len,ncol=3)
raw = NULL
for(i in 1:ncol(probs)){
raw = rbind(raw, cbind(i,rdiscrete(my.n,probs=probs[,i],values=1:5)))
}
raw = data.frame(id = seq(1,my.n),raw)

# An example of how to transform data back from normalized data to a flat file
raw.flat = dcast(raw, id ~ i, value.var=”V2″)
names(raw.flat) = c(“id”,”A”,”B”,”C”)

# Simulation example of latent class models
f = cbind(B, C) ~ A;
lca.fit1 < - poLCA(f,raw.flat,nclass=1, nrep=5); lca.fit2 <- poLCA(f,raw.flat,nclass=2, nrep=5); f = cbind(A, B, C)~1; lca.fit1 <- poLCA(f,raw.flat,nclass=1, nrep=5); lca.fit2 <- poLCA(f,raw.flat,nclass=2, nrep=5); [/sourcecode]   ANES 2000 The following is an example of how one can analyze data from the American National Election Study (ANES).  This is an election study conducted for each election year.  This is a built-in data frame for the R package, and it is from 2000. However, I would recommend going to ElectionStudies  and then go to their Data Center to get the most recent dataset from 2012. Additionally, for great data on election analysis I would strongly encourage the National Election Pool Exit Poll data.  There are some great analyses that can be obtained through those data.  However, the raw data is a bit more difficult to obtain (as of today the Roper Center has disabled all access to the raw data).  Consequently, the analysis is fairly limited.   LCA American National Election Studies

 

# Example dataset from the poLCA package
data(election)
# build the model with PARTY as the covariate
f < - cbind(MORALG,CARESG,KNOWG,LEADG,DISHONG,INTELG, MORALB,CARESB,KNOWB,LEADB,DISHONB,INTELB)~PARTY # Run LCA on the ANES 2000 dataset 3 classes anes2000 <- poLCA(f,election,nclass=3,nrep=5) # Build a matrix to prepare for graphing my.mat.max = 15 my.mat <- cbind(1,c(1:my.mat.max)) exb <- exp(pidmat %*% anes2000$coeff) # Run the matrix plot matplot(c(1:my.mat.max),(cbind(1,exb)/(1+rowSums(exb))),ylim=c(0,1),type="l", main="Party ID as a predictor of candidate affinity class", xlab="Party ID: strong Democratic (1) to strong Republican (7)", ylab="Probability of latent class membership",lwd=2,col=c('blue','green','red')) text(5.9,0.35,"Other") text(5.4,0.7,"Bush affinity") text(2.5,0.6,"Gore affinity") [/sourcecode]   National Election Pool Exit Poll 2012 Here is another example using the 2012 National Election Pool Exit Poll.  In this example I simply pull the data directly from the tables. This is to be used as a basic example and there are quite a few caveats (e.g. rounding, weighting, item nonresponse, using candidate vote, etc.) on creating a raw dataset this way  but the latent class model concept remains the same.  Also, the All Other category is not broken out by age so I simply divide out (through probably not a completely accurate approach) the count evenly. The n is 26565 so that will be the baseline.  Any member of the National Election Pool’s websites (ABC, CBS, CNN, Fox, NBC) can be used for this data.  Note that for some reason CBS has very wrong marginal data on their site for this table.

Probability of Latent Class Membership

# Cell counts pulled directly from the tables and based on n of 26565

table.raw = rbind(
cbind( rep(‘W’, 1286), rep(’18-29′, 1286), rep(‘O’, 1286) ),
cbind( rep(‘W’, 3395), rep(’30-44′, 3395), rep(‘O’, 3395) ),
cbind( rep(‘W’, 5239), rep(’45-64′, 5239), rep(‘O’, 5239) ),
cbind( rep(‘W’, 2417), rep(’65+’, 2417), rep(‘O’, 2417) ),
cbind( rep(‘B’, 534), rep(’18-29′, 534), rep(‘O’, 534) ),
cbind( rep(‘B’, 404), rep(’30-44′, 404), rep(‘O’, 404) ),
cbind( rep(‘B’, 404), rep(’45-64′, 404), rep(‘O’, 404) ),
cbind( rep(‘B’, 104), rep(’65+’, 104), rep(‘O’, 104) ),
cbind( rep(‘H’, 967), rep(’18-29′, 967), rep(‘O’, 967) ),
cbind( rep(‘H’, 749), rep(’30-44′, 749), rep(‘O’, 749) ),
cbind( rep(‘H’, 741), rep(’45-64′, 741), rep(‘O’, 741) ),
cbind( rep(‘H’, 247), rep(’65+’, 247), rep(‘O’, 247) ),
cbind( rep(‘O’, 197), rep(’18-29′, 197), rep(‘O’, 197) ),
cbind( rep(‘O’, 197), rep(’30-44′, 197), rep(‘O’, 197) ),
cbind( rep(‘O’, 197), rep(’45-64′, 197), rep(‘O’, 197) ),
cbind( rep(‘O’, 197), rep(’65+’, 197), rep(‘O’, 197) ),

cbind( rep(‘W’, 1490), rep(’18-29′, 1490), rep(‘R’, 1490) ),
cbind( rep(‘W’, 1339), rep(’30-44′, 1339), rep(‘R’, 1339) ),
cbind( rep(‘W’, 2388), rep(’45-64′, 2388), rep(‘R’, 2388) ),
cbind( rep(‘W’, 1302), rep(’65+’, 1302), rep(‘R’, 1302) ),
cbind( rep(‘B’, 247), rep(’18-29′, 247), rep(‘R’, 247) ),
cbind( rep(‘B’, 627), rep(’30-44′, 627), rep(‘R’, 627) ),
cbind( rep(‘B’, 648), rep(’45-64′, 648), rep(‘R’, 648) ),
cbind( rep(‘B’, 162), rep(’65+’, 162), rep(‘R’, 162) ),
cbind( rep(‘H’, 85), rep(’18-29′, 85), rep(‘R’, 85) ),
cbind( rep(‘H’, 40), rep(’30-44′, 40), rep(‘R’, 40) ),
cbind( rep(‘H’, 56), rep(’45-64′, 56), rep(‘R’, 56) ),
cbind( rep(‘H’, 16), rep(’65+’, 16), rep(‘R’, 16) ),
cbind( rep(‘O’, 61), rep(’18-29′, 61), rep(‘R’, 61) ),
cbind( rep(‘O’, 61), rep(’30-44′, 61), rep(‘R’, 61) ),
cbind( rep(‘O’, 61), rep(’45-64′, 61), rep(‘R’, 61) ),
cbind( rep(‘O’, 61), rep(’65+’, 61), rep(‘R’, 61) )
)

exitpoll2012 = data.frame(table.raw)
names(exitpoll2012) = c(“RACE”,”AGE”,”VOTE”)
table(table.raw[,1], table.raw[,2])
table(table.raw[,1], table.raw[,3])
f <- cbind(AGE, RACE)~VOTE xp.lca <- poLCA(f,exitpoll2012,nclass=2) table(exitpoll2012$AGE) # Build a matrix to prepare for graphing my.mat.max = 4 my.mat <- cbind(1,c(1:my.mat.max)) exb <- exp(my.mat %*% xp.lca$coeff) # Run the matrix plot matplot(c(1:my.mat.max),(cbind(1,exb)/(1+rowSums(exb))),ylim=c(0,1),type="l", main="Candidate Vote as a Predictor of Candidate Affinity Class using Voter Race and Age", xlab="Candidate Vote: Obama (1) Romney (2)", ylab="Probability of latent class membership",lwd=2,col=c('blue','red')) text(1.4,0.25,"Romney Leaning") text(1.4,0.8,"Obama Leaning") [/sourcecode]  

Hey, I Just Did a Significance Test!

I’ve seen it happens quite often. The sig test. Somebody simply needs to know the p-value and that one number will provide all of the information about the study that they need to know. The dataset is presented and the client/boss/colleague/etc invariably asks the question “is it significant?” and “what’s the correlation?”. To quote R.A. Fisher “To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.”

So obviously in the previous example there are probably lots of problems ranging from the person making the request doesn’t know what they want to a bigger problem of the study itself may have core structural problems that undermines the overall integrity of any results. I want to focus on the former and share some thoughts on how to do a couple of hypothesis tests and selecting the appropriate test. This is mainly a topic of parametric vs. non-parametric tests. I’ll show the one most people are probably most familiar (parametric tests) and then show an alternative, more appropriate hypothesis test. I won’t get into the philosophical topics of hypothesis testing or whether .04999 is significant and .05001 is not significant. This simply provides some hypothesis testing options that too often get overlooked in market and/or social research. Yes,  parametric tests provide more power but not everything in this world fits nicely into one package with a bow on top.

Box Plot

What’s the Correlation

I’ve had to happen more often that I would care to count. I’m given a dataset and then asked what’s the “correlation”? After some back and forth I find that they want the “correlation” on all permutations (with p-value) of the questions on the questionnaire. So here goes:

library(e1071)
library(Hmisc)
library(reshape)
n=10
raw = data.frame( id=seq(1,n), replicate(2, rdiscrete(n, probs=c(1,1,1,1,1), values=c(1,2,3,4,5)) ) )

Pearson’s Correlation or, more formally, Pearson product-moment correlation coefficient. This is the correlation with which most people are familiar. Be sure to check the assumption (i.e. homoscedasticity, linearity, normality). However, Spearman’s Rho or Kendall’s Tau (depending on how you want to interpret the results) may in fact be the better options.


# My example gave the following
&gt; cor(raw$X1,raw$X2, method="kendall")
[1] -0.191943
&gt; cor(raw$X1,raw$X2, method="spearman")
[1] -0.2320708
&gt; cor(raw$X1,raw$X2, method="pearson")
[1] -0.2573766

I Need a T-Test

This is a popular hypothesis test because people want to know if something is better (or worse) than something else.  So the initial choice is a t-test. For example did Group 1 make more money than Group 2 or did more people remember seeing an ad on theater screen 1 versus theater screen 2.  So rather than the parametric t-test we can use the Mann-Whitney U Test on our data that doesn’t meet the t-test assumptions.

# Example when data is normalized
raw2 = melt(raw, id=c("id"))

#Mann-Whitney U Test (independent 2-groups)
wilcox.test(raw$X1,raw$X2, paired=FALSE, alternative="two.sided")
wilcox.test(raw2$value~raw2$variable, paired=FALSE, alternative="two.sided")

Is There Any Association in My Table

When testing tables and counts the first go-to test is the chi-square test. But suppose you have a table like this:

2 1
3 4
raw.chisq = data.frame( id=seq(1,n),replicate(2, rdiscrete(n, probs=c(1,1), values=c(1,2)) ) )
table(raw.chisq$X1, raw.chisq$X2)

Sure you can just run you chi-square test and be done with it. But there is one small problem. The assumptions for a chi-square test are in not met. Namely, the cell sizes are way too small. So what’s a researcher to do? This is where Fisher’s Exact Test works well. If we can assume that the marginal totals are given then we can solve the problem this way:

fisher.test(raw.chisq$X1, raw.chisq$X2)

Is There a Difference in My Three Groups

Yes, it’s true, you can run three t-tests on your groups (1 vs 2, 1 vs 3, 2 vs 3). But that causes not only extra work but problems with your hypothesis test itself. Plus why do multiple tests when you can be more efficient in you testing and do just one ANOVA. Here you can do a non-parametric Kruskal-Wallis Rank Sum Test when you can’t make the assumptions for the parametric analysis of variance.

raw.anova = data.frame( id=seq(1,n),replicate(3, rdiscrete(n, probs=c(1,1,1,1,1), values=c(1,2,3,4,5)) ) )
raw.anova2 = melt(raw.anova, id=c("id"))
kruskal.test(raw.anova2$value ~ raw.anova2$variable)
fit.aov = lm(raw.anova2$value ~ raw.anova2$variable)
anova(fit.aov)

Ultimately, it is important to understand what you’re doing and why. Just because R, SPSS, SAS, etc. gives you a number it doesn’t mean it’s correct.  If you find that something is significant make sure you understand what it is saying and what you’re testing.  You don’t want to run an independent 2-sample test only to find out that it should have been a matched pairs test.