Latent class analysis is a useful tool that is used to identify groups within multivariate categorical data. An example of this is the likert scale. In categorical language these groups are known as latent classes. As a simple comparison this can be compared to the k-means multivariate cluster analysis. There are several key differences between the two methods. First, latent class analysis assigns observations to groups based on probability while k-means cluster analysis absolutely assigns observations to groups. While k-means is readily available in many software packages it is only appropriate for continuous data. Latent class analysis is not as widely available in many software packages but it is designed to handle categorical data.

There are a handful of latent class analysis software packages. Probably the best and most common is Latent Gold. However, the license can be somewhat cost prohibitive. This is particularly true if your daily routine does not include latent class modeling. Currently, SPSS does not include latent class analysis. IBM, the company that owns SPSS, has indicated that the enhancement request for latent class analysis has been added to SPSS Development. For SAS users there is *proc lca*, but once again that is somewhat cost prohibitive. On the open source side of things there are the R packages poLCA and MCLUST. Unless one needs the many features available in Latent Gold these R packages will generally be sufficient for data analysis.

In general latent class modeling has the following R code structure:

set.seed(1234) library(e1071) library(poLCA) library(reshape2) ## An example of simulating likert scale data probs = cbind(c(.4,.2/3,.2/3,.2/3,.4),c(.1/4,.1/4,.9,.1/4,.1/4),c(.2,.2,.2,.2,.2)) my.n = 1000 my.len = ncol(probs)*my.n raw = matrix(NA,nrow=my.len,ncol=3) raw = NULL for(i in 1:ncol(probs)){ raw = rbind(raw, cbind(i,rdiscrete(my.n,probs=probs[,i],values=1:5))) } raw = data.frame(id = seq(1,my.n),raw) # An example of how to transform data back from normalized data to a flat file raw.flat = dcast(raw, id ~ i, value.var="V2") names(raw.flat) = c("id","A","B","C") # Simulation example of latent class models f = cbind(B, C) ~ A; lca.fit1 < - poLCA(f,raw.flat,nclass=1, nrep=5); lca.fit2 <- poLCA(f,raw.flat,nclass=2, nrep=5); f = cbind(A, B, C)~1; lca.fit1 <- poLCA(f,raw.flat,nclass=1, nrep=5); lca.fit2 <- poLCA(f,raw.flat,nclass=2, nrep=5);

ANES 2000

The following is an example of how one can analyze data from the American National Election Study (ANES). This is an election study conducted for each election year. This is a built-in data frame for the R package, and it is from 2000. However, I would recommend going to ElectionStudies and then go to their Data Center to get the most recent dataset from 2012.

Additionally, for great data on election analysis I would strongly encourage the National Election Pool Exit Poll data. There are some great analyses that can be obtained through those data. However, the raw data is a bit more difficult to obtain (as of today the Roper Center has disabled all access to the raw data). Consequently, the analysis is fairly limited.

# Example dataset from the poLCA package data(election) # build the model with PARTY as the covariate f < - cbind(MORALG,CARESG,KNOWG,LEADG,DISHONG,INTELG, MORALB,CARESB,KNOWB,LEADB,DISHONB,INTELB)~PARTY # Run LCA on the ANES 2000 dataset 3 classes anes2000 <- poLCA(f,election,nclass=3,nrep=5) # Build a matrix to prepare for graphing my.mat.max = 15 my.mat <- cbind(1,c(1:my.mat.max)) exb <- exp(pidmat %*% anes2000$coeff) # Run the matrix plot matplot(c(1:my.mat.max),(cbind(1,exb)/(1+rowSums(exb))),ylim=c(0,1),type="l", main="Party ID as a predictor of candidate affinity class", xlab="Party ID: strong Democratic (1) to strong Republican (7)", ylab="Probability of latent class membership",lwd=2,col=c('blue','green','red')) text(5.9,0.35,"Other") text(5.4,0.7,"Bush affinity") text(2.5,0.6,"Gore affinity")

National Election Pool Exit Poll 2012

Here is another example using the 2012 National Election Pool Exit Poll. In this example I simply pull the data directly from the tables. This is to be used as a basic example and there are quite a few caveats (e.g. rounding, weighting, item nonresponse, using candidate vote, etc.) on creating a raw dataset this way but the latent class model concept remains the same. Also, the All Other category is not broken out by age so I simply divide out (through probably not a completely accurate approach) the count evenly.

The *n is 26565 so that will be the baseline. Any member of the National Election Pool’s websites (ABC, CBS, CNN, Fox, NBC) can be used for this data. Note that for some reason CBS has very wrong marginal data on their site for this table .*

- http://elections.nbcnews.com/ns/politics/2012/all/president/#exitPoll
- http://www.foxnews.com/politics/elections/2012-exit-poll
- http://www.cbsnews.com/election-results-2012/exit.shtml?state=US&race=P&jurisdiction=0&party=G&tag=contentBody;exitLink
- http://abcnews.go.com/Politics/2012_Elections_Exit_Polls/
- http://www.cnn.com/election/2012/results/race/president

# Cell counts pulled directly from the tables and based on n of 26565 table.raw = rbind( cbind( rep('W', 1286), rep('18-29', 1286), rep('O', 1286) ), cbind( rep('W', 3395), rep('30-44', 3395), rep('O', 3395) ), cbind( rep('W', 5239), rep('45-64', 5239), rep('O', 5239) ), cbind( rep('W', 2417), rep('65+', 2417), rep('O', 2417) ), cbind( rep('B', 534), rep('18-29', 534), rep('O', 534) ), cbind( rep('B', 404), rep('30-44', 404), rep('O', 404) ), cbind( rep('B', 404), rep('45-64', 404), rep('O', 404) ), cbind( rep('B', 104), rep('65+', 104), rep('O', 104) ), cbind( rep('H', 967), rep('18-29', 967), rep('O', 967) ), cbind( rep('H', 749), rep('30-44', 749), rep('O', 749) ), cbind( rep('H', 741), rep('45-64', 741), rep('O', 741) ), cbind( rep('H', 247), rep('65+', 247), rep('O', 247) ), cbind( rep('O', 197), rep('18-29', 197), rep('O', 197) ), cbind( rep('O', 197), rep('30-44', 197), rep('O', 197) ), cbind( rep('O', 197), rep('45-64', 197), rep('O', 197) ), cbind( rep('O', 197), rep('65+', 197), rep('O', 197) ), cbind( rep('W', 1490), rep('18-29', 1490), rep('R', 1490) ), cbind( rep('W', 1339), rep('30-44', 1339), rep('R', 1339) ), cbind( rep('W', 2388), rep('45-64', 2388), rep('R', 2388) ), cbind( rep('W', 1302), rep('65+', 1302), rep('R', 1302) ), cbind( rep('B', 247), rep('18-29', 247), rep('R', 247) ), cbind( rep('B', 627), rep('30-44', 627), rep('R', 627) ), cbind( rep('B', 648), rep('45-64', 648), rep('R', 648) ), cbind( rep('B', 162), rep('65+', 162), rep('R', 162) ), cbind( rep('H', 85), rep('18-29', 85), rep('R', 85) ), cbind( rep('H', 40), rep('30-44', 40), rep('R', 40) ), cbind( rep('H', 56), rep('45-64', 56), rep('R', 56) ), cbind( rep('H', 16), rep('65+', 16), rep('R', 16) ), cbind( rep('O', 61), rep('18-29', 61), rep('R', 61) ), cbind( rep('O', 61), rep('30-44', 61), rep('R', 61) ), cbind( rep('O', 61), rep('45-64', 61), rep('R', 61) ), cbind( rep('O', 61), rep('65+', 61), rep('R', 61) ) ) exitpoll2012 = data.frame(table.raw) names(exitpoll2012) = c("RACE","AGE","VOTE") table(table.raw[,1], table.raw[,2]) table(table.raw[,1], table.raw[,3]) f <- cbind(AGE, RACE)~VOTE xp.lca <- poLCA(f,exitpoll2012,nclass=2) table(exitpoll2012$AGE) # Build a matrix to prepare for graphing my.mat.max = 4 my.mat <- cbind(1,c(1:my.mat.max)) exb <- exp(my.mat %*% xp.lca$coeff) # Run the matrix plot matplot(c(1:my.mat.max),(cbind(1,exb)/(1+rowSums(exb))),ylim=c(0,1),type="l", main="Candidate Vote as a Predictor of Candidate Affinity Class using Voter Race and Age", xlab="Candidate Vote: Obama (1) Romney (2)", ylab="Probability of latent class membership",lwd=2,col=c('blue','red')) text(1.4,0.25,"Romney Leaning") text(1.4,0.8,"Obama Leaning")

I have a problem clustering data when there are continuous and categorical data. Is it possible to use latent class analysis to cluster that type of data? Is there R library that can help me with this task?

Thanks!

Yes, use MClust rather than using poLCA() in R. Plenty of places on google where you can get the codes. remember it does not handle missing values so either treat them or remove the respondents.