Occasionally, I will get requests from clients to calculate the mean. Most of the time it’s a simple request but from time-to-time the data was originally from grouped data. A common approach is to take the midpoint of each of the groups and just assume that all respondents within that group average out to the midpoint. The biggest problem I have run into has been trying to decide what to do with two extremes. What does one do with the age ’65+’ category or the ‘over $100,000’ category. Since there is really no mid-point one must make a (somewhat arbitrary) determination on what value to use for those categories. Over the years I have toyed around with various approaches but have found using this Monte Carlo approach has worked the best and seems to be most defensible. In this example I use a sample of voters who were asked their age and were given categorical response options. The final deliverable will be the average age and standard deviation.

Group | Frequency |

18 to <30 | 25 |

30 to <45 | 43 |

45 to <60 | 42 |

60+ | 31 |

I have written some code that I have traditionally used for this type of work but have since found that the *LearnBayes* package contains the necessary functions as well. For the purpose of making this code usable in other settings I have combined some of the code I have written with some of the functions from the *LearnBayes* package. This way I have been able to modify some of the algorithms and graphs to work under a variety of settings.

I welcome any comment on other approaches people have taken to solve similar problems using grouped, categorical data.

###Note that this function is also available through the VGAM package. ###However, it is displayed here as part of the example. library(graphics); laplace.dist <- function (logpost, mode, ...) { options(warn = -1) fit <- optim(mode, logpost, gr = NULL, ..., hessian = TRUE, control = list(fnscale = -1)) options(warn = 0) mode <- fit$par hess <- -solve(fit$hessian) l.mode <- length(mode) int <- l.mode/2 * log(2 * pi) + 0.5 * log(det(hess)) + logpost(mode, ...) ret.val <- list(mode = mode, var = hess, int = int, converge = fit$convergence == 0) return(ret.val) } metropolis.randwalk <- function (logpost, proposal, start, m, ...) { pb <- length(start) Mpar <- array(0, c(m, pb)) b <- matrix(t(start)) lb <- logpost(start, ...) a <- chol(proposal$var) scale <- proposal$scale accept <- 0 for (i in 1:m) { bc <- b + scale * t(a) %*% array(rnorm(pb), c(pb, 1)) lbc <- logpost(t(bc), ...) prob <- exp(lbc - lb) if (is.na(prob) == FALSE) { if (runif(1) < prob) { lb <- lbc b <- bc accept <- accept + 1 } } Mpar[i, ] <- b } accept <- accept/m ret.val <- list(par = Mpar, accept = accept) return(ret.val) } DataGroup <- function(theta, data){ cpoints=data$b freq <- data$f nbins <- length(cpoints) m <- theta[1] logsigma <- theta[2] z <- 0*m s <- exp(logsigma) z <- freq[1]*log(pnorm(cpoints[1],m,s)) for(j in 1:(nbins-1)){ z <- z+freq[j+1]*log(pnorm(cpoints[j+1],m,s)-pnorm(cpoints[j],m,s)) } z <- z+freq[nbins+1]*log(1-pnorm(cpoints[nbins],m,s)) return(z) } contour.plot = function (logf, limits, data, ...) { log.f = function(theta, data) { if (is.matrix(theta) == TRUE) { val = matrix(0, c(dim(theta)[1], 1)) for (j in 1:dim(theta)[1]) { val[j] = logf(theta[j,], data) } } else { val = logf(theta, data) } return(val) } ng <- 50 x0 <- seq(limits[1], limits[2], len = ng) y0 <- seq(limits[3], limits[4], len = ng) X <- outer(x0, rep(1, ng)) Y <- outer(rep(1, ng), y0) n2 <- ng^2 Z <- log.f(cbind(X[1:n2], Y[1:n2]), data) Z <- Z - max(Z) Z <- matrix(Z, c(ng, ng)) contour(x0, y0, Z, levels = seq(-6.9, 0, by = 2.3), lwd = 2) } lo <- c(18,30,45,60) hi <- c(30,45,60, Inf) d <- list(int.lo=lo, int.hi=hi, b = c(30,45,60), f=c(25,43,42,31)) y <- c(rep((18+30)/2,25),rep((30+45)/2,43),rep((45+60)/2,42),rep(61,31)) mean.y <- mean(y) log.sd.y <- log(sd(y)) start <- c(mean.y,log.sd.y) fit <- laplace.dist(DataGroup,start,d) fit modal.sds <- sqrt(diag(fit$var)) proposal <- list(var=fit$var,scale=2) fit2 <- metropolis.randwalk(DataGroup,proposal,start,10000,d) fit2$accept post.means <- apply(fit2$par,2,mean) post.sds <- apply(fit2$par,2,sd) cbind(c(fit$mode),modal.sds) ##These should be similar cbind(post.means,post.sds) contour.plot(DataGroup,c(min(post.means[1])-10,max(post.means[1]+10),min(post.means[2]-.5),max(post.means[2]+.5)),d, xlab="mu",ylab="log sigma") points(fit2$par[1:2000,1],fit2$par[1:2000,2])

I have a few comments about this problem. It’s very common for a client/customer/colleague to ask for something that is not mathematically correct (such as asking for a mean average from categorical data), and I have found that it takes tact and skill to counter with the correct analysis. In this case, it sounds like the customer may want a vague idea of how old most of the respondents are. Rather than give a mean average in this case, I would say that about 60% of the respondents report being between 30 and 60.

However, if we’re really stuck on a mean average, we’re in a tight spot. Categorical responses like these are ordered but not continuous, and we can’t assume any type of predictable distribution. That’s where the simulation starts to fail in real-life meaning. We can approximate a mean average based on a model, but we’re making a large assumption about the type of model is producing the results. We could try to make a Liekert-like assumption (probably not appropriate) and try to treat the categories as continuous, which would produce an average response of 2.56, which basically says the majority of the respondents are between 30 and 60, like I could already see from the 60% spread. This is why I start pushing back on requests like mean averages for categorical data. Providing meaningless numbers to a customer does not improve the customer’s ability to act on the numbers given.

We also have a problem with this particular example, since the categories are not mutually exclusive. The age ranges overlap, so it is theoretically possible that 48% of our respondents are age 30. This would seriously change our estimate of any mean average. In the worst case scenario, we could have a true mean average of just over 45, which could indicate any distribution from perfect uniformity to all respondents being either 30 or 60.

All that being said, there are still some times when you can’t talk a customer off a silly idea, and you have to provide something just to get on to more important things. If I absolutely had to provide a mean average for this data, I would provide the range. In this case, our mean average is between 38.9 and 58.9, assuming a maximum respondent age of 100. Anything more specific than that, and I would be either giving my customer bad information or setting myself up for embarrassment later. Nothing is more disquieting than having one of your non-specific numbers used as a lynch pin fact in an important meeting.

Thank you for the insights and pointing out the flaw in the table. I did adjust the table so that the groups reported are mutually exclusive. I completely agree that it can sometimes be difficult to talk a client out of idea that doesn’t have a solid statistical base. This is just one example and is for those situations where perfect data doesn’t exist. Like with any point estimate from a sample a range of likely values is required. This would be true even if all data were continuous. Whether it be a Frequentist confidence interval or a Bayesian posterior distribution credible interval this range should be provided. In this example the assumption is made that the sample is from a normal population with mean (mu) and standard deviation (sigma). We can also assess this example’s accuracy of the model approximation to the posterior by seeing that there is agreement between the laplace function and the simulated output from the MCMC algorithm.