May 2013 – Statistical Research

True story (no really, this did actually happen). While in grad school one of the other teaching assistants was approached by one of the students and was asked “will mu go out with median?” The teaching assistant thought the play on words was pretty funny, laughed, and then cluelessly walked away. All of us other grad students were surprised because we knew that really was mean.

There are a lot of ways to calculate a measure of center. Here are several examples that include arithmetic mean, geometric mean, harmonic mean, and for good measure the median.

Arithmetic Mean

By far the most common is the mean (aka the average). This is simply taking a list of numbers and dividing by the count of those numbers. It is useful when there are many numbers that add up to a total. What does this tell us? If you were looking at a teeter totter with a bunch of kids on it then it’s where the bar balances. It doesn’t really matter how many kids you have on either side it’s simply where the weight of the kids is even on each side.

Geometric Mean

Lesser used is the geometric mean. This is used when there are many quantities that multiply together to produce a product of those numbers. This is a more appropriate mean when dealing with proportional growth. Take for example when you invest in something like a 401k. If you get a 8% growth for the first year, 12% for the second, and 11% for the third you would want to take the geometric mean. This can be re-written as 1.08 the first year, 1.12 for the second, and 1.11 for the third. The geometric mean is then calculated as $\prod_{n=1}^3\left(1.08 \cdot 1.12 \cdot 1.11\right)^{\frac{1}{3}} - 1 = 10.32\%$ .

This table shows how the results from the geometric mean match the results when applying the rate year by year.

		Yearly	Geo-Mean
Rate		1000	1000
0.08	1.08	1080	1103.201691
0.12	1.12	1209.6	1217.053972
0.11	1.11	1342.66	1342.66
	0.103201691

Harmonic Mean

Harmonic mean, like the arithmetic mean, is additive in nature. However, the larger quantities get dampened down. Consequently, it can be used in some situations when there are outliers. This mean can also be useful in a variety of areas including machine learning when averaging precision and recall of classifiers.

Median

Medians are another example of measure of center. However, unlike arithmetic mean this is less sensitive to outliers. For example when determining a measure of center for national income the mean income would result in a different number than the median income and would lean more toward the very wealthy. However, the median is a better measure of center as it identifies the middle point where half the observations are on either side.

The following code snippets show the three Pythagorean means (arithmetic, geometric, harmonic) as well as the median.

### Generate some fake data
x = cbind(sort(rnorm(25,10,1)),rpois(25,10))
### Write a function for a weighted median
X = x[,1]; w = x[,2]
weighted.median = function(X,w=1){
### If a single value of 1 was entered then set up array
if(length(w)==1){
w = rep(1,length(X))
}

X = cbind(X,w)
X = X[complete.cases(X),]
y = X[order(X[,1]),] # Sort the matrix
y = cbind(y,cumsum(y[,2])) # Attach the cumulative sum

### locate the positions the need to be averaged.
### If there is an exact middle point then it uses the middle point.
which.min.lim = min( which(y[,3]/sum(y[,2]) >= 0.5 ) )
which.max.lim = max( which(y[,3]/sum(y[,2]) <= 0.5 ) ) weighted.median = mean(y[max(which.min.lim, which.max.lim),1]) return(weighted.median) } harmonic.mean = function(x,w=1){ if(length(w)==1){ w = rep(1,length(x)) } dem = w/x # Set up denominator values harmonic.mean = sum(w)/sum(dem) # Calculate harmonic mean return(harmonic.mean) } geometric.mean = function(x,w=1){ if(length(w)==1){ w = rep(1,length(x)) } a = x^w b = 1/sum(w) geometric.mean = prod(a) ^ b ### Same calculation just a different way # exp( sum(w * log(x) ) / sum(w) ) return(geometric.mean) } mean(x[,1]) weighted.mean(x[,1],x[,2]) median(x[,1]) weighted.median(x[,1],x[,2]) harmonic.mean(x[,1], x[,2]) harmonic.mean(x[,1]) geometric.mean(x[,1],x[,2]) geometric.mean(x[,1]) hist(x, nclass=100, xlim=c(10,11)); abline(v=weighted.mean(x[,1],x[,2]), col='red', lwd=2) abline(v=weighted.median(x[,1],x[,2]), col='blue', lwd=2) abline(v=harmonic.mean(x[,1], x[,2]), col='green', lwd=2) abline(v=geometric.mean(x[,1],x[,2]), col='purple', lwd=2) [/sourcecode]

While attending the American Association for Public Opinion Research conference in Boston, MA the topic of non-probability samples was something of a reoccurring theme. I attended the task force panel review on the topic. However, there is currently no commonly accepted solution.

It was about one year ago that Pew reported (Pew report) that their phone completion rate was down to 9%. I can’t imagine that out will be going up ant time soon. That makes one wonder how much longer phone surveys can be considered a probability sample (and that doesn’t mention the whole issue with cell phone adoption). It is certainly not a sustainable method.

One thing is clear, the time has come and something will need to be done in order to solve that problem. Some have even suggested that landline surveys be eliminated and move strictly to cell phone surveys. However, that is probably a band-aid at best and is likely not sustainable either. Some are using sample matching with opt-in Web panels with varying degrees of success. Twitter, Facebook, and other social media are constantly thrown around too.

Reg Baker over at The Survey Geek is heading up the AAPOR task force for the past couple of years trying to solve this problem.

George Box stated that “all models are wrong, but some are useful”. I guess the same now applies to samples. It will be interesting to follow this topic. For the recent update AAPOR just released their report.

Month: May 2013

Will Mu Go Out With Median

The Future of Non Probability Sampling