September 2013 – Statistical Research

Waiting in One Line or Multiple Lines

Posted on September 23, 2013October 16, 2013 by Clint Stevenson

Whenever I go to the grocery store it always seems to be a lesson in statistics. I go get the things I need to buy and then I try to select the checkout register that will decrease the amount of time I have to wait. Inevitably, I select the one line where there is some sort of problem and I just sit there and wait and wait. I will often mark the lines that I am not waiting in to see which line I would have made it through faster. So then the question is whether or not I get out of line and try to find a new line that is faster. As a statistician I know that wait time is often modeled using a memory-less exponential distribution and I could very easily choose the line with the fewest number of people in front of me only find that the cash register ran out of paper or there is some other delay. So it’s a cruel statistical game trying to find the one line that minimizes my wait time. But this game could easily be simplified, as well as removing the anxiety of customers watching others who arrived after them finish their transaction sooner. By eliminating multiple lines and having all customers go through one line — and then break off at the very end — the customers would generally end up getting through the line faster and with less variability. However, I do understand that sometimes there are operational limitations due to the physical store layout or otherwise.

I ran a simulation on the two wait line strategies. The first is when there is only one line that everyone goes through and then break off at the end to the next available cash register. I have often seen this approach in places like Banks, Airport Security, and U.S. Customs. The second strategy is that there are multiple lines and once you’re in the line you stay there until you complete your transaction. This approach is often seen at places like Costco, Walmart, Target, and just about every other grocery store.

The assumption I made for this simulation is that the wait time is distributed as an exponential distribution with a mean of three minutes (EXP( $\theta$ =3)). This next graph shows a comparison — when using a single line strategy — between the maximum wait time and the minimum wait time.

This final graph shows the range difference for each of the service locations (cash registers). When using multiple lines there is a wide range between the maximum total time of one of the locations and the minimum time of the locations. However, when using only one line that range gets dampened down due to the process as it tries to equalize the wait time at each of the locations. Ultimately, in order to get every last person out of the system it will take roughly the same amount of time. But what ends up happening is that there could be a whole group of people stuck in one of the problem lines thus creating higher variability. And anyone who is involved in process control, such as manufacturing, knows that higher variability is generally not a good thing and is less efficient.


library(reshape2) ## For use in the acast() function
service.locations = 3 ## i.e. Number of cash registers
total.obs = service.locations*10 ## Just needed a number and wanted to make it divisible by the service locations
nsims = 10000 ## Number of simulation replicates
x = replicate(nsims, rexp(total.obs, 1/3))

## Set some loading variables

cum.sum = aggreg.multi.mat = aggreg.one.mat = NULL

for(j in 1:ncol(x)){
## Multiple lines assigned randomly (sequentially)
wait.multiple = cbind( (seq(1:total.obs) %% service.locations)+1, x[,j] )
aggreg.values = aggregate(wait.multiple[,2], by=list( wait.multiple[,1] ), sum)
aggreg.multi.mat = rbind(aggreg.multi.mat, t( acast(aggreg.values, Group.1~., value.var="x") ) )

## One line and breaking off at the very end
wait.bucket = matrix(NA, ncol=service.locations,nrow(x))
queue = x[,j]

## Preload the first service locations
for(d in (1:service.locations)){
wait.bucket[d,d] = x[d,j]
}

## Cumulative sum without NA then find min location time as it will serve the next customer
for(i in service.locations+1:(nrow(x)-service.locations) ){
for(k in 1:service.locations){
cum.sum[k] = sum(wait.bucket[1:i,k], na.rm=T)

}
wait.bucket[i,which(cum.sum==min(cum.sum), arr.ind=TRUE)] = x[i,j]
}
aggreg.one.mat = rbind(aggreg.one.mat, apply(wait.bucket, 2, sum, na.rm=T))

}

## View the graphs
my.hist.one.1 = hist( apply(aggreg.one.mat,1,max), nclass=100, plot=FALSE)
my.hist.one.2 = hist( apply(aggreg.one.mat,1,min), nclass=100, plot=FALSE)
max.counts = max(my.hist.one.1$counts, my.hist.one.2$counts)
par(mfrow=c(2,1))
hist( apply(aggreg.one.mat,1,max), nclass=100, xlim=c(0,60), co=3, main=expression(paste("Single-Line -- Distribution of Max Wait Time With EXP(",theta,"=3)")), xlab="Total Minutes Per Register")
hist( apply(aggreg.one.mat,1,min), nclass=100, xlim=c(0,60), col=2, main=expression(paste("Single-Line -- Distribution of Min Wait Time With EXP(",theta,"=3)")), xlab="Total Minutes Per Register")

my.hist.data.1 = hist( apply(aggreg.multi.mat,1,max), nclass=100, plot=FALSE)
my.hist.data.2 = hist( apply(aggreg.multi.mat,1,min), nclass=100, plot=FALSE)
max.counts = max(my.hist.data.1$counts, my.hist.data.2$counts)
par(mfrow=c(2,1))
hist( apply(aggreg.multi.mat,1,max), nclass=100, xlim=c(0,60), ylim=c(0,max.counts), co=3, main=expression(paste("Multiple Lines -- Distribution of Max Wait Time With EXP(",theta,"=3) ")), xlab="Total Minutes Per Register")
hist( apply(aggreg.multi.mat,1,min), nclass=100, xlim=c(0,60), col=2, main=expression(paste("Multiple Lines -- Distribution of Min Wait Time With EXP(",theta,"=3) ")), xlab="Total Minutes Per Register")

par(mfrow=c(2,1))
hist( apply(aggreg.one.mat,1,max)-apply(aggreg.one.mat,1,min), nclass=100, xlim=c(0,60), col=2, main="Single Line", xlab="Range Difference Between Max and Min in Minutes of Service Locations")
hist( apply(aggreg.multi.mat,1,max)-apply(aggreg.multi.mat,1,min), nclass=100, xlim=c(0,60), col=3, main="Multiple Lines", xlab="Range Difference Between Max and Min in Minutes of Service Locations")

Profile Likelihood for New Jersey U.S. Senate Special Election

Posted on September 16, 2013September 15, 2013 by Clint Stevenson

As it stands right now Cory Booker has a very good chance of winning the New Jersey Special U.S. Senate election on October 16 to replace Frank Lautenberg and fill the remainder of his term for the next 15 months. So with the election only about a month away I took advantage of some of the data from the pre-election polls to produce a likelihood estimate based on the four most recent polls listed on www.realclearpolitics.com.

Though it may not be the perfect choice of distributions, the normal distribution will work well and we can easily use the likelihood function to produce the profile likelihood of the data. This makes it easier to describe one parameter at a time while eliminating the nuisance parameter. In this case we’re particularly interested in the estimate, $\mu$ .

Likelihood approaches can be useful as it not only shows where the likelihood is maximized but it also shows other likely values of $\theta$ . Furthermore, likelihood approaches can have better small-sample properties over those based on asymptotic convergence (e.g. standard errors).

There are some differences between Likelihood and Bayesian/Frequentist approaches. In this case we can produce what is referred to as a likelihood interval. However, likelihood intervals suffer from calibration problems where say a ‘5% likelihood’ does not have specific meaning. However, 5% or 10% probability does have meaning. So really, a probability-based inference can be the better option. But in some cases probability-based conclusions can simply be ridiculous and effectively have no useful value. In those cases probability-based inference is probably not the best choice.

In some cases the likelihood interval is the same as the confidence interval. Particularly for interval estimates, using a normal mean case, we can choose a cutoff of 15% such that it corresponds to a 95% confidence interval in Frequentist terms.

How is the Likelihood Interval interpreted

It can be interpreted the same way as a confidence intervals if an exact or large-sample approximation is justified. The likelihood must be reasonably regular (i.e. the log-likelihood can be approximated by a quadratic function). Though this requirement can be quite forgiving.
If the likelihood is certainly not regular (can be in small-sample problems or non-normal distributions) then the interval can be interpreted as a pure likelihood.

The following graphs show the estimate for Cory Booker as well as the variance of the four polls. Given these data (as of Friday, Sep. 13) and assuming this distribution the maximum likelihood estimate shows that Cory Booker is at 55.5%. You’ll also note that since this value is the MLE this is the same value that www.realclearpolitics.com uses on their website. We can also see that the other likely values are somewhere between 50.5% and 60.5%.


## Profile Likelihoods

par(mfrow=c(1,1))

x = c(64,50,54,54)

boxplot(x, col=3, main='Boxplot of All Polls', ylab='Cory Booker Percent')

n = length(x)

np = 500
sx = sqrt(var(x))

mu.theta = seq(mean(x)-8,mean(x)+8,len=np)
sigma.theta = seq(sx/2.75,sx*2.75,len=np)

logLikeFun < - function(mu, x){
## However, only the middle portion of this equation is necessary
a = -n/2 * log( 2 * pi ) - n/2 * log( sum((mu-x)^2)/n ) - 1/(2 * sum((mu-x)^2)/n )*sum((mu-x)^2)
a
}
logLikeFunSigma = function(mu,sigma){
a = -n/2*log(sigma^2) -
(sum(x^2) - 2*mu*sum(x) + n*mu^2)/(2*sigma^2)
-a
}
li = function(th,like,alpha=0.15){
that = mean(th&#91;like==max(like)&#93;)
lowth = th&#91;th < that&#93;
lowlik = like&#91;th < that&#93;
if (length(lowth) < 2){
lowval = min(th)
}
if (length(lowth) > 1){
lowval = approx(lowlik,lowth,xout=alpha)$y
}

upth = th[th > that]
if (length(upth) < 2 ){
return(c(lowval,max(th)))
}
if (length(upth) > 1){
uplik = like[th > that]
upval = approx(uplik,upth,xout=alpha)$y
return(c(lowval,upval))
}
}
## Joint Likelihood
ll.joint = outer(mu.theta, sigma.theta,'logLikeFunSigma')
like.joint = exp(min(ll.joint)-ll.joint)
contour(mu.theta, sigma.theta^2, like.joint,
xlab=expression(mu), ylab=expression(sigma^2),
level=c(.1,.3,.5,.7,.9), lwd=2,
main="Joint Likelihood Contour of\nCory Booker Estimate and Variance of Polls")

## Profile Likelihood for mean mu
log.like = lapply(mu.theta, logLikeFun, x)
log.like = unlist(log.like)
like = exp(log.like - max(log.like)) ## normalize the value to 1.0

## Profile Likelihood (mu, sigma^2=sigma.hat^2)
se = sqrt(var(x)*(n-1)/n)/sqrt(n)
estimate.like = dnorm(mean(x), mean=mu.theta, sd=se)
estimate.like = estimate.like/max(estimate.like)

## Profile Likelihood (mu, sigma=1)
log.like.sigma = logLikeFunSigma(mu.theta, sigma=1)
sigma.like = exp(min(log.like.sigma) - log.like.sigma)

## Profile LIkelihohod sigma^2

plot(mu.theta, like, type='n', main=expression(paste('Profile Likelihood of ',mu,' for Cory Booker')),
ylab='Likelihood', xlab=expression(mu), ylim=c(0,1)) ## create blank graph of the correct size
lines(mu.theta, like,lty=1,lwd=2, col=2)
lines(mu.theta, estimate.like, lty=2, lwd=2, col=3)
text(mean(x)+.75,0,format(mean(x),digits=3) )
abline(v=c(mean(x),li(mu.theta, like),li(mu.theta, estimate.like)), lty=1, lwd=c(2.5,1,1,1,1), col=c(1,2,2,3,3))
abline(h=.15)
legend('topright',c(expression(paste('L(',mu,')')),expression(paste('L(',mu,',',sigma^2,'=',hat(sigma)^2,') ' ))), lwd=2, col=c(2,3), bg='#FFFFFF')

A Monty Simulation

Posted on September 10, 2013September 16, 2013 by Clint Stevenson

I was listening to Science Friday from Sep 6th. and one of the discussions by Ira Flatow was on the well known Monty Hall Problem. This problem has been hashed out many times and in fact I was first introduced to the probability aspects of the problem while taking my introductory to statistics class when I was an undergraduate. Though the solution is not initially intuitive it is fairly straight forward and a discussion is available on Wikipedia. I figured I would take a few minutes to play the game a couple million times. This problem is so ubiquitous that Mythbusters did an episode on this back in 2011 where they too kept on playing the game over and over. So there is no need for me to detail it out but here’s a brief simulation showing the probability of winning, if you change doors, is 0.66. The following graphs show the cumulative results of the first 100 iterations comparing the two different strategies.


library(e1071)
doors = c("A","B","C")
nsim=1000000

sim.mat = matrix(NA, nrow=nsim, ncol=5)
winning.loc = rdiscrete(nsim,doors, prob=rep(1,length(doors)))
first.choice = rdiscrete(nsim,doors, prob=rep(1,length(doors)))
sim.mat[,1] = winning.loc
sim.mat[,2] = first.choice

for(i in 1:length(winning.loc)){
if(winning.loc[i] != first.choice[i]){
## Contestant has selected the goat. Therefore the host must show the other goat and not the winner.
sim.mat[i,3] = doors[ !(doors %in% c(winning.loc[i], first.choice[i])) ]
} else {
rand.goat.options = doors[ !(doors %in% c(winning.loc[i], first.choice[i])) ]
## Contestant has selected the winner and the host then randomly select one of the other two.
rand.goat = rdiscrete(1, rand.goat.options , prob=rep(1,length(rand.goat.options)) )
sim.mat[i,3] = rand.goat
}
## Second choice changes
sim.mat[i,5] = doors[ !(doors %in% c(sim.mat[i,3], first.choice[i])) ]
}
## Second choice is the same as the first choice
sim.mat[,4] = sim.mat[,2]
## names = TRUE.VAL, FIRST.CHOICE, GOAT.LOC.SHOWN, SECOND.CHOICE.STAY
sim.val.stay = as.numeric( (sim.mat[,4]==sim.mat[,1])*1 )
sim.val.change = as.numeric( (sim.mat[,5]==sim.mat[,1])*1 )
mean(sim.val.stay)
mean(sim.val.change)

## Probability by iteration
covergence.prob.change = cumsum(sim.val.change)/cumsum(rep(1,length(sim.val.change)))
covergence.prob.stay = cumsum(sim.val.stay)/cumsum(rep(1,length(sim.val.stay)))

par(mfrow=c(1,2))
plot(head(covergence.prob.stay, 100), pch=16, cex=.5,
ylab="Probability",
main="Probability of Winning\nKeeping the Same Door")
plot(head(covergence.prob.change, 100), pch=16, cex=.5, ylab="Probability",
main="Probability of winning\nChanging Doors")

The Eve of the NYC Democratic Mayoral Primary Election

Posted on September 9, 2013September 12, 2013 by Clint Stevenson

It is the eve of the New York City Democratic mayoral primary election. This is a simple follow-up on my post from last Friday as I was curious how the final pre-Election Day polling was going to fall. It’s fairly clear who is going to get the most votes. Though we have a very good idea who is going to take the nomination it is still not guaranteed. Using the same line of reasoning that I used in my previous post we can calculate the probability that Bill de Blasio is not only going to get the most vote but also achieve the threshold of 40% of the vote to avoid a run-off that is scheduled for October 1st. The following shows the five leading candidates and their distribution of vote. For the distribution shown in the graph below I proportionally assigned the remaining undecided vote each of the five main candidates. There are other candidates that will likely get a few votes but their vote will only be a few percentage points. For the purpose of this estimation and the graph I am only showing the five candidates.

The results once again come from www.realclearpolitics.com and I only used the most recent two polls and averaged the two results. The biggest discrepancy between the two polls in in Christine Quinn where there is a seven point difference. Another item to note is the methodology between the two polls NBC/WSJ/Marist use phone operators while PPP uses IVR. The first graph simply shows results from the combined two polls. The second graph I take the undecided voters and proportionally assign them to top five candidates. If we believe that the undecided voters will break this way then de Blasio will almost certainly win and avoid the run-off.

This graph simply takes the votes as provided by PPP and NBC/WSJ/Marist. This does not include the undecided voters though we know that some of the undecided will likely go to each of the candidates.

The following graph proportionally assigns the undecided vote to each of the five top candidates. This shows that Bill de Blasio has an extremely high likelihood of not only getting the most votes but also obtaining the minimum threshold.

library(MCMCpack)</p>
## Set up several of the recent polls but will only work with the most recent on
raw.1 = NULL
raw.1 = data.frame( rbind(
PPP = c(.38,.19,.13,.09,.05,750),
NBCWSJMAR = c(.36,.20,.20,.07,.05,505)
)
)
raw.1 = rbind(raw.1, c(apply(raw.1[,1:5],2,weighted.mean,raw.1[,6]),sum(raw.1[,6])))
names(raw.1) = c("Cand1","Cand2","Cand3","Cand4","Cand5","size")
raw.1$Other.und = 1-raw.1$Cand1-raw.1$Cand2-raw.1$Cand3-raw.1$Cand4-raw.1$Cand5
raw.1.no.und = data.frame(raw[3,1:5] + raw[3,1:5]/sum(raw[3,1:5])*raw[3,7],size=raw[3,6],Other.und=0)
raw = rbind(raw.1, raw.1.no.und)
###################################################################
## More than two candidates so Beta distribution won't work
## Function to randomly generate data from a dirichlet distribution
###################################################################
j= 4
prob.win = function(j,export=1){
p=rdirichlet(100000,
raw$size[j] *
c(raw$Cand1[j], raw$Cand2[j], raw$Cand3[j], raw$Cand4[j], raw$Cand5[j], 1-raw$Cand1[j]-raw$Cand2[j]-raw$Cand3[j]-raw$Cand4[j]-raw$Cand5[j])+1
)
if(export==1){
mean(p[,1]>p[,2] & p[,1]>=.40) ## de Blasio need to beat Thompson AND get >= .40
} else {
return(p)
}
}

( cand1.win.probs = sapply(1:nrow(raw),prob.win) )

sim.dir = prob.win(4,export=2) ## set simulated data for plotting and determining parameters
## The shape parameters (shape1 and shape2) might need some manual adjusting and tweaking.
## But for demonstration purposes this will suffice
fit.distr.1 = fitdistr(sim.dir[,1], "beta",
start=list(shape1=10,shape2=10))
fit.distr.2 = fitdistr(sim.dir[,2], "beta",
start=list(shape1=10,shape2=10))
fit.distr.3 = fitdistr(sim.dir[,3], "beta",
start=list(shape1=10,shape2=10))
fit.distr.4 = fitdistr(sim.dir[,4], "beta",
start=list(shape1=10,shape2=10))
fit.distr.5 = fitdistr(sim.dir[,5], "beta",
start=list(shape1=10,shape2=10))

## Could also draw a histogram of simulated data
curve(dbeta(x,fit.distr.1$estimate[1],fit.distr.1$estimate[2]),
ylim=c(0,60), xlim=c(0,.50), col=1, lty=1, lwd=2, ylab="Density", xlab="theta",
main="Distribution of the Democratic NYC Mayor Election 2013",
sub=paste("Probability that de Blasio beats Thompson AND gets >= 40% of Vote: ", round(cand1.win.probs[4],2) ) ) ## Candidate 1
curve(dbeta(x,fit.distr.2$estimate[1],fit.distr.2$estimate[2]), add=T, col=3, lty=2, lwd=2) ## Candidate 2
curve(dbeta(x,fit.distr.3$estimate[1],fit.distr.3$estimate[2]), add=T, col=4, lty=3, lwd=2) ## Candidate 3
curve(dbeta(x,fit.distr.4$estimate[1],fit.distr.4$estimate[2]), add=T, col=5, lty=3, lwd=2) ## Candidate 4
curve(dbeta(x,fit.distr.5$estimate[1],fit.distr.5$estimate[2]), add=T, col=6, lty=3, lwd=2) ## Candidate 5
abline(v=c(median(sim.dir[,1]), median(sim.dir[,2]), median(sim.dir[,3]), median(sim.dir[,4]), median(sim.dir[,5])), col=c(1,3,4,5,6), lwd=2, lty=c(1,2,3))
legend("topright",c("de Blasio","Thompson","Quinn","Weiner","Liu"), lwd=2, col=c(1,3,4,5,6), lty=c(1,2,3,4,5))

Probability of Avoiding a Run-off in the NYC 2013 Democratic Primary Election

Posted on September 6, 2013September 17, 2013 by Clint Stevenson

The New York City mayoral Democratic primary election is taking place this coming Tuesday (Sep. 10th) and there are several candidates in the running. Bill de Blasio is the front runner and is expected to win. However, there is a catch. Even if he takes the plurality of the vote he may not actually win. In New York if the candidate does not get at least 40% of the vote then the election goes into a run-off on October 1st. Meaning that Bill Thompson or Christine Quinn (whoever takes 2nd) will then have a second chance to win. This type of problems can be a little difficult to handle using standard univariate statistics. When there are only two candidates then a simple Beta distribution can be used to adequately model the data. However, with there being nine declared candidates the Beta distribution just doesn’t work very well. The Dirichlet distribution is the multivariate generalization of the Beta and this can be used to simulate the outcome when there are multiple candidates. A convenient feature of the Dirichlet is that the marginal distributions are distributed as a Beta making computations on the marginals fairly simple.

I gave a presentation at an AAPOR conference a while back where I used this approach but expanded the model to all 50 states and then applied winner-take-all electoral votes (except for Maine and Nebraska which have a proportion style). In this way I was able to calculate the posterior distribution of the electoral vote and calculate credible intervals and other distributional characteristics. This approach has proved to be a very good way to make estimates and a great way to visualize the distribution of the data.

So then the question is whether Bill de Blasio will get more votes than the next candidate and what is the probability that he will take it all without a run-off. There have been several polls by many well respected polling organizations including Quinnipiac, NY Times and NBC, among others.

This post is designed to be a more technical discussion of the underlying and distributional components of the 2013 NYC Democratic Mayoral election (and really elections in general) rather than for a comprehensive discussion. For demonstration simplicity sake I will take the most recent available poll on www.realclearpolitics.com which was Quinnipiac who fielded from Aug 8th to Sep 1. More polls are certain to arrive after this weekend so it will be interesting to see what happens to the probability distributions prior to Election Day.

Here we can see that de Blasio comes very close to obtaining not only a plurality of votes but the necessary 40%. However, the one item that may not necessarily be emphasized in these polls is that there is still a group of undecided voters. Normally, that does not make a major difference if it is expected that the undecided voters will break proportionally (or even somewhat proportionally) among other voters. Because when the candidates take the most votes they simply win. In this case even if a small portion of the voters start to break for de Blasio then that could put him over the top and prevent a run-off. So then what is the probability that he can avoid a run-off? Well, by running a simulation we can put in those constraints that he must beat the next candidate AND get more than 40% of the vote. In this case I set Thompson as the next candidate to compare.

Based on the given data his chances of avoiding a run-off are not that great as can be seen in the graph below. However, if he gains his proportion of the undecided voters then that will increase his probability of avoiding a run-off. Not but much but it ends up being just short of 50/50 chance. The given model can, of course, be improved with some work.

Maybe I’ll put a little bit of work into the upcoming New Jersey Senate election with Cory Booker and the New Jersey (with Chris Christie) & Virginia (with Cuccinelli & McAuliffe) Governor elections. I’ll see if time permits.


library(MCMCpack)

## Set up several of the recent polls but will only work with the most recent on
raw.1 = NULL
raw.1 = data.frame( rbind(
Quinnipiac = c(.43,.20,.18,750),
NYT = c(.32,.18,.17,505),
Quinnipiac2= c(.36,.20,.21,602),
amNY = c(.29, .24,.17, 600),
NBC = c(.24, .18,.24, 355),
Quinnipiac.adj = c(.465, .226, .194, 750) ## adjust for undecided
)
)
names(raw.1) = c("Cand1","Cand2","Cand3","size")
raw.1$Other = 1-raw.1$Cand1-raw.1$Cand2-raw.1$Cand3
raw = raw.1

###################################################################
## More than two candidates so Beta distribution won't work
## Function to randomly generate data from a dirichlet distribution
###################################################################

prob.win = function(j,export=1){
p=rdirichlet(100000,
raw$size[j] *
c(raw$Cand1[j],raw$Cand2[j],raw$Cand3[j],1-raw$Cand1[j]-raw$Cand2[j]-raw$Cand3[j])
+1)
if(export==1){
mean(p[,1]>p[,2] &amp; p[,1]>=.4) ## de Blasio need to beat Thompson AND get >= .40
} else {
return(p)
}
}

( cand1.win.probs = sapply(1:nrow(raw),prob.win) )
sim.dir = prob.win(1,export=2) ## set simulated data for plotting and determining parameters
## The shape parameters (shape1 and shape2) might need some manual adjusting
fit.distr.1 = fitdistr(sim.dir[,1], "beta",
start=list(shape1=5,shape2=5))
fit.distr.2 = fitdistr(sim.dir[,2], "beta",
start=list(shape1=4,shape2=7))
fit.distr.3 = fitdistr(sim.dir[,3], "beta",
start=list(shape1=2,shape2=9))

## Could also draw a histogram of simulated data
curve(dbeta(x,fit.distr.1$estimate[1],fit.distr.1$estimate[2]),
ylim=c(0,4), col=1, lty=1, lwd=2, ylab="Density", xlab="theta",
main="Distribution of the Democratic NYC Mayor Election 2013",
sub=paste("Probability that de Blasio beats Thompson AND gets > 40% of Vote: ", round(cand1.win.probs[1],2) ) ) ## Candidate 1
curve(dbeta(x,fit.distr.2$estimate[1],fit.distr.2$estimate[2]), add=T, col=3, lty=2, lwd=2) ## Candidate 2
curve(dbeta(x,fit.distr.3$estimate[1],fit.distr.3$estimate[2]), add=T, col=4, lty=3, lwd=2) ## Candidate 3
abline(v=c(median(sim.dir[,1]), median(sim.dir[,2]), median(sim.dir[,3])), col=c(1,3,4), lwd=2, lty=c(1,2,3))
legend("topright",c("de Blasio","Thompson","Quinn"), lwd=2, col=c(1,3,4), lty=c(1,2,3))

The Beta Prior, Likelihood, and Posterior

Posted on September 4, 2013September 4, 2013 by Clint Stevenson

The Beta distribution (and more generally the Dirichlet) are probably my favorite distributions. However, sometimes only limited information is available when trying set up the distribution. For example maybe you only know the lowest likely value, the highest likely value and the median, as a measure of center. That information is sufficient to construct a basic form of the distribution. The idea of this post is not to elaborate in detail on Bayesian priors and posteriors but to give a real working example of using a prior with limited knowledge about the distribution, adding some collected data and arriving at a posterior distribution along with a measure of its uncertainty.

The example below is a simple demonstration on how a prior distribution and current data can be combined and form a posterior distribution. Earlier this year I gave a presentation at a conference where I modified this simple version of my code to be substantially more complex and I used the Dirichlet distribution to make national predictions based on statewide and local samples. So this approach has some very useful applied statistical properties and can be modified to handle some very complex distributions.

A Note on Posterior Intervals

The posterior interval (also called a credible interval or credible region) provides a very intuitive way to describe the measure of uncertainty. Unlike a confidence interval (discussed in one of my previous posts), a credible interval does in fact provide the probability that a value exists within the interval. With this interval it is based on calculating the probability of different values given the data. The graph above shows the different values (identified as theta) as well as the simulated posterior interval limits (alpha=.05 in black and .01 in light gray). In other words the probability that theta is a member of the 95% credible interval is 0.95 (written as $P\left( \theta \in CI \right)$ = 0.95).

The Prior and Posterior Distribution: An Example

The code to run the beta.select() function is found in the LearnBayes package. This is a great function because by providing two quantiles one can determine the shape parameters of the Beta distribution. This is useful to find the parameters (or a close approximation) of the prior distribution given only limited information. If additional quantiles are known then they can be incorporated to better determine the shape parameters of the Beta distribution.


library(LearnBayes)
Q = data.frame(
quantile=c(
median=0.5,
maximum=0.99999,
minimum=0.00001),
prior=c(
median=0.85,
maximum=0.95,
minimum=0.60)
)

optimalBeta = function(Q) {
q1q = Q$quantile[1]
q1p = Q$prior[1]
q2q = Q$quantile[2]
q2p = Q$prior[2]
q3q = Q$quantile[3]
q3p = Q$prior[3]

# find the beta prior using quantile1 and quantile2
q.med = list(p=q1q, x=q1p)
q.max = list(p=q2q, x=q2p)
q.min = list(p=q3q, x=q3p)

# prior parameters using median and max, and median and min
prior.A = beta.select(q.med,q.max)
prior.B = beta.select(q.med,q.min)

prior.Aa = prior.A[1]
prior.Ab = prior.A[2]

prior.Ba = prior.B[1]
prior.Bb = prior.B[2]

## find the best possible beta prior
## Set a start and stop point range to find the best parameters
if (prior.Aa < prior.Ba) {
start.a = prior.Aa
stop.a = prior.Ba
} else {
start.a = prior.Ba
stop.a = prior.Aa
}

if (prior.Ab < prior.Bb) {
start.b = prior.Ab
stop.b = prior.Bb
} else {
start.b = prior.Bb
stop.b = prior.Ab
}
seq.a = seq(from=start.a, to=stop.a, length.out=1000)
seq.b = seq(from=start.b, to=stop.b, length.out=1000)

seq.grid = expand.grid(seq.a, seq.b)

prior.C.q1 = qbeta(q1q, seq.grid&#91;,1&#93;, seq.grid&#91;,2&#93;)
prior.C.q2 = qbeta(q2q, seq.grid&#91;,1&#93;, seq.grid&#91;,2&#93;)
prior.C.q3 = qbeta(q3q, seq.grid&#91;,1&#93;, seq.grid&#91;,2&#93;)

## Different distance measurements, manhattan, euclidean, or otherwise.
## It would be interesting to run a simulation to measure a variety of distance measurements.
prior.C.delta = abs(prior.C.q1 - q1p) + abs(prior.C.q2 - q2p) + abs(prior.C.q3 - q3p)
## prior.C.delta = sqrt( (prior.C.q1 - q1p)^2 + (prior.C.q2 - q2p)^2 + (prior.C.q3 - q3p)^2 )

optimize.seq = cbind(seq.grid, prior.C.q1, prior.C.q2, prior.C.q3, prior.C.delta)

## Minimize the delta, if the min-delta is not unique then choose the first occurence
best.a = optimize.seq&#91;,1&#93;&#91; optimize.seq&#91;,6&#93;==min(optimize.seq&#91;,6&#93;)&#93;&#91;1&#93;
best.b = optimize.seq&#91;,2&#93;&#91; optimize.seq&#91;,6&#93;==min(optimize.seq&#91;,6&#93;)&#93;&#91;1&#93;

return(list(a=best.a,b=best.b))
}

prior.dist = optimalBeta(Q)

##########################################################
## Take a look at only the prior
##########################################################
curve(dbeta(x,prior.dist$a,prior.dist$b)) # plot the prior
abline(v=Q$prior&#91;1&#93;)

##########################################################
## Take a look at only the likelihood with given successes
##########################################################
calcLikelihood = function(successes, total){
curve(dbinom(successes,total,x)) # plot the likelihood
}

calcLikelihood(45, 50) ## e.g. 45/50 sucesses
## calculate some properties of the Beta distribution
calcBetaMode = function(aa, bb) {
beta.mode = (aa - 1)/(aa + bb - 2)
return(beta.mode)
}
calcBetaMean = function(aa, bb) {
beta.mean = (aa)/(aa + bb)
return(beta.mean)
}
calcBetaVar = function(aa, bb) {
beta.var = (aa * bb)/(((aa + bb)^2) * (aa + bb + 1))
return(beta.var)
}
calcBetaMedian = function(aa, bb) {
beta.med = (aa-1/3)/(aa+bb-2/3)
return(beta.med)
}
calcBetaSkew = function(aa, bb) {
beta.skew = ( 2*(bb-aa)*sqrt(aa+bb+1) ) /( (aa+bb+2)/sqrt(aa+bb) )
return(beta.skew)
}

##########################################################
## Take a look at the prior, likelihood, and posterior
##########################################################
priorToPosterior = function(successes, total, a, b) {
## Note the rule of succession
likelihood.a = successes + 1
likelihood.b = total - successes + 1

## Create posterior
posterior.a = a + successes;
posterior.b = b + total - successes
theta = seq(0.005, 0.995, length = 500)

## Calc density
prior = dbeta(theta, a, b)
likelihood = dbeta(theta, likelihood.a, likelihood.b)
posterior = dbeta(theta, posterior.a, posterior.b)

## Plot prior, likelihood, and posterior

## Can be used to scale down the graph if desired.
## However, the density is different for each prior, likelihood, posterior
m.orig = apply( cbind(prior, likelihood, posterior), 2, max)
m = max(c(prior, likelihood, posterior))

plot(theta, posterior, type = "l", ylab = "Density", lty = 2, lwd = 3,
main = paste("Prior: beta(", round(a,2), ",", round(b,2), "); Data: B(", total, ",", successes, "); ",
"Posterior: beta(", round(posterior.a,2), ",", round(posterior.b,2), ")", sep=""), ylim = c(0, m), col = 1)
lines(theta, likelihood, lty = 1, lwd = 3, col = 2)
lines(theta, prior, lty = 3, lwd = 3, col = 3)
legend("topleft",y=m, c("Prior", "Likelihood", "Posterior"), lty = c(3, 1, 2),
lwd = c(3, 3, 3), col = c(3, 2, 1))

prior.mode = calcBetaMode(a, b)
likelihood.mode = calcBetaMode(likelihood.a, likelihood.b)
posterior.mode = calcBetaMode(posterior.a, posterior.b)
prior.mean = calcBetaMean(a, b)
likelihood.mean = calcBetaMean(likelihood.a, likelihood.b)
posterior.mean = calcBetaMean(posterior.a, posterior.b)
prior.med = calcBetaMedian(a, b)
likelihood.med = calcBetaMedian(likelihood.a, likelihood.b)
posterior.med = calcBetaMedian(posterior.a, posterior.b)
prior.var = calcBetaVar(a, b)
likelihood.var = calcBetaVar(likelihood.a, likelihood.b)
posterior.var = calcBetaVar(posterior.a, posterior.b)
prior.skew = calcBetaSkew(a, b)
likelihood.skew = calcBetaSkew(likelihood.a, likelihood.b)
posterior.skew = calcBetaSkew(posterior.a, posterior.b)

print(paste("Mode: prior=",prior.mode,"; Likelihood=",likelihood.mode,"; Posterior=",posterior.mode))
print(paste("Mean: prior=",prior.mean,"; Likelihood=",likelihood.mean,"; Posterior=",posterior.mean))
print(paste("~Approx Median (for a and b > 1): prior=",prior.med,"; Likelihood=",likelihood.med,", for Posterior=",posterior.med))
print(paste("Var: prior=",prior.var,"; Likelihood=", likelihood.var,"; Posterior=",posterior.var))
print(paste("Skewness: prior=",prior.skew,"; Likelihood=",likelihood.skew,"; Posterior=",posterior.skew))
return(list(a=posterior.a,b=posterior.b))
}

posterior.out = priorToPosterior(25,50, prior.dist$a, prior.dist$b) # 25/50 is current data
beta.sim = rbeta(1000000,posterior.out$a, posterior.out$b)
abline(v=quantile(beta.sim, prob=c(.05/2, 1-.05/2)), col='#000000', lwd=2)
abline(v=quantile(beta.sim, prob=c(.01/2, 1-.01/2)), col='#EEEEEE', lwd=2)

Passing-Bablok Regression: R code for SAS users

Posted on September 2, 2013August 30, 2013 by Clint Stevenson

While at the Joint Statistical Meeting a few weeks ago I was talking to a friend about various aspects to clinical trials. He indicated that no current R package was able to perfectly reproduce Passing-Bablok (PB) regression so that it exactly matched SAS. He ultimately wrote a couple of functions and kindly shared them with me so that I could share them here.

There is currently an R package (mcr) that will do the Passing-Bablok regression but, as it turns out, the output does not precisely match SAS. Sadly, I don’t currently have a copy of SAS available to use for this purpose so I can’t independently run code and comparisons. Though I would be interested in an equivalent comparison. However, I ran some examples below to compare the R mcr package.

These code snippets that he shared with me will only calculate the slope, intercept and the confidence interval for each coefficient. It’s not CRAN package ready but the short function and output are convenient and easy to use.

One interesting characteristic comparing the two PB regressions (at least in the simulation below) is that the mcr package always has a slightly smaller coefficient. In the simulation below the maximum difference is .00054. The minimum difference is somewhere along the lines of zero (1.59e-12). The details of the differences is something I must look into at a later date.

library(mcr)

# Function to generate some correlated data
genData = function(numobs=100,randval){
R = matrix(cbind(1,.80, .80,1),nrow=2)
U = t(chol(R))
nvars = dim(U)[1]
set.seed(randval)
random.normal = matrix(rnorm(nvars*numobs,100,10), nrow=nvars, ncol=numobs);
X = U %*% random.normal
newX = t(X)
raw = newX
return(raw)
}

PB.reg = function(X,Y,alpha=.05) {

## (1) calculate all pairwise slopes
x < - X y <- Y dat <- cbind(x,y) n <- length(x) S <- array(NA,dim=rep(n,2)) for(i in 1:(n-1)){ for(j in (i+1):n) { if(i != j) { S[i,j] <- (y[i] - y[j])/(x[i] - x[j]) } } } S <- sort(na.exclude(as.vector(S))) K <- sum(S <= -1) - .5 * sum(S == -1) N <- length(S) b <- ifelse(N%%2,S[(N+1)/2+K],mean(S[N/2+K+0:1])) ### CI for b C.gamma <- qnorm(1-alpha/2) * sqrt(n*(n-1)*(2*n+5)/18) M1 <- round((N-C.gamma)/2,0) M2 <- N - M1 + 1 CI.b <- c(LB=S[M1+K],UB=S[M2+K]) a <- median(y - b*x) ### CI for a CI.a <- c(LB=median(y - CI.b["UB"]*x), UB=median(y - CI.b["LB"]*x)) ## return(list(a=a, CI.a=CI.a, b=b, CI.b=CI.b)) } # A quick function to give only b coefficient PB.fit.red = function(X,Y){ fit.lm = PB.reg(X, Y) fit.lm$b } # Generate some one-time use correlated data simData = genData(numobs=20,c(12345)) # Updated function pb.fit = PB.reg(simData[,1], simData[,2]) # Look at just the b coefficient, single example PB.fit.red(simData[,1], simData[,2]) # mcreg function, single example pb.fit2 = mcreg(simData[,1], simData[,2], method.reg="PaBa", method.ci="bootstrap") slot(pb.fit2, "glob.coef")[2] &amp;nbsp; # Run a small simulation nsims = 250 sim.data = matrix(NA, nrow=nsims, ncol=2) # for() loop used for demonstration purposes. # Not efficient code applying a function over the vectors is more efficient for(i in 1:nsims){ simData = genData( numobs=20, i ) # Look at just the b coefficient, multiple simulations sim.data[i,1] = PB.fit.red(simData[,1], simData[,2]) # mcreg function, multiple simulations pb.fit2 = mcreg(simData[,1], simData[,2], method.reg="PaBa", method.ci="bootstrap") sim.data[i,2] = slot(pb.fit2, "glob.coef")[2] } delta = sim.data[,1] - sim.data[,2] mean(delta) max(delta) min(delta) hist(delta, nclass=50, main="Distribution of Differences in PB Regression") plot(sim.data, main="Difference Between Two Types of Passing-Bablok Non-Parametric Regression", xlab="Improved PB Function", ylab="mcr Package") [/sourcecode]