It is the eve of the New York City Democratic mayoral primary election. This is a simple follow-up on my post from last Friday as I was curious how the final pre-Election Day polling was going to fall. It’s fairly clear who is going to get the most votes. Though we have a very good idea who is going to take the nomination it is still not guaranteed. Using the same line of reasoning that I used in my previous post we can calculate the probability that Bill de Blasio is not only going to get the most vote but also achieve the threshold of 40% of the vote to avoid a run-off that is scheduled for October 1st. The following shows the five leading candidates and their distribution of vote. For the distribution shown in the graph below I proportionally assigned the remaining undecided vote each of the five main candidates. There are other candidates that will likely get a few votes but their vote will only be a few percentage points. For the purpose of this estimation and the graph I am only showing the five candidates.

The results once again come from www.realclearpolitics.com and I only used the most recent two polls and averaged the two results. The biggest discrepancy between the two polls in in Christine Quinn where there is a seven point difference. Another item to note is the methodology between the two polls NBC/WSJ/Marist use phone operators while PPP uses IVR. The first graph simply shows results from the combined two polls. The second graph I take the undecided voters and proportionally assign them to top five candidates. If we believe that the undecided voters will break this way then de Blasio will almost certainly win and avoid the run-off.

This graph simply takes the votes as provided by PPP and NBC/WSJ/Marist. This does not include the undecided voters though we know that some of the undecided will likely go to each of the candidates.

The following graph proportionally assigns the undecided vote to each of the five top candidates. This shows that Bill de Blasio has an extremely high likelihood of not only getting the most votes but also obtaining the minimum threshold.

library(MCMCpack)</p> ## Set up several of the recent polls but will only work with the most recent on raw.1 = NULL raw.1 = data.frame( rbind( PPP = c(.38,.19,.13,.09,.05,750), NBCWSJMAR = c(.36,.20,.20,.07,.05,505) ) ) raw.1 = rbind(raw.1, c(apply(raw.1[,1:5],2,weighted.mean,raw.1[,6]),sum(raw.1[,6]))) names(raw.1) = c("Cand1","Cand2","Cand3","Cand4","Cand5","size") raw.1$Other.und = 1-raw.1$Cand1-raw.1$Cand2-raw.1$Cand3-raw.1$Cand4-raw.1$Cand5 raw.1.no.und = data.frame(raw[3,1:5] + raw[3,1:5]/sum(raw[3,1:5])*raw[3,7],size=raw[3,6],Other.und=0) raw = rbind(raw.1, raw.1.no.und) ################################################################### ## More than two candidates so Beta distribution won't work ## Function to randomly generate data from a dirichlet distribution ################################################################### j= 4 prob.win = function(j,export=1){ p=rdirichlet(100000, raw$size[j] * c(raw$Cand1[j], raw$Cand2[j], raw$Cand3[j], raw$Cand4[j], raw$Cand5[j], 1-raw$Cand1[j]-raw$Cand2[j]-raw$Cand3[j]-raw$Cand4[j]-raw$Cand5[j])+1 ) if(export==1){ mean(p[,1]>p[,2] & p[,1]>=.40) ## de Blasio need to beat Thompson AND get >= .40 } else { return(p) } } ( cand1.win.probs = sapply(1:nrow(raw),prob.win) ) sim.dir = prob.win(4,export=2) ## set simulated data for plotting and determining parameters ## The shape parameters (shape1 and shape2) might need some manual adjusting and tweaking. ## But for demonstration purposes this will suffice fit.distr.1 = fitdistr(sim.dir[,1], "beta", start=list(shape1=10,shape2=10)) fit.distr.2 = fitdistr(sim.dir[,2], "beta", start=list(shape1=10,shape2=10)) fit.distr.3 = fitdistr(sim.dir[,3], "beta", start=list(shape1=10,shape2=10)) fit.distr.4 = fitdistr(sim.dir[,4], "beta", start=list(shape1=10,shape2=10)) fit.distr.5 = fitdistr(sim.dir[,5], "beta", start=list(shape1=10,shape2=10)) ## Could also draw a histogram of simulated data curve(dbeta(x,fit.distr.1$estimate[1],fit.distr.1$estimate[2]), ylim=c(0,60), xlim=c(0,.50), col=1, lty=1, lwd=2, ylab="Density", xlab="theta", main="Distribution of the Democratic NYC Mayor Election 2013", sub=paste("Probability that de Blasio beats Thompson AND gets >= 40% of Vote: ", round(cand1.win.probs[4],2) ) ) ## Candidate 1 curve(dbeta(x,fit.distr.2$estimate[1],fit.distr.2$estimate[2]), add=T, col=3, lty=2, lwd=2) ## Candidate 2 curve(dbeta(x,fit.distr.3$estimate[1],fit.distr.3$estimate[2]), add=T, col=4, lty=3, lwd=2) ## Candidate 3 curve(dbeta(x,fit.distr.4$estimate[1],fit.distr.4$estimate[2]), add=T, col=5, lty=3, lwd=2) ## Candidate 4 curve(dbeta(x,fit.distr.5$estimate[1],fit.distr.5$estimate[2]), add=T, col=6, lty=3, lwd=2) ## Candidate 5 abline(v=c(median(sim.dir[,1]), median(sim.dir[,2]), median(sim.dir[,3]), median(sim.dir[,4]), median(sim.dir[,5])), col=c(1,3,4,5,6), lwd=2, lty=c(1,2,3)) legend("topright",c("de Blasio","Thompson","Quinn","Weiner","Liu"), lwd=2, col=c(1,3,4,5,6), lty=c(1,2,3,4,5))

I’m somewhat confused by what it means to tweak the “shape1” and “shape2” parameters “by hand” for a beta-distributed statistic. How do I know what good starting and ending parameters are for the algorithm R employs? The documentation is not being much help.

“Tweaking” refers to the starting point in the fitdistr() function. Using different starting parameters in that function can produce different output. So it may need some adjusting until good convergence is met. I have found that if I don’t have a clue what to start with I’ll just put in a couple of reasonable numbers. Then I’ll look at the output and adjust my starting parameters to better match what I see from the output. I may iterate through that process a few times to make sure the parameter output values aren’t drastically bouncing around based on the starting input.