That’s Smooth

I had someone ask me the other day how to take a scatterplot and draw something other than a straight line through the graph using Excel.  Yes, it can be done in Excel and it’s really quite simple, but there are some limitations when using the stock Excel dialog screens. So it is probably in one’s best interest to use a higher quality statistical software package.

There are several ways to add a curve to a scatter plot. This could be something as simple as adding a polynomial to the model or something a little more complex like a kernel curve, LOESS, or splines.

Smoothers

Linear vs. Non-Linear

First, a difference needs to be made between linear and non-linear regression. Linear regression can also include drawing “lines” that, when plotted on a graph, actually have a curved shape. Polynomial regression is a type linear regression that produces curves and is referred to a curvilinear regression. Specifically, if you can take the partial derivative of the equation and the result is a function that still includes the unknown parameters then the model is considered nonlinear. Polynomials continue to remain a linear combination. Take the following linear model equation.

  y = B_{0} + x\beta_{1} + x^{2}\beta_{2} + x^{3}\beta_{3}

The partial derivatives, as seen below, are no longer functions of \beta. This means that even though it is a polynomial this is a linear model.

\begin{array}{l} \frac{dy}{d\beta_{0}}=1\\  \frac{dy}{d\beta_{1}}=x\\  \frac{dy}{d\beta_{2}}=x^{2} \\  \frac{dy}{d\beta_{3}}=x^{3}  \end{array}

Example Data

An example that I often use is the speed of a car and the fuel economy. This often shows a strong curvilinear relationship. A lot of other data including biological data and data relating to speed and weight can often have a form of curved relationship.  This graph is from the 2012 Fuel Economy Guide.

FuelEconomy.gov Speed and MPG

Data can also be simulated that is curvilinear.   In this case I added a cubic component and show the straight line regression.

Simulated Curve

 The example I’ll use is the cars dataset that is shipped with R in the datasets library.  It has two variables and is the speed of the car and the stopping distance.

Polynomials

Polynomials can be a very good choice because a straight forward closed-form solution can easily be obtained.  Additionally, polynomial models are fairly easy to implement and the models can be easily compared to other predictive models.

Polynomial

 

Smoothing Spline

Like other smoothers the spline uses a range of the value to determine its smoothness.  This is referred to as the knot.  The more knots the tighter the fit of the model.  Fewer knots produce a smoother curve.  Here the natural spline (green) and the smoothing spline (blue) are fairly similar.

 Splines

Kernel Smoothing

Kernel smoothing uses a weighted average of the corresponding x/y point. The sensitivity is based on a predefined value called bandwidth.  The greater the bandwidth the smoother the curve will be as it dampened down by the additional data points but it doesn’t handle localized variation very well.  A smaller bandwidth can be very sensitive to local variation and will allow some very dramatic changes in the curve.  For example here is a comparison of a bandwidth of 1.0 and 5.0.  The smoother curve is, of course, the larger bandwidth.

Kernel Comparison

 

LOESS and LOWESS

This is a non-parametric locally weighted regression using a nearest neighbor approach.  It too uses a value to control the smoothing.  In R this is referred to as span but can also be referred to as bandwidth, similar to kernel smoothing.  I nice feature the comes with LOESS is it’s ability to produce uncertainly around the prediction.  This means that confidence intervals can easily be produced and plotted.

 LOESS

Closing Thoughts

It can be a little troublesome trying to determine which is best model or whether the curve is over/under fit.  It may also depend on how sensitive to local variation you want to allow.  There are options available to help determine the approach you may want to use.  For example the MGCV package in R has some features to help with this.  As can be seen in this example all the approaches result in very similar solutions.  If all you need is some sort of smoother through a scatterplot then it may be best to use the simplest approach.  However, if one is trying to make predictions then a lot more care needs to be taken to fit the model properly.

Example Code


library(graphics)
library(splines) # Used for the ns() function -- (natural cubic splines)

R = matrix(cbind(1,.99, .99,1),nrow=2)
U = t(chol(R))
nvars = dim(U)[1]
numobs = 1000
set.seed(1)
random.normal = matrix(rnorm(nvars*numobs,10,1), nrow=nvars, ncol=numobs);
X = U %*% random.normal
newX = t(X)
raw = as.data.frame(newX)
orig.raw = as.data.frame(t(random.normal))
names(raw) = c("response","predictor1")
raw$predictor1.3 = raw$predictor1^3
raw$predictor1.2 = raw$predictor1^2
fit = lm(raw$response ~ raw$predictor1.3)

plot(raw$response ~ raw$predictor1.3, pch=16, cex=.4, xlab="Predictor", ylab="Response", main="Simulated Data with Slight Curve")
abline(fit)

x = with(cars, speed)
y = with(cars, dist)
eval.length = 50

# This LOESS shows two different R function arriving at the same solution.
# Careful using the LOESS defaults as they differ and will produce different solutions.
fit.loess = loess.smooth(x, y, evaluation = eval.length,
family="gaussian", span=.75, degree=1)
fit.loess2= loess(y ~ x, family="gaussian",
span=.75, degree=1)

## Set a simple 95% CI on the fit.loess model
new.x = seq(min(x),max(x), length.out=eval.length)
ci = cbind(
predict(fit.loess2, data.frame(x=new.x)),
predict(fit.loess2, data.frame(x=new.x))+
predict(fit.loess2, data.frame(x=new.x), se=TRUE)$se.fit*qnorm(1-.05/2),
predict(fit.loess2, data.frame(x=new.x))-
predict(fit.loess2, data.frame(x=new.x), se=TRUE)$se.fit*qnorm(1-.05/2)
)

## Linear Model
fit = lm(y ~ x )

## Polynomial
fit.3 = lm(y ~ poly(x,3) )

## Natural Spline
fit.ns.3 = lm(y ~ ns(x, 3) )
## Smoothing Spline
fit.sp = smooth.spline(y ~ x, nknots=15)

plot(x,y, xlim=c(min(x),max(x)), ylim=c(min(y),max(y)), pch=16, cex=.5,
ylab = "Stopping Distance (feet)", xlab= "Speed (MPH)", main="Comparison of Models"
, sub="Splines")
## Add additional models on top of graph. It can get cluttered with all the models.

## LOESS with Confidence Intervals
matplot(new.x, ci, lty = c(1,2,2), col=c(1,2,2), type = "l", add=T)
## Linear
lines(new.x, predict(fit, data.frame(x=new.x)), col='orange', lty=3)
## Polynomial
lines(new.x, predict(fit.3, data.frame(x=new.x)), col='light blue', lty=4)
## Natural Spline
lines(new.x, predict(fit.ns.3, data.frame(x=new.x)), col='green', lty=5)
## Smoothing Spline
lines(fit.sp, col='blue', lty=6)
## Kernel Curve
lines(ksmooth(x, y, "normal", bandwidth = 5), col = 'purple', lty=7)
legend("topleft",c("Linear","Polynomial","Natural Spline","Smoothing Spline","Kernel"),
col=c('black','light blue','green','blue','purple'), lty=c(3,4,5,6,7), lwd=2)

 

The Uncertainty of Predictions

There are many kinds of intervals in statistics.  To name a few of the common intervals: confidence intervals, prediction intervals, credible intervals, and tolerance intervals. Each are useful and serve their own purpose.

I’ve been recently working on a couple of projects that involve making predictions from a regression model and I’ve been doing some work on estimation and the variability of the estimates. Recently, in my case, I need to predict an individual outcome drawn from the distribution (this differs from estimating the mean of that distribution). Often I end up estimating several simultaneous individual outcomes. It follows that we need to assume the underlying regression model still applies and is appropriate for this new individual observation.

Prediction vs. Confidence Intervals

Prediction Intervals and Confidence Intervals

Prediction intervals are useful in that it gives us an understanding of likely values that an individual observation may be.  There is a subtle, but big, difference between prediction intervals and confidence intervals.  With a confidence interval we are trying to obtain an upper and lower limit for the parameter.  For a prediction interval we are interested in the upper and lower limits on an individual observation.  One key item to note is that prediction intervals will always be wider than the corresponding confidence intervals.  The reason for this is that when making a new prediction on an observation ( \hat{y_{i}} ) we encounter the variability in the observation estimate as well as variation within the probability distribution of y.

These intervals often get confused with confidence intervals (which those alone have many nuances). But they really need to be kept separate because calling a prediction interval a confidence interval (or vice versa) will provide a false understanding of the variability.

Doing Many Prediction Intervals

Sometimes it is necessary to estimate several new observations.  In this case we need an overall family confidence level of (1-\alpha). To compensate for simultaneously estimating many observations there are a couple of difference procedures: Scheffé and Bonferroni.  It seems, based on various conferences that I’ve attended, that Bonferroni is most common, probably because it is not only conservative in its application but because it is so simple to implement.

Bayesian Posterior Predictive Distribution

It’s worth mentioning the posterior predictive distribution using a Bayesian approach.  In Bayesian statistics the parameter is treated as random and often unknown (as compared to frequentists where it’s fixed and often unknown).   So here we can use the posterior predictive distribution for new data. It also happens to be a good way to see how well the model performed.

Posterior Prediction Distribution

Example Code

library(LearnBayes)
alpha = 0.05
y = rnorm(100,0, 1)
x1= y+rnorm(100,1,1)
x2= y+rnorm(100,2,2)

##### Least-squares fit
fit=lm(y~x1+x2,x=TRUE,y=TRUE)

new = data.frame(x1 = seq(-5, 5, 0.5), x2=seq(-3, 7, 0.5))

pred.w.plim = predict(fit, new, interval = “prediction”)
pred.w.clim = predict(fit, new, interval = “confidence”)
matplot(new$x1, cbind(pred.w.clim, pred.w.plim[,-1]),
lty = c(1,2,2,3,3), col=c(1,2,2,3,3), type = “l”,
ylab = “predicted y”, xlab= “x1″, main=”Confidence vs. Prediction Intervals”)
legend(“topleft”, c(“Confidence”,”Prediction”), lwd=2, lty=c(2,3), col=c(2,3), cex=.7)

##### Sampling from posterior
theta.sample=blinreg(fit$y,fit$x,5000)

######### Model checking via posterior predictive distribution
pred.draws=blinregpred(fit$x,theta.sample)
pred.sum=apply(pred.draws,2,quantile,c(alpha/2,1-alpha/2))

par(mfrow=c(1,1))
ind=1:ncol( pred.sum )
ind.odd=seq(1,ncol( pred.sum ), by=2)
ind.even=seq(2,ncol( pred.sum ), by=2)
matplot(rbind(ind,ind),pred.sum,type=”l”,lty=1,col=1,
xlab=”Identifier”,ylab=”Response Variable y”, main=”Posterior Predictive Distribution”
, xaxt=’n’)
axis(side=1, at=ind.odd, tcl = -1.0, lty = 1, lwd = 0.5, labels=ind.odd, cex.axis=.75)
axis(side=1, at=ind.even, tcl = -0.7, lty = 1, lwd = 0.5, labels=rep(“”,length(ind.even)), cex.axis=.75)
points(ind,y,pch=19, cex=.4)

out=(y>pred.sum[2,] | y

Waiting in One Line or Multiple Lines

Whenever I go to the grocery store it always seems to be a lesson in statistics. I go get the things I need to buy and then  I try to select the checkout register that will decrease the amount of time I have to wait. Inevitably, I select the one line where there is some sort of problem and I just sit there and wait and wait.  I will often mark the lines that I am not waiting in to see which line I would have made it through faster. So then the question is whether or not I get out of line and try to find a new line that is faster. As a statistician I know that wait time is often modeled using a memory-less exponential distribution and I could very easily choose the line with the fewest number of people in front of me only find that the cash register ran out of paper or there is some other delay. So it’s a cruel statistical game trying to find the one line that minimizes my wait time. But this game could easily be simplified, as well as removing the anxiety of customers watching others who arrived after them finish their transaction sooner. By eliminating multiple lines and having all customers go through one line — and then break off at the very end  — the customers would generally end up getting through the line faster and with less variability. However, I do understand that sometimes there are operational limitations due to the physical store layout or otherwise.

I ran a simulation on the two wait line strategies. The first is when there is only one line that everyone goes through and then break off at the end to the next available cash register.  I have often seen this approach in places like Banks, Airport Security, and U.S. Customs. The second strategy is that there are multiple lines and once you’re in the line you stay there until you complete your transaction.  This approach is often seen at places like Costco, Walmart, Target, and just about every other grocery store.

Multiple Lines Wait Time Max/Min

The assumption I made for this simulation is that the wait time is distributed as an exponential distribution with a mean of three minutes (EXP(\theta=3)).  This next graph shows a comparison — when using a single line strategy — between the maximum wait time and the minimum wait time.

Single Line Wait Time Max/Min

This final graph shows the range difference for each of the service locations (cash registers).  When using multiple lines there is a wide range between the maximum total time of one of the locations and the minimum time of the locations.  However, when using only one line that range gets dampened down due to the process as it tries to equalize the wait time at each of the locations.  Ultimately, in order to get every last person out of the system it will take roughly the same amount of time.  But what ends up happening is that there could be a whole group of people stuck in one of the problem lines thus creating higher variability.  And anyone who is involved in process control, such as manufacturing, knows that higher variability is generally not a good thing and is less efficient.

Single and Multiple Wait Time Range Difference


library(reshape2) ## For use in the acast() function
service.locations = 3 ## i.e. Number of cash registers
total.obs = service.locations*10 ## Just needed a number and wanted to make it divisible by the service locations
nsims = 10000 ## Number of simulation replicates
x = replicate(nsims, rexp(total.obs, 1/3))

## Set some loading variables

cum.sum = aggreg.multi.mat = aggreg.one.mat = NULL

for(j in 1:ncol(x)){
## Multiple lines assigned randomly (sequentially)
wait.multiple = cbind( (seq(1:total.obs) %% service.locations)+1, x[,j] )
aggreg.values = aggregate(wait.multiple[,2], by=list( wait.multiple[,1] ), sum)
aggreg.multi.mat = rbind(aggreg.multi.mat, t( acast(aggreg.values, Group.1~., value.var="x") ) )

## One line and breaking off at the very end
wait.bucket = matrix(NA, ncol=service.locations,nrow(x))
queue = x[,j]

## Preload the first service locations
for(d in (1:service.locations)){
wait.bucket[d,d] = x[d,j]
}

## Cumulative sum without NA then find min location time as it will serve the next customer
for(i in service.locations+1:(nrow(x)-service.locations) ){
for(k in 1:service.locations){
cum.sum[k] = sum(wait.bucket[1:i,k], na.rm=T)

}
wait.bucket[i,which(cum.sum==min(cum.sum), arr.ind=TRUE)] = x[i,j]
}
aggreg.one.mat = rbind(aggreg.one.mat, apply(wait.bucket, 2, sum, na.rm=T))

}

## View the graphs
my.hist.one.1 = hist( apply(aggreg.one.mat,1,max), nclass=100, plot=FALSE)
my.hist.one.2 = hist( apply(aggreg.one.mat,1,min), nclass=100, plot=FALSE)
max.counts = max(my.hist.one.1$counts, my.hist.one.2$counts)
par(mfrow=c(2,1))
hist( apply(aggreg.one.mat,1,max), nclass=100, xlim=c(0,60), co=3, main=expression(paste("Single-Line -- Distribution of Max Wait Time With EXP(",theta,"=3)")), xlab="Total Minutes Per Register")
hist( apply(aggreg.one.mat,1,min), nclass=100, xlim=c(0,60), col=2, main=expression(paste("Single-Line -- Distribution of Min Wait Time With EXP(",theta,"=3)")), xlab="Total Minutes Per Register")

my.hist.data.1 = hist( apply(aggreg.multi.mat,1,max), nclass=100, plot=FALSE)
my.hist.data.2 = hist( apply(aggreg.multi.mat,1,min), nclass=100, plot=FALSE)
max.counts = max(my.hist.data.1$counts, my.hist.data.2$counts)
par(mfrow=c(2,1))
hist( apply(aggreg.multi.mat,1,max), nclass=100, xlim=c(0,60), ylim=c(0,max.counts), co=3, main=expression(paste("Multiple Lines -- Distribution of Max Wait Time With EXP(",theta,"=3) ")), xlab="Total Minutes Per Register")
hist( apply(aggreg.multi.mat,1,min), nclass=100, xlim=c(0,60), col=2, main=expression(paste("Multiple Lines -- Distribution of Min Wait Time With EXP(",theta,"=3) ")), xlab="Total Minutes Per Register")

par(mfrow=c(2,1))
hist( apply(aggreg.one.mat,1,max)-apply(aggreg.one.mat,1,min), nclass=100, xlim=c(0,60), col=2, main="Single Line", xlab="Range Difference Between Max and Min in Minutes of Service Locations")
hist( apply(aggreg.multi.mat,1,max)-apply(aggreg.multi.mat,1,min), nclass=100, xlim=c(0,60), col=3, main="Multiple Lines", xlab="Range Difference Between Max and Min in Minutes of Service Locations")

Profile Likelihood for New Jersey U.S. Senate Special Election

As it stands right now Cory Booker has a very good chance of winning the New Jersey Special U.S. Senate election on October 16 to replace Frank Lautenberg and fill the remainder of his term for the next 15 months.  So with the election only about a month away I took advantage of some of the data from the pre-election polls to produce a likelihood estimate based on the four most recent polls listed on www.realclearpolitics.com.

Contour Plot of Joint Likelihood

Though it may not be the perfect choice of distributions, the normal distribution will work well and we can easily use the likelihood function to produce the profile likelihood of the data.  This makes it easier to describe one parameter at a time while eliminating the nuisance parameter.  In this case we’re particularly interested in the estimate, \mu.

Likelihood approaches can be useful as it not only shows where the likelihood is maximized but it also shows other likely values of \theta.  Furthermore, likelihood approaches can have better small-sample properties over those based on asymptotic convergence (e.g. standard errors).

There are some differences between Likelihood and Bayesian/Frequentist approaches.  In this case we can produce what is referred to as a likelihood interval. However, likelihood intervals suffer from calibration problems where say a ‘5% likelihood’ does not have specific meaning.  However, 5% or 10% probability does have meaning.  So really, a probability-based inference can be the better option. But in some cases probability-based conclusions can simply be ridiculous and effectively have no useful value.  In those cases probability-based inference is probably not the best choice.

In some cases the likelihood interval is the same as the confidence interval.  Particularly for interval estimates, using a normal mean case, we can choose a cutoff of 15% such that it corresponds to a 95% confidence interval in Frequentist terms.

How is the Likelihood Interval interpreted

  • It can be interpreted the same way as a confidence intervals if an exact or large-sample approximation is justified.  The likelihood must be reasonably regular (i.e. the log-likelihood can be approximated by a quadratic function). Though this requirement can be quite forgiving.
  • If the likelihood is certainly not regular (can be in small-sample problems or non-normal distributions) then the interval can be interpreted as a pure likelihood.

The following graphs show the estimate for Cory Booker as well as the variance of the four polls.  Given these data (as of Friday, Sep. 13) and assuming this distribution the maximum likelihood estimate shows that Cory Booker is at 55.5%. You’ll also note that since this value is the MLE this is the same value that www.realclearpolitics.com uses on their website. We can also see that the other likely values are somewhere between 50.5% and 60.5%.

Contour Plot of Joint Likelihood

Cory Booker Profile Likelihood

 


## Profile Likelihoods

par(mfrow=c(1,1))

x = c(64,50,54,54)

boxplot(x, col=3, main='Boxplot of All Polls', ylab='Cory Booker Percent')

n = length(x)

np = 500
sx = sqrt(var(x))

mu.theta = seq(mean(x)-8,mean(x)+8,len=np)
sigma.theta = seq(sx/2.75,sx*2.75,len=np)

logLikeFun < - function(mu, x){
## However, only the middle portion of this equation is necessary
a = -n/2 * log( 2 * pi ) - n/2 * log( sum((mu-x)^2)/n ) - 1/(2 * sum((mu-x)^2)/n )*sum((mu-x)^2)
a
}
logLikeFunSigma = function(mu,sigma){
a = -n/2*log(sigma^2) -
(sum(x^2) - 2*mu*sum(x) + n*mu^2)/(2*sigma^2)
-a
}
li = function(th,like,alpha=0.15){
that = mean(th&#91;like==max(like)&#93;)
lowth = th&#91;th < that&#93;
lowlik = like&#91;th < that&#93;
if (length(lowth) < 2){
lowval = min(th)
}
if (length(lowth) > 1){
lowval = approx(lowlik,lowth,xout=alpha)$y
}

upth = th[th > that]
if (length(upth) < 2 ){
return(c(lowval,max(th)))
}
if (length(upth) > 1){
uplik = like[th > that]
upval = approx(uplik,upth,xout=alpha)$y
return(c(lowval,upval))
}
}
## Joint Likelihood
ll.joint = outer(mu.theta, sigma.theta,'logLikeFunSigma')
like.joint = exp(min(ll.joint)-ll.joint)
contour(mu.theta, sigma.theta^2, like.joint,
xlab=expression(mu), ylab=expression(sigma^2),
level=c(.1,.3,.5,.7,.9), lwd=2,
main="Joint Likelihood Contour of\nCory Booker Estimate and Variance of Polls")

## Profile Likelihood for mean mu
log.like = lapply(mu.theta, logLikeFun, x)
log.like = unlist(log.like)
like = exp(log.like - max(log.like)) ## normalize the value to 1.0

## Profile Likelihood (mu, sigma^2=sigma.hat^2)
se = sqrt(var(x)*(n-1)/n)/sqrt(n)
estimate.like = dnorm(mean(x), mean=mu.theta, sd=se)
estimate.like = estimate.like/max(estimate.like)

## Profile Likelihood (mu, sigma=1)
log.like.sigma = logLikeFunSigma(mu.theta, sigma=1)
sigma.like = exp(min(log.like.sigma) - log.like.sigma)

## Profile LIkelihohod sigma^2

plot(mu.theta, like, type='n', main=expression(paste('Profile Likelihood of ',mu,' for Cory Booker')),
ylab='Likelihood', xlab=expression(mu), ylim=c(0,1)) ## create blank graph of the correct size
lines(mu.theta, like,lty=1,lwd=2, col=2)
lines(mu.theta, estimate.like, lty=2, lwd=2, col=3)
text(mean(x)+.75,0,format(mean(x),digits=3) )
abline(v=c(mean(x),li(mu.theta, like),li(mu.theta, estimate.like)), lty=1, lwd=c(2.5,1,1,1,1), col=c(1,2,2,3,3))
abline(h=.15)
legend('topright',c(expression(paste('L(',mu,')')),expression(paste('L(',mu,',',sigma^2,'=',hat(sigma)^2,') ' ))), lwd=2, col=c(2,3), bg='#FFFFFF')

 

A Monty Simulation

I was listening to Science Friday from Sep 6th. and one of the discussions by Ira Flatow was on the well known Monty Hall Problem.  This problem has been hashed out many times and in fact I was first introduced to the probability aspects of the problem while taking my introductory to statistics class when I was an undergraduate.  Though the solution is not initially intuitive it is fairly straight forward and a discussion is available on Wikipedia. I figured I would take a few minutes to play the game a couple million times.  This problem is so ubiquitous that Mythbusters did an episode on this back in 2011 where they too kept on playing the game over and over.  So there is no need for me to detail it out but here’s a brief simulation showing the probability of winning, if you change doors, is 0.66.  The following graphs show the cumulative results of the first 100 iterations comparing the two different strategies.

Monty Hall Simulation


library(e1071)
doors = c("A","B","C")
nsim=1000000

sim.mat = matrix(NA, nrow=nsim, ncol=5)
winning.loc = rdiscrete(nsim,doors, prob=rep(1,length(doors)))
first.choice = rdiscrete(nsim,doors, prob=rep(1,length(doors)))
sim.mat[,1] = winning.loc
sim.mat[,2] = first.choice

for(i in 1:length(winning.loc)){
if(winning.loc[i] != first.choice[i]){
## Contestant has selected the goat. Therefore the host must show the other goat and not the winner.
sim.mat[i,3] = doors[ !(doors %in% c(winning.loc[i], first.choice[i])) ]
} else {
rand.goat.options = doors[ !(doors %in% c(winning.loc[i], first.choice[i])) ]
## Contestant has selected the winner and the host then randomly select one of the other two.
rand.goat = rdiscrete(1, rand.goat.options , prob=rep(1,length(rand.goat.options)) )
sim.mat[i,3] = rand.goat
}
## Second choice changes
sim.mat[i,5] = doors[ !(doors %in% c(sim.mat[i,3], first.choice[i])) ]
}
## Second choice is the same as the first choice
sim.mat[,4] = sim.mat[,2]
## names = TRUE.VAL, FIRST.CHOICE, GOAT.LOC.SHOWN, SECOND.CHOICE.STAY
sim.val.stay = as.numeric( (sim.mat[,4]==sim.mat[,1])*1 )
sim.val.change = as.numeric( (sim.mat[,5]==sim.mat[,1])*1 )
mean(sim.val.stay)
mean(sim.val.change)

## Probability by iteration
covergence.prob.change = cumsum(sim.val.change)/cumsum(rep(1,length(sim.val.change)))
covergence.prob.stay = cumsum(sim.val.stay)/cumsum(rep(1,length(sim.val.stay)))

par(mfrow=c(1,2))
plot(head(covergence.prob.stay, 100), pch=16, cex=.5,
ylab="Probability",
main="Probability of Winning\nKeeping the Same Door")
plot(head(covergence.prob.change, 100), pch=16, cex=.5, ylab="Probability",
main="Probability of winning\nChanging Doors")

The Eve of the NYC Democratic Mayoral Primary Election

It is the eve of the New York City Democratic mayoral primary election.  This is a simple follow-up on my post from last Friday as I was curious how the final pre-Election Day polling was going to fall.  It’s fairly clear who is going to get the most votes.  Though we have a very good idea who is going to take the nomination it is still not guaranteed.  Using the same line of reasoning that I used in my previous post we can calculate the probability that Bill de Blasio is not only going to get the most vote but also achieve the threshold of 40% of the vote to avoid a run-off that is scheduled for October 1st.  The following shows the five leading candidates and their distribution of vote.  For the distribution shown in the graph below I proportionally assigned the remaining undecided vote each of the five main candidates.  There are other candidates that will likely get a few votes but their vote will only be a few percentage points.  For the purpose of this estimation and the graph I am only showing the five candidates.

The results once again come from www.realclearpolitics.com and I only used the most recent two polls and averaged the two results.  The biggest discrepancy between the two polls in in Christine Quinn where there is a seven point difference. Another item to note is the methodology between the two polls NBC/WSJ/Marist use phone operators while PPP uses IVR.  The first graph simply shows results from the combined two polls.  The second graph I take the undecided voters and proportionally assign them to top five candidates.  If we believe that the undecided voters will break this way then de Blasio will almost certainly win and avoid the run-off.

This graph simply takes the votes as provided by PPP and NBC/WSJ/Marist.  This does not include the undecided voters though we know that some of the undecided will likely go to each of the candidates.

FinalNYCDemocraticEstimation

The following graph proportionally assigns the undecided vote to each of the five top candidates.  This shows that Bill de Blasio has an extremely high likelihood of not only getting the most votes but also obtaining the minimum threshold.

NYC Democratic Mayor Primary Election Including Undecided Voters

library(MCMCpack)</p>
## Set up several of the recent polls but will only work with the most recent on
raw.1 = NULL
raw.1 = data.frame( rbind(
PPP = c(.38,.19,.13,.09,.05,750),
NBCWSJMAR = c(.36,.20,.20,.07,.05,505)
)
)
raw.1 = rbind(raw.1, c(apply(raw.1[,1:5],2,weighted.mean,raw.1[,6]),sum(raw.1[,6])))
names(raw.1) = c("Cand1","Cand2","Cand3","Cand4","Cand5","size")
raw.1$Other.und = 1-raw.1$Cand1-raw.1$Cand2-raw.1$Cand3-raw.1$Cand4-raw.1$Cand5
raw.1.no.und = data.frame(raw[3,1:5] + raw[3,1:5]/sum(raw[3,1:5])*raw[3,7],size=raw[3,6],Other.und=0)
raw = rbind(raw.1, raw.1.no.und)
###################################################################
## More than two candidates so Beta distribution won't work
## Function to randomly generate data from a dirichlet distribution
###################################################################
j= 4
prob.win = function(j,export=1){
p=rdirichlet(100000,
raw$size[j] *
c(raw$Cand1[j], raw$Cand2[j], raw$Cand3[j], raw$Cand4[j], raw$Cand5[j], 1-raw$Cand1[j]-raw$Cand2[j]-raw$Cand3[j]-raw$Cand4[j]-raw$Cand5[j])+1
)
if(export==1){
mean(p[,1]>p[,2] & p[,1]>=.40) ## de Blasio need to beat Thompson AND get >= .40
} else {
return(p)
}
}

( cand1.win.probs = sapply(1:nrow(raw),prob.win) )

sim.dir = prob.win(4,export=2) ## set simulated data for plotting and determining parameters
## The shape parameters (shape1 and shape2) might need some manual adjusting and tweaking.
## But for demonstration purposes this will suffice
fit.distr.1 = fitdistr(sim.dir[,1], "beta",
start=list(shape1=10,shape2=10))
fit.distr.2 = fitdistr(sim.dir[,2], "beta",
start=list(shape1=10,shape2=10))
fit.distr.3 = fitdistr(sim.dir[,3], "beta",
start=list(shape1=10,shape2=10))
fit.distr.4 = fitdistr(sim.dir[,4], "beta",
start=list(shape1=10,shape2=10))
fit.distr.5 = fitdistr(sim.dir[,5], "beta",
start=list(shape1=10,shape2=10))

## Could also draw a histogram of simulated data
curve(dbeta(x,fit.distr.1$estimate[1],fit.distr.1$estimate[2]),
ylim=c(0,60), xlim=c(0,.50), col=1, lty=1, lwd=2, ylab="Density", xlab="theta",
main="Distribution of the Democratic NYC Mayor Election 2013",
sub=paste("Probability that de Blasio beats Thompson AND gets >= 40% of Vote: ", round(cand1.win.probs[4],2) ) ) ## Candidate 1
curve(dbeta(x,fit.distr.2$estimate[1],fit.distr.2$estimate[2]), add=T, col=3, lty=2, lwd=2) ## Candidate 2
curve(dbeta(x,fit.distr.3$estimate[1],fit.distr.3$estimate[2]), add=T, col=4, lty=3, lwd=2) ## Candidate 3
curve(dbeta(x,fit.distr.4$estimate[1],fit.distr.4$estimate[2]), add=T, col=5, lty=3, lwd=2) ## Candidate 4
curve(dbeta(x,fit.distr.5$estimate[1],fit.distr.5$estimate[2]), add=T, col=6, lty=3, lwd=2) ## Candidate 5
abline(v=c(median(sim.dir[,1]), median(sim.dir[,2]), median(sim.dir[,3]), median(sim.dir[,4]), median(sim.dir[,5])), col=c(1,3,4,5,6), lwd=2, lty=c(1,2,3))
legend("topright",c("de Blasio","Thompson","Quinn","Weiner","Liu"), lwd=2, col=c(1,3,4,5,6), lty=c(1,2,3,4,5))

Probability of Avoiding a Run-off in the NYC 2013 Democratic Primary Election

The New York City mayoral Democratic primary election is taking place this coming Tuesday (Sep. 10th) and there are several candidates in the running. Bill de Blasio is the front runner and is expected to win. However, there is a catch. Even if he takes the plurality of the vote he may not actually win. In New York if the candidate does not get at least 40% of the vote then the election goes into a run-off on October 1st. Meaning that Bill Thompson or Christine Quinn (whoever takes 2nd) will then have a second chance to win. This type of problems can be a little difficult to handle using standard univariate statistics. When there are only two candidates then a simple Beta distribution can be used to adequately model the data. However, with there being nine declared candidates the Beta distribution just doesn’t work very well. The Dirichlet distribution is the multivariate generalization of the Beta and this can be used to simulate the outcome when there are multiple candidates. A convenient feature of the Dirichlet is that the marginal distributions are distributed as a Beta making computations on the marginals fairly simple.

I gave a presentation at an AAPOR conference a while back where I used this approach but expanded the model to all 50 states and then applied winner-take-all electoral votes (except for Maine and Nebraska which have a proportion style). In this way I was able to calculate the posterior distribution of the electoral vote and calculate credible intervals and other distributional characteristics. This approach has proved to be a very good way to make estimates and a great way to visualize the distribution of the data.

So then the question is whether Bill de Blasio will get more votes than the next candidate and what is the probability that he will take it all without a run-off. There have been several polls by many well respected polling organizations including Quinnipiac, NY Times and NBC, among others.

This post is designed to be a more technical discussion of the underlying and distributional components of the 2013 NYC Democratic Mayoral election (and really elections in general) rather than for a comprehensive discussion. For demonstration simplicity sake I will take the most recent available poll on www.realclearpolitics.com which was Quinnipiac who fielded from Aug 8th to Sep 1.  More polls are certain to arrive after this weekend so it will be interesting to see what happens to the probability distributions prior to Election Day.

Here we can see that de Blasio comes very close to obtaining not only a plurality of votes but the necessary 40%. However, the one item that may not necessarily be emphasized in these polls is that there is still a group of undecided voters. Normally, that does not make a major difference if it is expected that the undecided voters will break proportionally (or even somewhat proportionally) among other voters. Because when the candidates take the most votes they simply win. In this case even if a small portion of the voters start to break for de Blasio then that could put him over the top and prevent a run-off. So then what is the probability that he can avoid a run-off? Well, by running a simulation we can put in those constraints that he must beat the next candidate AND get more than 40% of the vote. In this case I set Thompson as the next candidate to compare.

Based on the given data his chances of avoiding a run-off are not that great as can be seen in the graph below. However, if he gains his proportion of the undecided voters then that will increase his probability of avoiding a run-off. Not but much but it ends up being just short of 50/50 chance.  The given model can, of course, be improved with some work.

Maybe I’ll put a little bit of work into the upcoming New Jersey Senate election with Cory Booker and the New Jersey (with Chris Christie) & Virginia (with Cuccinelli & McAuliffe) Governor elections.  I’ll see if time permits.

 2013 NYC Democratic Mayor Distribution

 


library(MCMCpack)

## Set up several of the recent polls but will only work with the most recent on
raw.1 = NULL
raw.1 = data.frame( rbind(
Quinnipiac = c(.43,.20,.18,750),
NYT = c(.32,.18,.17,505),
Quinnipiac2= c(.36,.20,.21,602),
amNY = c(.29, .24,.17, 600),
NBC = c(.24, .18,.24, 355),
Quinnipiac.adj = c(.465, .226, .194, 750) ## adjust for undecided
)
)
names(raw.1) = c("Cand1","Cand2","Cand3","size")
raw.1$Other = 1-raw.1$Cand1-raw.1$Cand2-raw.1$Cand3
raw = raw.1

###################################################################
## More than two candidates so Beta distribution won't work
## Function to randomly generate data from a dirichlet distribution
###################################################################

prob.win = function(j,export=1){
p=rdirichlet(100000,
raw$size[j] *
c(raw$Cand1[j],raw$Cand2[j],raw$Cand3[j],1-raw$Cand1[j]-raw$Cand2[j]-raw$Cand3[j])
+1)
if(export==1){
mean(p[,1]>p[,2] &amp; p[,1]>=.4) ## de Blasio need to beat Thompson AND get >= .40
} else {
return(p)
}
}

( cand1.win.probs = sapply(1:nrow(raw),prob.win) )
sim.dir = prob.win(1,export=2) ## set simulated data for plotting and determining parameters
## The shape parameters (shape1 and shape2) might need some manual adjusting
fit.distr.1 = fitdistr(sim.dir[,1], "beta",
start=list(shape1=5,shape2=5))
fit.distr.2 = fitdistr(sim.dir[,2], "beta",
start=list(shape1=4,shape2=7))
fit.distr.3 = fitdistr(sim.dir[,3], "beta",
start=list(shape1=2,shape2=9))

## Could also draw a histogram of simulated data
curve(dbeta(x,fit.distr.1$estimate[1],fit.distr.1$estimate[2]),
ylim=c(0,4), col=1, lty=1, lwd=2, ylab="Density", xlab="theta",
main="Distribution of the Democratic NYC Mayor Election 2013",
sub=paste("Probability that de Blasio beats Thompson AND gets > 40% of Vote: ", round(cand1.win.probs[1],2) ) ) ## Candidate 1
curve(dbeta(x,fit.distr.2$estimate[1],fit.distr.2$estimate[2]), add=T, col=3, lty=2, lwd=2) ## Candidate 2
curve(dbeta(x,fit.distr.3$estimate[1],fit.distr.3$estimate[2]), add=T, col=4, lty=3, lwd=2) ## Candidate 3
abline(v=c(median(sim.dir[,1]), median(sim.dir[,2]), median(sim.dir[,3])), col=c(1,3,4), lwd=2, lty=c(1,2,3))
legend("topright",c("de Blasio","Thompson","Quinn"), lwd=2, col=c(1,3,4), lty=c(1,2,3))

The Beta Prior, Likelihood, and Posterior

The Beta distribution (and more generally the Dirichlet) are probably my favorite distributions.  However, sometimes only limited information is available when trying set up the distribution.  For example maybe you only know the lowest likely value, the highest likely value and the median, as a measure of center.  That information is sufficient to construct a basic form of the distribution.  The idea of this post is not to elaborate in detail on Bayesian priors and posteriors but to give a real working example of using a prior with limited knowledge about the distribution, adding some collected data and arriving at a posterior distribution along with a measure of its uncertainty.

The example below is a simple demonstration on how a prior distribution and current data can be combined and form a posterior distribution.  Earlier this year I gave a presentation at a conference where I modified this simple version of my code to be substantially more complex and I used the Dirichlet distribution to make national predictions based on statewide and local samples.   So this approach has some very useful applied statistical properties and can be modified to handle some very complex distributions.

Prior, Likelihood, and Posterior Distributions

A Note on Posterior Intervals

The posterior interval (also called a credible interval or credible region) provides a very intuitive way to describe the measure of uncertainty.  Unlike a confidence interval (discussed in one of my previous posts), a credible interval does in fact provide the probability that a value exists within the interval.  With this interval it is based on calculating the probability of different values given the data.  The graph above shows the different values (identified as theta) as well as the simulated posterior interval limits (alpha=.05 in black and .01 in light gray).  In other words the probability that theta is a member of the 95% credible interval is 0.95 (written as P\left( \theta \in CI \right) = 0.95).

 

The Prior and Posterior Distribution: An Example

The code to run the beta.select() function is found in the LearnBayes package.  This is a great function because by providing two quantiles one can determine the shape parameters of the Beta distribution.  This is useful to find the parameters (or a close approximation) of the prior distribution given only limited information.  If additional quantiles are known then they can be incorporated to better determine the shape parameters of the Beta distribution.


library(LearnBayes)
Q = data.frame(
quantile=c(
median=0.5,
maximum=0.99999,
minimum=0.00001),
prior=c(
median=0.85,
maximum=0.95,
minimum=0.60)
)

optimalBeta = function(Q) {
q1q = Q$quantile[1]
q1p = Q$prior[1]
q2q = Q$quantile[2]
q2p = Q$prior[2]
q3q = Q$quantile[3]
q3p = Q$prior[3]

# find the beta prior using quantile1 and quantile2
q.med = list(p=q1q, x=q1p)
q.max = list(p=q2q, x=q2p)
q.min = list(p=q3q, x=q3p)

# prior parameters using median and max, and median and min
prior.A = beta.select(q.med,q.max)
prior.B = beta.select(q.med,q.min)

prior.Aa = prior.A[1]
prior.Ab = prior.A[2]

prior.Ba = prior.B[1]
prior.Bb = prior.B[2]

## find the best possible beta prior
## Set a start and stop point range to find the best parameters
if (prior.Aa < prior.Ba) {
start.a = prior.Aa
stop.a = prior.Ba
} else {
start.a = prior.Ba
stop.a = prior.Aa
}

if (prior.Ab < prior.Bb) {
start.b = prior.Ab
stop.b = prior.Bb
} else {
start.b = prior.Bb
stop.b = prior.Ab
}
seq.a = seq(from=start.a, to=stop.a, length.out=1000)
seq.b = seq(from=start.b, to=stop.b, length.out=1000)

seq.grid = expand.grid(seq.a, seq.b)

prior.C.q1 = qbeta(q1q, seq.grid&#91;,1&#93;, seq.grid&#91;,2&#93;)
prior.C.q2 = qbeta(q2q, seq.grid&#91;,1&#93;, seq.grid&#91;,2&#93;)
prior.C.q3 = qbeta(q3q, seq.grid&#91;,1&#93;, seq.grid&#91;,2&#93;)

## Different distance measurements, manhattan, euclidean, or otherwise.
## It would be interesting to run a simulation to measure a variety of distance measurements.
prior.C.delta = abs(prior.C.q1 - q1p) + abs(prior.C.q2 - q2p) + abs(prior.C.q3 - q3p)
## prior.C.delta = sqrt( (prior.C.q1 - q1p)^2 + (prior.C.q2 - q2p)^2 + (prior.C.q3 - q3p)^2 )

optimize.seq = cbind(seq.grid, prior.C.q1, prior.C.q2, prior.C.q3, prior.C.delta)

## Minimize the delta, if the min-delta is not unique then choose the first occurence
best.a = optimize.seq&#91;,1&#93;&#91; optimize.seq&#91;,6&#93;==min(optimize.seq&#91;,6&#93;)&#93;&#91;1&#93;
best.b = optimize.seq&#91;,2&#93;&#91; optimize.seq&#91;,6&#93;==min(optimize.seq&#91;,6&#93;)&#93;&#91;1&#93;

return(list(a=best.a,b=best.b))
}

prior.dist = optimalBeta(Q)

##########################################################
## Take a look at only the prior
##########################################################
curve(dbeta(x,prior.dist$a,prior.dist$b)) # plot the prior
abline(v=Q$prior&#91;1&#93;)

##########################################################
## Take a look at only the likelihood with given successes
##########################################################
calcLikelihood = function(successes, total){
curve(dbinom(successes,total,x)) # plot the likelihood
}

calcLikelihood(45, 50) ## e.g. 45/50 sucesses
## calculate some properties of the Beta distribution
calcBetaMode = function(aa, bb) {
beta.mode = (aa - 1)/(aa + bb - 2)
return(beta.mode)
}
calcBetaMean = function(aa, bb) {
beta.mean = (aa)/(aa + bb)
return(beta.mean)
}
calcBetaVar = function(aa, bb) {
beta.var = (aa * bb)/(((aa + bb)^2) * (aa + bb + 1))
return(beta.var)
}
calcBetaMedian = function(aa, bb) {
beta.med = (aa-1/3)/(aa+bb-2/3)
return(beta.med)
}
calcBetaSkew = function(aa, bb) {
beta.skew = ( 2*(bb-aa)*sqrt(aa+bb+1) ) /( (aa+bb+2)/sqrt(aa+bb) )
return(beta.skew)
}

##########################################################
## Take a look at the prior, likelihood, and posterior
##########################################################
priorToPosterior = function(successes, total, a, b) {
## Note the rule of succession
likelihood.a = successes + 1
likelihood.b = total - successes + 1

## Create posterior
posterior.a = a + successes;
posterior.b = b + total - successes
theta = seq(0.005, 0.995, length = 500)

## Calc density
prior = dbeta(theta, a, b)
likelihood = dbeta(theta, likelihood.a, likelihood.b)
posterior = dbeta(theta, posterior.a, posterior.b)

## Plot prior, likelihood, and posterior

## Can be used to scale down the graph if desired.
## However, the density is different for each prior, likelihood, posterior
m.orig = apply( cbind(prior, likelihood, posterior), 2, max)
m = max(c(prior, likelihood, posterior))

plot(theta, posterior, type = "l", ylab = "Density", lty = 2, lwd = 3,
main = paste("Prior: beta(", round(a,2), ",", round(b,2), "); Data: B(", total, ",", successes, "); ",
"Posterior: beta(", round(posterior.a,2), ",", round(posterior.b,2), ")", sep=""), ylim = c(0, m), col = 1)
lines(theta, likelihood, lty = 1, lwd = 3, col = 2)
lines(theta, prior, lty = 3, lwd = 3, col = 3)
legend("topleft",y=m, c("Prior", "Likelihood", "Posterior"), lty = c(3, 1, 2),
lwd = c(3, 3, 3), col = c(3, 2, 1))

prior.mode = calcBetaMode(a, b)
likelihood.mode = calcBetaMode(likelihood.a, likelihood.b)
posterior.mode = calcBetaMode(posterior.a, posterior.b)
prior.mean = calcBetaMean(a, b)
likelihood.mean = calcBetaMean(likelihood.a, likelihood.b)
posterior.mean = calcBetaMean(posterior.a, posterior.b)
prior.med = calcBetaMedian(a, b)
likelihood.med = calcBetaMedian(likelihood.a, likelihood.b)
posterior.med = calcBetaMedian(posterior.a, posterior.b)
prior.var = calcBetaVar(a, b)
likelihood.var = calcBetaVar(likelihood.a, likelihood.b)
posterior.var = calcBetaVar(posterior.a, posterior.b)
prior.skew = calcBetaSkew(a, b)
likelihood.skew = calcBetaSkew(likelihood.a, likelihood.b)
posterior.skew = calcBetaSkew(posterior.a, posterior.b)

print(paste("Mode: prior=",prior.mode,"; Likelihood=",likelihood.mode,"; Posterior=",posterior.mode))
print(paste("Mean: prior=",prior.mean,"; Likelihood=",likelihood.mean,"; Posterior=",posterior.mean))
print(paste("~Approx Median (for a and b > 1): prior=",prior.med,"; Likelihood=",likelihood.med,", for Posterior=",posterior.med))
print(paste("Var: prior=",prior.var,"; Likelihood=", likelihood.var,"; Posterior=",posterior.var))
print(paste("Skewness: prior=",prior.skew,"; Likelihood=",likelihood.skew,"; Posterior=",posterior.skew))
return(list(a=posterior.a,b=posterior.b))
}

posterior.out = priorToPosterior(25,50, prior.dist$a, prior.dist$b) # 25/50 is current data
beta.sim = rbeta(1000000,posterior.out$a, posterior.out$b)
abline(v=quantile(beta.sim, prob=c(.05/2, 1-.05/2)), col='#000000', lwd=2)
abline(v=quantile(beta.sim, prob=c(.01/2, 1-.01/2)), col='#EEEEEE', lwd=2)

 

Passing-Bablok Regression: R code for SAS users

While at the Joint Statistical Meeting a few weeks ago I was talking to a friend about various aspects to clinical trials. He indicated that no current R package was able to perfectly reproduce Passing-Bablok (PB) regression so that it exactly matched SAS. He ultimately wrote a couple of functions and kindly shared them with me so that I could share them here.

There is currently an R package (mcr) that will do the Passing-Bablok regression but, as it turns out, the output does not precisely match SAS.  Sadly, I don’t currently have a copy of SAS available to use for this purpose so I can’t independently run code and comparisons.  Though I would be interested in an equivalent comparison.  However, I ran some examples below to compare the R mcr package.

These code snippets that he shared with me will only calculate the slope, intercept and the confidence interval for each coefficient. It’s not CRAN package ready but the short function and output are convenient and easy to use.

One interesting characteristic comparing the two PB regressions (at least in the simulation below) is that the mcr package always has a slightly smaller coefficient.  In the simulation below the maximum difference is .00054.  The minimum difference is somewhere along the lines of zero (1.59e-12). The details of the differences is something I must look into at a later date.

Passing-Bablok Comparison Scatter Plot

Passing-Bablok Regression Differences

library(mcr)

# Function to generate some correlated data
genData = function(numobs=100,randval){
R = matrix(cbind(1,.80, .80,1),nrow=2)
U = t(chol(R))
nvars = dim(U)[1]
set.seed(randval)
random.normal = matrix(rnorm(nvars*numobs,100,10), nrow=nvars, ncol=numobs);
X = U %*% random.normal
newX = t(X)
raw = newX
return(raw)
}

PB.reg = function(X,Y,alpha=.05) {

## (1) calculate all pairwise slopes
x < - X y <- Y dat <- cbind(x,y) n <- length(x) S <- array(NA,dim=rep(n,2)) for(i in 1:(n-1)){ for(j in (i+1):n) { if(i != j) { S[i,j] <- (y[i] - y[j])/(x[i] - x[j]) } } } S <- sort(na.exclude(as.vector(S))) K <- sum(S <= -1) - .5 * sum(S == -1) N <- length(S) b <- ifelse(N%%2,S[(N+1)/2+K],mean(S[N/2+K+0:1])) ### CI for b C.gamma <- qnorm(1-alpha/2) * sqrt(n*(n-1)*(2*n+5)/18) M1 <- round((N-C.gamma)/2,0) M2 <- N - M1 + 1 CI.b <- c(LB=S[M1+K],UB=S[M2+K]) a <- median(y - b*x) ### CI for a CI.a <- c(LB=median(y - CI.b["UB"]*x), UB=median(y - CI.b["LB"]*x)) ## return(list(a=a, CI.a=CI.a, b=b, CI.b=CI.b)) } # A quick function to give only b coefficient PB.fit.red = function(X,Y){ fit.lm = PB.reg(X, Y) fit.lm$b } # Generate some one-time use correlated data simData = genData(numobs=20,c(12345)) # Updated function pb.fit = PB.reg(simData[,1], simData[,2]) # Look at just the b coefficient, single example PB.fit.red(simData[,1], simData[,2]) # mcreg function, single example pb.fit2 = mcreg(simData[,1], simData[,2], method.reg="PaBa", method.ci="bootstrap") slot(pb.fit2, "glob.coef")[2] &amp;amp;nbsp; # Run a small simulation nsims = 250 sim.data = matrix(NA, nrow=nsims, ncol=2) # for() loop used for demonstration purposes. # Not efficient code applying a function over the vectors is more efficient for(i in 1:nsims){ simData = genData( numobs=20, i ) # Look at just the b coefficient, multiple simulations sim.data[i,1] = PB.fit.red(simData[,1], simData[,2]) # mcreg function, multiple simulations pb.fit2 = mcreg(simData[,1], simData[,2], method.reg="PaBa", method.ci="bootstrap") sim.data[i,2] = slot(pb.fit2, "glob.coef")[2] } delta = sim.data[,1] - sim.data[,2] mean(delta) max(delta) min(delta) hist(delta, nclass=50, main="Distribution of Differences in PB Regression") plot(sim.data, main="Difference Between Two Types of Passing-Bablok Non-Parametric Regression", xlab="Improved PB Function", ylab="mcr Package") [/sourcecode]  

When Discussing Confidence Level With Others…

This post spawned from a discussion I had the other day. Confidence intervals are notoriously a difficult topic for those unfamiliar with statistics. I can’t really think of another statistical topic that is so widely published in newspaper articles, television, and elsewhere that so few people really understand. It’s been this way since the moment that Jerzy Neyman proposed the idea (in the appendix no less) in 1937.

What the Confidence Interval is Not

There are a lot of things that the confidence interval is not. Unfortunately many of these are often used to define confidence interval.

  • It is not the probability that the true value is in the confidence interval.
  • It is not that we will obtain the true value 95% of the time.
  • We are not 95% sure that the true value lies within the interval of the one sample.
  • It is not the probability that we correct.
  • It does not say anything about how accurate the current estimate is.
  • It does not mean that if we calculate a 95% confidence interval then the true value is, with certainty, contained within that one interval.

The Confidence Interval

There are several core assumption that need to be met to use confidence intervals and often require random selection, independent and identically distribution (IID) data, among others. When one computes a confidence interval repeatedly they will find that the true value lies within the computed interval 95 percent of the time. That means in the long run if we keep on computing these confidence intervals then 95% of those intervals will contain the true value.

Margin of Error by Proportion

The other 5%

When we have a “95% Confidence Interval” it means that if we repeatedly conduct this survey using the exact same procedures then 95% of the intervals would contain the actual, “true value”, in the long run. But that leaves a remaining 5%. Where did that go? This gets into hypothesis testing and rejecting the null (H_{0}) and concluding the alternative (H_{a}). The 5% that is often used is known as a Type I error. It is often identified by the Greek letter alpha (\alpha). This 5% is the probability of making a Type I error and is often called significance level. This means that the probability of an error and rejecting the null hypothesis when the null hypothesis is in fact true is 5%.

The Population

Simply looking at the formulas used to calculate a confidence interval we can see that it is a function of the data (variance and mean). Unless the finite population correction (FPC) is used, it is otherwise not related to the population size. If we have a population of one hundred thousand or one hundred million the confidence interval will be the same. With a population of that size the FPC is so minuscule that it won’t really change anything anyway.

The Margin of Error

A direct component of the confidence interval is the margin of error. This is the number that is most widely seen in the news whether it be print, TV or otherwise. Often, however, the confidence level is excluded and not mentioned in these articles. One can normally assume a 95% confidence level, most of the time. What makes the whole thing difficult is that the margin of error could be based on a 90% confidence level making the margin of error smaller.  Thus giving the artificial impression of the survey’s accuracy.  The graph below shows the sample size needed for a given margin of error.  This graph is based on the conservative 50% proportion.  Different proportions will provide a smaller margin of error due to the math.  In other words .5*.5 maximizes the margin of error (as seen in the graph above), any other combination of numbers will decrease the margin of error.  Often the “magic number” for sample size seems to be in the neighborhood of 1000 respondents (with, according to Pew, a 9% response rate for telephone surveys).

 

The Other Error

Margin of error isn’t the only error.  Keep in mind that the word error should not be confused with there being a mistake in the research.  Error simply means random variation due to sampling.  So when a survey or other study indicates a margin of error of +/- 3% that is simply the error (variation) due to random sampling.  There are all sorts of other types of error that can work its way in to the research including, but not limited to, differential response, question wording on surveys, weather, and the list could go on.  Many books have been written on this topic.

Some Examples

ci-plot

Margin of Error

alpha = .01
reps = 100000
true.mean = 0
true.var = 1

true.prop = .25

raw = replicate(reps, rnorm(100,true.mean,true.var))

# Calculate the mean and standard error for each of the replicates
raw.mean = apply(raw, 2, mean)
raw.se = apply(raw, 2, sd)/sqrt( nrow(raw) )

# Calculate the margin of error
raw.moe = raw.se * qnorm(1-alpha/2)

# Set up upper and lower bound matrix. This format is useful for the graphs
raw.moe.mat = rbind(raw.mean+raw.moe, raw.mean-raw.moe)
row.names(raw.moe.mat) = c(alpha/2, 1-alpha/2)

# Calculate the confidence level
( raw.CI = (1-sum(
as.numeric( apply(raw.moe.mat, 2, min) > 0 | apply(raw.moe.mat, 2, max) < 0 ) )/reps)*100 ) # Try some binomial distribution data raw.bin.mean = rbinom(reps,50, prob=true.prop)/50 raw.bin.moe = sqrt(raw.bin.mean*(1-raw.bin.mean)/50)*qnorm(1-alpha/2) raw.bin.moe.mat = rbind(raw.bin.mean+raw.bin.moe, raw.bin.mean-raw.bin.moe) row.names(raw.bin.moe.mat) = c(alpha/2, 1-alpha/2) ( raw.bin.CI = (1-sum( as.numeric( apply(raw.bin.moe.mat, 2, min) > true.prop | apply(raw.bin.moe.mat, 2, max) <= true.prop ) )/reps)*100 ) par(mfrow=c(1,1)) ind = 1:100 ind.odd = seq(1,100, by=2) ind.even = seq(2,100, by=2) matplot(rbind(ind,ind),raw.moe.mat[,1:100],type="l",lty=1,col=1, xlab="Sample Identifier",ylab="Response Value", main=expression(paste("Confidence Intervals with ",alpha,"=.01")), sub=paste("Simulated confidence Level: ",raw.CI,"%", sep="") , xaxt='n') axis(side=1, at=ind.odd, tcl = -1.0, lty = 1, lwd = 0.5, labels=ind.odd, cex.axis=.75) axis(side=1, at=ind.even, tcl = -0.7, lty = 1, lwd = 0.5, labels=rep("",length(ind.even)), cex.axis=.75) points(ind,raw.mean[1:100],pch=19, cex=.4) abline(h=0, col="#0000FF") size.seq = seq(0, 10000, by=500)[-1] moe.seq = sqrt( (.5*(1-.5))/size.seq ) * qnorm(1-alpha/2) plot(size.seq, moe.seq, xaxt='n', yaxt='n', main='Margin of Error and Sample Size', ylab='Margin of Error', xlab='Sample Size', sub='Based on 50% Proportion') lines(size.seq, moe.seq) axis(side=1, at=size.seq, tcl = -1.0, lty = 1, lwd = 0.5, labels=size.seq, cex.axis=.75) axis(side=2, at=seq(0,15, by=.005), tcl = -0.7, lty = 1, lwd = 0.5, labels=seq(0,15, by=.005), cex.axis=.75) abline(h=seq(0,15,by=.005), col='#CCCCCC') abline(v=size.seq, col='#CCCCCC') size.seq = seq(0,1, by=.01) moe.seq = sqrt( (size.seq*(1-size.seq))/1000 ) * qnorm(1-alpha/2) plot(size.seq, moe.seq, xaxt='n', yaxt='n', main='Margin of Error and Sample Size', ylab='Margin of Error', xlab='Proportion', sub='Based on 50% Proportion') lines(size.seq, moe.seq) axis(side=1, at=size.seq, tcl = -1.0, lty = 1, lwd = 0.5, labels=size.seq, cex.axis=.75) axis(side=2, at=seq(0,15, by=.005), tcl = -0.7, lty = 1, lwd = 0.5, labels=seq(0,15, by=.005), cex.axis=.75) abline(h=seq(0,15,by=.005), col='#CCCCCC') abline(v=.5, col="#CCCCCC") [/sourcecode]