Tracking the 2013 Hurricane Season

With it nearing the end of hurricane season it’s only appropriate to do a brief summary of the activity this year.   It’s been a surprisingly low-key season as far as hurricanes are concerned.  There have been only a few hurricanes and the barometric pressure of any hurricane this season has not even come close to hurricane Sandy (which broke the lowest barometric pressure record).

2013 Hurricane Season

A lot more analysis and visualization could be done with these data.  I will do some further statistical analysis and maybe some time series comparing past recorded hurricanes or maybe some other kriging approaches. Perhaps when I have some more spare time I’ll play with the data even more and look into the sustained wind speed and some of the other hurricane indices (e.g. the Power Dissipation Index).

The following graph shows a histogram of the barometric pressure and how it compares to hurricane Sandy.

2013 Barometric Pressure

In the mean time here is a short piece of code to produce the graphics for the 2013 hurricanes.   It produces a graph of the area around the Atlantic Basin and pulls the data down from Unisys.  It also highlights the East Coast states of the United States.  I have a text file that I put together that lists all the hurricanes by year back through 2007.

Example Code

library(maps)
library(maptools)
library(rgdal)
library(OpenStreetMap) ## for future use to produce aerial image maps
library(raster)

year = 2013
hurricanes = read.table(file=paste(“http://statistical-research.com/wp-content/uploads/2013/10/hurricanes.txt”,sep=””), sep=”,”, fill=TRUE, header=T)
hurricanes = as.vector( subset(hurricanes, hurricanes$YEAR==year, select=c(“NAME”)) )

hurr.dat = list()
max.lon = max.lat = min.lat = min.lon = NULL
b.press = NULL

for(i in 1:nrow(hurricanes)){
raw = read.table(file=paste(“http://weather.unisys.com/hurricane/atlantic/”,year,”/”,hurricanes[i,],”/track.dat”,sep=””), skip=2,fill=TRUE)
colnames(raw) = c(“Latitude”,”Longitude”,”Time”,”WindSpeed”,”Pressure”,”Status”)
raw$Pressure = as.character(raw$Pressure)
raw$Pressure[raw$Pressure==”-“] = NA
raw$Pressure = as.numeric(raw$Pressure)

hurr.dat[[i]] = cbind(raw$Latitude, raw$Longitude, raw$Pressure)
b.press = c(b.press, min(raw$Pressure, na.rm=T))

if(is.null(max.lat)){
max.lat = max(raw$Latitude)
} else if(max.lat < max(raw$Latitude)) { max.lat = max(raw$Latitude) } if(is.null(min.lat)){ min.lat = min(raw$Latitude) } else if (min.lat > min(raw$Latitude)){
min.lat = min(raw$Latitude)
}
if(is.null(max.lon)){
max.lon = max(raw$Longitude)
} else if (max.lon < max(raw$Longitude)){ max.lon = max(raw$Longitude) } if(is.null(min.lon)){ min.lon = min(raw$Longitude) } else if (min.lon > min(raw$Longitude)){
min.lon = min(raw$Longitude)
}

}
xlim <- c(min.lon-5,max.lon+10) ylim <- c(min.lat-5,max.lat+10) state.list <- c('new york','new jersey','virginia','massachusetts','connecticut','delaware','pennsylvania','maryland','north carolina','south carolina','georgia','florida', 'new hampshire','maine','district of columbia','west virginia','vermont') my.map <- map("state", region=state.list, interior = FALSE, xlim=xlim, ylim=ylim) map("state", region=state.list, boundary = TRUE, col="gray", add = TRUE,xlim=xlim) map("world", boundary = TRUE, col="gray", add = TRUE,xlim=xlim) for(j in 1:nrow(hurricanes)){ lines(x=hurr.dat[[j]][,2],y=hurr.dat[[j]][,1],col=j,cex=0.75) points(x=hurr.dat[[j]][,2],y=hurr.dat[[j]][,1],pch=15,cex=0.4, col=j) text(hurr.dat[[j]][1,2], hurr.dat[[j]][1,1],hurricanes[j,]) } title("Path of 2013 Hurricane Season") box() hist(b.press, xlim=c(920,1020), main="Histogram of Barometric Pressure for 2013", xlab="Barometric Pressure (mb)", ylab="Frequency") abline(v=940, col='blue',lwd=3) text(958,.5,"<&lt;2012 Hurricane Sandy") [/sourcecode]

Beta Distribution and the NJ U.S. Senate Election

The beta distribution is highly flexible distribution and applies to many situations and environments. The beta distribution applies well when there are percentages. The upcoming New Jersey U.S. Senate election on Wednesday fits that criterion quite well. So here I applied the beta distribution to some pre-election polls where the numbers were obtained through the poll aggregator www.realclearpolitics.com.

The candidates for New Jersey election this Wednesday — to fill the vacant seat left by the death of Frank Lautenberg — are Cory Booker and Steve Lonegan. Though there are other third-party candidates running the race it is effectively between Booker and Lonegan. Though more complex models can be used reducing the candidates to two the beta distribution can be applied to these data and the outcomes and a simple simulation can be achieved using the given data.

Some Historical Notes

This general election is on a non-standard Election Day (Wednesday, October 16th). It happens to be the first time that a New Jersey general election has been held on a Wednesday. Aside from the current Republican senator who was appointed by Chris Christie the last time there was a Republican U.S. Senator in New Jersey was back in the early 1980’s and even then he too was appointed to the office.

2013 NJ US Senate -- Booker v. Lonegan

The Beta Distribution

As can be seen from the elections since 1990 the Democratic candidate has won by an average of about 8.9%.

2012 — Menendez: 58.9% v. Kyrillos: 39.4%
2008 — Lautenberg: 55.5% v. Zimmer: 42.5%
2006 — Menendez: 53.3% v. Kean Jr.: 44.3%
2002 — Lautenberg: 53.9% v. Forrester: 44.0%
2000 — Corzine: 50.1% v. Franks: 47.1%
1996 — Torricelli: 52.7% v. Zimmer: 42.6%
1994 — Lautenberg: 50.3% v. Haytaian: 47.0%
1990 — Bradley: 50.5% v. Whitman: 47.4%

Based on recent pre-election polling it looks like Booker will likely win by a similar margin and maybe a little higher than the average of 8.9% and, based on pre-election polls, closer to 12 percentage points.  The marginal difference between Booker and Lonegan is distributed as a beta distribution and we can see that the threshold of zero (0) is out in the far tail of the distribution.  So based on historical elections and current pre-election polling it seems that the likelihood that Booker will win is very high.

Histogram of Simulated Differences for 2013 U.S. Senate Election

Example Code


library(MCMCpack)
## Set up several of the recent polls but will only work with the most recent on
raw.1 = NULL
raw.1 = data.frame( rbind(
Quinnipiac = c(.53,.41,899),
RSC = c(.50,.39,729),
FD= c(.45,.29,702),
Mon = c(.53, .40,571)
)
)
raw.1 = rbind(raw.1, c(apply(raw.1[,1:2],2,weighted.mean,raw.1[,3]),sum(raw.1[,3])))
names(raw.1) = c("Cand1","Cand2","size")
raw.1$Other.und = 1-raw.1$Cand1-raw.1$Cand2
raw.1.no.und = data.frame(raw.1[5,1:2] + raw.1[5,1:2]/sum(raw.1[5,1:2])*raw.1[5,4],size=raw.1[5,3],Other.und=0)
raw = rbind(raw.1, raw.1.no.und)
###################################################################
## More than two candidates so Beta distribution won't work
## Function to randomly generate data from a dirichlet distribution
###################################################################
j= 4
prob.win = function(j,export=1){
p=rdirichlet(100000,
raw$size[j] *
c(raw$Cand1[j], raw$Cand2[j], 1-raw$Cand1[j]-raw$Cand2[j])+1
)
if(export==1){
mean(p[,1]>p[,2])
} else {
return(p)
}
}

( cand1.win.probs = sapply(1:nrow(raw),prob.win) )

sim.dir = prob.win(4,export=2) ## set simulated data for plotting and determining parameters
sim.dir.diff = sim.dir[,1]-sim.dir[,2] ## Get the marginal. From a Dirichlet the is distributed as a Beta.
sim.dir = cbind(sim.dir, sim.dir[,1]-sim.dir[,2])
## The shape parameters (shape1 and shape2) might need some manual adjusting and tweaking.
## In this case I ran the function a few time to set the start value close to the output
fit.distr.1 = fitdistr(sim.dir[,1], "beta",
start=list(shape1=302,shape2=270))
fit.distr.2 = fitdistr(sim.dir[,2], "beta",
start=list(shape1=229,shape2=343))
fit.distr.margin = fitdistr(sim.dir[,4], "beta",
start=list(shape1=5,shape2=5))
## Could also draw a histogram of simulated data
curve(dbeta(x,fit.distr.1$estimate[1],fit.distr.1$estimate[2]),
ylim=c(0,20), xlim=c(.3,.6), col='blue', lty=1, lwd=2, ylab="Density", xlab="theta",
main="Distribution of the NJ U.S. Senate Election 2013",
sub=paste("Probability that Booker beats Lonegan: ", round(cand1.win.probs[6],2) ) ) ## Candidate 1
curve(dbeta(x,fit.distr.2$estimate[1],fit.distr.2$estimate[2]), add=T, col='red', lty=2, lwd=2) ## Candidate 2

abline(v=c(median(sim.dir[,1]), median(sim.dir[,2])), col=c('blue','red'), lwd=2, lty=c(1,2,3))
legend("topleft",c("Booker","Lonegan"), lwd=2, col=c('blue','red'), lty=c(1,2))
## Draw a histogram of simulated data
hist(sim.dir[,4], nclass=100, main="Histogram of the Candidate Differences", xlab="Candidate Difference")
abline(v=0, col=c('black'), lwd=2, lty=c(1))

Random Sequence of Heads and Tails: For R Users

Rick Wicklin on the SAS blog made a post today on how to tell if a sequence of coin flips were random.  I figured it was only fair to port the SAS IML code over to R.  Just like Rick Wicklin did in his example this is the Wald-Wolfowitz test for randomness.  I tried to match his code line-for-line.

flips = matrix(c('H','T','T','H','H','H','T','T','T','T','T','T','T','H','H','H','T','H','T','H','H','H','T','H','H','H','T','H','T','H'))

RunsTest = function(flip.seq){
  u = unique(flip.seq) # unique value (should be two)
  
  d = rep(-1, nrow(flip.seq)*ncol(flip.seq)) # recode as vector of -1, +1
  d[flip.seq==u[1]] = 1
  
  n = sum(d > 0) # count +1's
  m = sum(d < 0) # count -1's
  
  dif = c(ifelse(d&#91;1&#93; < 0, 2, -2), diff( sign(d) )) # take the lag and find differences
  
  R = sum(dif==2 | dif==-2) # count up the number of runs
  
  ww.mu = 2*n*m / (n+m) + 1 # get the mean
  ww.var = (ww.mu-1)*(ww.mu-2)/(n+m-1) # get the variance
  sigma = sqrt(ww.var) # standard deviation
  
  # compute test statistics
  if((n+m) > 50){
    Z  = (R-ww.mu) / sigma
  } else if ((R-ww.mu) < 0){
    Z = (R-ww.mu+0.5) / sigma
  } else {
    Z = (R-ww.mu-0.5)/sigma
  }
  
  
  pval = 2*(1-pnorm(abs(Z))) # compute a two-sided p-value 
  
  ret.val = list(Z=Z, p.value=pval)
return(ret.val)
}

runs.test = RunsTest(flips)
runs.test


> runs.test
$Z
[1] -0.1617764

$p.value
[1] 0.8714819

That’s Smooth

I had someone ask me the other day how to take a scatterplot and draw something other than a straight line through the graph using Excel.  Yes, it can be done in Excel and it’s really quite simple, but there are some limitations when using the stock Excel dialog screens. So it is probably in one’s best interest to use a higher quality statistical software package.

There are several ways to add a curve to a scatter plot. This could be something as simple as adding a polynomial to the model or something a little more complex like a kernel curve, LOESS, or splines.

Smoothers

Linear vs. Non-Linear

First, a difference needs to be made between linear and non-linear regression. Linear regression can also include drawing “lines” that, when plotted on a graph, actually have a curved shape. Polynomial regression is a type linear regression that produces curves and is referred to a curvilinear regression. Specifically, if you can take the partial derivative of the equation and the result is a function that still includes the unknown parameters then the model is considered nonlinear. Polynomials continue to remain a linear combination. Take the following linear model equation.

  y = B_{0} + x\beta_{1} + x^{2}\beta_{2} + x^{3}\beta_{3}

The partial derivatives, as seen below, are no longer functions of \beta. This means that even though it is a polynomial this is a linear model.

\begin{array}{l} \frac{dy}{d\beta_{0}}=1\\  \frac{dy}{d\beta_{1}}=x\\  \frac{dy}{d\beta_{2}}=x^{2} \\  \frac{dy}{d\beta_{3}}=x^{3}  \end{array}

Example Data

An example that I often use is the speed of a car and the fuel economy. This often shows a strong curvilinear relationship. A lot of other data including biological data and data relating to speed and weight can often have a form of curved relationship.  This graph is from the 2012 Fuel Economy Guide.

FuelEconomy.gov Speed and MPG

Data can also be simulated that is curvilinear.   In this case I added a cubic component and show the straight line regression.

Simulated Curve

 The example I’ll use is the cars dataset that is shipped with R in the datasets library.  It has two variables and is the speed of the car and the stopping distance.

Polynomials

Polynomials can be a very good choice because a straight forward closed-form solution can easily be obtained.  Additionally, polynomial models are fairly easy to implement and the models can be easily compared to other predictive models.

Polynomial

 

Smoothing Spline

Like other smoothers the spline uses a range of the value to determine its smoothness.  This is referred to as the knot.  The more knots the tighter the fit of the model.  Fewer knots produce a smoother curve.  Here the natural spline (green) and the smoothing spline (blue) are fairly similar.

 Splines

Kernel Smoothing

Kernel smoothing uses a weighted average of the corresponding x/y point. The sensitivity is based on a predefined value called bandwidth.  The greater the bandwidth the smoother the curve will be as it dampened down by the additional data points but it doesn’t handle localized variation very well.  A smaller bandwidth can be very sensitive to local variation and will allow some very dramatic changes in the curve.  For example here is a comparison of a bandwidth of 1.0 and 5.0.  The smoother curve is, of course, the larger bandwidth.

Kernel Comparison

 

LOESS and LOWESS

This is a non-parametric locally weighted regression using a nearest neighbor approach.  It too uses a value to control the smoothing.  In R this is referred to as span but can also be referred to as bandwidth, similar to kernel smoothing.  I nice feature the comes with LOESS is it’s ability to produce uncertainly around the prediction.  This means that confidence intervals can easily be produced and plotted.

 LOESS

Closing Thoughts

It can be a little troublesome trying to determine which is best model or whether the curve is over/under fit.  It may also depend on how sensitive to local variation you want to allow.  There are options available to help determine the approach you may want to use.  For example the MGCV package in R has some features to help with this.  As can be seen in this example all the approaches result in very similar solutions.  If all you need is some sort of smoother through a scatterplot then it may be best to use the simplest approach.  However, if one is trying to make predictions then a lot more care needs to be taken to fit the model properly.

Example Code


library(graphics)
library(splines) # Used for the ns() function -- (natural cubic splines)

R = matrix(cbind(1,.99, .99,1),nrow=2)
U = t(chol(R))
nvars = dim(U)[1]
numobs = 1000
set.seed(1)
random.normal = matrix(rnorm(nvars*numobs,10,1), nrow=nvars, ncol=numobs);
X = U %*% random.normal
newX = t(X)
raw = as.data.frame(newX)
orig.raw = as.data.frame(t(random.normal))
names(raw) = c("response","predictor1")
raw$predictor1.3 = raw$predictor1^3
raw$predictor1.2 = raw$predictor1^2
fit = lm(raw$response ~ raw$predictor1.3)

plot(raw$response ~ raw$predictor1.3, pch=16, cex=.4, xlab="Predictor", ylab="Response", main="Simulated Data with Slight Curve")
abline(fit)

x = with(cars, speed)
y = with(cars, dist)
eval.length = 50

# This LOESS shows two different R function arriving at the same solution.
# Careful using the LOESS defaults as they differ and will produce different solutions.
fit.loess = loess.smooth(x, y, evaluation = eval.length,
family="gaussian", span=.75, degree=1)
fit.loess2= loess(y ~ x, family="gaussian",
span=.75, degree=1)

## Set a simple 95% CI on the fit.loess model
new.x = seq(min(x),max(x), length.out=eval.length)
ci = cbind(
predict(fit.loess2, data.frame(x=new.x)),
predict(fit.loess2, data.frame(x=new.x))+
predict(fit.loess2, data.frame(x=new.x), se=TRUE)$se.fit*qnorm(1-.05/2),
predict(fit.loess2, data.frame(x=new.x))-
predict(fit.loess2, data.frame(x=new.x), se=TRUE)$se.fit*qnorm(1-.05/2)
)

## Linear Model
fit = lm(y ~ x )

## Polynomial
fit.3 = lm(y ~ poly(x,3) )

## Natural Spline
fit.ns.3 = lm(y ~ ns(x, 3) )
## Smoothing Spline
fit.sp = smooth.spline(y ~ x, nknots=15)

plot(x,y, xlim=c(min(x),max(x)), ylim=c(min(y),max(y)), pch=16, cex=.5,
ylab = "Stopping Distance (feet)", xlab= "Speed (MPH)", main="Comparison of Models"
, sub="Splines")
## Add additional models on top of graph. It can get cluttered with all the models.

## LOESS with Confidence Intervals
matplot(new.x, ci, lty = c(1,2,2), col=c(1,2,2), type = "l", add=T)
## Linear
lines(new.x, predict(fit, data.frame(x=new.x)), col='orange', lty=3)
## Polynomial
lines(new.x, predict(fit.3, data.frame(x=new.x)), col='light blue', lty=4)
## Natural Spline
lines(new.x, predict(fit.ns.3, data.frame(x=new.x)), col='green', lty=5)
## Smoothing Spline
lines(fit.sp, col='blue', lty=6)
## Kernel Curve
lines(ksmooth(x, y, "normal", bandwidth = 5), col = 'purple', lty=7)
legend("topleft",c("Linear","Polynomial","Natural Spline","Smoothing Spline","Kernel"),
col=c('black','light blue','green','blue','purple'), lty=c(3,4,5,6,7), lwd=2)

 

The Uncertainty of Predictions

There are many kinds of intervals in statistics.  To name a few of the common intervals: confidence intervals, prediction intervals, credible intervals, and tolerance intervals. Each are useful and serve their own purpose.

I’ve been recently working on a couple of projects that involve making predictions from a regression model and I’ve been doing some work on estimation and the variability of the estimates. Recently, in my case, I need to predict an individual outcome drawn from the distribution (this differs from estimating the mean of that distribution). Often I end up estimating several simultaneous individual outcomes. It follows that we need to assume the underlying regression model still applies and is appropriate for this new individual observation.

Prediction vs. Confidence Intervals

Prediction Intervals and Confidence Intervals

Prediction intervals are useful in that it gives us an understanding of likely values that an individual observation may be.  There is a subtle, but big, difference between prediction intervals and confidence intervals.  With a confidence interval we are trying to obtain an upper and lower limit for the parameter.  For a prediction interval we are interested in the upper and lower limits on an individual observation.  One key item to note is that prediction intervals will always be wider than the corresponding confidence intervals.  The reason for this is that when making a new prediction on an observation ( \hat{y_{i}} ) we encounter the variability in the observation estimate as well as variation within the probability distribution of y.

These intervals often get confused with confidence intervals (which those alone have many nuances). But they really need to be kept separate because calling a prediction interval a confidence interval (or vice versa) will provide a false understanding of the variability.

Doing Many Prediction Intervals

Sometimes it is necessary to estimate several new observations.  In this case we need an overall family confidence level of (1-\alpha). To compensate for simultaneously estimating many observations there are a couple of difference procedures: Scheffé and Bonferroni.  It seems, based on various conferences that I’ve attended, that Bonferroni is most common, probably because it is not only conservative in its application but because it is so simple to implement.

Bayesian Posterior Predictive Distribution

It’s worth mentioning the posterior predictive distribution using a Bayesian approach.  In Bayesian statistics the parameter is treated as random and often unknown (as compared to frequentists where it’s fixed and often unknown).   So here we can use the posterior predictive distribution for new data. It also happens to be a good way to see how well the model performed.

Posterior Prediction Distribution

Example Code

library(LearnBayes)
alpha = 0.05
y = rnorm(100,0, 1)
x1= y+rnorm(100,1,1)
x2= y+rnorm(100,2,2)

##### Least-squares fit
fit=lm(y~x1+x2,x=TRUE,y=TRUE)

new = data.frame(x1 = seq(-5, 5, 0.5), x2=seq(-3, 7, 0.5))

pred.w.plim = predict(fit, new, interval = “prediction”)
pred.w.clim = predict(fit, new, interval = “confidence”)
matplot(new$x1, cbind(pred.w.clim, pred.w.plim[,-1]),
lty = c(1,2,2,3,3), col=c(1,2,2,3,3), type = “l”,
ylab = “predicted y”, xlab= “x1″, main=”Confidence vs. Prediction Intervals”)
legend(“topleft”, c(“Confidence”,”Prediction”), lwd=2, lty=c(2,3), col=c(2,3), cex=.7)

##### Sampling from posterior
theta.sample=blinreg(fit$y,fit$x,5000)

######### Model checking via posterior predictive distribution
pred.draws=blinregpred(fit$x,theta.sample)
pred.sum=apply(pred.draws,2,quantile,c(alpha/2,1-alpha/2))

par(mfrow=c(1,1))
ind=1:ncol( pred.sum )
ind.odd=seq(1,ncol( pred.sum ), by=2)
ind.even=seq(2,ncol( pred.sum ), by=2)
matplot(rbind(ind,ind),pred.sum,type=”l”,lty=1,col=1,
xlab=”Identifier”,ylab=”Response Variable y”, main=”Posterior Predictive Distribution”
, xaxt=’n’)
axis(side=1, at=ind.odd, tcl = -1.0, lty = 1, lwd = 0.5, labels=ind.odd, cex.axis=.75)
axis(side=1, at=ind.even, tcl = -0.7, lty = 1, lwd = 0.5, labels=rep(“”,length(ind.even)), cex.axis=.75)
points(ind,y,pch=19, cex=.4)

out=(y>pred.sum[2,] | y