That’s Smooth

I had someone ask me the other day how to take a scatterplot and draw something other than a straight line through the graph using Excel.  Yes, it can be done in Excel and it’s really quite simple, but there are some limitations when using the stock Excel dialog screens. So it is probably in one’s best interest to use a higher quality statistical software package.

There are several ways to add a curve to a scatter plot. This could be something as simple as adding a polynomial to the model or something a little more complex like a kernel curve, LOESS, or splines. Linear vs. Non-Linear

First, a difference needs to be made between linear and non-linear regression. Linear regression can also include drawing “lines” that, when plotted on a graph, actually have a curved shape. Polynomial regression is a type linear regression that produces curves and is referred to a curvilinear regression. Specifically, if you can take the partial derivative of the equation and the result is a function that still includes the unknown parameters then the model is considered nonlinear. Polynomials continue to remain a linear combination. Take the following linear model equation. $y = B_{0} + x\beta_{1} + x^{2}\beta_{2} + x^{3}\beta_{3}$

The partial derivatives, as seen below, are no longer functions of $\beta$. This means that even though it is a polynomial this is a linear model. $\begin{array}{l} \frac{dy}{d\beta_{0}}=1\\ \frac{dy}{d\beta_{1}}=x\\ \frac{dy}{d\beta_{2}}=x^{2} \\ \frac{dy}{d\beta_{3}}=x^{3} \end{array}$

Example Data

An example that I often use is the speed of a car and the fuel economy. This often shows a strong curvilinear relationship. A lot of other data including biological data and data relating to speed and weight can often have a form of curved relationship.  This graph is from the 2012 Fuel Economy Guide. Data can also be simulated that is curvilinear.   In this case I added a cubic component and show the straight line regression. The example I’ll use is the cars dataset that is shipped with R in the datasets library.  It has two variables and is the speed of the car and the stopping distance.

Polynomials

Polynomials can be a very good choice because a straight forward closed-form solution can easily be obtained.  Additionally, polynomial models are fairly easy to implement and the models can be easily compared to other predictive models. Smoothing Spline

Like other smoothers the spline uses a range of the value to determine its smoothness.  This is referred to as the knot.  The more knots the tighter the fit of the model.  Fewer knots produce a smoother curve.  Here the natural spline (green) and the smoothing spline (blue) are fairly similar.

Kernel Smoothing

Kernel smoothing uses a weighted average of the corresponding x/y point. The sensitivity is based on a predefined value called bandwidth.  The greater the bandwidth the smoother the curve will be as it dampened down by the additional data points but it doesn’t handle localized variation very well.  A smaller bandwidth can be very sensitive to local variation and will allow some very dramatic changes in the curve.  For example here is a comparison of a bandwidth of 1.0 and 5.0.  The smoother curve is, of course, the larger bandwidth. LOESS and LOWESS

This is a non-parametric locally weighted regression using a nearest neighbor approach.  It too uses a value to control the smoothing.  In R this is referred to as span but can also be referred to as bandwidth, similar to kernel smoothing.  I nice feature the comes with LOESS is it’s ability to produce uncertainly around the prediction.  This means that confidence intervals can easily be produced and plotted.

Closing Thoughts

It can be a little troublesome trying to determine which is best model or whether the curve is over/under fit.  It may also depend on how sensitive to local variation you want to allow.  There are options available to help determine the approach you may want to use.  For example the MGCV package in R has some features to help with this.  As can be seen in this example all the approaches result in very similar solutions.  If all you need is some sort of smoother through a scatterplot then it may be best to use the simplest approach.  However, if one is trying to make predictions then a lot more care needs to be taken to fit the model properly.

Example Code

library(graphics)
library(splines) # Used for the ns() function -- (natural cubic splines)

R = matrix(cbind(1,.99, .99,1),nrow=2)
U = t(chol(R))
nvars = dim(U)
numobs = 1000
set.seed(1)
random.normal = matrix(rnorm(nvars*numobs,10,1), nrow=nvars, ncol=numobs);
X = U %*% random.normal
newX = t(X)
raw = as.data.frame(newX)
orig.raw = as.data.frame(t(random.normal))
names(raw) = c("response","predictor1")
raw$predictor1.3 = raw$predictor1^3
raw$predictor1.2 = raw$predictor1^2
fit = lm(raw$response ~ raw$predictor1.3)

plot(raw$response ~ raw$predictor1.3, pch=16, cex=.4, xlab="Predictor", ylab="Response", main="Simulated Data with Slight Curve")
abline(fit)

x = with(cars, speed)
y = with(cars, dist)
eval.length = 50

# This LOESS shows two different R function arriving at the same solution.
# Careful using the LOESS defaults as they differ and will produce different solutions.
fit.loess = loess.smooth(x, y, evaluation = eval.length,
family="gaussian", span=.75, degree=1)
fit.loess2= loess(y ~ x, family="gaussian",
span=.75, degree=1)

## Set a simple 95% CI on the fit.loess model
new.x = seq(min(x),max(x), length.out=eval.length)
ci = cbind(
predict(fit.loess2, data.frame(x=new.x)),
predict(fit.loess2, data.frame(x=new.x))+
predict(fit.loess2, data.frame(x=new.x), se=TRUE)$se.fit*qnorm(1-.05/2), predict(fit.loess2, data.frame(x=new.x))- predict(fit.loess2, data.frame(x=new.x), se=TRUE)$se.fit*qnorm(1-.05/2)
)

## Linear Model
fit = lm(y ~ x )

## Polynomial
fit.3 = lm(y ~ poly(x,3) )

## Natural Spline
fit.ns.3 = lm(y ~ ns(x, 3) )
## Smoothing Spline
fit.sp = smooth.spline(y ~ x, nknots=15)

plot(x,y, xlim=c(min(x),max(x)), ylim=c(min(y),max(y)), pch=16, cex=.5,
ylab = "Stopping Distance (feet)", xlab= "Speed (MPH)", main="Comparison of Models"
, sub="Splines")
## Add additional models on top of graph. It can get cluttered with all the models.

## LOESS with Confidence Intervals
matplot(new.x, ci, lty = c(1,2,2), col=c(1,2,2), type = "l", add=T)
## Linear
lines(new.x, predict(fit, data.frame(x=new.x)), col='orange', lty=3)
## Polynomial
lines(new.x, predict(fit.3, data.frame(x=new.x)), col='light blue', lty=4)
## Natural Spline
lines(new.x, predict(fit.ns.3, data.frame(x=new.x)), col='green', lty=5)
## Smoothing Spline
lines(fit.sp, col='blue', lty=6)
## Kernel Curve
lines(ksmooth(x, y, "normal", bandwidth = 5), col = 'purple', lty=7)
legend("topleft",c("Linear","Polynomial","Natural Spline","Smoothing Spline","Kernel"),
col=c('black','light blue','green','blue','purple'), lty=c(3,4,5,6,7), lwd=2)