## When Discussing Confidence Level With Others…

This post spawned from a discussion I had the other day. Confidence intervals are notoriously a difficult topic for those unfamiliar with statistics. I can’t really think of another statistical topic that is so widely published in newspaper articles, television, and elsewhere that so few people really understand. It’s been this way since the moment that Jerzy Neyman proposed the idea (in the appendix no less) in 1937.

What the Confidence Interval is Not

There are a lot of things that the confidence interval is not. Unfortunately many of these are often used to define confidence interval.

• It is not the probability that the true value is in the confidence interval.
• It is not that we will obtain the true value 95% of the time.
• We are not 95% sure that the true value lies within the interval of the one sample.
• It is not the probability that we correct.
• It does not say anything about how accurate the current estimate is.
• It does not mean that if we calculate a 95% confidence interval then the true value is, with certainty, contained within that one interval.

The Confidence Interval

There are several core assumption that need to be met to use confidence intervals and often require random selection, independent and identically distribution (IID) data, among others. When one computes a confidence interval repeatedly they will find that the true value lies within the computed interval 95 percent of the time. That means in the long run if we keep on computing these confidence intervals then 95% of those intervals will contain the true value. The other 5%

When we have a “95% Confidence Interval” it means that if we repeatedly conduct this survey using the exact same procedures then 95% of the intervals would contain the actual, “true value”, in the long run. But that leaves a remaining 5%. Where did that go? This gets into hypothesis testing and rejecting the null ( $H_{0}$) and concluding the alternative ( $H_{a}$). The 5% that is often used is known as a Type I error. It is often identified by the Greek letter alpha ( $\alpha$). This 5% is the probability of making a Type I error and is often called significance level. This means that the probability of an error and rejecting the null hypothesis when the null hypothesis is in fact true is 5%.

The Population

Simply looking at the formulas used to calculate a confidence interval we can see that it is a function of the data (variance and mean). Unless the finite population correction (FPC) is used, it is otherwise not related to the population size. If we have a population of one hundred thousand or one hundred million the confidence interval will be the same. With a population of that size the FPC is so minuscule that it won’t really change anything anyway.

The Margin of Error

A direct component of the confidence interval is the margin of error. This is the number that is most widely seen in the news whether it be print, TV or otherwise. Often, however, the confidence level is excluded and not mentioned in these articles. One can normally assume a 95% confidence level, most of the time. What makes the whole thing difficult is that the margin of error could be based on a 90% confidence level making the margin of error smaller.  Thus giving the artificial impression of the survey’s accuracy.  The graph below shows the sample size needed for a given margin of error.  This graph is based on the conservative 50% proportion.  Different proportions will provide a smaller margin of error due to the math.  In other words .5*.5 maximizes the margin of error (as seen in the graph above), any other combination of numbers will decrease the margin of error.  Often the “magic number” for sample size seems to be in the neighborhood of 1000 respondents (with, according to Pew, a 9% response rate for telephone surveys).

The Other Error

Margin of error isn’t the only error.  Keep in mind that the word error should not be confused with there being a mistake in the research.  Error simply means random variation due to sampling.  So when a survey or other study indicates a margin of error of +/- 3% that is simply the error (variation) due to random sampling.  There are all sorts of other types of error that can work its way in to the research including, but not limited to, differential response, question wording on surveys, weather, and the list could go on.  Many books have been written on this topic.

Some Examples  alpha = .01
reps = 100000
true.mean = 0
true.var = 1

true.prop = .25

raw = replicate(reps, rnorm(100,true.mean,true.var))

# Calculate the mean and standard error for each of the replicates
raw.mean = apply(raw, 2, mean)
raw.se = apply(raw, 2, sd)/sqrt( nrow(raw) )

# Calculate the margin of error
raw.moe = raw.se * qnorm(1-alpha/2)

# Set up upper and lower bound matrix. This format is useful for the graphs
raw.moe.mat = rbind(raw.mean+raw.moe, raw.mean-raw.moe)
row.names(raw.moe.mat) = c(alpha/2, 1-alpha/2)

# Calculate the confidence level
( raw.CI = (1-sum(
as.numeric( apply(raw.moe.mat, 2, min) > 0 | apply(raw.moe.mat, 2, max) < 0 ) )/reps)*100 ) # Try some binomial distribution data raw.bin.mean = rbinom(reps,50, prob=true.prop)/50 raw.bin.moe = sqrt(raw.bin.mean*(1-raw.bin.mean)/50)*qnorm(1-alpha/2) raw.bin.moe.mat = rbind(raw.bin.mean+raw.bin.moe, raw.bin.mean-raw.bin.moe) row.names(raw.bin.moe.mat) = c(alpha/2, 1-alpha/2) ( raw.bin.CI = (1-sum( as.numeric( apply(raw.bin.moe.mat, 2, min) > true.prop | apply(raw.bin.moe.mat, 2, max) <= true.prop ) )/reps)*100 ) par(mfrow=c(1,1)) ind = 1:100 ind.odd = seq(1,100, by=2) ind.even = seq(2,100, by=2) matplot(rbind(ind,ind),raw.moe.mat[,1:100],type="l",lty=1,col=1, xlab="Sample Identifier",ylab="Response Value", main=expression(paste("Confidence Intervals with ",alpha,"=.01")), sub=paste("Simulated confidence Level: ",raw.CI,"%", sep="") , xaxt='n') axis(side=1, at=ind.odd, tcl = -1.0, lty = 1, lwd = 0.5, labels=ind.odd, cex.axis=.75) axis(side=1, at=ind.even, tcl = -0.7, lty = 1, lwd = 0.5, labels=rep("",length(ind.even)), cex.axis=.75) points(ind,raw.mean[1:100],pch=19, cex=.4) abline(h=0, col="#0000FF") size.seq = seq(0, 10000, by=500)[-1] moe.seq = sqrt( (.5*(1-.5))/size.seq ) * qnorm(1-alpha/2) plot(size.seq, moe.seq, xaxt='n', yaxt='n', main='Margin of Error and Sample Size', ylab='Margin of Error', xlab='Sample Size', sub='Based on 50% Proportion') lines(size.seq, moe.seq) axis(side=1, at=size.seq, tcl = -1.0, lty = 1, lwd = 0.5, labels=size.seq, cex.axis=.75) axis(side=2, at=seq(0,15, by=.005), tcl = -0.7, lty = 1, lwd = 0.5, labels=seq(0,15, by=.005), cex.axis=.75) abline(h=seq(0,15,by=.005), col='#CCCCCC') abline(v=size.seq, col='#CCCCCC') size.seq = seq(0,1, by=.01) moe.seq = sqrt( (size.seq*(1-size.seq))/1000 ) * qnorm(1-alpha/2) plot(size.seq, moe.seq, xaxt='n', yaxt='n', main='Margin of Error and Sample Size', ylab='Margin of Error', xlab='Proportion', sub='Based on 50% Proportion') lines(size.seq, moe.seq) axis(side=1, at=size.seq, tcl = -1.0, lty = 1, lwd = 0.5, labels=size.seq, cex.axis=.75) axis(side=2, at=seq(0,15, by=.005), tcl = -0.7, lty = 1, lwd = 0.5, labels=seq(0,15, by=.005), cex.axis=.75) abline(h=seq(0,15,by=.005), col='#CCCCCC') abline(v=.5, col="#CCCCCC") [/sourcecode]

## Data Scientists and Statisticians: Can’t We All Just Get Along

It seems that the title “data science” has taken the world by storm.  It’s a title that conjures up almost mystical abilities of a person garnering information from oceans of data with ease.  It’s where a data scientist can wave his or her hand like a Jedi Knight and simply tell the data what it should be.

What is interesting about the field of data science is it’s perceived (possibly real) threat to other fields, namely statistics.  It seems to me that the two fields are distinct areas.  Though the two fields can exist separately on their own each is weak without the other.  Hilary Mason (of Bitly) shares her definition of a data scientist.  I suppose my definition differs from Hilary Mason’s data science definition.  Statisticians need to understand the science and structure of data, and data scientists need to understand statistics.  Larry Wasserman over at the Normal Deviate blog shares his thoughts on statistics and data science.  There are others blogs but these two are probably sufficient.

Data science is emerging as a field of absolutes and that is something that the general public can wrap their heads around.  It’s no wonder that statistician are feeling threatened by data scientists.  Here are two (albeit extreme) examples:

If  a statistician presents an estimate to a journalist and says “here is the point estimate of the number of people listening to a given radio station and states that the margin of error is +/- 3% with a 90% confidence interval” there is almost always a follow-up discussion about the margin of error and how the standard error was calculated (simple random, stratified, cluster) why is it a 90% confidence interval rather than a 95% confidence interval.  And then someone is bound to ask what a confidence interval is anyway?  Then extend this even further and the statistician gives the journalist a p-value?  Now there is an argument between statisticians about hypothesis testing and the terms “frequentist” and “Bayesian” start getting thrown around.

It’s no wonder that people don’t want to work with statisticians.  Not only are they confusing to the general public but the statisticians can’t even agree (even if it’s a friendly disagreement) on what is correct.  Now if we take the following data scientist example:

A data scientist looks through a small file of 50 billion records where people have listened to songs through a registration-based online radio station (e.g. Spotify, Pandora, TuneIn, etc.).  This data scientist then merges and matches the records to a handful of public data sources to give the dataset a dimensionality of 10000.  The data scientist then simply reports that there are X number of listeners in a given metro area listening for Y amount of time and produces a a great SVG graph that can be dynamically updated each week with the click of a button on a website. It is a fairly simple task and just about everyone can understand what is means.

I feel that there will always be a need for a solid foundation in statistics.  There will always exists natural variation that must be measures and accounted.  There will always be data that is so expensive that only a limited number of observations can feasibility be collected.  Or suppose that a certain set of data is so difficult to actually obtain that only a handful of observations can even be collected.  I would conjecture that a data scientist would not have a clue what to do what that data without help from someone with a background in statistics.  At the same time if a statistician was told that there is a 50 billion by 10000 dimension dataset sitting on a Hadoop cluster then I would also guess that many statisticians would be hard pressed to set the data up to analyze without consulting a data scientist.  But at the same time a data scientist would probably struggle if they were asked to take those 10000 dimensions and reduce that down to a digestible and understandable set. Take another example of genetic sequencing.  A data scientist could work the data and discover that in one sequence there is something different.  Then a domain expert can come in and find that the mutation is in the BRCA1 gene and that the BRCA1 gene relates to breast cancer.  A statistician can then be consulted and find the risk and probability that the particular mutation will result in an increased mortality and what the probability will be that the patient will ultimately get breast cancer.

Ultimately, the way I see it the two disciplines need to come together and become one.  I see no reason why is can’t be part of the curriculum in statistics department to teach students how to work with real world data.  Those working in the data science and statistics fields need to have the statistical training while having the ability to work with data regardless of the location, format, or size.

## JSM 2013 – Wednesday

I was able to attend a continuing education short course workshops at the JSM conference that proved to be quite insightful.  The discussion was on data mining and was titled “Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets”.  The presentation was given by Dan Steinberg and the examples that he gave were based on a proprietary software called SPM (Salford Predictive Modeler).  I have not personally used the software so I’m in no position to endorse or discourage its use.  I generally prefer open source solutions unless there is resounding evidence to use commercial products.  So I’m interested in seeing how this software operates.  The slides for this presentation (as well as other continuing education courses) are available on their website http://info.salford-systems.com/jsm-2013-ctw.  Much of the workshop dealt with a dataset relating to car insurance fraud and how to use CART models and Random Forests.  As an aside I made a post a while back giving some examples in R on those models.  The workshop was educational and informative on how to approach these types of problems using a different software package.  I’m particularly interested in comparing SPM to R or seeing if others have already run some comparisons.

## JSM 2013 – Tuesday

The Joint Statistical Meeting in Montreal has proven to be very good.   Here are a few highlight from Tuesday’s sessions.  There is one major problem that exists and that is there are too many good sessions to attend.  During one time block I had six session that I wanted to go to.  Unfortunately, it is simply not possible to make it to all of them.  However, the reoccurring theme is that if you don’t know at least R then you will quickly be left in the dust.  Based on the sessions so far knowing R is a must and knowing other languages such as Java, Scala or Python will certainly be good.

Session on Analytics and Data Visualization in Professional Sports During the morning I attended a session on statistics in sports.  It was mostly several sabermetric presentations with some basketball in there too.  One presentation caught my attention due to the granularity of the data that the presenter used.  Benjamin Baumer’s R package is called openWAR which is an open source version of WAR (Wins Above Replacement) in baseball.  With the data that package accesses it is able to identify every play as well as the spatial location of where the batter hit the ball on the field.  If someone is interested in sports statistics or just interested in playing with a lot of publicly available data then openWAR is great resource (currently available on GitHub at https://github.com/beanumber/openWAR).  This presentation also discussed the distribution of the players on the field and their ability to field the ball once it was hit.  A different presentation from Sportvision presented on the location and trajectory of the ball as the pitcher throws the ball.  Sportvision also shows the location in the strike zone of where the batter hits the ball the hardest.  They are the same company that do the 1st & 10 graphics (i.e. the yellow line needed for a 1st down).

Session on Statistical Computing: Software and Graphics

I attended the Statistical Computing session and 5 of the 6 presentations were on R packages.  The first was a presentation on Muste the R implementation of Survo.  I have not used Survo before but I will certainly do some research into it.  The next presentation was by Stephan Ritter the maintainer for the relaxnet and widenet.  The third presentation was by David Dahl, the maintainer for jvmr.  With this package one can integrate Scala and Java into R without any special compilation.  TIBCO Spotfire then presented the TIBCO Enterprise Runtime for R (TERR).  This looks to be an interesting solution to some of the data management issues that exist in R.  The presenter indicated that it does a very good job at managing the system resources.  The fifth presentation discussed the Rcpp package and the final presentation by Christopher Fonnesbeck was on PyMC which allows a user to perform Bayesian statistical analysis in Python.

## JSM 2013 – Monday

I am currently attending the 2013 Joint Statistical Meeting in Montreal. I will try to share a few of the things that I take away each day. Last night (Monday) I attended the JSM keynote speaker with Nate Silver and it proved to be a very interesting discussion.  Silver is best known for his work on http://www.fivethirtyeight.com.  His speech was good and focused on the journalistic component of statistics.  He shared that, in his opinion, statisticians should blog and contribute to journalism.  He also added, though it’s a matter of personal opinion, and I don’t agree, that a statistician should get a few years of hands on work before going on for an advanced degree.  I’m of the philosophy that you just do it all at once, otherwise you might find yourself in a position where, for one reason out another, you simply can’t go back to school.

I think Silver gave the quote of the conference.  Silver was asked his opinion on the difference between Data Scientist and Statistician.  He response was, “Data Scientist is just a sexed up version of Statistician”. He then added that they are effectively redundant and just do good work and call yourself whatever you want.

The question was also asked during three Q&A portion why he feels election exit polls should not be trusted.  I disagree with Silver on this point.  He feels that exit polls are wrong and his arguments include the sample design of the exit poll (cluster design).  His argument was that a cluster design increases the Margin of Error.  This is a true statement but it misses the whole point of sample design and the fact that the exit poll uses a 99.9% confidence level to call a race. Which, that alone, increases the Margin of Error.  This is due to news networks not being in the business of calling races wrong and looking foolish.  Exit polls serve their purpose and have ultimately been quite accurate in calling races.  Exit polls also serve the purpose to help give context to why a candidate won or lost.