I had the privilege of working on this with our Edison Research Podcast Consumer Tracker team. Edison released the top 50 podcasts for Q1 2021 through Q4 2021. https://www.edisonresearch.com/the-top-50-most-listened-to-podcasts-in-the-u-s-q4-2021/
Spring 2021 CO/WY American Statistical Association Presentation
R Code for Election Posterior Distribution From a Random Sample
I wrote a summary article a couple of years ago discussing some probability aspects of the 2012 Presidential general election with a particular focus on exit polling. I’ve had a few people email me asking for the code I used in some if the examples. I have used this code since before the 2008 elections so I’ve made several changes over the years and now use it for many projects. But here is the basic code to take the state estimates and compute the posterior distribution of the electoral votes. This code should run “right out of the box”. This approach works for methodologies considered simple random samples such as a landline/cell phone poll (e.g. surveying absentee voters by phone). However, applying this to an exit poll methodology is more complex than a phone poll as an exit poll is actually a stratified cluster sample design. For now I am posting the simple random sample code where if one wants they can extend it to more complex designs and models.
The following snippet of code simulates the distribution of vote using the Dirichlet probability distribution. Though in this instance the Beta distribution would also work well enough as the Dirichlet is a multivariate generalization of the Beta. Using the Dirichlet distribution allows the distribution to be built using the top two candidates and then all other third-party candidates. Here the third-party candidates generally make up an insignificant portion of the vote. The example data is artificially generated but is based on true data. The data can easily be replaced with any other data.
In addition to extending this code to more complex sample designs this code can be adjusted to accommodate more complex models or alternate distributions. This way other known variables can be applied to the model.
## This data was simulated during an earlier process for purposes of this example. ## The simulation is based on actual data collected from other surveys. ## "Probability-Based Estimation and the 2012 Presidential Election Exit Poll" ## Use gtools for rdirichlet() library(gtools) raw_txt = "Dem.pct EV size State Rep.pct AK 40.50108 3 732 AK 55.49892 AL 38.62712 9 405 AL 60.37288 AR 40.56291 6 504 AR 59.43709 AZ 46.89617 11 859 AZ 53.10383 CA 61.49733 55 1005 CA 36.50267 CT 61.98948 7 900 CT 38.01052 DC 96.70255 3 501 DC 03.29745 DE 64.25041 3 588 DE 35.74959 FL 51.73604 29 738 FL 48.26396 GA 46.59993 16 596 GA 52.40007 HI 70.62515 4 761 HI 27.37485 IA 52.72411 6 1008 IA 46.27589 ID 39.34329 4 603 ID 60.65671 IL 63.53229 20 225 IL 36.46771 IN 49.95051 11 937 IN 50.04949 KS 37.85524 6 819 KS 62.14476 KY 39.42100 8 567 KY 60.57900 LA 40.41514 8 541 LA 57.58486 MA 64.06700 11 741 MA 35.93300 MD 68.46216 10 257 MD 31.53784 ME 60.95495 4 573 ME 39.04505 MI 59.32956 16 152 MI 40.67044 MN 56.39913 10 897 MN 43.60087 MO 46.92633 10 906 MO 53.07367 MS 48.60395 6 573 MS 51.39605 MT 42.38009 3 728 MT 55.61991 NC 48.08133 15 782 NC 49.91867 ND 39.72801 3 514 ND 60.27199 NE 32.55075 5 484 NE 67.44925 NH 56.95381 4 932 NH 43.04619 NJ 60.31767 14 388 NJ 39.68233 NM 56.23601 5 282 NM 43.76399 NV 50.96606 6 918 NV 49.03394 NY 66.55607 29 955 NY 33.44393 OH 51.46216 18 929 OH 47.53784 OK 34.13442 7 708 OK 65.86558 PA 52.38386 20 970 PA 45.61614 RI 66.38853 4 411 RI 33.61147 SC 49.62293 9 663 SC 50.37707 SD 39.74359 3 546 SD 60.25641 TN 46.22093 11 344 TN 53.77907 TX 41.57170 38 918 TX 57.42830 UT 43.70502 6 510 UT 56.29498 VA 58.95710 13 642 VA 41.04290 VT 73.74302 3 760 VT 26.25698 WI 55.49687 10 436 WI 44.50313 WV 36.40509 5 427 WV 62.59491 WY 27.88871 3 503 WY 68.11129 CO 52.02242 9 729 CO 46.97758 OR 54.95128 7 437 OR 42.04872 WA 56.7056 12 437 WA 42.2944"; raw_data = textConnection(raw_txt) raw = read.table(raw_data, header=TRUE, comment.char="#", sep="") close.connection(raw_data) ## Function to simulate each state outcome p.win = function(state){ #Dirichlet distribution because there can be multiple candidates p=rdirichlet(1000000, raw$size[state]*c(raw$Rep.pct[state],raw$Dem.pct[state],(100-raw$Rep.pct[state]-raw$Dem.pct[state]))/100+1) mean(p[,2]>p[,1]) } run.simulation=function(){ ## Binomial distribution because in all states except Nebraska and Maine it is winner takes all. ## In NE and ME they use the Congressional District Method. But often all votes go to the same candidate. ## During 2008 election was the first and last time NE split it's vote when Obama received 1 electoral vote. winner=rbinom(nrow(raw),1,probability.of.win) sum(raw$EV*winner) } ## Iterate over the states ## This can be adjusted to further account for within state sample design rather than assuming an SRS within each state. ## Note that an exit poll style sample design is not an SRS but is a stratified, cluster design p.win.state = sapply(1:nrow(raw),p.win) ## Renamed to perform other functions if desired ## Set Obama.win.probs for the sim.election() function probability.of.win = p.win.state # ## Replicate the simulation. A greater number of replicates will smooth out the distribution. electoral.vote.simulation = replicate(500000,run.simulation()) ( sim.median = median(electoral.vote.simulation) ) #Calculate the median from the simulation ## Graph it. hist(electoral.vote.simulation, nclass=1000, main='Simulation of Electoral Vote', xlab='Electoral Votes') credible.interval = quantile(electoral.vote.simulation, prob=c(.025, .975)) abline(v=credible.interval, col=2) abline(v=sim.median, col=3, lwd=3) ## Also, an individual state can be examined my.state = 'CA' ## Enter 2-letter state abbreviation j.state = which(raw$State==my.state) p.state=rdirichlet(100000, raw$size[j.state]* c(raw$Rep.pct[j.state],raw$Dem.pct[j.state],100-raw$Rep.pct[j.state]-raw$Dem.pct[j.state])/100+1) hist(p.state[,2], nclass=1000, main=paste('Simulation for',my.state), xlab='Proportion of Vote') median(p.state[,2]) abline(v=median(p.state[,2]), lwd=3, col=3) abline(v=quantile(p.state[,2], prob=c(.025,.975)), col=2)
Recent Articles
I have uploaded a few papers I have written and presented at some national conferences over the past several years. Currently, all the articles relate to election research.
The Birthday Simulation
Nothing novel or unique about this problem. This just extends the problem to measure the probability to three or more people sharing the same birthday using simulation approaches. Though there are other ways to approach this problem with built-in functions the example below shows some of the individual steps.
For two people it’s fairly straight forward and with a group of about 22 people the probability that two people share the same birthday is about 0.5. For groups approaching 50 there is an extremely high probability that two people share the same birthday.
When determining that three (or more) people have the same birthday the probability decreases fairly quickly compared to measuring only two people. A fairly large group would be needed to find three people with the same birthday.
Here is some R code to determine these probabilities.
n.rep = 5000 theta.val = 75 doy = seq(from=1, to=365, by=1) sim.mat = matrix(NA, nrow=theta.val, ncol=4) getProb = function(n){ q = 1 - seq(0,n-1)/365 p = 1 - prod(q) } theta.list = seq(from=2, to=75, by=1) p.graph = sapply(theta.list, getProb) fifty.fifty = which(p.graph >.5)[1] plot(p.graph, main="Probability Two People Have the Same Birthday", ylab='Probability', xlab="Number of People in Group") lines(p.graph) abline(h=.5, v=fifty.fifty) ## For matching multiple people ## Runs a little slow. for(i in 2:theta.val){ bday = replicate(n.rep, sample(doy, size=i, replace=T) ) bday[1,] bday.table = apply(bday, 2, table) sim.2 = ifelse( unlist( lapply(bday.table, max) ) >=2, 1, 0) sim.3 = ifelse( unlist( lapply(bday.table, max) ) >=3, 1, 0) sim.4 = ifelse( unlist( lapply(bday.table, max) ) >=4, 1, 0) sim.mat[i,1] = i sim.mat[i,2] = sum(sim.2)/length(sim.2) sim.mat[i,3] = sum(sim.3)/length(sim.3) sim.mat[i,4] = sum(sim.4)/length(sim.4) } graph.sim = t( sim.mat[,2:4] ) colnames(graph.sim) = sim.mat[,1] barplot(graph.sim[1,], ylim=c(0,1), col="red", main="Probability of Having Multiple People with the Same Birthday", xlab="People with Birthday", ylab="Probability") barplot(graph.sim[2,], ylim=c(0,1), col="blue", add=T) barplot(graph.sim[3,], ylim=c(0,1), col="black", add=T) abline(h=.5) legend("topleft", c("2","3","4"), col=c("red","blue","black"), lwd=3)
Connecting TOAD For MySQL, MySQL Workbench, and R to Amazon AWS EC2 Using SSH Tunneling
I often use Amazon EC2 to store and retrieve data when I need either additional storage or higher computing capacity. In this tutorial I’ll share how to connect to a MySQL database so that one can retrieve the data and do the analysis. I tend to use either TOAD for MySQL or MySQL Workbench to run and test queries against a MySQL database. I generally use MySQL Workbench when I’m sitting on a Linux-based operating system and TOAD when I’m on Windows. It’s not terribly difficult to connect to EC2 but it is also not as simple as typing localhost as it requires a few additional steps.
If you’re reading this article then I’ll assume that you already know about Amazon EC2 and are familiar with the basics. Make sure you have all the security groups (i.e. IP Addresses) set up so that a database connection (for MySQL usually through port 3306) can be made from wherever you are located.
All that is needed is the Public DNS that is available once you start your Amazon instance and the key pair file (.pem) used for the Amazon instance (the one Amazon tells you not to lose). The Public DNS can just be copied and pasted from the specific AWS instance you want to connect. The key pair .pem file should already be saved to your local hard drive. I generally use an Ubuntu operating system on an Amazon instance so some of the connection information is specific to that instance type (e.g. username of ubuntu). One additional tip is to turn off any locally running MySQL database on your desktop.
MySQL Workbench
MySQL Workbench is probably the easiest as everything can be done right from within MySQL WorkBench. The Public DNS from the instance can be added to the SSH Hostname line. This particular Amazon instance uses the username ubuntu. The SSH Key File is the file that is generated from Amazon. Next is the MySQL Hostname. The database host is relative to the SSH connection. Once you have connected to the remote location it is now relative to the remote location and MySQL Hostname will be on localhost (though you can create more complex connections). Then you can use the database username and password you created (or were provided from the database administrator). Using MySQL Workbench is probably the easiest way to connect as the connection process is all self-contain. However, this will only work for MySQL Workbench and if you want to use other software (e.g. R) to connect to the database then this approach alone will not work.
TOAD for MySQL
TOAD for MySQL requires an additional step as is it does not have a built-in SSH tunneling option. So the tunneling process requires separate software. There are several ways to do this but two freely available options are either PuTTY or Bitvise SSH Client (Bitvise is free for individual use but otherwise there is a fee). PuTTY is useful and completely free however it requires that the .pem file be converted to a PuTTY specific .ppk file using the puttygen.exe application.
PuTTY
First, take the Public DNS and add it to the PuTTY session. Though not required I would suggest going to the Connection category on the left and changing the keepalives to something like 120 and then check the Enable TCP keepalives. Otherwise, after a few minutes your connection will drop with inactivity.
Second, you need to select the SSH authentication (located within the Connection category). This will be from the .ppk file created from the puttygen.exe file. Select the .ppk file and add it to the Private key file for authentication.
Third, you need to enable SSH port forwarding. So that way you can connect to your local desktop but have all that forwarded on to Amazon. This way when connecting to your localhost at port 3306 you will actually be connecting to your Amazon Public DNS location.
Bitvise
Like PuTTY you will enter your Public DNS and username. Then click on the User keypair manager. From the keypair manager import the .pem file into Bitvise. Take note of the slot column as that will be used to select the key pair you want to use. Once imported use the Initial method located directly below the username and select the keypair slot (e.g. “public-key slot 1”).
Then to do the SSH port forwarding you will use the C2S tab across the top. Simply set the Listening interface and port to 127.0.0.1:3306 and the destination port to 127.0.0.1:3306. Once completed you can save the profile and then login.
Connecting Using TOAD
Finally, once you have opened an SSH connection using PuTTY or Bitvise you can then open up TOAD and create a new database connection. Because everything on 127.0.0.1 is being forwarded you will want to connect using 127.0.0.1 as your Host. Then enter your database username and password.
Using R
A while back I wrote up an article on how to connect to a database using R (http://statistical-research.com/connecting-to-a-sql-server-and-mysql-database-using-ms-windows/). To connect to an Amazon EC2 MySQL database from R the same process is used. PuTTY or Bitvise will need to be running and since it has enabled SSH port forwarding, anything on 127.0.0.1 using port 3306 will be forwarded on to the Amazon Public DNS. So all that is needed is to create an ODBC connection pointing to 127.0.0.1 port 3306 and that connection will be forwarded on to Amazon. So if you ever change the Public DNS by stopping the instance you don’t need to change anything in the ODBC configuration. Only the PuTTY or Bitvise host will need to be updated.
Probabilities and P-Values
P-values have been an issue for statistician for an extremely long time. I’ve seen situations where entire narratives are written without p-values and only provide the effects. It can also be used as a data reduction tool but ultimately it reduces the world into a binary system: yes/no, accept/reject. Not only that but the binary threshold is determined on a roughly subjective criterion. It reminds me of Jon Stewart’s recent discussion “good thing/bad thing“. Taking the most complex of issues and reducing them down to two outcomes.
Below is a simple graph (nothing elaborate) that shows how p-values alone don’t tell the whole story. Sometimes, data is reduced so much that solid decisions are difficult to make. The graph on the left shows a simulated situation where there are identical p-values but very different effects. The graph on the right shows where the p-values are very different, and one is quite low, but the simulated effects are the same.
P-values and confidence intervals have quite a bit in common and when interpreted incorrectly can be misleading. Simply put a p-value is the probability of the observed data (e.g. ), or more extreme data, given the null hypothesis is true[1, 2, 3, see also Moore’s book The Basic Practice of Statistics, 2nd ed. p 321-322].
Ultimately, I could write a fairly lengthy discussion, and there are several others (e.g. Gelman), on the topic. However, for this blog post I’ll highlight this one idea that with p-values it’s possible to get differing effects with the same p-values and differing p-values with the same effect. In the example below I opted for extreme data to highlight the point. Here’s a quick matplot of the example…
In conclusion, P-values are useful and can be useful. But we sometimes need to look beyond the test statistic and p-value to understand the data. Below is some code to simulate some extreme datasets.
set.seed(1234) # fix the one sample. However, replicate is randomized. So exact replication of these data not possible. Need a lot and sometimes it doesn't always work so it may need to be rerun. x1 = rnorm(10, 0, 1) x2 = replicate(500000, rnorm(10, 0, 5)) set.seed(1234) # same as previous seed. x3 = rnorm(50, 0, 1) x4 = replicate(500000, rnorm(10, 0, 4)) get.ttest = function(x){ ## This is equivelent to the one sample t-test from t.test() ## just explicitly showing the formula t.x1 = abs( ( mean(x) - 0 ) / ( sd(x)/sqrt(length(x)) ) ) p.value = pt(t.x1, 9, lower.tail=FALSE)*2 # two-sided return(p.value) } get.ttest.ci = function(x){ ## This is equivelent to the one sample t-test from t.test() ## just explicitly showing the formula me = qt(1-.05/2, length(x)-1) * sd(x)/sqrt(length(x)) ll = mean(x)-me ul = mean(x)+me return(rbind(ll, ul)) } ### Find data with the greatest difference in effect but yet the same p-value. ## No real reason for this approach it just helps finding extreme sets of data. ## Need a very high number of simulations to ensure the constraints on effect.match are meet. sim.p = apply(x2, 2, get.ttest) sim.ci = apply(x2, 2, get.ttest.ci) sim.dif = sim.ci[1,]-sim.ci[2,] effect.match = x2[,round(get.ttest(x1),3) == round(sim.p,3) & sim.dif==min(sim.dif)] sim.max.effect = apply(effect.match, 2, mean) - mean(x1) pick.max.effect = which( sim.max.effect == max(sim.max.effect) ) pick.small.ci = effect.match[,pick.max.effect] ci.matrix = cbind( get.ttest.ci(x1), get.ttest.ci(pick.small.ci) ) ### Find data with the same effect and has the greatest difference in p-value sim.mean = apply(x4, 2, mean) effect.match.mean = x4[, round(mean(x3),1) == round(sim.mean, 1)] sim.max.p = apply(effect.match.mean, 2, get.ttest) - get.ttest(x3) pick.max.p = which( sim.max.p == max(sim.max.p) ) pick.small.effect = effect.match.mean[,pick.max.p] ci.matrix.effect = cbind( get.ttest.ci(x3), get.ttest.ci(pick.small.effect) ) ###Plot the graph par(mfrow=c(1,2)) ind=1:ncol( ci.matrix ) ind.odd=seq(1,ncol( ci.matrix ), by=1) ind.even=seq(2,ncol( ci.matrix ), by=1) matplot(rbind(ind,ind),ci.matrix,type="l",lty=1, lwd=1, col=1, xlab="Group",ylab="Response Variable, y", main=paste("Comparison of data with the same p-value of ", round(get.ttest(x1),2),"\nbut different effects", sep="") , xaxt='n') axis(side=1, at=ind.odd, tcl = -1.0, lty = 1, lwd = 0.5, labels=ind.odd, cex.axis=.75) axis(side=1, at=ind.even, tcl = -0.7, lty = 1, lwd = 0.5, labels=rep("",length(ind.even)), cex.axis=.75) points(ind,c(mean(x1),mean(pick.small.ci)),pch=19, cex=1, col='red') ###Plot the graph ind=1:ncol( ci.matrix.effect ) ind.odd=seq(1,ncol( ci.matrix.effect ), by=1) ind.even=seq(2,ncol( ci.matrix.effect ), by=1) matplot(rbind(ind,ind),ci.matrix.effect,type="l",lty=1, lwd=1, col=1, xlab="Group",ylab="Response Variable, y", main=paste("Comparison of data with the same effect of ", round(mean(x3),1), "\n but different p-values ", sprintf("%.3f", get.ttest(x3)), " and ", sprintf("%.3f", get.ttest(x4) ) , sep="") , xaxt='n') axis(side=1, at=ind.odd, tcl = -1.0, lty = 1, lwd = 0.5, labels=ind.odd, cex.axis=.75) axis(side=1, at=ind.even, tcl = -0.7, lty = 1, lwd = 0.5, labels=rep("",length(ind.even)), cex.axis=.75) points(ind,c(mean(x3),mean(pick.small.effect)),pch=19, cex=1, col='red')
Some Options for Testing Tables
Contingency tables are a very good way to summarize discrete data. They are quite easy to construct and reasonably easy to understand. However, there are many nuances with tables and care should be taken when making conclusions related to the data. Here are just a few thoughts on the topic.
Dealing with sparse data
On one end of the spectrum there is a struggle to deal with the mass amounts of available data and trying to make sense of data in the petabyte (and larger) range. At the other end of the spectrum lacking sufficient data has its own problems.
Collapsing row or column levels
Sometimes when working with discrete data if certain factor levels lack adequate data then it may be possible to combine the factor levels together. This may be done fairly easily with 4 and 5-scale Likert scales and, in fact, happens quite frequently. Taking this approach may allow for sufficient data to make conclusions without violating underlying assumption. The following tables of some random data show how cells can be collapse such that the basic assumptions are met.
18-29 | 30-45 | 46-59 | 60+ | |
Very Strongly Agree | 1 | 1 | 12 | 10 |
Strongly Agree | 17 | 13 | 16 | 18 |
Undecided | 13 | 6 | 15 | 2 |
Strongly Disagree | 10 | 7 | 8 | 19 |
Very Strongly Disagree | 0 | 11 | 10 | 2 |
18-45 | 46+ | |
Agree | 32 | 56 |
Undecided | 19 | 17 |
Disagree | 28 | 39 |
Sanity Checks
Though it’s not a formal test just doing some sanity checks and some other sensitivity analysis is a good way to check the stability of the test. If one can take a single observation and move it to a different cell and that changes the decision then one should reevaluate the criteria for making a conclusion.
Using a basic chi square test (though other statistical tests would help with this problem, including the chi square correction for continuity) gives a p-value of .0363 for the following table of some made-up data and would be considered significant at . However, by simply moving one observation from the Cold/Fast group to the Cold/Slow group the p-value of .1575 is no longer significant at . The volatility of the data is suspect and decisions should be taken with caution.
Fast | Slow | Total | |
Hot | 7 | 2 | 9 |
Cold | 1 | 4 | 5 |
Total | 8 | 6 | 14 |
Fast | Slow | Total | |
Hot | 7 | 2 | 9 |
Cold | 2 | 3 | 5 |
Total | 9 | 5 | 14 |
Sparse Data Tests
There are many tests to handle many different categorical data situations. Listed here are a few of the common approaches.
Chi Square With Yates Correction
Sometimes decreasing the chi square statistics (and increasing the p-value) is sufficient for the specific case of a 2 x 2 table. In R, for example, this is applied by default.
Fisher’s Exact Test
This is often the immediate fall back for 2 x 2 tables. This test is based on the hypergeometric distribution. However, one important rule for this test is that it is conditioned on the marginal totals. An example counter to this rule is to take a random sample of 15 people. Suppose 5 are male and 10 are female. Here the chi square starts to break down. But the other problem is that Fishers Exact Test calls for the marginals to be fixed and that is not the case. If another random sample of 15 people is selected we could get a different number of males and females.
Fisher’s Exact Test was developed in a time (1934) when a computer just wasn’t available to play around with really complex hypergeometric distributions. A 2 x 2 table was really the only feasible sized table. Consequently, Fisher’s Exact Test was designed for 2 x 2 tables but can be used on any m x n sized table.
So why not always use Fisher’s Exact Test? At some point the two begin to converge and using the exact test may just be too exact. Alan Agresti and Brent Coull write an article here (pdf) that discusses this topic in the context of interval estimation.
Barnard Test
Is similar to the Fisher Test but this test overcomes the problem of conditioning on the marginal. Like many tests this applies only to a 2 x 2 table.
McNemar Exact Test
McNemar’s test is used when the data are correlated. For example, a matched pairs design where there is a before and after treatment or when a question is asked on a repeat survey. Like Fishers Test this provide options for smaller sample sizes.
These are only a few of the options available and great care should be taken in any analysis but smaller sample sizes add a bit more complexity. There are many other options including logistic regression as well as other non-parametric tests that are available for small tables.
Examples
Here is some R code that shows some of the tests I described: Chi Square, Fishers Exact Test, Barnards Test, and McNemars Exact Test.
x1 = matrix(c(7, 1, 2, 4), nrow = 2, dimnames = list(c("Hot", "Cold"), c("Fast", "Slow"))) x2 = matrix(c(7, 2, 2, 3), nrow = 2, dimnames = list(c("Hot", "Cold"), c("Fast", "Slow"))) chisq.test(x1, correct=FALSE) chisq.test(x2, correct=FALSE) fisher.test(x1, alternative="two.sided", conf.int=TRUE, conf.level=0.95) fisher.test(x2, alternative="two.sided", conf.int=TRUE, conf.level=0.95) x3 = matrix(c(50, 5, 3, 15), nrow = 2, dimnames = list("Replication 1" = c("Hot", "Cold"), "Replication 2" = c("Hot", "Cold"))) mcnemar.test(x3, correct=TRUE) library(Barnard) barnardw.test(x1[1,1],x1[1,2],x1[2,1],x1[2,2]) library(ade4) table.cont(x1, csi = 2, col.labels = colnames(x1), clabel.r = 1.5, clabel.c = 1.5)
Spatial Clustering With Equal Sizes
This is a problem I have encountered many times where the goal is to take a sample of spatial locations and apply constraints to the algorithm. In addition to providing a pre-determined number of K clusters a fixed size of elements needs to be held constant within each cluster. An application of this algorithm is when one needs to geographically stratify and pre-allocate the sample frame but keep the sizes the same (or constant) to facilitate operational fielding of a study.
I have done a cursory look for other approaches to this problem but have come up fairly empty. I would certainly be interested in other approaches that are used. However, in general, this is somewhat counter to the textbook teaching of k-means clustering where cluster sizes naturally form based on the specified criteria.
This is one of several approaches to determine the optimal clustering when dealing with spatial data. Other cluster assignment approaches could be used. One in particular is the CLARANS algorithm, but like other clustering approaches it does not constrain the sizes of the clusters. Ultimately the goal here is to keep the clusters the same size and to reduce the total spatial distance from the center of the cluster.
I created a random dataset with just under 10000 randomly selected geographic coordinates (removing Alaska and Hawaii) in the 48 states. Based on the latitude and longitude the locations can be clustered and the sizes constrained. In this example I use exactly equal sized clusters (except when n is not divisible by K), . However, the option exists where one could pre-allocated the cluster sizes so they are fixed in advance but are different from cluster to cluster and then constrained to those sizes if desired.
Centers | ||
Latitude | Longitude | Cluster |
37.46644 | -113.412 | 1 |
40.24648 | -74.7457 | 2 |
31.89746 | -85.5054 | 3 |
41.08111 | -85.3031 | 4 |
As for the distance function I take a couple things into account. First, the earth does not have a constant radius. Because the earth is spinning around so fast it flattens a bit. So I built into the distance function those two radii. This way when traveling over a greater distance of latitude the correct distance can be calculated based on the flattening earth. Second, because the earth is mostly round the Pythagorean theorem doesn’t exactly apply and a more accurate distance around a curved earth is needed. Consequently, the flattening of the earth as well as the curvature of the earth is combined as a more complex formula that is used in the function.
I’m still working on fine tuning and making the algorithm better but my initial algorithm is as follows:
1) set equal cluster size, e.g. n/k, or assign specified sizes.
2) initialize cluster assignment. I’m still working on a better approach but for now I just randomly select, order and systematically assign it through all observations.
3) calculate the center of the clusters.
4) take the first observation and assign it to the closest cluster.
5) since one cluster now has and another has establish a trade to even out the sizes. The closest observation to the giving cluster is then traded.
6) this process continues through all locations.
7) the sum of the distance from each observation to its assigned centroid is calculated.
8) if the next iteration doesn’t decrease that distance (within the tolerance threshold) then stop.
9) continue the process with step 3 until the maximum iteration is meet.
The following code is what I used for my prototype and is not strictly optimized and may take several minutes (~15) on datasets with many thousands of observations. I’ll provide optimized R, Python, and maybe some PHP code at a later time. I’ve included a verbose version of the R code where it will provide convergence information as well as a timer indicating how long it will take before the maximum iteration is met.
# Convert to radian as_radians = function(theta=0){ return(theta * pi / 180) } calc_dist = function(fr, to) { lat1 = as_radians(fr$lat) lon1 = as_radians(fr$lon) lat2 = as_radians(to$lat) lon2 = as_radians(to$lon) a = 3963.191; b = 3949.903; numerator = ( a^2 * cos(lat2) )^2 + ( b^2 * sin(lat2) ) ^2 denominator = ( a * cos(lat2) )^2 + ( b * sin(lat2) )^2 radiusofearth = sqrt(numerator/denominator) #Accounts for the ellipticity of the earth. d = radiusofearth * acos( sin(lat1) * sin(lat2) + cos(lat1)*cos(lat2)*cos(lon2 - lon1) ) d.return = list(distance_miles=d) return(d.return) } raw.og = read.csv("http://statistical-research.com/wp-content/uploads/2013/11/sample_geo.txt", header=T, sep="\t") orig.data = raw.og[,1:3] dirichletClusters_constrained = function(orig.data, k=5, max.iter =50, tolerance = 1, plot.iter=TRUE) { fr = to = NULL r.k.start = sample(seq(1:k)) n = nrow( orig.data ) k.size = ceiling(n/k) initial.clusters = rep(r.k.start, k.size) if(n%%length(initial.clusters)!=0){ exclude.k = length(initial.clusters) - n%%length(initial.clusters) } else { exclude.k = 0 } orig.data$cluster = initial.clusters[1:(length(initial.clusters)-exclude.k)] orig.data$cluster_original = orig.data$cluster ## Calc centers and merge mu = cbind( by(orig.data$Latitude, orig.data$cluster, mean), by(orig.data$Longitude, orig.data$cluster, mean), seq(1:k) ) tmp1 = matrix( match(orig.data$cluster, mu[,3]) ) orig.data.centers = cbind(as.matrix(orig.data), mu[tmp1,])[,c(1:2,4:6)] ## Calc initial distance from centers fr$lat = orig.data.centers[,3]; fr$lon = orig.data.centers[,4] to$lat = orig.data.centers[,1]; to$lon = orig.data.centers[,2] orig.data$distance.from.center = calc_dist(fr, to)$distance_miles orig.data$distance.from.center_original = orig.data$distance.from.center ## Set some initial configuration values is.converged = FALSE iteration = 0 error.old = Inf error.curr = Inf while ( !is.converged &amp;&amp; iteration &lt; max.iter ) { # Iterate until threshold or maximum iterations if(plot.iter==TRUE){ plot(orig.data$Longitude, orig.data$Latitude, col=orig.data$cluster, pch=16, cex=.6, xlab="Longitude",ylab="Latitude") } iteration = iteration + 1 start.time = as.numeric(Sys.time()) cat("Iteration ", iteration,sep="") for( i in 1:n ) { # Iterate over each observation and measure the distance each observation' from its mean center # Produces an exchange. It takes the observation closest to it's mean and in return it gives the observation # closest to the giver, k, mean fr = to = distances = NULL for( j in 1:k ){ # Determine the distance from each k group fr$lat = orig.data$Latitude[i]; fr$lon = orig.data$Longitude[i] to$lat = mu[j,1]; to$lon = mu[j,2] distances[j] = as.numeric( calc_dist(fr, to) ) } # Which k cluster is the observation closest. which.min.distance = which(distances==min(distances), arr.ind=TRUE) previous.cluster = orig.data$cluster[i] orig.data$cluster[i] = which.min.distance # Replace cluster with closest cluster # Trade an observation that is closest to the giving cluster if(previous.cluster != which.min.distance){ new.cluster.group = orig.data[orig.data$cluster==which.min.distance,] fr$lat = mu[previous.cluster,1]; fr$lon = mu[previous.cluster,2] to$lat = new.cluster.group$Latitude; to$lon = new.cluster.group$Longitude new.cluster.group$tmp.dist = calc_dist(fr, to)$distance_miles take.out.new.cluster.group = which(new.cluster.group$tmp.dist==min(new.cluster.group$tmp.dist), arr.ind=TRUE) LocationID = new.cluster.group$LocationID[take.out.new.cluster.group] orig.data$cluster[orig.data$LocationID == LocationID] = previous.cluster } } # Calculate new cluster means mu = cbind( by(orig.data$Latitude, orig.data$cluster, mean), by(orig.data$Longitude, orig.data$cluster, mean), seq(1:k) ) tmp1 = matrix( match(orig.data$cluster, mu[,3]) ) orig.data.centers = cbind(as.matrix(orig.data), mu[tmp1,])[,c(1:2,4:6)] mu = cbind( by(orig.data$Latitude, orig.data$cluster, mean), by(orig.data$Longitude, orig.data$cluster, mean), seq(1:k) ) ## Calc initial distance from centers fr$lat = orig.data.centers[,3]; fr$lon = orig.data.centers[,4] to$lat = orig.data.centers[,1]; to$lon = orig.data.centers[,2] orig.data$distance.from.center = calc_dist(fr, to)$distance_miles # Test for convergence. Is the previous distance within the threshold of the current total distance from center error.curr = sum(orig.data$distance.from.center) error.diff = abs( error.old - error.curr ) error.old = error.curr if( !is.nan( error.diff ) & error.diff < tolerance ) { is.converged = TRUE } # Set a time to see how long the process will take is going through all iterations stop.time = as.numeric(Sys.time()) hour.diff = (((stop.time - start.time) * (max.iter - iteration))/60)/60 cat("\n Error ",error.diff," Hours remain from iterations ", hour.diff,"\n") # Write out iterations. Can later be used as a starting point if iterations need to pause write.table(orig.data, paste("C:\\optimize_iteration_",iteration,"_data.csv", sep=""), sep=",", row.names=F) } centers = data.frame(mu) ret.val = list("centers" = centers, "cluster" = factor(orig.data$cluster), "LocationID" = orig.data$LocationID, "Latitude" = orig.data$Latitude, "Longitude" = orig.data$Longitude, "k" = k, "iterations" = iteration, "error.diff" = error.diff) return(ret.val) } # Constrained clustering cl_constrain = dirichletClusters_constrained(orig.data, k=4, max.iter=5, tolerance=.0001, plot.iter=TRUE) table( cl_constrain$cluster ) plot(cl_constrain$Longitude, cl_constrain$Latitude, col=cl_constrain$cluster, pch=16, cex=.6, xlab="Longitude",ylab="Latitude") library(maps) map("state", add=T) points(cl_constrain$centers[,c(2,1)], pch=4, cex=2, col='orange', lwd=4)
Tracking the 2013 Hurricane Season
With it nearing the end of hurricane season it’s only appropriate to do a brief summary of the activity this year. It’s been a surprisingly low-key season as far as hurricanes are concerned. There have been only a few hurricanes and the barometric pressure of any hurricane this season has not even come close to hurricane Sandy (which broke the lowest barometric pressure record).
A lot more analysis and visualization could be done with these data. I will do some further statistical analysis and maybe some time series comparing past recorded hurricanes or maybe some other kriging approaches. Perhaps when I have some more spare time I’ll play with the data even more and look into the sustained wind speed and some of the other hurricane indices (e.g. the Power Dissipation Index).
The following graph shows a histogram of the barometric pressure and how it compares to hurricane Sandy.
In the mean time here is a short piece of code to produce the graphics for the 2013 hurricanes. It produces a graph of the area around the Atlantic Basin and pulls the data down from Unisys. It also highlights the East Coast states of the United States. I have a text file that I put together that lists all the hurricanes by year back through 2007.
Example Code
library(maps)
library(maptools)
library(rgdal)
library(OpenStreetMap) ## for future use to produce aerial image maps
library(raster)
year = 2013
hurricanes = read.table(file=paste(“http://statistical-research.com/wp-content/uploads/2013/10/hurricanes.txt”,sep=””), sep=”,”, fill=TRUE, header=T)
hurricanes = as.vector( subset(hurricanes, hurricanes$YEAR==year, select=c(“NAME”)) )
hurr.dat = list()
max.lon = max.lat = min.lat = min.lon = NULL
b.press = NULL
for(i in 1:nrow(hurricanes)){
raw = read.table(file=paste(“http://weather.unisys.com/hurricane/atlantic/”,year,”/”,hurricanes[i,],”/track.dat”,sep=””), skip=2,fill=TRUE)
colnames(raw) = c(“Latitude”,”Longitude”,”Time”,”WindSpeed”,”Pressure”,”Status”)
raw$Pressure = as.character(raw$Pressure)
raw$Pressure[raw$Pressure==”-“] = NA
raw$Pressure = as.numeric(raw$Pressure)
hurr.dat[[i]] = cbind(raw$Latitude, raw$Longitude, raw$Pressure)
b.press = c(b.press, min(raw$Pressure, na.rm=T))
if(is.null(max.lat)){
max.lat = max(raw$Latitude)
} else if(max.lat < max(raw$Latitude)) {
max.lat = max(raw$Latitude)
}
if(is.null(min.lat)){
min.lat = min(raw$Latitude)
} else if (min.lat > min(raw$Latitude)){
min.lat = min(raw$Latitude)
}
if(is.null(max.lon)){
max.lon = max(raw$Longitude)
} else if (max.lon < max(raw$Longitude)){
max.lon = max(raw$Longitude)
}
if(is.null(min.lon)){
min.lon = min(raw$Longitude)
} else if (min.lon > min(raw$Longitude)){
min.lon = min(raw$Longitude)
}
}
xlim <- c(min.lon-5,max.lon+10)
ylim <- c(min.lat-5,max.lat+10)
state.list <- c('new york','new jersey','virginia','massachusetts','connecticut','delaware','pennsylvania','maryland','north carolina','south carolina','georgia','florida',
'new hampshire','maine','district of columbia','west virginia','vermont')
my.map <- map("state", region=state.list, interior = FALSE, xlim=xlim, ylim=ylim)
map("state", region=state.list, boundary = TRUE, col="gray", add = TRUE,xlim=xlim)
map("world", boundary = TRUE, col="gray", add = TRUE,xlim=xlim)
for(j in 1:nrow(hurricanes)){
lines(x=hurr.dat[[j]][,2],y=hurr.dat[[j]][,1],col=j,cex=0.75)
points(x=hurr.dat[[j]][,2],y=hurr.dat[[j]][,1],pch=15,cex=0.4, col=j)
text(hurr.dat[[j]][1,2],
hurr.dat[[j]][1,1],hurricanes[j,])
}
title("Path of 2013 Hurricane Season")
box()
hist(b.press, xlim=c(920,1020), main="Histogram of Barometric Pressure for 2013",
xlab="Barometric Pressure (mb)", ylab="Frequency")
abline(v=940, col='blue',lwd=3)
text(958,.5,"<<2012 Hurricane Sandy")
[/sourcecode]