The Birthday Simulation

Nothing novel or unique about this problem.  This just extends the problem to measure the probability to three or more people sharing the same birthday using simulation approaches. Though there are other ways to approach with problem with built-in functions the example below show some of the individual steps.

For two people it’s fairly straight forward and with a group of about 22 people the probability that two people share the same birthday is about 0.5.  For groups approaching 50 there is an extremely high probability that two people share the same birthday

Prob2Birthday

 

When determining that three (or more) people have the same birthday the probability decreases fairly quickly compared to measuring only two people.  A fairly large group would be needed to find three people with the same birthday.

Birthday Plot

 

Here is some R code to determine these probabilities.


n.rep = 5000
theta.val = 75
doy = seq(from=1, to=365, by=1)
sim.mat = matrix(NA, nrow=theta.val, ncol=4)

getProb = function(n){
q = 1 - seq(0,n-1)/365
p = 1 - prod(q)
}

theta.list = seq(from=2, to=75, by=1)
p.graph = sapply(theta.list, getProb)
fifty.fifty = which(p.graph >.5)[1]
plot(p.graph, main="Probability Two People Have the Same Birthday", ylab='Probability', xlab="Number of People in Group")
lines(p.graph)
abline(h=.5, v=fifty.fifty)

 

## For matching multiple people

## Runs a little slow.
for(i in 2:theta.val){
bday = replicate(n.rep, sample(doy, size=i, replace=T) )
bday[1,]
bday.table = apply(bday, 2, table)

sim.2 = ifelse( unlist( lapply(bday.table, max) ) >=2, 1, 0)
sim.3 = ifelse( unlist( lapply(bday.table, max) ) >=3, 1, 0)
sim.4 = ifelse( unlist( lapply(bday.table, max) ) >=4, 1, 0)

sim.mat[i,1] = i
sim.mat[i,2] = sum(sim.2)/length(sim.2)
sim.mat[i,3] = sum(sim.3)/length(sim.3)
sim.mat[i,4] = sum(sim.4)/length(sim.4)

}

graph.sim = t( sim.mat[,2:4] )
colnames(graph.sim) = sim.mat[,1]

barplot(graph.sim[1,], ylim=c(0,1), col="red",
main="Probability of Having Multiple People with the Same Birthday",
xlab="People with Birthday",
ylab="Probability")
barplot(graph.sim[2,], ylim=c(0,1), col="blue", add=T)
barplot(graph.sim[3,], ylim=c(0,1), col="black", add=T)
abline(h=.5)
legend("topleft", c("2","3","4"), col=c("red","blue","black"), lwd=3)

Connecting TOAD For MySQL, MySQL Workbench, and R to Amazon AWS EC2 Using SSH Tunneling

awsI often use Amazon EC2 to store and retrieve data when I need either additional storage or higher computing capacity.  In this tutorial I’ll share how to connect to a MySQL database so that one can retrieve the data and do the analysis.  I tend to use either TOAD for MySQL or MySQL Workbench to run and test queries against a MySQL database.  I generally use MySQL Workbench when I’m sitting on a Linux-based operating system and TOAD when I’m on Windows.  It’s not terribly difficult to connect to EC2 but it is also not as simple as typing localhost as it requires a few additional steps.

If you’re reading this article then I’ll assume that you already know about Amazon EC2 and are familiar with the basics.  Make sure you have all the security groups (i.e. IP Addresses) set up so that a database connection (for MySQL usually through port 3306) can be made from wherever you are located.

All that is needed is the Public DNS that is available once you start your Amazon instance and the key pair file (.pem) used for the Amazon instance (the one Amazon tells you not to lose).  The Public DNS can just be copied and pasted from the specific AWS instance you want to connect. The key pair .pem file should already be saved to your local hard drive.  I generally use an Ubuntu operating system on an Amazon instance so some of the connection information is specific to that instance type (e.g. username of ubuntu).  One additional tip is to turn off any locally running MySQL database on your desktop.

MySQL Workbench

MySQL Workbench is probably the easiest as everything can be done right from within MySQL WorkBench.  The Public DNS from the instance can be added to the SSH Hostname line.  This particular Amazon instance uses the username ubuntu.  The SSH Key File is the file that is generated from Amazon. Next is the MySQL Hostname.  The database host is relative to the SSH connection.  Once you have connected to the remote location it is now relative to the remote location and MySQL Hostname will be on localhost (though you can create more complex connections).  Then you can use the database username and password you created (or were provided from the database administrator).  Using MySQL Workbench is probably the easiest way to connect as the connection process is all self-contain.  However, this will only work for MySQL Workbench and if you want to use other software (e.g. R) to connect to the database then this approach alone will not work.

TOAD for MySQL

TOAD for MySQL requires an additional step as is it does not have a built-in SSH tunneling option.  So the tunneling process requires separate software.  There are several ways to do this but two freely available options are either PuTTY or Bitvise SSH Client (Bitvise is free for individual use but otherwise there is a fee). PuTTY is useful and completely free however it requires that the .pem file be converted to a PuTTY specific .ppk file using the puttygen.exe application.

PuTTY

First, take the Public DNS and add it to the PuTTY session.  Though not required I would suggest going to the Connection category on the left and changing the keepalives to something like 120 and then check the Enable TCP keepalives.  Otherwise, after a few minutes your connection will drop with inactivity.

PuTTY

 

Second, you need to select the SSH authentication (located within the Connection category).  This will be from the .ppk file created from the puttygen.exe file.  Select the .ppk file and add it to the Private key file for authentication.

PuTTY Auth

 

Third, you need to enable SSH port forwarding.  So that way you can connect to your local desktop but have all that forwarded on to Amazon.  This way when connecting to your localhost at port 3306 you will actually be connecting to your Amazon Public DNS location.

PuTTY Tunnels

Bitvise

Like PuTTY you will enter your Public DNS and username.  Then click on the User keypair manager. From the keypair manager import the .pem file into Bitvise.  Take note of the slot column as that will be used to select the key pair you want to use.  Once imported use the Initial method located directly below the username and select the keypair slot (e.g. “public-key slot 1”).

Bitvise

 

Bitvise Key Pair

 

Then to do the SSH port forwarding you will use the C2S tab across the top.  Simply set the Listening interface and port to 127.0.0.1:3306 and the destination port to 127.0.0.1:3306.  Once completed you can save the profile and then login.

Bitvise C2S

 

Connecting Using TOAD

Finally, once you have opened an SSH connection using PuTTY or Bitvise you can then open up TOAD and create a new database connection.  Because everything on 127.0.0.1 is being forwarded you will want to connect using 127.0.0.1 as your Host.  Then enter your database username and password.

TOAD Connect

Using R

A while back I wrote up an article on how to connect to a database using R (http://statistical-research.com/connecting-to-a-sql-server-and-mysql-database-using-ms-windows/).  To connect to an Amazon EC2 MySQL database from R the same process is used.  PuTTY or Bitvise will need to be running and since it has enabled SSH port forwarding, anything on 127.0.0.1 using port 3306 will be forwarded on to the Amazon Public DNS.  So all that is needed is to create an ODBC connection pointing to 127.0.0.1 port 3306 and that connection will be forwarded on to Amazon.  So if you ever change the Public DNS by stopping the instance you don’t need to change anything in the ODBC configuration.  Only the PuTTY or Bitvise host will need to be updated.

AWS ODBC

Probabilities and P-Values

P-values seem to be the bane of a statistician’s existence.  I’ve seen situations where entire narratives are written without p-values and only provide the effects. It can also be used as a data reduction tool but ultimately it reduces the world into a binary system: yes/no, accept/reject. Not only that but the binary threshold is determined on a roughly subjective criterion.  It reminds me of Jon Stewart’s recent discussion “good thing/bad thing“.  Taking the most complex of issues and reducing them down to two outcomes.

Below is a simple graph (nothing elaborate) that shows how p-values alone don’t tell the whole story.  Sometimes, data is reduced so much that solid decisions are difficult to make. The graph on the left shows a simulated situation where there are identical p-values but very different effects. The graph on the right shows where the p-values are very different, and one is quite low, but the simulated effects are the same.

P-values and confidence intervals have quite a bit in common and when interpreted incorrectly can be misleading. Simply put a p-value is the probability of the observed data (e.g. \bar{x}), or more extreme data, given the null hypothesis is true[1, 2, 3, see also Moore's book The Basic Practice of Statistics, 2nd ed. p 321-322].

Ultimately, I could write a fairly lengthy discussion, and there are several others (e.g. Gelman), on the topic. However, for this blog post I’ll highlight this one idea that with p-values it’s possible to get differing effects with the same p-values and differing p-values with the same effect. In the example below I opted for extreme data to highlight the point. Here’s a quick little matplot of the example…

p-value_and_effects

In conclusion, don’t get me wrong. P-values are useful and I actually use them quite often. But we sometimes need to look beyond the test statistic and p-value to understand the data. Below is some code to simulate some extreme datasets.

set.seed(1234) # fix the one sample. However, replicate is randomized. So exact replication of these data not possible. Need a lot and sometimes it doesn't always work so it may need to be rerun.
x1 = rnorm(10, 0, 1)
x2 = replicate(500000, rnorm(10, 0, 5))
set.seed(1234) # same as previous seed.
x3 = rnorm(50, 0, 1)
x4 = replicate(500000, rnorm(10, 0, 4))

get.ttest = function(x){
## This is equivelent to the one sample t-test from t.test()
## just explicitly showing the formula
t.x1 = abs( ( mean(x) - 0 ) / ( sd(x)/sqrt(length(x)) ) )
p.value = pt(t.x1, 9, lower.tail=FALSE)*2 # two-sided
return(p.value)
}
get.ttest.ci = function(x){
## This is equivelent to the one sample t-test from t.test()
## just explicitly showing the formula
me = qt(1-.05/2, length(x)-1) * sd(x)/sqrt(length(x))
ll = mean(x)-me
ul = mean(x)+me
return(rbind(ll, ul))
}
### Find data with the greatest difference in effect but yet the same p-value.
## No real reason for this approach it just helps finding extreme sets of data.
## Need a very high number of simulations to ensure the constraints on effect.match are meet.
sim.p = apply(x2, 2, get.ttest)
sim.ci = apply(x2, 2, get.ttest.ci)
sim.dif = sim.ci[1,]-sim.ci[2,]
effect.match = x2[,round(get.ttest(x1),3) == round(sim.p,3) & sim.dif==min(sim.dif)]
sim.max.effect = apply(effect.match, 2, mean) - mean(x1)
pick.max.effect = which( sim.max.effect == max(sim.max.effect) )
pick.small.ci = effect.match[,pick.max.effect]
ci.matrix = cbind(
get.ttest.ci(x1),
get.ttest.ci(pick.small.ci)
)
### Find data with the same effect and has the greatest difference in p-value
sim.mean = apply(x4, 2, mean)
effect.match.mean = x4[, round(mean(x3),1) == round(sim.mean, 1)]
sim.max.p = apply(effect.match.mean, 2, get.ttest) - get.ttest(x3)
pick.max.p = which( sim.max.p == max(sim.max.p) )
pick.small.effect = effect.match.mean[,pick.max.p]
ci.matrix.effect = cbind(
get.ttest.ci(x3),
get.ttest.ci(pick.small.effect)
)
###Plot the graph
par(mfrow=c(1,2))
ind=1:ncol( ci.matrix )
ind.odd=seq(1,ncol( ci.matrix ), by=1)
ind.even=seq(2,ncol( ci.matrix ), by=1)
matplot(rbind(ind,ind),ci.matrix,type="l",lty=1, lwd=1, col=1,
xlab="Group",ylab="Response Variable, y", main=paste("Comparison of data with the same p-value of ", round(get.ttest(x1),2),"\nbut different effects", sep="")
, xaxt='n')
axis(side=1, at=ind.odd, tcl = -1.0, lty = 1, lwd = 0.5, labels=ind.odd, cex.axis=.75)
axis(side=1, at=ind.even, tcl = -0.7, lty = 1, lwd = 0.5, labels=rep("",length(ind.even)), cex.axis=.75)
points(ind,c(mean(x1),mean(pick.small.ci)),pch=19, cex=1, col='red')

###Plot the graph
ind=1:ncol( ci.matrix.effect )
ind.odd=seq(1,ncol( ci.matrix.effect ), by=1)
ind.even=seq(2,ncol( ci.matrix.effect ), by=1)
matplot(rbind(ind,ind),ci.matrix.effect,type="l",lty=1, lwd=1, col=1,
xlab="Group",ylab="Response Variable, y", main=paste("Comparison of data with the same effect of ", round(mean(x3),1), "\n but different p-values ", sprintf("%.3f", get.ttest(x3)), " and ", sprintf("%.3f", get.ttest(x4) ) , sep="")
, xaxt='n')
axis(side=1, at=ind.odd, tcl = -1.0, lty = 1, lwd = 0.5, labels=ind.odd, cex.axis=.75)
axis(side=1, at=ind.even, tcl = -0.7, lty = 1, lwd = 0.5, labels=rep("",length(ind.even)), cex.axis=.75)
points(ind,c(mean(x3),mean(pick.small.effect)),pch=19, cex=1, col='red')
%d bloggers like this: