# Probabilities and P-Values

P-values seem to be the bane of a statistician’s existence.  I’ve seen situations where entire narratives are written without p-values and only provide the effects. It can also be used as a data reduction tool but ultimately it reduces the world into a binary system: yes/no, accept/reject. Not only that but the binary threshold is determined on a roughly subjective criterion.  It reminds me of Jon Stewart’s recent discussion “good thing/bad thing“.  Taking the most complex of issues and reducing them down to two outcomes.

Below is a simple graph that shows how p-values don’t tell the whole story.  Sometimes, data is reduced so much that solid decisions are difficult to make. The graph on the left shows a situation where there are identical p-values but very different effects. The graph on the right shows where the p-values are very different, and one is quite low, but the effects are the same.

P-values and confidence intervals have quite a bit in common and when interpreted incorrectly can be misleading. Simply put a p-value is the probability of the observed data, or more extreme data, given the null hypothesis is true.

I could write a fairly lengthy discussion, and there are several others (e.g. Gelman), on the topic. However, for this blog post I’ll highlight this one idea that with p-values it’s possible to get differing effects with the same p-values and differing p-values with the same effect. In the example below I opted for extreme data to highlight the point.

set.seed(1234)
x1 = rnorm(10, 0, 1)
x2 = replicate(500000, rnorm(10, 0, 5))
set.seed(1234)
x3 = rnorm(50, 0, 1)
x4 = replicate(500000, rnorm(10, 0, 4))

get.ttest = function(x){
## This is equivelent to the one sample t-test from t.test()
## just explicitly showing the formula
t.x1 = abs( ( mean(x) - 0 ) / ( sd(x)/sqrt(length(x)) ) )
p.value = pt(t.x1, 9, lower.tail=FALSE)*2 # two-sided
return(p.value)
}
get.ttest.ci = function(x){
## This is equivelent to the one sample t-test from t.test()
## just explicitly showing the formula
me = qt(1-.05/2, length(x)-1) * sd(x1)/sqrt(length(x))
ll = mean(x)-me
ul = mean(x)+me
return(rbind(ll, ul))
}
### Find data with the greatest difference in effect but yet the same p-value.
## No real reason for this approach it just helps finding extreme sets of data.
sim.p = apply(x2, 2, get.ttest)
sim.ci = apply(x2, 2, get.ttest.ci)
sim.dif = sim.ci[1,]-sim.ci[2,]
effect.match = x2[,round(get.ttest(x1),3) == round(sim.p,3) &amp; sim.dif==min(sim.dif)]
sim.max.effect = apply(effect.match, 2, mean) - mean(x1)
pick.max.effect = which( sim.max.effect == max(sim.max.effect) )
pick.small.ci = effect.match[,pick.max.effect]
ci.matrix = cbind(
get.ttest.ci(x1),
get.ttest.ci(pick.small.ci)
)
### Find data with the same effect and has the greatest difference in p-value
sim.mean = apply(x4, 2, mean)
effect.match.mean = x4[, round(mean(x3),1) == round(sim.mean, 1)]
sim.max.p = apply(effect.match.mean, 2, get.ttest) - get.ttest(x3)
pick.max.p = which( sim.max.p == max(sim.max.p) )
pick.small.effect = effect.match.mean[,pick.max.p]
ci.matrix.effect = cbind(
get.ttest.ci(x3),
get.ttest.ci(pick.small.effect)
)
###Plot the graph
par(mfrow=c(1,2))
ind=1:ncol( ci.matrix )
ind.odd=seq(1,ncol( ci.matrix ), by=1)
ind.even=seq(2,ncol( ci.matrix ), by=1)
matplot(rbind(ind,ind),ci.matrix,type="l",lty=1, lwd=1, col=1,
xlab="Group",ylab="Response Variable, y", main=paste("Comparison of data with the same p-value of ", round(get.ttest(x1),2),"\nbut different effects", sep="")
, xaxt='n')
axis(side=1, at=ind.odd, tcl = -1.0, lty = 1, lwd = 0.5, labels=ind.odd, cex.axis=.75)
axis(side=1, at=ind.even, tcl = -0.7, lty = 1, lwd = 0.5, labels=rep("",length(ind.even)), cex.axis=.75)
points(ind,c(mean(x1),mean(pick.small.ci)),pch=19, cex=1, col='red')

###Plot the graph
ind=1:ncol( ci.matrix.effect )
ind.odd=seq(1,ncol( ci.matrix.effect ), by=1)
ind.even=seq(2,ncol( ci.matrix.effect ), by=1)
matplot(rbind(ind,ind),ci.matrix.effect,type="l",lty=1, lwd=1, col=1,
xlab="Group",ylab="Response Variable, y", main=paste("Comparison of data with the same effect of ", round(mean(x3),1), "\n but different p-values ", sprintf("%.3f", get.ttest(x3)), " and ", sprintf("%.3f", get.ttest(x4) ) , sep="")
, xaxt='n')
axis(side=1, at=ind.odd, tcl = -1.0, lty = 1, lwd = 0.5, labels=ind.odd, cex.axis=.75)
axis(side=1, at=ind.even, tcl = -0.7, lty = 1, lwd = 0.5, labels=rep("",length(ind.even)), cex.axis=.75)
points(ind,c(mean(x3),mean(pick.small.effect)),pch=19, cex=1, col='red')


# Some Options for Testing Tables

Contingency tables are a very good way to summarize discrete data.  They are quite easy to construct and reasonably easy to understand. However, there are many nuances with tables and care should be taken when making conclusions related to the data. Here are just a few thoughts on the topic.

## Dealing with sparse data

On one end of the spectrum there is a struggle to deal with the mass amounts of available data and trying to make sense of data in the petabyte (and larger) range. At the other end of the spectrum lacking sufficient data has its own problems.

### Collapsing row or column levels

Sometimes when working with discrete data if certain factor levels lack adequate data then it may be possible to combine the factor levels together. This may be done fairly easily with 4 and 5-scale Likert scales and, in fact, happens quite frequently. Taking this approach may allow for sufficient data to make conclusions without violating underlying assumption.  The following tables of some random data show how cells can be collapse such that the basic assumptions are met.

 18-29 30-45 46-59 60+ Very Strongly Agree 1 1 12 10 Strongly Agree 17 13 16 18 Undecided 13 6 15 2 Strongly Disagree 10 7 8 19 Very Strongly Disagree 0 11 10 2
 18-45 46+ Agree 32 56 Undecided 19 17 Disagree 28 39

### Sanity Checks

Though it’s not a formal test just doing some sanity checks and some other sensitivity analysis is a good way to check the stability of the test. If one can take a single observation and move it to a different cell and that changes the decision then one should reevaluate the criteria for making a conclusion.

Using a basic chi square test (though other statistical tests would help with this problem, including the chi square correction for continuity) gives a p-value of .0363 for the following table of some made-up data and would be considered significant at $\alpha=.05$.  However, by simply moving one observation from the Cold/Fast group to the Cold/Slow group the p-value of .1575 is no longer significant at $\alpha=.05$.  The volatility of the data is suspect and decisions should be taken with caution.

 Fast Slow Total Hot 7 2 9 Cold 1 4 5 Total 8 6 14
 Fast Slow Total Hot 7 2 9 Cold 2 3 5 Total 9 5 14

### Sparse Data Tests

There are many tests to handle many different categorical data situations. Listed here are a few of the common approaches.

Chi Square With Yates Correction
Sometimes decreasing the chi square statistics (and increasing the p-value) is sufficient for the specific case of a 2 x 2 table.  In R, for example, this is applied by default.

$\chi^2 = \sum_{i=1}^N \frac{(|O_i - E_i|-0.5)^2}{E_i}$

Fisher’s Exact Test

This is often the immediate fall back for 2 x 2 tables. This test is based on the hypergeometric distribution. However, one important rule for this test is that it is conditioned on the marginal totals. An example counter to this rule is to take a random sample of 15 people. Suppose 5 are male and 10 are female. Here the chi square starts to break down. But the other problem is that Fishers Exact Test calls for the marginals to be fixed and that is not the case. If another random sample of 15 people is selected we could get a different number of males and females.

Fisher’s Exact Test was developed in a time (1934) when a computer just wasn’t available to play around with really complex hypergeometric distributions. A 2 x 2 table was really the only feasible sized table. Consequently, Fisher’s Exact Test was designed for 2 x 2 tables but can be used on any m x n sized table.

So why not always use Fisher’s Exact Test? At some point the two begin to converge and using the exact test may just be too exact. Alan Agresti and Brent Coull write an article here (pdf) that discusses this topic in the context of interval estimation.

Barnard Test
Is similar to the Fisher Test but this test overcomes the problem of conditioning on the marginal.  Like many tests this applies only to a 2 x 2 table.

McNemar Exact Test
McNemar’s test is used when the data are correlated. For example, a matched pairs design where there is a before and after treatment or when a question is asked on a repeat survey.  Like Fishers Test this provide options for smaller sample sizes.

These are only a few of the options available and great care should be taken in any analysis but smaller sample sizes add a bit more complexity. There are many other options including logistic regression as well as other non-parametric tests that are available for small tables.

## Examples

Here is some R code that shows some of the tests I described: Chi Square, Fishers Exact Test, Barnards Test, and McNemars Exact Test.

x1 = matrix(c(7, 1, 2, 4),
nrow = 2,
dimnames =
list(c("Hot", "Cold"),
c("Fast", "Slow")))

x2 = matrix(c(7, 2, 2, 3),
nrow = 2,
dimnames =
list(c("Hot", "Cold"),
c("Fast", "Slow")))

chisq.test(x1, correct=FALSE)
chisq.test(x2, correct=FALSE)

fisher.test(x1, alternative="two.sided", conf.int=TRUE, conf.level=0.95)
fisher.test(x2, alternative="two.sided", conf.int=TRUE, conf.level=0.95)
x3 = matrix(c(50, 5, 3, 15),
nrow = 2,
dimnames =
list("Replication 1" = c("Hot", "Cold"),
"Replication 2" = c("Hot", "Cold")))

mcnemar.test(x3, correct=TRUE)
library(Barnard)
barnardw.test(x1[1,1],x1[1,2],x1[2,1],x1[2,2])
table.cont(x1, csi = 2,
col.labels = colnames(x1),
clabel.r = 1.5, clabel.c = 1.5)



# Spatial Clustering With Equal Sizes

This is a problem I have encountered many times where the goal is to take a sample of spatial locations and apply constraints to the algorithm.  In addition to providing a pre-determined number of K clusters a fixed size of $m_k$ elements needs to be held constant within each cluster. An application of this algorithm is when one needs to geographically stratify and pre-allocate the sample frame but keep the sizes the same (or constant) to facilitate operational fielding of a study.

I have done a cursory look for other approaches to this problem but have come up fairly empty. I would certainly be interested in other approaches that are used. However, in general, this is somewhat counter to the textbook teaching of k-means clustering where cluster sizes naturally form based on the specified criteria.

This is one of several approaches to determine the optimal clustering when dealing with spatial data. Other cluster assignment approaches could be used. One in particular is the CLARANS algorithm, but like other clustering approaches it does not constrain the sizes of the clusters. Ultimately the goal here is to keep the clusters the same size and to reduce the total spatial distance from the center of the cluster.

I created a random dataset with just under 10000 randomly selected geographic coordinates (removing Alaska and Hawaii) in the 48 states. Based on the latitude and longitude the locations can be clustered and the sizes constrained. In this example I use exactly equal sized clusters (except when n is not divisible by K), $m_k$. However, the option exists where one could pre-allocated the cluster sizes so they are fixed in advance but are different from cluster to cluster and then constrained to those sizes if desired.

 Centers Latitude Longitude Cluster 37.46644 -113.412 1 40.24648 -74.7457 2 31.89746 -85.5054 3 41.08111 -85.3031 4

As for the distance function I take a couple things into account. First, the earth does not have a constant radius. Because the earth is spinning around so fast it flattens a bit. So I built into the distance function those two radii. This way when traveling over a greater distance of latitude the correct distance can be calculated based on the flattening earth. Second, because the earth is mostly round the Pythagorean theorem doesn’t exactly apply and a more accurate distance around a curved earth is needed. Consequently, the flattening of the earth as well as the curvature of the earth is combined as a more complex formula that is used in the function.

I’m still working on fine tuning and making the algorithm better but my initial algorithm is as follows:

1) set equal cluster size, e.g. n/k, or assign specified sizes.
2) initialize cluster assignment. I’m still working on a better approach but for now I just randomly select, order and systematically assign it through all observations.
3) calculate the center of the clusters.
4) take the first observation and assign it to the closest cluster.
5) since one cluster now has $m_k+1$ and another has $m_k-1$ establish a trade to even out the sizes. The closest observation to the giving cluster is then traded.
6) this process continues through all locations.
7) the sum of the distance from each observation to its assigned centroid is calculated.
8) if the next iteration doesn’t decrease that distance (within the tolerance threshold) then stop.
9) continue the process with step 3 until the maximum iteration is meet.

The following code is what I used for my prototype and is not strictly optimized and may take several minutes (~15) on datasets with many thousands of observations. I’ll provide optimized R, Python, and maybe some PHP code at a later time.  I’ve included a verbose version of the R code where it will provide convergence information as well as a timer indicating how long it will take before the maximum iteration is met.


return(theta * pi / 180)
}

calc_dist = function(fr, to) {
lat1 = as_radians(fr$lat) lon1 = as_radians(fr$lon)
lat2 = as_radians(to$lat) lon2 = as_radians(to$lon)
a = 3963.191;
b = 3949.903;
numerator = ( a^2 * cos(lat2) )^2 + ( b^2 * sin(lat2) ) ^2
denominator = ( a * cos(lat2) )^2 + ( b * sin(lat2) )^2
radiusofearth = sqrt(numerator/denominator) #Accounts for the ellipticity of the earth.
d = radiusofearth * acos( sin(lat1) * sin(lat2) + cos(lat1)*cos(lat2)*cos(lon2 - lon1) )
d.return = list(distance_miles=d)
return(d.return)
}

orig.data = raw.og[,1:3]

dirichletClusters_constrained = function(orig.data, k=5, max.iter =50, tolerance = 1, plot.iter=TRUE) {
fr = to = NULL

r.k.start = sample(seq(1:k))
n = nrow( orig.data )
k.size = ceiling(n/k)
initial.clusters = rep(r.k.start, k.size)

if(n%%length(initial.clusters)!=0){
exclude.k = length(initial.clusters) - n%%length(initial.clusters)
} else {
exclude.k = 0
}
orig.data$cluster = initial.clusters[1:(length(initial.clusters)-exclude.k)] orig.data$cluster_original = orig.data$cluster ## Calc centers and merge mu = cbind( by(orig.data$Latitude, orig.data$cluster, mean), by(orig.data$Longitude, orig.data$cluster, mean), seq(1:k) ) tmp1 = matrix( match(orig.data$cluster, mu[,3]) )
orig.data.centers = cbind(as.matrix(orig.data), mu[tmp1,])[,c(1:2,4:6)]

## Calc initial distance from centers
fr$lat = orig.data.centers[,3]; fr$lon = orig.data.centers[,4]
to$lat = orig.data.centers[,1]; to$lon = orig.data.centers[,2]
orig.data$distance.from.center = calc_dist(fr, to)$distance_miles
orig.data$distance.from.center_original = orig.data$distance.from.center

## Set some initial configuration values
is.converged = FALSE
iteration = 0
error.old = Inf
error.curr = Inf

while ( !is.converged && iteration < max.iter ) { # Iterate until threshold or maximum iterations

if(plot.iter==TRUE){
plot(orig.data$Longitude, orig.data$Latitude, col=orig.data$cluster, pch=16, cex=.6, xlab="Longitude",ylab="Latitude") } iteration = iteration + 1 start.time = as.numeric(Sys.time()) cat("Iteration ", iteration,sep="") for( i in 1:n ) { # Iterate over each observation and measure the distance each observation' from its mean center # Produces an exchange. It takes the observation closest to it's mean and in return it gives the observation # closest to the giver, k, mean fr = to = distances = NULL for( j in 1:k ){ # Determine the distance from each k group fr$lat = orig.data$Latitude[i]; fr$lon = orig.data$Longitude[i] to$lat = mu[j,1]; to$lon = mu[j,2] distances[j] = as.numeric( calc_dist(fr, to) ) } # Which k cluster is the observation closest. which.min.distance = which(distances==min(distances), arr.ind=TRUE) previous.cluster = orig.data$cluster[i]
orig.data$cluster[i] = which.min.distance # Replace cluster with closest cluster # Trade an observation that is closest to the giving cluster if(previous.cluster != which.min.distance){ new.cluster.group = orig.data[orig.data$cluster==which.min.distance,]

fr$lat = mu[previous.cluster,1]; fr$lon = mu[previous.cluster,2]
to$lat = new.cluster.group$Latitude; to$lon = new.cluster.group$Longitude
new.cluster.group$tmp.dist = calc_dist(fr, to)$distance_miles

take.out.new.cluster.group = which(new.cluster.group$tmp.dist==min(new.cluster.group$tmp.dist), arr.ind=TRUE)
LocationID = new.cluster.group$LocationID[take.out.new.cluster.group] orig.data$cluster[orig.data$LocationID == LocationID] = previous.cluster } } # Calculate new cluster means mu = cbind( by(orig.data$Latitude, orig.data$cluster, mean), by(orig.data$Longitude, orig.data$cluster, mean), seq(1:k) ) tmp1 = matrix( match(orig.data$cluster, mu[,3]) )
orig.data.centers = cbind(as.matrix(orig.data), mu[tmp1,])[,c(1:2,4:6)]
mu = cbind( by(orig.data$Latitude, orig.data$cluster, mean), by(orig.data$Longitude, orig.data$cluster, mean), seq(1:k) )

## Calc initial distance from centers
fr$lat = orig.data.centers[,3]; fr$lon = orig.data.centers[,4]
to$lat = orig.data.centers[,1]; to$lon = orig.data.centers[,2]
orig.data$distance.from.center = calc_dist(fr, to)$distance_miles

# Test for convergence. Is the previous distance within the threshold of the current total distance from center
error.curr = sum(orig.data$distance.from.center) error.diff = abs( error.old - error.curr ) error.old = error.curr if( !is.nan( error.diff ) && error.diff < tolerance ) { is.converged = TRUE } # Set a time to see how long the process will take is going through all iterations stop.time = as.numeric(Sys.time()) hour.diff = (((stop.time - start.time) * (max.iter - iteration))/60)/60 cat("\n Error ",error.diff," Hours remain from iterations ",hour.diff,"\n") # Write out iterations. Can later be used as a starting point if iterations need to pause write.table(orig.data, paste("C:\\optimize_iteration_",iteration,"_instore_data.csv", sep=""), sep=",", row.names=F) } centers = data.frame(mu) ret.val = list("centers" = centers, "cluster" = factor(orig.data$cluster), "LocationID" = orig.data$LocationID, "Latitude" = orig.data$Latitude, "Longitude" = orig.data$Longitude, "k" = k, "iterations" = iteration, "error.diff" = error.diff) return(ret.val) } # Constrained clustering cl_constrain = dirichletClusters_constrained(orig.data, k=4, max.iter=5, tolerance=.0001, plot.iter=TRUE) table( cl_constrain$cluster )
plot(cl_constrain$Longitude, cl_constrain$Latitude, col=cl_constrain$cluster, pch=16, cex=.6, xlab="Longitude",ylab="Latitude") library(maps) map("state", add=T) points(cl_constrain$centers[,c(2,1)], pch=4, cex=2, col='orange', lwd=4)