Significant P-Values and Overlapping Confidence Intervals

There are all sorts of problems with p-values and confidence intervals and I have no intention (or the time) to cover all those problems right now.  However, a big problem is that most people have no idea what p-values really mean. Here is one example of a common problem with p-values and how it relates to confidence intervals.  The problem arises when there are two  random variables (from independent populations), each with a mean and variance.  A confidence interval can be constructed around each sample mean.  Using these confidence intervals might be a tempting way to explain whether two values are statistically different.  The issue is that a person may see that the confidence intervals overlap and therefore declare that there is no difference.  Simply put this is not one of those iff (if and only if) situations.  If the confidence intervals do not overlap then one can conclude that there is a statistical difference between means.  However, the opposite can not be concluded.  When the confidence intervals do overlap then there still may be a difference.

Here are simulated data from two independent normally distributed populations testing the confidence intervals and the p-values.  It can easily be seen that this is a fairly frequent event. So don’t make the mistake and make conclusions solely on confidence intervals.


###
#Set some constants
###
alpha = .05
m = 15
n = 15
nsim = 20000

###
#Function to calculate the t statistic. Same as t.test(x,y, var.eqal=T)
# spooled can easiliy be modified.
###
tstatfunc=function(x,y){
m = length(x)
n = length(y)
spooled = sqrt(((m-1)*sd(x)^2+(n-1)*sd(y)^2)/(m+n-2))
tstat = (mean(x)-mean(y))/(spooled*sqrt(1/m+1/n))

return(tstat)
}

calcInterval = function(){
x = rnorm(m,0,1)
y = rnorm(n,1,.8)
se.x = sd(x)/sqrt(m)
se.y = sd(y)/sqrt(n)

t.stat = tstatfunc(x,y)

p.val = (1-pt(abs(t.stat),m+n-2))*2 #Pooled Variance, two sided hypothesis
ci.x.ll = mean(x)-abs(qt(alpha/2,m-1))*se.x
ci.x.ul = mean(x)+abs(qt(alpha/2,m-1))*se.x
ci.y.ll = mean(y)-abs(qt(alpha/2,n-1))*se.y
ci.y.ul = mean(y)+abs(qt(alpha/2,n-1))*se.y
TTest = t.test(x,y, var.equal=TRUE) #Run the t.test() function for comparison
ret.val = c(p=p.val, t.p=TTest$p.value, ci.x.ll=ci.x.ll, ci.x.ul=ci.x.ul, ci.y.ll=ci.y.ll, ci.y.ul=ci.y.ul) return(ret.val) } ### #Replicate a few times ### my.sims = as.data.frame( t(replicate(nsim,calcInterval())) ) ### #Do the intervals overlap and it's significant? ### ci.vals = cbind(my.sims$ci.x.ll - my.sims$ci.y.ul, my.sims$ci.x.ul-my.sims$ci.y.ul, my.sims$ci.x.ul-my.sims$ci.y.ll) overlapTest = (ci.vals[,1] > 0 & ci.vals[,2] > 0 & ci.vals[,3] > 0) | (ci.vals[,1] < 0 & ci.vals&#91;,2&#93; < 0 & ci.vals&#91;,3&#93; < 0) my.sims = cbind(my.sims,ci.vals,NotOverlap=overlapTest) sum(my.sims$CI.p)/nsim

hist(
rbeta(100000,sum(my.sims$CI.p)+1, nsim-sum(my.sims$CI.p)+1) ##CI.p is a binomial distribution.
, nclass=100, xlab="Proportion", freq=F, main=expression("Histogram of Overlapping Confidence Intervals and p<"*alpha*" from Beta Distribution"))
###
#What percent have overlapping CI and p < alpha? -- "Significant" but yet CI indicate otherwise.
#Multiple comparisons
###
my.sims$CI.p = as.numeric( !my.sims$NotOverlap & my.sims$p < alpha ) my.sims$CI.p2 = as.numeric( my.sims$NotOverlap & my.sims$p > alpha )
my.sims$my.diff = my.sims$p-my.sims$t.p # Check calculations for consistency sum(my.sims$CI.p)/length(my.sims$CI.p) ### #How many have confidence intervals that do not overlap but yet are still "Significant"? ### my.sims$CI.p2 = as.numeric( my.sims$NotOverlap & my.sims$p > alpha )
sum(my.sims$CI.p2) ### #Histograms of the intervals ### hist(my.sims$t.p, nclass=100)
subset(my.sims,my.sims$p > .5) ### #Histograms of the intervals ### hist(my.sims$ci.y.ul, nclass=100, xlim=c(-2,3), ylim=c(0,2), col=4, freq=F,
main="Histogram of Confidence Intervals", xlab="Value")
hist(my.sims$ci.y.ll, nclass=100, add=T, col=4, freq=F) hist(my.sims$ci.x.ul, nclass=100, add=T, col=2, freq=F)
hist(my.sims\$ci.x.ll, nclass=100, add=T, col=2, freq=F)



Simulating Random Multivariate Correlated Data (Categorical Variables)

This is a repost of the second part of an example that I posted last year but at the time I only had the PDF document (written in $\LaTeXe$).

This is the second example to generate multivariate random associated data. This example shows how to generate ordinal, categorical, data. It is a little more complex than generating continuous data in that the correlation matrix and the marginal distribution is required.  This uses the R library GenOrd.

The graph above plots out the randomly generated data with the given correlation matrix and groups it  by the second variable.  Though there are many other approaches on graphing categorical data available.  One source is available here.

This example creates a 2-variable dataset. However, this can easily be extended to many more variables. The correlation matrix R for this 2-dimensional example.

$R = \left( \begin{smallmatrix} 1&-0.6\\ -0.6&1 \end{smallmatrix} \right)$

The R code below will generate an ordinal dataset with a correlation matrix of:

$R = \left( \begin{smallmatrix} 1&-0.5469243\\ -0.5469243&1 \end{smallmatrix} \right)$

Increasing the sample size will let the correlation coefficients converge on the target correlations.

library(GenOrd)
set.seed(1)
# Sets the marginals.
# The values are cumulative so for the first variable the first marginal will be .1, the second is .2, the third is .3, and the fourth is .4
marginal < - list(c(0.1,0.3,0.6),c(0.4,0.7,0.9)) # Checks the lower and upper bounds of the correlation coefficients. corrcheck(marginal) # Sets the correlation coefficients R <- matrix(c(1,-0.6,-0.6,1),2,2) # Correlation matrix n <- 100 ##Selects and ordinal sample with given correlation R and given marginals. m <- ordsample(n, marginal, R) ##compare it with the pre-defined R cor(m) table(m[,1],m[,2]) chisq.test(m) gbar < - tapply(m[,1], list(m[,1], m[,2]), length) par(mfrow=c(1,1)) barplot(gbar, beside=T, col=cm.colors(4), main="Example Bar Chart of Counts by Group",xlab="Group",ylab="Frequency") [/sourcecode]

Simulating Random Multivariate Correlated Data (Continuous Variables)

This is a repost of an example that I posted last year but at the time I only had the PDF document (written in $\LaTeXe$).  I’m reposting it directly into WordPress and I’m including the graphs.

From time-to-time a researcher needs to develop a script or an application to collect and analyze data. They may also need to test their application under a variety of scenarios prior to data collection. However, because the data has not been collected yet it is necessary to create test data. Creating continuous data is relatively simple and is fairly straight forward using the Cholesky (pronounced kol-eh-ski) decomposition. This approach takes an original X variable (or matrix) and uses the Cholesky transformation to create a new, correlated, Y variable. To make things simple and straight forward this example will generate data from the a random normal distribution N(0,1).

The reason this approach is so useful is that that correlation structure can be specifically defined. The scripts can be used to create many different variables with different correlation structures. The method to transform the data into correlated variables is seen below using the correlation matrix R.

$R = \left( \begin{smallmatrix} 1&0.8&0.2\\ 0.8&1&0.7\\0.2&0.7&1 \end{smallmatrix} \right)$$R = \left( \begin{smallmatrix} 1&0.8&0.2\\ 0.8&1&0.7\\0.2&0.7&1 \end{smallmatrix} \right)$

Once the correlation matrix is set the researcher takes the Cholesky decomposition of the correlation matrix. Multiplying the Cholesky decomposition of the correlation matrix by the data matrix the resulting matrix is a transformed dataset with the specified correlation.

$W = \left[Cholesky (R)\right]\left[X\right]$
The R code from below will generate a correlation matrix of:

$R = \left( \begin{smallmatrix} 1&0.7997999&0.1998661\\ 0.7997999&1&0.7000217\\0.1998661&0.7000217&1 \end{smallmatrix} \right)$$R = \left( \begin{smallmatrix} 1&0.7997999&0.1998661\\ 0.7997999&1&0.7000217\\0.1998661&0.7000217&1 \end{smallmatrix} \right)$


R = matrix(cbind(1,.80,.2,  .80,1,.7,  .2,.7,1),nrow=3)
U = t(chol(R))
nvars = dim(U)[1]
numobs = 100000
set.seed(1)
random.normal = matrix(rnorm(nvars*numobs,0,1), nrow=nvars, ncol=numobs);
X = U %*% random.normal
newX = t(X)
raw = as.data.frame(newX)
orig.raw = as.data.frame(t(random.normal))
names(raw) = c("response","predictor1","predictor2")
cor(raw)
plot(head(raw, 100))
plot(head(orig.raw,100))




Distribution of T-Scores

Like most of my post these code snippets derive from various other projects.  In this example it shows a simulation of how one can determine if a set of t statistics are distributed properly.  This can be useful when sampling known populations (e.g. U.S. census or hospital populations) or populations that will soon be known (e.g. pre-election, exit polling).  This is a simple example but the concept can be expanded upon to include varying sample sizes and varying known mean values.  When collecting data in real life the nsim value will likely be only a handful of random samples rather than a million.  In this example a fixed constant sample size of 50 is used.

If you’re collecting data and you begin to see that your distribution of t scores begins to deviate from the known distribution then it might be time to tweak some of the algorithms.

set.seed(1234)
nsims <- 1000000 n <- 50 x <- replicate(nsims, rexp(n, 5)) x.sd <- apply(x, 2, sd) x.mean <- apply(x, 2, mean) x.t <- (x.mean - 0)/(x.sd/sqrt(nrow(x))) qqnorm(x.t) # follows a normal distribution (x.grand.mean <- mean(x.t)) # ~0 median(x.t) # ~0 var(x.t) # v/(v-2) skewness(x.t) # ~0 library(e1071) kurtosis(x.t, type=1) theta <- seq(-4,4, by=.01) p <- dt(theta, n) p <- p/max(p) d <- density(x.t) plot(d) plot(theta, p, type = "l", ylab = "Density", lty = 2, lwd = 3) abline(v=x.grand.mean, col="red") [/sourcecode]