This post stem from an interesting discussion in the course W203: Exploring and Analyzing Data happened on July 16, 2015. The premise is a simple enough statistical puzzle at a glance - yet mainstream parametric hypothesis test procedures prove inadequate.
Do Facebook users have higher median number of children than non-Facebook users?
(Data: 5000 sample-size survey)
As noted in the class, the research question here asks about population meidan as opposed the usual mean in typical statistical textbooks, normal techniques like T test and ANOVA are out of the window. We talked about possibly using either Wilcoxon rank sum test, Mood’s Median test or bootstrapping.
In this post, we ask a slightly different (easier) version of the question:
And we try to evaluate different tests in answering this question through simulation.
Denote the r.v. of number of children of individuals who are Facebook users, and the number of children of individuals who are not Facebook users. Assume we have data from 2500 and 2500 , namely 2500 observations from each group.
vs.
For a real-valued random variable with CDF , the median is defined as the the real number such that:
or:
For with an absolutely continuous probability distribution, we have:
In general, median is not necessarily unique. For example, Bernoulli(0.5) does not have a single unique median, any satisfies the above defining properties, and therefore is a valid median. Also, at least one median always exist for any random variable, and the largest median is and smallest median is . See [2]. In addition, the set of median is always a closed and bounded interval in in case it is not unique.
T test assumes normal distribution with the same variance from both groups of observations. The Normal distribution’s median is the same as its mean. Therefore, if we are reasonably certain, the two populations are normal, we can use T test to test the equality of population median.
For out specific problem outlined previously, however, we cannot assume normal distribution - even before any normality test invalidates the assumption. The number of children for a given individual is discreet, non-negative, bounded and most likely not symmetrical. Hence, we suspect T test will perform poorly in answering the research question.
We start our exploration simple: by assuming the population is follows a Binomial distribution, more to the point, a special class of Binomial distribution where the median is unique (when is an integer). To model number of children as a Binomial distribution is questionable, however, Binomial r.v. does have the desired property of discreet, non-negative values, and we can easily tweak and to have different distribution yet identical medians.
Here are the assumptions for population distribution:
and
We have:
and is satisfied. The two populations are purposefully made distinct enough, has a positive skewness and has a negative skewness by design, which further invalidates normality assumption which is required for T test to be effective.
Now lets prepare for the simulation. In R a test is usually a function that returns some object typically with the attribute $p.value
. We can take advantage of this in designing our simulation interface. sim.run
is just a dumb function take takes a test function, do a random sampling from our and populations (by specifying sample function sample.fun
, which would return a matrix/data.frame of dimension , the first column being ’s and second being ’s) and return test’s p.value
.
# single run
sim.run <- function(test, sample.fun, sample_size = 2500) {
data <- sample.fun(sample_size)
x <- data[ ,1]
y <- data[ ,2]
test(x,y)$p.value
}
For our Binomial populations, here’s the implementation of sample.fun
sample.binom <- function(n) {
cbind(
rbinom(n, 100, 0.15),
rbinom(n, 20, 0.75)
)
}
The Pmf for and are plotted below:
pmf.X <- data.frame(k = 0:100, prob = dbinom(0:100, size = 100, p = 0.15), variable = 'X')
pmf.Y <- data.frame(k = 0:20, prob = dbinom(0:20, size = 20, p = 15/20), variable = 'Y')
dat <- rbind(pmf.X, pmf.Y)
p <- ggplot(dat, aes(x = k, y = prob))
p + geom_segment(aes(xend = k, yend = 0), size=0.5) + ylab('p(k)') + facet_grid(. ~ variable)
For T Test and Wilcox rank sum tests, we have readily available test functions in base R: t.test
and wilcox.test
. For Mood’s Test and Bootstrapped T Test, we need to implement them ourselves.
Let’s quickly go over what Mood’s Test is as it is lesser known (even its Wikipedia page is a stub!). This is a good intro. To recap:
Determine the overall median.
For each sample, count how many observations are greater than the overall median, and how many are equal to or less than it.
Put the counts from step 2 into a 2xk contingency table
sample1 | sample2 | |
---|---|---|
greater than overall median | ||
less than or equal to overall median |
And here’s the code:
mood.test <- function(x, y) {
overall_mid <- median( c(x, y) )
nx <- length(x)
ny <- length(y)
nxbelow <- sum(x <= overall_mid)
nxabove <- sum(x > overall_mid)
nybelow <- sum(y <= overall_mid)
nyabove <- sum(y > overall_mid)
chisq.test( rbind( c(nxabove, nxbelow), c(nyabove, nybelow) ) )
}
Given a sufficiently large sample size (in our case 2500 for and ), we can generate new samples by random sampling with replacement, this is the central idea of bootstrapping. For each set of the re-sampled data , we can compute a sample median . The sample median is an unbiased estimator of population median, therefore, the mean of all ’s follows a normal distribution asymptotically with the mean of population median, due to CLT - and this is a statistic we can perform T Test on.
In addition to Bootstrapping, we also consider subsampling. The key difference is when resampling, subsampling samples a smaller sample size and without replacement. The advantage of subsampling is that it is valid under much weaker conditions compared to the bootstrap.
The code:
medians.bootstrap <- function(x, k=1e3) {
n <- length(x)
replicate(k, expr=median(sample(x, n, replace=TRUE)))
}
medians.subsample <- function(x, k=1e3) {
n <- 100
replicate(k, expr=median(sample(x, n, replace=FALSE)))
}
t.test.bootstrap <- function(x, y) {
tryCatch({
t.test(medians.bootstrap(x), medians.bootstrap(y))
}, error=function(err) {
list(
p.value=NA
)
})
}
t.test.subsample <- function(x, y) {
tryCatch({
t.test(medians.subsample(x), medians.subsample(y))
}, error=function(err) {
list(
p.value=NA
)
})
}
With all building blocks, ready, now we run all five (T, Wilcox, Mood, Bootstrapped T, Subsample T) tests on 10000 random samples, and record the p-values. What should the distribution of simulation p-values look like? Recall the relation between Type I Error probability, alpha and p-value:
For , where is the r.v. for p-value in performing repeated tests on different samples from the same population. That’s the distribution function of a uniform r.v.
At this step, a problem surfaced. Since there are so few possible values sample observations could occupy, resampling using Bootstrapping essential produces the same sample median all the time, halting the subsequent T Test. Error message in R:
Error in t.test.default(c(1, 1), c(1, 1)) : data are essentially constant
Therefore, we have to drop Bootstrapping in favor of Subsampling.
p.values <- cbind(
replicate(n=1e4, expr=sim.run(t.test, sample.binom)),
replicate(n=1e4, expr=sim.run(wilcox.test, sample.binom)),
replicate(n=1e4, expr=sim.run(mood.test, sample.binom)),
replicate(n=1e4, expr=sim.run(t.test.subsample, sample.binom))
)
p.values <- as.data.frame(p.values)
names(p.values) <- c('T', 'Wilcox', 'Mood', 'Subsample')
par(mfrow=c(2,2))
for(i in 1:4) {
p.value <- p.values[!is.na(p.values[ ,i]), i]
plot(density(p.value, from=0 ,to = 1), main=names(p.values)[i])
}
par(mfrow=c(2,2))
for(i in 1:4) {
p.value <- p.values[!is.na(p.values[ ,i]), i]
plot.ecdf(p.value, main=names(p.values)[i])
}
For our specifically chosen population, T Test looks surprisingly good, its empirical p-values is the closest to uniform distribution among the four tests. The reason, of course is for Binomial distribution its mean and median is the same: , and T Test turns out to be very robust. A typical method to evaluate tests is Power function. In our context, rephrase the hypothesis as follows:
vs.
Power function is defined as:
when is true, i.e.
when is true, i.e.
where is the reject region of the test.
The ideal power function should have a value of 0 when and 1 when . For example, If our test is a coin flip, the corresponding power function is:
which is not very good.
The empirical power function values when is true () is, assuming a 5% :
T Test: 0.0485
Wilcox Test: 0.2614
Mood Test: 0.2128
Subsampling T Test: 0.9255
Clearly, T Test is the winner here, other tests have larger Type I error probabilities, and Subsampling T Tests is even worse than a coin flip test.
Now we consider the populations where is true:
sample.binom2 <- function(n, diff) {
cbind(
rbinom(n, 100, (15 + diff) / 100),
rbinom(n, 20, 0.75)
)
}
The goal of this section is to get a good picture of the shape of power functions, by controlling . We create a different sample function, withe additional parameter diff
as the difference of population median. We shall consider .
power <- function(diff, runs=1e4) {
sample.fun <- function(n) {
sample.binom2(n, diff)
}
p.values <- cbind(
replicate(n=runs, expr=sim.run(t.test, sample.fun)),
replicate(n=runs, expr=sim.run(wilcox.test, sample.fun)),
replicate(n=runs, expr=sim.run(mood.test, sample.fun)),
replicate(n=runs, expr=sim.run(t.test.subsample, sample.fun))
)
p.values <- as.data.frame(p.values)
names(p.values) <- c('T', 'Wilcox', 'Mood', 'Subsample')
summarize(p.values,
T=mean(T < 0.05),
Wilcox=mean(Wilcox < 0.05),
Mood=mean(Mood < 0.05),
Subsample=mean(Subsample[!is.na(Subsample)] < 0.05))
}
Running the simulation for all values:
powers <- ldply(c(-5:-1, 1:5), power)
powers <- cbind(powers, c(-5:-1, 1:5))
names(powers)[5] <- 'Median.Diff'
print(powers)
## T Wilcox Mood Subsample Median.Diff
## 1 1 1 1 1 -5
## 2 1 1 1 1 -4
## 3 1 1 1 1 -3
## 4 1 1 1 1 -2
## 5 1 1 1 1 -1
## 6 1 1 1 1 1
## 7 1 1 1 1 2
## 8 1 1 1 1 3
## 9 1 1 1 1 4
## 10 1 1 1 1 5
All tests have very high powers.
There are infinitely many ways of specifying population distribution of and . In this section, we restrict the potential values of both and to . The Pmf of can be specified by 10 params: where . Same goes for . Note the population median is not necessarily unique.
If there exists such that:
then the median is not unique, any number in is a population median. If we define median as , the sample median is an unbiased estimator of population median.
If no such exist, i.e. and then population median is . In this case, sample median is a biased estimator (but asymptotically unbiased).
Following the above definition of median (in case it’s not unique), in this section, we shall generate from an uniform distribution, then normalize them such that . Once the population of is chosen, we use the following procedure to choose ’s, such that :
Note in this process, scenario 1. will almost never happen. Also, since in general, sample median is biased, the subsampling method is probably not reliable.
The following function generates the probabilities for
x.probs <- function() {
ps <- runif(11)
ps / sum(ps)
}
The following function pop.median
and pop.mean
computes the population median and mean given a vector of probabilities:
S <- lower.tri(diag(11), diag=TRUE)
pop.median <- function(probs) {
cdf <- as.vector(S %*% probs)
lower.half <- cdf[cdf < 0.5]
upper.half <- cdf[cdf > 0.5]
if(length(lower.half) + length(upper.half) < length(cdf)) {
m <- which(cdf == 0.5)
return(m - 0.5)
}
length(lower.half)
}
pop.mean <- function(probs) {
sum( probs*seq(0, 10) )
}
The following function takes care of scenario 1., and generates probabilities for given a non-integer median:
y.probs.1 <- function(m) {
m <- floor(m)
ps.lower <- runif(m)
ps.upper <- runif(11 - m)
ps.lower <- ps.lower / sum(ps.lower) / 2
ps.upper <- ps.upper / sum(ps.upper) / 2
c(ps.lower, ps.upper)
}
The following function takes care of scenario 2., and generates probabilities for given an integer median:
y.probs.2 <- function(m) {
pa <- runif(1, 0, 0.5)
pb <- runif(1, 0.5, 1)
ps.lower <- runif(m)
ps.upper <- runif(10 - m)
ps.lower <- ps.lower / sum(ps.lower) * pa
ps.upper <- ps.upper / sum(ps.upper) * (1 - pb)
c(ps.lower, pb-pa, ps.upper)
}
And together, they form the function to generate probabilities for :
y.probs <- function(m) {
if(m == floor(m)) {
return(y.probs.2(m))
} else {
return(y.probs.1(m))
}
}
Here’s how two randomly generated populations look like in their pmf:
pxs <- x.probs()
mx <- pop.median(pxs)
pys <- y.probs(mx)
pmf.X <- data.frame(k = 0:10, prob = pxs, variable = 'X')
pmf.Y <- data.frame(k = 0:10, prob = pys, variable = 'Y')
dat <- rbind(pmf.X, pmf.Y)
dat$k <- as.factor(dat$k)
p <- ggplot(dat, aes(x = k, y = prob))
p + geom_segment(aes(xend = k, yend = 0), size=4) + ylab('p(k)') + facet_grid(. ~ variable)
print(pop.median(pxs))
## [1] 5
print(pop.median(pys))
## [1] 5
And we create a new sample function, which creates a pair of randomized populations with the same median, and sample from them:
sample.random_pop <- function(n) {
px <- x.probs()
m <- pop.median(px)
py <- y.probs(m)
cbind(
sample(x = seq(0, 10), n, replace = T, prob=px),
sample(x = seq(0, 10), n, replace = T, prob=py)
)
}
p.values <- cbind(
replicate(n=1e4, expr=sim.run(t.test, sample.random_pop)),
replicate(n=1e4, expr=sim.run(wilcox.test, sample.random_pop)),
replicate(n=1e4, expr=sim.run(mood.test, sample.random_pop)),
replicate(n=1e4, expr=sim.run(t.test.subsample, sample.random_pop))
)
p.values <- as.data.frame(p.values)
names(p.values) <- c('T', 'Wilcox', 'Mood', 'Subsample')
par(mfrow=c(2,2))
for(i in 1:4) {
p.value <- p.values[!is.na(p.values[ ,i]), i]
plot(density(p.value, from=0 ,to = 1), main=names(p.values)[i])
}
par(mfrow=c(2,2))
for(i in 1:4) {
p.value <- p.values[!is.na(p.values[ ,i]), i]
plot.ecdf(p.value, main=names(p.values)[i])
}
All of these look pretty bad.
Let’s generate two random populations:
set.seed(123)
pxs <- x.probs()
m <- pop.median(pxs)
pys <- y.probs(m)
Both populations have a median of 6, the mean for is 5.3046801 and 5.4742832 for . The mean and median is not the same for the population, therefore, we expect T Test to perform poorly. To get a sense of the population distribution, we plot their pmf:
pmf.X <- data.frame(k = 0:10, prob = pxs, variable = 'X')
pmf.Y <- data.frame(k = 0:10, prob = pys, variable = 'Y')
dat <- rbind(pmf.X, pmf.Y)
dat$k <- as.factor(dat$k)
p <- ggplot(dat, aes(x = k, y = prob))
p + geom_segment(aes(xend = k, yend = 0), size=4) + ylab('p(k)') + facet_grid(. ~ variable)
and have very different distributions, even if they share the same median. Keep these two populations fixed, we repeated sample from them using the following sample function:
sample.random_pop2 <- function(n) {
cbind(
sample(x = seq(0, 10), n, replace = T, prob=pxs),
sample(x = seq(0, 10), n, replace = T, prob=pys)
)
}
And Run the simulation
p.values <- cbind(
replicate(n=1e4, expr=sim.run(t.test, sample.random_pop)),
replicate(n=1e4, expr=sim.run(wilcox.test, sample.random_pop)),
replicate(n=1e4, expr=sim.run(mood.test, sample.random_pop)),
replicate(n=1e4, expr=sim.run(t.test.subsample, sample.random_pop))
)
p.values <- as.data.frame(p.values)
names(p.values) <- c('T', 'Wilcox', 'Mood', 'Subsample')
par(mfrow=c(2,2))
for(i in 1:4) {
p.value <- p.values[!is.na(p.values[ ,i]), i]
plot(density(p.value, from=0 ,to = 1), main=names(p.values)[i])
}
One thing we notice is since many and observations will take the value of their population median 6. In the case of Mood Test, under (), the following is assumed to be true for the Chi-sq test to work:
This would work well for absolute continuously distributions, but it’s easy to verify this is not the case in our randomly generated populations, instead all we can say is:
and
The difference between (0.5333528) and (0.7795565) non-zero and quite large.
The first thing to keep notice is we choose a subsample size of 100 throughout this exercise, and we know the sample median could be biased, and this could cause problems. To illustrate the biasness of sample median, we shall draw samples from size ranging from 10 to 100000, and plot the sample median:
sizes <- floor(c(seq(50, 950, by=50), 10^(seq(3,5,by=0.2))))
medians <- sapply(sizes,
function(n) mean(replicate(1e3, expr=median(sample(x = seq(0, 10), n, replace = T, prob=pxs)))))
sample.medians <- data.frame(
size=sizes,
median=medians
)
p <- ggplot(sample.medians)
p <- p + geom_point(aes(x=size, y=median)) +
geom_line(aes(x=size, y=median)) +
geom_hline(yintercept=m) +
geom_text(aes(0,m,label = 'Population Median', hjust = -2, vjust=1))
plot(p)
For however, the bias is pretty-much non-existent, due to the fact (0.6121182) is very large.
medians <- sapply(sizes,
function(n) mean(replicate(1e3, expr=median(sample(x = seq(0, 10), n, replace = T, prob=pys)))))
sample.medians <- data.frame(
size=sizes,
median=medians
)
p <- ggplot(sample.medians)
p <- p + geom_point(aes(x=size, y=median)) +
geom_line(aes(x=size, y=median)) +
geom_hline(yintercept=m) +
geom_text(aes(0,m,label = 'Population Median', hjust = -2, vjust=1))
plot(p)
Due to the bias of sample median in , the subsampling T test will reject for any reasonable subsample size we choose.
The failure in previous section hints for another line of experimentation: populations which have unbiased sample median statistics, in which case the subsampling method would’ve been much more potent. If we assume some population, with probabilities of ’s, whose sample median is an unbiased estimator of for all sample size ’s, this must hold true when . The sample median when is , which is also the sample mean.
where is population mean
Therefore, we must have: , population median and mean is the same. Turns out the unbiased sample median assumption is a pretty strong restriction. (Under which T Test would’ve also performed better)
It’s easy to prove if the distribution of is symmetrical, the above property is satisfied, some asymmetrical distributions also satisfies the same property, for example, Binomial distribution where is an integer.
To satisfy the above restriction, one quick example we can think of is, follows a uniform discreet distribution in , and follows . Both have their mean and median equal to 5, yet their distribution is very different.
pxs <- rep(1/11, 11)
pys <- dbinom(c(0:10), 10, 0.5)
Their Pmfs are plotted below:
pmf.X <- data.frame(k = 0:10, prob = pxs, variable = 'X')
pmf.Y <- data.frame(k = 0:10, prob = pys, variable = 'Y')
dat <- rbind(pmf.X, pmf.Y)
dat$k <- as.factor(dat$k)
p <- ggplot(dat, aes(x = k, y = prob))
p + geom_segment(aes(xend = k, yend = 0), size=4) + ylab('p(k)') + facet_grid(. ~ variable)
sample.random_pop3 <- sample.random_pop2 # global variable pxs and pys have changed, so we can reuse this function
p.values <- cbind(
replicate(n=1e4, expr=sim.run(t.test, sample.random_pop3)),
replicate(n=1e4, expr=sim.run(wilcox.test, sample.random_pop3)),
replicate(n=1e4, expr=sim.run(mood.test, sample.random_pop3)),
replicate(n=1e4, expr=sim.run(t.test.subsample, sample.random_pop3))
)
p.values <- as.data.frame(p.values)
names(p.values) <- c('T', 'Wilcox', 'Mood', 'Subsample')
par(mfrow=c(2,2))
for(i in 1:4) {
p.value <- p.values[!is.na(p.values[ ,i]), i]
plot(density(p.value, from=0 ,to = 1), main=names(p.values)[i])
}
par(mfrow=c(2,2))
for(i in 1:4) {
p.value <- p.values[!is.na(p.values[ ,i]), i]
plot.ecdf(p.value, main=names(p.values)[i])
}
The empirical power function values when is true is, assuming a 5% :
T Test: 0.049
Wilcox Test: 0.0703
Mood Test: 0.9999
Subsampling T Test: 0.7335
T Test and Wilcox rank sum both look good. For the same reason we have identified before Mood Test prove to be inadequate - surprising ly subsampling method does not do well at all, the reasons are yet to be analyzed.
Throughout this analysis, we have tried various setup of different discreet populations, and run simulations on randomly sampled data from them. Here’s a few takeaways:
[1] MIDS program: http://datascience.berkeley.edu/
[2] http://math.stackexchange.com/questions/85696/does-a-median-always-exist-for-a-random-variable