riccardo
Testing and polling sampled population
Updated: Feb 16, 2021
#poll #confidenceInterval #statistics

Say we have a population of N individuals and we know that among that pool there is an unknown number of cases representing specific characteristics or preferences (e.g. 1%, 10%, etc.); how indicative of the entire population is the result of positive cases obtained by testing only a sampled group? Going through this topic, among some intuitive conclusions, there are a couple of considerations that we usually get wrong and can invalidate our conclusions. This argument often applies to our daily professional lives and, even if it is related to the unfortunate situation we are experiencing and the daily data we are exposed to, it does not want to be directly linked to that - I think I should not publicly talk about things I am not directly involved with; those cases might also require conditional probability which will not be part of this.
To understand real applications where we cannot test, nor ask, any single individual of a given population, we first need to understand ideal ones where we have access to complete information. Say we have a country with a population of 30 million people and we need a measure of how many people eat gluten-free among them assuming that the true value is 1% (i.e. trivial case). We can poll all 30 million people, but for some reason, we can do it only in 3k groups by 10k people each (it could be for geographical reasons or other practical and statistical ones). We can execute a simulation in python (or Excel): build an array of 30M with 300k positives (1%), randomly pick 10k samples at a time, register their average percentage of positive cases [%], and exclude those samples from the total 30M population before conducting the subsequent 10k draws. Finally, plot the distribution of the 10k averages of positive cases obtained. The result would be something like the following:

We would obviously get the exact 1% average since we are asking every single individual of the entire 30M population their preference, and we are doing it only once (non-overlapping groups). That result would not depend on the specific grouping, meaning that even if we polled the 30M population by dividing them into 100 separate groups by 300k people each (instead of 3k groups by 10k), we would get the same 1%. However, the size of the groups would still affect the result, it would affect the deviation of every single result from the total overall true 1% average (red lines in the picture above). In particular, the deviation – how much each sub-group’s average ends up being far from the overall 1% true avg – is somehow inversely proportional to the size n of each group. Let us also recall that, the Standard Deviation is that number that tells us how far from the mean we should go to find about 68% of the possible single outcomes (1 SD), 95% (2SD), and 99% (3SD):

Brief mathematical parentheses on the group’s size (skip through "***" if you wish)
The independence of the average and the dependence of the Variance (or its square root, the Standard Deviation) form the size of the groups, is confirmed by the theory - through some might prefer to the following in terms of unbiased (Mean) and biased (Variance/StDev) estimators. Let us consider every single individual of the entire population N a random variable X valued 1 if the individual is positive and 0 otherwise (Bernoulli random variable). Let us also assume as an estimate of the true rate (1%) the parameter p = Sum(X)/n, that is the average of the n sampled cases: we could show that the Expectation [or the average] of p would be exactly equal to p itself, or:
E[Sum(X)/n] = p
This is nothing more than computing the expectation of the underlying Bernoulli random variable. This result is independent of n (the size of the group) and in our case exactly equal to the true p_true = 1% because we are considering the entire population, therefore n = N. Intuitively, if we polled a sampled population of y individuals, their overall average would not depend on the number of groups we divide y into; the final average would always stay the same because the overall average would always be computed on all the y individuals, whether we first compute sub-averages from sub-groups and then average them, or we compute directly the overall average from all the y elements.
On the contrary, the deviation [from p] of the single averages from each group would be inversely proportional to the size of those groups. In short:
StdDev -> SquareRoot { [p * (1-p)] / n } (this is for large N, this is why we have “->”)
Therefore, the Standard Deviation is inversely proportional to the square root of the size n of the groups. Applying the equation above to the example we just examined, we would get:
SD = Sqrt (1% * (1 - 1%) / 10k) = 0.09% (which is also exactly what the simulation in python returned and that is showed by the first picture above)
***
How can we leverage this in real applications?
In practice, we usually cannot poll the entire population and we are forced to use sampled groups. Wait a second, is it really “groups” [plural] or “group” [singular]? Here is the crucial point:
Say we want to know the percentage rate of people eating gluten-free among a total population of 30 million individuals (same as the example above), and say we can poll only 10k of them (only one single group of the previous example). Even if we decided to poll those 10k people in separate smaller sub-groups to capture statistical significance (e.g. geographical, etc.), we should still think about all those 10k individuals as one unique big group (regardless of how they are sub-grouped).
Thinking about one unique group of 10k people is like saying that we are polling the entire 30M population divided in groups of 10k people each (so far exactly the same situation as the initial ideal example) with visibility of the outcome limited to only one of those groups. We are basically looking at only one thin line of the entire distribution of the initial picture above. In doing that we now could use the considerations made before when we had complete knowledge. Whatever the gluten-free percentage within our group of 10k elements, we could say that our result is "probably" 1x StsDev far from the truth with about 68% confidence - or 2x SD far from the truth with 95% confidence, or 3x SD far from the truth with more than 99% confidence. We would be basically leveraging the Central Limit Theorem telling us that, if we could repeat our poll on different 10k-person groups, our estimated mean would dance around the true mean, according to a normal distribution and in a measure inversely proportional to the groups' size (10k in our case). So, for example, say we want to have a 95% confidence on our estimate from the 10k samples that we calculated to be 1.15% (just a made-up number), we can say that the gluten-free percentage of the entire 30M population is 1.15% +- 2x SD with 95% confidence (as said in the mathematical parenthesis above, the SD of the example above is 0.09%, equal to what we can calculate through the equation above and what we obtained through the simulation). As we will see immediately below, real applications require bigger margins because of incomplete information.
Even though we are considering now only one sampled group rather than the entire population, we are still using parameters that in practice are not known a priori (i.e. the true underlying p, therefore also the true underlying Standard Deviation). Above, we knew from the beginning the exact percentage p = 1% of people eating gluten-free and we used that number to either compute the SD through the equation (showed above in the mathematical parenthesis) or to figure that through the complete simulation. In practice, we do not know p because it is exactly what we want to figure out. In short, if it was a real application where we did not know the exact p = 1% (say a political election poll), we could use the result obtained from the poll, assumed above to be p = 1.15% [and calculate SD = Sqrt {1.15% * (1 - 1.15%) / 10k = 0.12% > 0.09%}. However, in practice, in order to overcome the approximation of using the obtained p=1.15% as true p, a conservative number would be used for p, and that is 50%, which would yield SD = 0.5% in our case. This is because, the SD is a concave function of p as the following picture shows (calculated over n = 10k as per our case):

When we hear polls’ results being announced with “2% [or 3%] margin of error” they are doing something similar; after they calculate the average result from their polled sample (as we did above) they make assumptions on p to figure out the Standard Deviation. They probably assume a conservative value for p as well and they compute the equation previously seen StdDev = Sqrt { [p – (1-p)] / n }. They then multiply that value times x2 or x3 depending on their target confidence level 95 or 99% (usually omitted). Often communications have a margin of error +- 3% implying a 95% confidence or 2x SD; which could be the result of assuming something close to p = 0.5, and polling 1k sampled people; it would give 2x SD = 2x Sqrt (.5 * .5 / 1000) = 3.2% . Please note, sometimes those confidence levels are communicated in terms of derived “scores” and can differ a bit, however, the main concept underlying the calculation stays the same – there are dedicated resources on how to interpret communications about poll’s results, one is linked below.
Everything above leads us to a final thought. Understanding the theory of probability & statistics – which could effectively be done through simple simulations in Excel or Python – is crucial to then be able to effectively apply those arguments to real-case scenarios. That is because real situations usually involve missing knowledge and they require assumptions. In general, once we understand the underlying math and concepts, we should test them on ideal simulations, and then add logical considerations to structure practical and effective analyses.
Here is a brief reference summary on the confidence interval:
Main image's tag: