Assuming our initial assumptions are wrong is usually correct
Updated: Feb 16, 2021
Part 1 of the series Make Better Decisions Through Probability & Statistics.
As per the intro-post of the series, this post will focus on: "Bayesian probability made me realize that in any scenario we are more likely to be wrong in our initial hypotheses than right. We can make better decisions only by trying to disproof ourselves rather than validate our initial thinking."
The most common approach to probability and statistics is arguably the frequentist one, using historical data to compute parameters of distributions like the average and the standard deviation. While we usually think that with this approach we are making objective analyses, we may be still introducing important subjective assumptions. That subjective component is not always put in evidence by frequentist approaches and that may be one of its main limits.
Opposed to the frequentist method is Bayesian probability & statistics – from now on also BP&S -- where data are used not to directly obtain parameters, but to validate prior assumptions made on them – we’ll clarify with a simple example.
BP&S has many current applications, like testing the result of marketing campaigns or the effectiveness of new drugs. BP&S is based on conditional probability; one simple example I found online, is the following:
Consider a university campus made only of two groups of students: math grads (estimated 20% of the population) and business grads (estimated 80% of the population), what is the probability that a student we encounter in the campus’ cafeteria is a math grad given that we notice he is a shy person? We can estimate that among the math grads 50% are on average shy people, while among the business grads only 10% of them are shy. The probability that a student is a math grad given that he is shy is given by the Bayesian relation:
Probability that a student is a math grad given that s/he is shy = P(M/S) =
= P(S/M) x P(M) / P(S) =
= P(S/M) x P(M) / [P(S/M) x P(M) + P(S/B) x P(B)] =
= 50% x 20% / 18%
If you have never seen a Bayesian expression like the one above and did not get what we just wrote, think about it intuitively; that is basically a weighted probability. The probability that a shy student is a math grad is the percentage of the math students who are shy, divided by the total percentage of shy students (math & shy + bz & shy)
That might seem almost a trivial example given that it might result very similar to the common frequentist approach; the difference is that the percentages we used for the calculation were our estimates. In essence, BP&S allows us to start from the evidence, that are the data (the student is shy), and validate our assumed parameters (prior expectation). BP&S could be considered basically to be the opposite of the frequentist approach. In order to do that we’d need a simulation, and we’ll run it in Python.
The following example was inspired by a similar one found online, however, as you’ll see in the end, during my analysis it will drift away till I reach my turning point and the core idea of this post.
I often feel clumsy while trying to pair-back socks after a laundry; usually, the higher the number of socks, the higher the probability of getting many initial unpaired ones (i.e. singles). A good example of the applicability of BP&S is the prediction of how many pairs of socks we have in the laundry given that we collected the first, let’ say 17 socks, being all singles.
The simplified process is the following: we assume a distribution of total socks, let’s say we assume we have on average about 30 socks in a single laundry with some deviations in between laundries described by the beta distribution at the bottom of figure_1 -- "prior" distribution. Let’s assume also socks are all paired, therefore, on average we have 15 pairs. We then run a kind of Monte Carlo simulation in Python which will repeatedly extract numbers of total socks from that prior distribution centered around 30, then draw 17 socks, and finally, register the initial total number of socks only if the 17 drawn were all singles. All those recorded values of total socks will then build the new “posterior” distribution centered around the possible total number of socks – that will be a corrected version of our initial distribution centered around 30. The result is shown at the top of figure 1 and it is now centered around 60. The Bayesian analysis is basically telling us the following: “look, while you assumed the total socks to be about 30, given that your first 17 are all singles, it is more probable you have about twice that number of socks in your laundry, that is 60.
That may seem a remarkable result, however, we must be careful. When I went through a similar calculation, after a first similar result, I tried to stress the process in order to understand its limits, and I found interesting ones. The problem, or solution, should be thought in a different way: while the simulation is indeed giving us an important and correct indication -- warning us that 30 seems pretty low given our firsts 17 singles right out of the laundry -- it is giving us a sort of minimum rather than the exact solution. Anything above 60 could give us, with good probability, the first 17 socks being singles; higher numbers above 60 would just be more and more capable of that. Say we have a laundry with 1000 socks (500 pairs), it should be even easier to have the first 17 be singles.
There is more. The dynamics of the experiment are affected by our starting point below the probable limit – avg of the prior 30 vs avg of the posterior 60. Below, in figure_2, I try to show in excel what is happening by computing the probability of extracting 1, 2, 3 … singles socks for different numbers of total socks - some readers may notice this is similar to a Binomial random variable reducing to p^i, even though we have here dependency of draws. If we wanted 17 single socks, the number of total socks should be at least about 60 in order to make it a probable enough outcome (about 64% probability). If we reduced our request to 12 singles, the total should be at least 40. If we just wanted 6 first singles, the total number of socks should be at least 20.
The critical point is to notice that, if we started from a higher assumed number than the minimum 60 (our prior), BP&S would in some way fool us by telling us that we’d be just about right. In figure_3 I show this result. Starting from a prior assumption of about 100 total socks (mean = 96), the algorithm would tell us that we would be probably very close, and it would update our belief just a bit by centering the posterior distribution to 109. Prior assumptions above 100, would yield similar results and, while not wrong, they would be basically just confirming that all those assumed priors would be highly capable of resulting in the first 17 socks being all singles. It is important to stress that they would not be telling us the probable total number of socks.
We must always approach problems from different directions; only by triangulating results we can be more confident about them. Our initial assumptions are likely to be wrong and even when we can spot the error, it may not tell the entire story. Spotting an error often tells us something only about what we considered initially; whatever we left out with our initial assumptions may still be playing an important role and still affecting negatively the result.
We can make better decisions only by trying to disproof ourselves rather than validate our initial thinking. Bayesian approaches legitimize in some ways investigative processes.
As the reader may have already perceived, BP&S focuses on distributions rather than numbers, and that may allow for a more open analysis and investigation of the assumptions.
The focus on the distribution of the data leads us to the second part of this series.
See you in part_2