The Debugger's Theorem

This reminder has been on a wall in my house for ~15 years


As a programmer for 22 years I've fixed thousands of bugs, and created many times more.  Very often it appeared that the problem I was trying to fix had multiple independent causes.  However I have found - almost invariably - that if you dig long enough you'll find a single cause for all the problems you are seeing.  In fact the moment you hit on the right theory is often really obvious because it suddenly explains a whole bunch of other things that have been going wrong!  But, I wondered, can this observation be proven mathematically?  It turns out it can!

The Debugger's Theorem

If a system that usually works currently isn't working, then it is more likely than not that there's just one thing causing all the observed problems.

Proof

Let $P_0$ be the probability that the system has no problems, $P_1$ be the probability that one independent problem has occurred, $P_{2+}$ the probability that two or more have occurred, and $p_i$ be the probability that the $i$th independent problem has occurred
$$
\begin{align}
P_0 &= \prod_{i=1}^{n}{(1-p_i)} \\
P_1 &= \sum_{i=1}^{n}{(1-p_1)(1-p_2)...(1-p_{i-1})p_i(1-p_{i+1})...(1-p_n)}\\
&=P_0 \sum_{i=1}^{n}{\frac{p_i}{(1-p_i)}}\\
P_{2+} &= 1 - P_0 - P_1 \\
\end{align}
$$

Let's try to work out how to maximize $P_{2+} - P_1$ subject to the constraint that the probability the system works is fixed, i.e. $P_0 = C$.   Since,
$$
P_{2+} - P_1 = 1 - P_0 \left( 1+ 2\sum{\frac{p_i}{1-p_i}}\right)
$$
maximizing $P_{2+} - P_1$ subject to $P_0$ constant is the same as minimizing $\sum{\frac{p_i}{1-p_i}}$ subject to $P_0$ constant.  To do this we need to use the method of Lagrange mulitipliers.  This involves a few steps.  First we define
$$
L(p_1,...,p_n) = \sum{\frac{p_i}{1-p_i}} + \lambda P_0(p_1,...,p_n)
$$
where $\lambda$ is treated like an unknown constant.  Next we need to find a stationary point of $L$ giving $n$ equations in $n+1$ unknowns, including the $\lambda$.  The last step is to add the equation
$$
P_0(p_1,...,p_n) = C
$$
to get $n+1$ equations in $n+1$ unknowns, and then solve for $p_1,...,p_n,\lambda$.
The $n$ equations for the stationary point are
$$
\begin{align}
0 &= \frac{\partial L}{\partial p_k}\\
&= \frac{1}{(1-p_k)^2} - \lambda \prod_{i \neq k}{1-p_i} \\
&= \frac{1}{(1-p_k)^2} - \lambda \frac{P_0}{1-p_k} \\
\iff p_k &= 1- \frac{1}{\lambda P_0} \tag{1}
\end{align}
$$
The last step adds the following equation
$$
\begin{align}
\left(\frac{1}{\lambda C}\right)^n &= C \\
\iff  \lambda &= C^{-\frac{n+1}{n}} \tag{2}
\end{align}
$$
Equations (1) and (2) combine to give
$$
p_k = 1 - \sqrt[n]{C}
$$
Note that, apart from giving us a formula for each independent probability, this tells us $P_{2+} - P_1$ is maximized when $p_1,...,p_n$ are all equal.  Which is kind of obvious I suppose.  Anyway, let's now work out whether $P_{2+}$ is bigger or smaller than $P_1$ when $P_{2+} - P_1$ is as big as possible
$$
\begin{align}
P_{2+}-P_1 &= 1 - P_0 -2 P_1 \\
&= 1 - C - 2C \sum{\frac{p_i}{1-p_i}} \\
&= 1 - C \left( 1 + 2n \frac{1- \sqrt[n]C}{\sqrt[n]C} \right) \\
&= 1 - C ( 1 - 2n + 2nC^{-\frac{1}{n}})
\end{align}
$$
Now, what if $C = \frac{1}{2}$?  In that case it simplifies to
$$
\begin{align}
P_{2+}-P_1 &= \frac{1}{2} + n - n C^{-\frac{1}{n}}\\
&= \frac{1}{2} + n(1 - \sqrt[n]2)
\end{align}
$$
So we just need to check that the second term $n(1 - \sqrt[n]2)$ is always below -0.5.  Now, differentiating w.r.t. $n$ shows that it is an increasing function of $n$.  That's okay as long as it converges something below $-0.5$. Let's check
$$
\begin{align}
\lim_{n\to \infty} n(1 - 2^{\frac{1}{n}}) &= \lim_{x\to 0} \frac{1 - 2^x}{x} \\
&= \lim_{x\to 0} \frac{1 - e^{x ln2}}{x}\\
&= -ln2 \\
&\approx -0.69
\end{align}
$$
Hey presto! It turns out that $P_1$ is always bigger than $P_{2+}$ if $P_0 = \frac{1}{2}$ which is what we set out to prove

Notes

It turns out that we don't even need the system to work most of the time for it to be more likely than not that there's just one thing wrong with it when it's broken.  In fact, it's sufficient for the system to be working just 28.5% of the time for the theorem to be true $^\dagger$.

Also - and this is important - normally the $n$ independent things that can go wrong with the system are not equally likely.  This means that it is even rarer to find multiple independent things go wrong at the same time than this suggests.

The message is: if your system has stopped working and it looks like lots of different things have gone wrong with it, keep looking until you find a single ultimate cause for all of them.  At least spend longer looking for it than you usually do.

FOOTNOTES

  • $\dagger$: Thanks wolfram alpha, I couldn't have worked this out without you!

Comments

Popular posts from this blog

How To Make ASCII Diagrams Beautifuller

Calculation of ECS using simple convection model

Why growth is falling in all developed countries (as a long term trend)