Numbers-Only Hash

2019-10-09

Maybe two years ago, I encountered something that made me think. (I mean, that
wasn't the only time, I'm just... look, never mind.) I was committing something
to some `git` repository, and the truncated commit hash (e.g. `f151b8e`) was all
numbers. Not one of the possible letters `a` through `f` appeared in it. Of
course, the full hash (e.g. `f151b8e71a17fd7c7e074bc4c1500f2dc05da742`) had
plenty of letters, but by chance those first 7 nibbles were all less than `a`.

*Hmm*, I thought. *How likely is that?* Well, that's a pretty easy problem.
Let's assume all hashes are equally likely. That means we can consider each
character of the hash independently. Each character can be one of sixteen
possible values (`0` through `f`), but we want the cases where each character is
a member of the subset `0` through `9`. That means that the probability of
getting a digit for a given character in the hash is given by the following.

$$ P_{1} = \frac{10}{16} =  0.625 $$

Simple enough, but that's just for one character. To get all seven characters as
digits, we need to raise this value to the power 7.

$$ P_{7} = \left(\frac{10}{16}\right)^{7} \approx 0.0373 $$

So about 1 out of every 27 commits or so should contain nothing but digits in
the first seven characters. To my mind, that's infrequently enough to be
interesting while still frequently enough to be fun.

### Other Interesting Parameters

What other neat constraints can we put on?

Well, an obvious one is to expand our digits-only hash from 7 characters to the
full 40. In that case, the probability is considerably smaller.

$$ P_{40} = \left(\frac{10}{16}\right)^{40} \approx 6.8 \cdot 10^{-9} $$

That's about 1 in 146 million, so we can probably say that there might be **a**
commit out there that only has digits. I bet a focused effort could even find
it, but I leave that as an exercise for the reader (or maybe me in the future).

What about a short hash that's in ascending order? We'll include letters again,
so it could be something like `24569bf`. Notice that each character's value is
less than the one after it.[^1]

[^1]:
    Now, some of you may be smugly thinking that you know the term for this
    property---as I did mere moments before writing these words---but you're
    (probably) wrong. The parameter I've arbitrarily stated explicitly disallows
    repeating the same value, but a *monotonic* function can have plenty of
    repeated values if it wants, as long as the values never start going back
    the way they came.

How could we go about solving this? Well, somebody told me once that probability
is just fancy counting, so let's count. We know there are sixteen possible
values and seven spots for them to go in. When we pick the first character, it
can be any character up to `9`, inclusive, since anything more wouldn't leave us
enough room to fit strictly-increasing characters.

That means that the only case for `9` looks like this: `9abcdef`. For `8`, there
can be one "jump" in the sequence, and that jump can be anywhere, so `8abcdef`
and `89abcde` are both valid. That jump can be in any of seven positions (after
the first, after the second, etc.), so for `8` there are seven cases.

`7` is where it starts to get more difficult. Here, there can be *two* jumps of
value 1 or *one* jump of value 2. We know from `8` that the jump of value 2 can
be in any of seven positions (e.g. `789cdef`). A better way to count these along
with the jumps of value 1 might be to think of the jumps as additive. That is,
jumps are always of value 1, but if more than one happen to occupy the same
position, they "stack up" and become a jump of a larger value. This makes the
counting easier, because now we can consider the jumps individually.

Let's try it. Each jump can occupy any of seven positions, and for `7` there are
two jumps. That's a total of $7^{2} = 49$ cases. For `6`, there are three
possible jumps, so that's $7^{3} = 343$. In general, we can sum everything up
using the following expression.

$$ N_{\text{ascending}} = \sum_{i=1}^9 7^{n} = \text{47,079,207} $$

As you might expect, most of the cases come from the last term of the sum, when
$n=9$.

Now, all we need to do is count up the number of possible arrangements of any
sort and divide the two values to come up with the probability. Since there are
sixteen possible values for seven different positions, finding the total number
of possibilities is pretty straightforward.

$$ N_{\text{all}} = 16^{7} = \text{268,435,456} $$

And so is dividing the two values.

$$ P_{\text{ascending}} = \frac{N_{\text{ascending}}}{N_{\text{all}}} = \frac{\text{47,079,207}}{\text{268,435,456}} \approx 0.175 $$

This value is surprisingly high to me. I wouldn't have guessed that nearly 1 in
5 truncated commit hashes meet my strictly-increasing criterion, but apparently
they do.

### We could continue on like this forever.

There are endless criteria we could place on commit hashes (or other
nearly-meaningless strings). I encourage the reader to set one or more and try
to work out the probability. It's good for the soul, or whatever.