I started thinking about the SHA-1 hashes used in Git today. The hashes are 160 bits long, and are usually represented by 40 character hex strings. Why 40?

If one character represents *x* bits, the character set must include 2^{x }distinct characters. And with 2^{x }distinct characters (or numerals), we have a base-2^{x } system. With just two distinct characters (A and B, say), one character represents one bit, *x* = 1, and we have a base-2 system. With four distinct characters (1,2,3,4, say), we have a base-4 system, and one character represents two bits since 2^{2} = 4.

One hex character (or hexadecimal), is a character in a base-16 system (hexa – 6, decimal – 10), so a full set must include 16 distinct characters, and each character represent 4 bits. Most often, the set are taken to consist of the numerals 0-9 plus the six letters A-F. This makes a hex string easy to read and use for human beings.

The number *n* of *x* bit characters needed to store a 160 bit hash is found by solving 2^{160} = (2^{x})^{n}, or *n* = 160/*x*. The number of hexadecimals needed are therefore *n* = 160/4 = 40.

On the other hand the *byte* character represents 8 bits, which means a full set must include 2^{8 }= 256 different characters. A byte string would therefore only be 160/8 = 20 characters long, but most byte character sets include characters that are not human readable, such as control characters. A character set of 256 human readable characters using some of the richer alphabets than the basic Latin one would be possible, but then come issues with transferability.

What about something inbetween? A 5 bit string would only be 160/5 = 32 characters long, and would need 2^5=32 unique characters, a base-32 system. That could be the 26 letters A-Z plus 6 numerals. It could still be case-insensitive and easy readable, yet 8 characters shorter. Nevermind…

## Leave a Reply