Not a blog about the science or practice of travel through books, but perhaps rather about science, practice and travel through books.

The Git hashes

I started thinking about the SHA-1 hashes used in Git today. The hashes are 160 bits long, and are usually represented by 40 character hex strings. Why 40?

If one character represents x bits, the character set must include 2x distinct characters. And with 2x distinct characters (or numerals), we have a base-2x system. With just two distinct characters (A and B, say), one character represents one bit, x = 1, and we have a base-2 system. With four distinct characters (1,2,3,4, say), we have a base-4 system, and one character represents two bits since 22 = 4.

One hex character (or hexadecimal), is a character in a base-16 system (hexa – 6, decimal – 10), so a full set must include 16 distinct characters, and each character represent 4 bits. Most often, the set are taken to consist of the numerals 0-9 plus the six letters A-F. This makes a hex string easy to read and use for human beings.

The number n of x bit characters needed to store a 160 bit hash is found by solving 2160 = (2x)n, or n = 160/x. The number of hexadecimals needed are therefore n = 160/4 = 40.

On the other hand the byte character represents 8 bits, which means a full set must include 28 = 256 different characters. A byte string would therefore only be 160/8 = 20 characters long, but most byte character sets include characters that are not human readable, such as control characters. A character set of 256 human readable characters using some of the richer alphabets than the basic Latin one would be possible, but then come issues with transferability.

What about something inbetween? A 5 bit string would only be 160/5 = 32 characters long, and would need 2^5=32 unique characters, a base-32 system. That could be the 26 letters A-Z plus 6 numerals. It could still be  case-insensitive and easy readable, yet 8 characters shorter. Nevermind…

Leave a comment