What are git hashes and why are they alaphanumeric?
Contents
10. What are git hashes and why are they alaphanumeric?#
10.1. What is a hash?#
a hash is a fixed size value that can be used to represent data of arbitrary sizes
the output of a hashing function
often fixed to a hash table
Common examples of hashing are lookup tables and encryption with a cyrptographic hash.
A cyrptographic hash is additionally:
unique
not reversible
similar inputs hash to very different values so they appear uncorrelated
Hashes can then be used for a lot of purposes:
message integrity (when sending a message, the unhashed message and its hash are both sent; the message is real if the sent message can be hashed to produce the same has)
password verification (password selected by the user is hashed and the hash is stored; when attempting to login, the input is hashed and the hashes are compared)
file or data identifier (eg in git)
10.2. Hashing in Git#
in git, 40 characters that uniquely represent either each object and are used as the key to retrieve the object as a value. Recall there are multiple types of objects: tree, commit, blob.
Git as originally designed to use SHA-1
SHA-1 is weak. Git switched to hardened HSA-1 in response to a collision
In that case it adjusts the SHA-1 computation to result in a safe hash. This means that it will compute the regular SHA-1 hash for files without a collision attack, but produce a special hash for files with a collision attack, where both files will have a different unpredictable hash. from.
Learn more about the SHA-1 collision attach
For now though it’s still SHA-1 like.
cd ../github-inclass-brownsarahm/
Mostly, a shorter version of the commit is sufficient to be unique, so we can use those to refer to commits by just a few characters:
minimum 4
must be unique
For most project 7 characters is enough and by default, git will give you 7 digits if you use --abbrev-commit
and git will automatically use more if needed.
git log --abbrev-commit --pretty=oneline
313c1d7 (HEAD -> main) Revert "c4"
a47d362 c5
94d236b c4
7d17fdc c3
c1807f4 revert chang 1
948eda1 change 2
83690bc change 1
a399978 (origin/new_feature, new_feature) new feature ready for PR
3c9980a (origin/main, origin/HEAD) create about
ec3dd02 Merge pull request #4 from introcompsys/create_readme
db2e41d (origin/create_readme) Create README.md
1613072 Initial commit
git uses the SHA hash primarily for uniuqeness, not privacy
It does provide some security assurances, because we can check the content against the hash to make sure it is what it matches.
We can use git to hash things for us, without writing them:
echo "learning hashes" | git hash-object --stdin
3ba345cea32208d6612e97830fdb0b1ae70ea8bd
The SHA-1 digest is 20 bytes or 160 bits, which is 40 characters in hexadecimal. The number of randomly hashed objects needed to ensure a 50% probability of a single collision is about \(2^{80}\) (the formula for determining collision probability is p = (n(n-1)/2) * (1/2^160)). \(2^{80}\) is 1.2 x 1024 or 1 million billion billion. That’s 1,200 times the number of grains of sand on the earth.
10.3. What is a Number ?#
a mathematical object used to count, measure and label
10.4. What is a number system?#
While numbers represent are quantities that conceptually, exist all over, the numbers themselves are a cultural artifact. For example, we all have a value representing a single item.
In modern, western cultures our is a a hindu-arabic
invented by Hindu mathematicians in India 600 or earlier
called “Arabic” numerals in the West because Arab merchants introduced them to Europeans
slow adoption
We use a place based system. That means that the position or place of the symbol changes its meaning. So 1, 10, and 100 are all different values. This system is also a decimal system, or base 10. So we refer to the places and the ones (\(10^0\)), the tens (\(10^1\)), the hundreds(\(10^2\)), etc for all powers of 10.
10.4.1. Roman Numerals#
Not all systems are place based, for example Roman numerals. In this system the symbols are added if they are the same value or decreasing and subtracted if increasing. There are symbols for specific values: 1=I, V=5, X=10, L =50, C = 100, D=500, M = 1000. Then III = 1+1+1 = 3 and IV = -1 + 5 = 4, VI = 5+1 = 6 and XLIX = -10 + 50 -1 +10 = 49.
10.4.2. Binary#
Binary is any base two system, and it can be represented using any different characters.
Binary number systems have origins in ancient cultures:
Egypt (fractions) 1200 BC
China 9th century BC
India 2nd century BC
In computer science we use binary because mechanical computers began using relays (open/closed) to implement logical (boolean) operations and then digital computers use on and off in their circuits.
We represent binary using the same hindu-arabic symbols that we use for other numbers, but only the 0 and 1(the first two). We also keep it as a place-based number system so the places are the ones(\(2^0\)), twos (\(2^1\)), fours (\(2^2\)), eights (\(2^3\)), etc for all powers of 2.
so in binary, the number of characters in the word binary is 110.
10.4.3. Octal#
Is base 8. This too has history in other cultures, not only in computer science. It is rooted in cultures that counted using the spaces between fingers instead of counting using fingers.
use by native americans from present day CA
and
As in binary we use hindu-arabic symbols, 0,1,2,3,4,5,6,7 (the first eight). Then nine is 11.
In computer science we use octal a lot because it reduces every 3 bits of a number in binary to a single character. So for a large number, in binary say 101110001100
we can change to 5614
which is easier to read, for a person.
10.4.4. Hexadecimal#
base 16, commonin CS because its 4 bits. we use 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F.
This is how the git hash is 160 bits, or 20 bytes (one byte is 8 bits) but we represent it as 40 characters. 160/4=40.
10.5. Review today’s class#
review the notes
find 2 more real world examples of using other number systems (either different bases or different symbols and bases) that are current. Describe them in
numbers.md
Read about hexpeak from Wikipedia for an overview and one additional source in your kwl repo in
hexspeak.md
. Come up with a word or two on your own.Create a single branch for all of the work related to today’s class in your KWL
10.6. Prepare for Next Class#
Bring to class a scenario where you think a small command line program or bash script could be useful. A command line program is a program that we execute on the command line. For example the courseutils kwlcheck is one I wrote.
Bring one scenario in git that you have seen or anticipate that we have not seen the solution for
10.7. More Practice#
Learn more about how git is working on changing from SHA-1 to SHA-256 and answer the transition questions below)
Research the use of non base 10 numbers systems in another culture and contribute one to your group repo
(priority) In a language of your choice or pseudocode, write a short program to convert, without using libraries, between all pairs of (binary, decimal, hexidecimal) in
numbers.md
. Test your code, but include in the markdown file enclosed in three backticks so that it is a “code block” write the name of the language after the ticks like:
```python
# python code
```
## transition questions
1. Why make the switch?
2. Learn more about one collision
3. What impact will the swith have on how git works?
4. If you have scripts that operate on git repos, what might you do to prepare for the switch?
10.8. Questions After Class#
10.8.1. Are git commit hashes predictable/significant in any way? Or is the risk of using a low-bit hash generator negligible for github usage?#
10.8.2. I am still confused on what TRULY a git hash is. Is it an object? Is it the string to the object?#
This is a place where being casual with language makes it hard. A hash, in general is the output of a hashing algorithm. In git, everything that is stored gets hashed and git uses the hashes as the unique values to refer to each stored object. For example, if I take the hash that is the commit number from the most recent commit above, I can use git cat-file
to look at the file, or git object that for the commit.
git cat-file -p 313c1d7
tree 47182af2099fc6075832c13798ee16b713ebb285
parent a47d362409bc4ad0cb57e73b0e7a4c1a1a586f43
author Sarah M Brown <brownsarahm@uri.edu> 1664849752 -0400
committer Sarah M Brown <brownsarahm@uri.edu> 1664849752 -0400
Revert "c4"
This reverts commit 94d236b459d5035ab3f2c2676a888b27cd77e80d.
The “contents” of the commit are several things:
the hash of the snapshot of the contents as a tree object
the hash of the previous commit
author information
commit message
We can also check the type:
git cat-file -t 313c1d7
commit
We can check the type of the parent to verify it points to the last commit.
git cat-file -t a47d3624
commit
We can then look at the contents of the tree as well, using its has has the file name to first check the type of
git cat-file -t 47182af
tree
then display:
git cat-file -p 47182af
040000 tree 95b60ce8cdec1bc4e1df1416e0c0e6ecbd3e7a8c .github
100644 blob bcc1d9287b5cae329629cf3e0779b065b72dad7a README.md
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 about.md
100644 blob 96d35c05b19fc42bc44153b8a97512001c37a09e new_feature.md
this is all of the contents of the directory at the time of that commit, one blob object for each file and one tree object for the single directory.
We can also look at a file
git cat-file -p bcc1d9
# github-inclass-brownsarahm
github-inclass-brownsarahm created by GitHub Classroom
and verify it is the same contents as on disk
cat README.md
# github-inclass-brownsarahm
github-inclass-brownsarahm created by GitHub Classroom
(base) brownsarahm@github-inclass-brownsarahm $
10.8.3. What are uses of Octal#
Octal is convenient because it is consistent with binary representations easily. 1 place in octal is 3 bits (and 4 in hexadecimal is 4 bits).
We will see it for file representations, because they have 3 parts, each of which can be on or off (ie it’s 3 bits of information).
10.8.4. What is the most interesting number system created?#
That is a good thing to explore, but interesting is relative.
10.8.5. the -w
option to git hash-object
writes the hash number to a file?#
No, it writes the content to a file named based on the hash in the .git/objects
folder.
10.8.6. With the information given to us last class, is it possible to complete the work for gitplumbingreview.md and gitplumbingdetail.md?#
Do what you can and that task will come back after next class.
Important
I revised the text of the the activities as posted here relative to what I posted in class to clarify