How do hashes work in git?
Contents
9. How do hashes work in git?¶
In this class we will cover:
what type of hashes are used in git
what a hash is
number systems
9.1. Git Plumbing Q&A¶
Plumbing commands work with the .git directory as a file system directly. Porcelain commands provide higher level interactions with git as a version control system.
9.2. What is a hash?¶
a hash is a fixed size value that can be used to represent data of arbitrary sizes
the output of a hashing function
often fixed to a hash table
A hasing function could be really simple, to read off a hash table, or it can be more complex.
For example:
Hash |
content |
0 |
Success |
1 |
Failure |
If we want to represent the status of a program running it has two possible outcomes: success or failure. We can use the following hash table and a function that takes in the content and returns the corresponding hash. Then we could pass around the 0 and 1 as a single bit of information that corresponds to the outcomes.
This lookup table hash works here.
In a more complex scenario, imagine trying to hash all of the new terms you learn in class. A table would be hard for this, because until you have seen them all, you do not know how many there will be. A more effective way to hash this, is to derive a hashing function that is a general strategy.
9.3. When does hashing occur in git?¶
In git we hash both the content directly to store it in the database (.git) directory and the commit information.
Recall, when we were working in our toy repo we created an empty repository and
then added content directly, we all got the same hash, but when we used git
commit our commits had different hashes because we have different names and
made the commits at different seconds. We also saw that two entries were
created in the .git
directory for the commit.
In git, 40 characters that uniquely represent either the content or a commit.
Mostly, a shorter version of the commit is sufficient to be unique, so we can use those to refer to commits by just a few characters:
minimum 4
must be unique
Note
You can view commits shorter with options to git log
git log --abbrev-commit --pretty=oneline
9.4. What hashing function does git use?¶
Git uses SHA-1. See a generator
This is a Secure Hashing Algorithm that is derived from cryptography. Because it is secure, no set of mathematical options can directly decrypt an SHA-1 hash. It is designed so that any possible content that we put in it returns a unique key. It uses a combination of bit level operations on the content to produce the unique values.
The SHA-1 Algorithm hashes content into a fixed length of 160 bits. This means it can produce \(2^160\) different hashes. Which makes the probability of a collision very low.
This output, 160 bits (a bit is a unit of information in base 2; a 0 or 1) can be interpreted as a number and represented in diffrent ways. Since 160 characters is really long, git represents it as 40 characters in hexadecimal.
However, SHA-1 is subjec to collision attacks.
We have broken SHA-1 in practice.
…
It is now practically possible to craft two colliding PDF files and obtain a SHA-1 digital signature on the first PDF file which can also be abused as a valid signature on the second PDF file.
…
GIT strongly relies on SHA-1 for the identification and integrity checking of all file objects and commits. It is essentially possible to create two GIT repositories with the same head commit hash and different contents, say a benign source code and a backdoored one. An attacker could potentially selectively serve either repository to targeted users. This will require attackers to compute their own collision. — shattered it
Git switched to hardended SHA-1 in response to a collision
In that case it adjusts the SHA-1 computation to result in a safe hash. This means that it will compute the regular SHA-1 hash for files without a collision attack, but produce a special hash for files with a collision attack, where both files will have a different unpredictable hash. from
9.5. What is a number?¶
a mathematical object used to count, meaure and label
a mathematical object used to count, meaure and label
What types of numbers are you familiar with?
9.6. Number Representation¶
Numbers are a cultural artifact: We use a base 10 system most commonly becuase we count with our hands and have 10 fingers. Computers use base 2 because they are digital:on & off. The current representation system we use (0,1, 2, 3, 4,5,6,7,8,9) is called hindu-arabic.
9.6.1. Decimal¶
To represent larger numbers than we have digits on we have a base (10) and then.
we have the ones (\(10^0) place, tens (\)10^1\() place, hundreds (\)10^2) place etc.
9.6.2. Binary¶
uses base 2, but the same characters. So the place values are the ones(\(2^0\)), the twos(\(2^1\)), the fours(\(2^2\)), the eights(\(2^3\)), etc.
so this 10 in binary is 2 in decimal
Binary numbers have been discovered in ancient egyptian, chinese and Indian texts.
9.6.3. Octal¶
uses base 8, but the same characters. So the place values are the ones(\(8^0\)), the eights(\(8^1\)), the 64s (\(8^2\)), etc.
so 10 in octal is 8 in decimal
This numbering system was popular in 6 bit and 12 bit computers, but is has origins before that. Native Americans using the Yuki Language (based in what is now California)used an octal system because they count using the spaces between fingers and speakers of the Pamean languages in Mexico count on knuckles in a closed fist. Europeans debated using decimal vs octal in the 1600-1800s for various reasons because 8 is better for math mostly. It is also found in Chinese texts dating to 1000BC.
9.6.4. Hexadecimal¶
uses base 16, but the same characters plus letters A-F. The letters fill in for the numbers aft 9 so A is 10, B is 11, etc. So the place values are the ones(\(16^0\)), the sixteens(\(16^1\)), the two hundred fifty sixes(\(16^2\)),etc.
so 10 in hex is 16 in decimal
\( CD => 16*12 + 1*13 = 205\)$
There was debate a number of different proposals for how the characters beyond 9 should be represented.
9.6.5. Roman numerals¶
Use different representation completely. It doesn’t have places the same way. But it is still a way to represent numbers.
I = 1, V = 5, X = 10, L = 50, C = 100, D = 500, M =1000
This repesentation concatenates symbols and adds them if they are in desceding order and if a smaller digit before a larger then it is subtracted.
III = 1 + 1 + 1= 3
IX = 10 -1 = 9
XL = 50 - 10 = 40
Learn more
Learning more about other number systems or the history of these number systems is a good topic for a deeper exploration.
9.7. Prepare for next Class¶
review the past two classes of notes
find 3 more examples of using other number systems, list them in
numbers.md
Read about hexpeak from Wikipedia for an overview and one additional source. In your kwl repo in
hexspeak.md
summarize it in your own words, one interesting fact from your additional source and link to the source you found. Come up with a word or two on your own.(priority) Bring to class a scenario where you think git could help, but we haven't covered yet how to do it. (be prepared to post it to Prismia at the start of class)
9.8. More Practice¶
Read more about git's change from SHA-1 to SHA-256 and reflect on what you learned (questions provided)
(priority) In a language of your choice or pseudocode, write a short program to convert, without using libraries, between all pairs of (binary, decimal, hexidecimal) in
numbers.md
. Test your code, but include in the markdown file enclosed in three backticks so that it is a "code block" write the name of the language after the ticks like:
```python
# python code
```
9.8.1. transition questions¶
Why is the switch important?
Summarize one vulnerability of SHA-1.
What impact will the swith have on how git works?
If you have scripts that operate on git repos, what might you do to future proof them so that the switch won’t break your code.
9.9. Questions After Class¶
9.9.1. Why did git use SHA-1 isntead of SHA 256 from the start?¶
The shorter the hash the more efficient the system is and git was 12 years old when SHA-1 was determined to be meaningfully vulnerable to collsion attack.
9.9.2. If there is ever a class cancellation or any kind of issue with the notes again, will that be emailed to us or only sent through Slack?¶
It will be announced on slack, but with an @channel mention, you can set your notification preferences so that you get e-mails for direct messages and @ mentions and the @channel counts as an @ mention for all members of the slack workspace (in this case our class). You coudl also opt for mobile notifications if you prefer. The goal with slack is to make it so that you all can better customize how you are notified and I can post to one place.
9.9.3. Are both git and github switchig from sha1¶
GitHub uses git, it is not an alternative implementation or a fork, so yes it will switch too. The developers at GitHub an other git hosts are among the most impacted by the change since they write code that directly interacts with git objects.
9.9.4. Are octal numbers ever used in computing?¶
We will see it when we talk about file permissions in unix systems.
More broadly, this would be a good topic for a deeper exploration report.
9.9.5. Questions left as an exercise¶
Warning
these are for you to look up/hypothesize about and then we will discuss with you in your KWL repo
What are the possible dangers of git switching to a better encryption scheme immediately?