11. What is a commit number?#

11.1. Announcements#

Important

follow the prepare for 10/19 to get announcements from the repo commit messages

11.2. What is a hash?#

a hash is:

  • a fixed size value that can be used to represent data of arbitrary sizes

  • the output of a hashing function

  • often fixed to a hash table

Common examples of hashing are lookup tables and encryption with a cyrptographic hash.

A hashing function could be really simple, to read off a hash table, or it can be more complex.

For example:

Hash

content

0

Success

1

Failure

If we want to represent the status of a program running it has two possible outcomes: success or failure. We can use the following hash table and a function that takes in the content and returns the corresponding hash. Then we could pass around the 0 and 1 as a single bit of information that corresponds to the outcomes.

This lookup table hash works here.

In a more complex scenario, imagine trying to hash all of the new terms you learn in class.

A table would be hard for this, because until you have seen them all, you do not know how many there will be. A more effective way to hash this, is to derive a hashing function that is a general strategy.

A cyrptographic hash is additionally:

  • unique

  • not reversible

  • similar inputs hash to very different values so they appear uncorrelated

What are some ways a hash could be used?

Hashes can then be used for a lot of purposes:

  • message integrity (when sending a message, the unhashed message and its hash are both sent; the message is real if the sent message can be hashed to produce the same has)

  • password verification (password selected by the user is hashed and the hash is stored; when attempting to login, the input is hashed and the hashes are compared)

  • file or data identifier (eg in git)

11.3. Hashing in passwords#

Passowrds can be encrypted and the encrypted information is stored, then when you submit a candidate password it can compare the hash of the submitted password to the hash that was stored. Since the hashing function is nonreversible, they cannot see the password.

Some sites are negligent and store passwords unencrypted, if your browser warns you about such a site, proceed with caution and definitely do not reuse a password you ever use. (you should never reuse passwords, but especially do not if there is a warning)

An attacker who gets one of those databases, cannot actually read the passwords, but they could build a lookup table. For example, “password” is a bad password because it has been hashed in basically every algorithm and then the value of it can be reversed. Choosing an uncommon password makes it less likely that your password exists in a lookup table.

echo "password" | git hash-object --stdin
f3097ab13082b70f67202aab7dd9d1b35b7ceac2

11.4. Hashing in Git#

In git we hash both the content directly to store it in the database (.git) directory and the commit information.

Recall, when we were working in our toy repo we created an empty repository and then added content directly, we all got the same hash, but when we used git commit our commits had different hashes because we have different names and made the commits at different seconds. We also saw that two entries were created in the .git directory for the commit.

11.4.1. Git hashes are cryptographic#

This is a Secure Hashing Algorithm that is derived from cryptography. Because it is secure, no set of mathematical options can directly decrypt an SHA-1 hash. It is designed so that any possible content that we put in it returns a unique key. It uses a combination of bit level operations on the content to produce the unique values.

We can use the git hashing algorithm without writing to the repo too:

echo "what's up" | git hash-object --stdin
bd26a07a5f7b350d3bce48721cdc5a8c51f65306
echo "what's up " | git hash-object --stdin
1b0095f6628174ff6b4c9341e19230db33071fce
echo "what up " | git hash-object --stdin
d562044d1a61d1b86e3b15e70d6c7f5922e54498
echo "password" | git hash-object --stdin
f3097ab13082b70f67202aab7dd9d1b35b7ceac2
echo "passworksjfwklsjfd" | git hash-object --stdin
251c2d02f1fcbe5230648f4c386ba7d9afdd2cbe
echo "passworksjfwklsjfd" | git hash-object --stdin
251c2d02f1fcbe5230648f4c386ba7d9afdd2cbe

11.4.2. Git hashing algorithm#

Git was originally designed to use SHA-1.

The SHA-1 Algorithm hashes content into a fixed length of 160 bits.

This means it can produce \(2^160\) different hashes. Which makes the probability of a collision very low.

The number of randomly hashed objects needed to ensure a 50% probability of a single collision is about \(2^{80}\) (the formula for determining collision probability is \(p = (n(n-1)/2) * (1/2^160))\). \(2^{80}\) is 1.2 x 1024 or 1 million billion billion. That’s 1,200 times the number of grains of sand on the earth.

A SHORT NOTE ABOUT SHA-1 in the Git Documentation

git uses the SHA hash primarily for uniuqeness, not privacy

It does provide some security assurances, because we can check the content against the hash to make sure it is what it matches.

Then the SHA-1 collision attack was discovered

Git switched to hardened HSA-1 in response to a collision.

In that case it adjusts the SHA-1 computation to result in a safe hash. This means that it will compute the regular SHA-1 hash for files without a collision attack, but produce a special hash for files with a collision attack, where both files will have a different unpredictable hash. from.

they will change again soon

GitHub uses git, it is not an alternative implementation or a fork, so yes it will switch too. The developers at GitHub an other git hosts are among the most impacted by the change since they write code that directly interacts with git objects.

11.4.3. Remember we have many more git objects than commits#

find github-inclass-fa23-brownsarahm/.git/objects -type f
github-inclass-fa23-brownsarahm/.git/objects/04/2a42eb47c33ee43d793feb4d891a93e7460527
github-inclass-fa23-brownsarahm/.git/objects/04/ab89e167ed77bc2a95710f69f68d91a6219471
github-inclass-fa23-brownsarahm/.git/objects/69/3a2b5b9ad4c27eb3b50571b3c93dde353320a1
github-inclass-fa23-brownsarahm/.git/objects/93/4c15dc2655c988c981d9a836783afebda77355
github-inclass-fa23-brownsarahm/.git/objects/5a/a1ed29b82e1cebb8527019b0e594ba71dda214
github-inclass-fa23-brownsarahm/.git/objects/5f/e5a9821625fad2cca4c500e497e6694132c303
github-inclass-fa23-brownsarahm/.git/objects/d7/6bc523443bda5a5daae2fe7fcfbf6fba71ae6d
github-inclass-fa23-brownsarahm/.git/objects/b3/78bd148e53dfa7195c58123362e40ae12ef3e7
github-inclass-fa23-brownsarahm/.git/objects/bc/281792d6ab62b153d7bf44f7985ec7cfc3b850
github-inclass-fa23-brownsarahm/.git/objects/ca/eacb503cf4776f075b848f0faff535671f2887
github-inclass-fa23-brownsarahm/.git/objects/ca/feca302e31c65139b4a5294356e1ea8595dcb1
github-inclass-fa23-brownsarahm/.git/objects/pack/pack-8631fedd908bc07c0b64786e9a83f5bf7a4de110.rev
github-inclass-fa23-brownsarahm/.git/objects/pack/pack-8631fedd908bc07c0b64786e9a83f5bf7a4de110.pack
github-inclass-fa23-brownsarahm/.git/objects/pack/pack-8631fedd908bc07c0b64786e9a83f5bf7a4de110.idx
github-inclass-fa23-brownsarahm/.git/objects/11/d53c24bb5d2bf2e3f645ef188f8bc75fa9c911
github-inclass-fa23-brownsarahm/.git/objects/45/fcb1dd311e5e45af759cb3627dca5f47f58f04
github-inclass-fa23-brownsarahm/.git/objects/75/6c4879c0447db20980f73a26bc2ba072e08a6d
github-inclass-fa23-brownsarahm/.git/objects/44/3f164cdde5059d78df6a61ca3f07bc6a605eb0
github-inclass-fa23-brownsarahm/.git/objects/43/a1267370f1af98071d53f8508abbc56fa3abde
github-inclass-fa23-brownsarahm/.git/objects/88/5588412d138cceb89f06ffed5e83c316c2b593
github-inclass-fa23-brownsarahm/.git/objects/5c/8aaa9f2a129d551b8cb2cb294676f63c4af410
github-inclass-fa23-brownsarahm/.git/objects/65/e9e39935be8400ef12cc9003592f12244b50da
github-inclass-fa23-brownsarahm/.git/objects/3a/cf0fb1c2febd24561294bfb966e1ad1f033eb8
github-inclass-fa23-brownsarahm/.git/objects/98/96f7a7000a7b9d2fdb12047a141524358286c3
github-inclass-fa23-brownsarahm/.git/objects/37/0e04baf4f62d1e62f4949208bc5e4d33af5336
github-inclass-fa23-brownsarahm/.git/objects/6d/4dbd33860fceb9c87bd3c4509deff8cecb3f45
github-inclass-fa23-brownsarahm/.git/objects/39/f1c5eabb1458fa6cf9042611599b69665cf288
github-inclass-fa23-brownsarahm/.git/objects/55/56d17391aeeec9b2b86d2821c011d7ed5377aa
github-inclass-fa23-brownsarahm/.git/objects/b8/6eb90ba1ae5504edfcdc9ef8879e1c6d7a1b75
github-inclass-fa23-brownsarahm/.git/objects/b6/2d570421c3096d8c80c7df56357cdd3203fd3a
github-inclass-fa23-brownsarahm/.git/objects/ea/c84c8320a3ab4f37a441a332b828c45ecedcc9
github-inclass-fa23-brownsarahm/.git/objects/e1/82616690a91a8d0e363f4143e68dd9e136ccee
github-inclass-fa23-brownsarahm/.git/objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391
github-inclass-fa23-brownsarahm/.git/objects/8c/7cefb877c62a46a3b71c68a858c24075b379fe
github-inclass-fa23-brownsarahm/.git/objects/76/8dec80c5e0734476d476ae83376c9c786b6450
github-inclass-fa23-brownsarahm/.git/objects/2b/cb5d446129df427a1ce09e8ba658a5bb8ceca3
echo "passworksjfwklsjfd" | git hash-object --stdin
251c2d02f1fcbe5230648f4c386ba7d9afdd2cbe
python
Python 3.11.4 (main, Jul  5 2023, 09:00:44) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 16*16
256
>>> exit()
python
Python 3.11.4 (main, Jul  5 2023, 09:00:44) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> len('251c2d02f1fcbe5230648f4c386ba7d9afdd2cbe')
40
>>> exit()

11.4.4. Workign with git hashes#

Mostly, a shorter version of the commit is sufficient to be unique, so we can use those to refer to commits by just a few characters:

  • minimum 4

  • must be unique

Git log by default shows us the full length

cd github-inclass-fa23-brownsarahm/
git log
commit 042a42eb47c33ee43d793feb4d891a93e7460527 (HEAD -> main, origin/main, origin/HEAD)
Author: Sarah M Brown <brownsarahm@uri.edu>
Date:   Thu Sep 21 13:26:59 2023 -0400

    begin organizing

commit d76bc523443bda5a5daae2fe7fcfbf6fba71ae6d
Author: Sarah M Brown <brownsarahm@uri.edu>
Date:   Thu Sep 21 12:53:14 2023 -0400

    start organizng for real

commit bc281792d6ab62b153d7bf44f7985ec7cfc3b850
Author: Sarah M Brown <brownsarahm@uri.edu>
Date:   Thu Sep 21 12:51:22 2023 -0400

    start organizing

commit 756c4879c0447db20980f73a26bc2ba072e08a6d (origin/fun_fact, fun_fact)
Author: Sarah M Brown <brownsarahm@uri.edu>
Date:   Tue Sep 19 13:26:20 2023 -0400

    second fun fact

commit 768dec80c5e0734476d476ae83376c9c786b6450

For most project 7 characters is enough and by default, git will give you 7 digits if you use --abbrev-commit and git will automatically use more if needed.

git log --abbrev-commit --pretty=oneline
042a42e (HEAD -> main, origin/main, origin/HEAD) begin organizing
d76bc52 start organizng for real
bc28179 start organizing
756c487 (origin/fun_fact, fun_fact) second fun fact
768dec8 Update about.md
6d4dbd3 add fun fact
5c8aaa9 Merge pull request #5 from introcompsys/add-name
65e9e39 (origin/add-name) closes #2
caeacb5 Merge pull request #4 from introcompsys/1-create-an-about-file
693a2b5 (origin/1-create-an-about-file, 1-create-an-about-file) create and complete about file closes #1
6a12db0 Add online IDE url
cfe32e5 Initial commit
git checkout bc28179
Note: switching to 'bc28179'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at bc28179 start organizing

Then we can check and the HEAD is moved to that commit

git log --abbrev-commit --pretty=oneline
bc28179 (HEAD) start organizing
756c487 (origin/fun_fact, fun_fact) second fun fact
768dec8 Update about.md
6d4dbd3 add fun fact
5c8aaa9 Merge pull request #5 from introcompsys/add-name
65e9e39 (origin/add-name) closes #2
caeacb5 Merge pull request #4 from introcompsys/1-create-an-about-file
693a2b5 (origin/1-create-an-about-file, 1-create-an-about-file) create and complete about file closes #1
6a12db0 Add online IDE url
cfe32e5 Initial commit
git checkout main
Previous HEAD position was bc28179 start organizing
Switched to branch 'main'
Your branch is up to date with 'origin/main'.
git cat-file -p 6d4dbd3
tree 370e04baf4f62d1e62f4949208bc5e4d33af5336
parent 5c8aaa9f2a129d551b8cb2cb294676f63c4af410
author Sarah M Brown <brownsarahm@uri.edu> 1695143214 -0400
committer Sarah M Brown <brownsarahm@uri.edu> 1695143214 -0400

add fun fact

11.5. How do we get to the alphanumeric hashes we see?#

Let’s build this up with some review

11.5.1. What is a Number ?#

a mathematical object used to count, measure and label

11.6. What is a number system?#

While numbers represent quantities that conceptually, exist all over, the numbers themselves are a cultural artifact. For example, we all have a way to represent a single item, but that can look very different.

for example I could express the value of a single item in different ways:

  • 1

  • I

11.6.1. Hindu Arabic#

In modern, western cultures, our number system is called the hindu-arabic system, it consists of a set of numerals: 0,1,2,3,4,5,6,7,8,9 and uses a place based system with base 10.

11.6.1.1. Where does the name come from?#

  • invented by Hindu mathematicians in India 600 or earlier

  • called “Arabic” numerals in the West because Arab merchants introduced them to Europeans

  • slow adoption

11.6.1.2. Numerals#

are the visual characters used to represent quantities

11.6.1.3. Place based#

We use a place based system. That means that the position or place of the symbol changes its meaning. So 1, 10, and 100 are all different values. This system is also a decimal system, or base 10. So we refer to the places and the ones (\(10^0\)), the tens (\(10^1\)), the hundreds(\(10^2\)), etc for all powers of 10.

Number systems can use different characters, use different strategies for representing larger quantities, or both.

11.6.2. Roman Numerals#

is a number system that uses different numerals and is not a place based system

There are symbols for specific values: 1=I, V=5, X=10, L =50, C = 100, D=500, M = 1000.

Not all systems are place based, for example Roman numerals. In this system the subsequent symbols are either added or subtracted, with no (nonidentity) multipliers based on position. Instead if the symbol to right is the same or smaller, add the two together, if the left symbol is smaller, subtract it from the one on the right.

Then

  • III = 1+1+1 = 3

  • IV = -1 + 5 = 4

  • VI = 5+1 = 6

  • XLIX = -10 + 50 -1 +10 = 49.

This feel hard because it is unfamiliar

11.7. Different Bases#

11.7.1. Decimal#

To represent larger numbers than we have digits on we have a base (10) and then.

\[10 = 10*1 + 1*0\]
\[22 = 10*2 + 1*2 \]

we have the ones (\(10^0) place, tens (\)10^1\() place, hundreds (\)10^2) place etc.

11.7.2. Binary#

Binary is any base two system, and it can be represented using any different characters.

Binary number systems have origins in ancient cultures:

  • Egypt (fractions) 1200 BC

  • China 9th century BC

  • India 2nd century BC

In computer science we use binary because mechanical computers began using relays (open/closed) to implement logical (boolean) operations and then digital computers use on and off in their circuits.

We represent binary using the same hindu-arabic symbols that we use for other numbers, but only the 0 and 1(the first two). We also keep it as a place-based number system so the places are the ones(\(2^0\)), twos (\(2^1\)), fours (\(2^2\)), eights (\(2^3\)), etc for all powers of 2.

so in binary, the number of characters in the word binary is 110.

\[ 10 => 2*1 + 1*0 = 2\]

so this 10 in binary is 2 in decimal

\[ 1001 => 8*1 + 4*0 + 2*0 + 1*1 = 9\]

11.7.3. Octal#

Is base 8. This too has history in other cultures, not only in computer science. It is rooted in cultures that counted using the spaces between fingers instead of counting using fingers.

This numbering system was popular in 6 bit and 12 bit computers, but is has origins before that. Native Americans using the Yuki Language (based in what is now California)used an octal system because they count using the spaces between fingers and speakers of the Pamean languages in Mexico count on knuckles in a closed fist. Europeans debated using decimal vs octal in the 1600-1800s for various reasons because 8 is better for math mostly. It is also found in Chinese texts dating to 1000BC.

\[ 10 = > 8*1 + 1*0 = 8\]

so 10 in octal is 8 in decimal

\[ 401 => 64*4 + 8*0 + 1*1 = 257\]

As in binary we use hindu-arabic symbols, 0,1,2,3,4,5,6,7 (the first eight). Then nine is 11.

In computer science we use octal a lot because it reduces every 3 bits of a number in binary to a single character. So for a large number, in binary say 101110001100 we can change to 5614 which is easier to read, for a person.

11.7.4. Hexadecimal#

base 16, commonin CS because its 4 bits. we use 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F.

This is commonly used for representing colors

This is how the git hash is 160 bits, or 20 bytes (one byte is 8 bits) but we represent it as 40 characters. 160/4=40.

11.8. Prepare for Next Class#

  1. Review the course website releases to get familiar with what content appears there

  2. Review the GitHub Action files in your KWL repo and make note of what if any syntax in there is unfamilar. (note that link may not work on the rendered website, but will work on issues)

  3. Use quote reply or edit to see how I made a relative path to a location within the repo in this issue.

  4. Use git log to view recent updates to the course website from the October 12 class to the October 17 class release

  5. review the open PR about a build badge if you are intersted in that option and comment if anything is unclear or approve if it is helpful.

11.9. Review today’s class#

  1. find 2 more real world examples of using other number systems (either different bases or different symbols and bases) that are current. Describe them in numbers.md. Include links to your sources and be sure that the sources are trust worthy.

  2. Calculate the maximum number of git objects that a repo can have without needing to use more than the minimum number of characters to refer to any object and include that number in gitcounts.md with a title # Git counts. Describe the scenario that would get you to that number of objects with the maximum or minimum number of commits in terms of what types of objects would exist. Assume normal git use with porcelain commands, not atypical cases with plubming commands. If you get stuck, outline what you know and then request a review.

11.10. More Practice#

  1. Read about the Learn more about the SHA-1 collision attach

  2. Learn more about how git is working on changing from SHA-1 to SHA-256 and answer the transition questions below gittransition.md

  3. find 2 more real world examples of using other number systems (either different bases or different symbols and bases) that are current. Describe them in numbers.md

  4. Calculate the maximum number of git objects that a repo can have without needing to use more than the minimum number of characters to refer to any object and include that number in gitcounts_scenarios.md with a title # Git counts. Describe 3 scenarios that would get you to that number of objects in terms of what types of objects would exist. For example, what is the maximum number of commits you could have without exceeding that number, how many files could you have? How could you get to that number of objects in the fewest number of commits? What might be a typical way to get there? Assume normal git use with porcelain commands, not atypical cases with plubming commands. If you get stuck, outline what you know and then request a review.

11.11. gittransition#

## transition questions

1. Why make the switch?
2. What impact will the switch have on how git works?
3. Which developers will have the most work to do because of the switch?

11.12. Experience Report Evidence#

none

11.13. Questions After Today’s Class#

11.13.1. What might the password table looklike?#

You can check your own passwords against a collection of breech-revealed passwords, or download such a table at haveibeenpwned

Conceptually it could be as simmpel as 2 columns:

password

hash

11.13.2. Will we have to keep creating bigger and bigger number systems when computers get too fast at brute forcing hashes?#

Most systems lock people out after a few tries to try to contol this, but potentially.

This is also why we typically have 2 factors now.

11.13.3. Are the letters used in hexadecimal representation global? Like if all countries that may not use english have to use the english characters in hex A,B,C,D,E,F?#

The letters we use are not limited to English. The English alphabet descends largely from the latin alphabet, as do many other European languages. Further, since globalization, even some languages that historically had their own alphabets have adopted the English alphabet or one that is very similar.

Historically people may have used others for hex, but in programming contexts we all use the same latin alphabet .

11.13.4. Why are numbers so complicated?#

Fundamentally, counting can happen in many different ways. What we talked about today is abstractions and representations that have been developed over 1000s of years over the entire planet.

People are complicated and vary a alot

11.13.5. What is the main difference between hash tables and hash functions?#

Strictly speaking, a hash function could be used to produce a hash table, a hash table is sort of a special case.

A hash function tells how to produce the hash from the input. A hash table could assign hashes to content arbitrarily.

11.13.6. If the word “password” hash has been easily reverised, couldn’t that just be done with every word in a language?#

Yes, it could, that is why we have rules about what makes a good password. That increases the possible options.

There are a lot of words in a language, but it’s a lot less than the total number of valid pass words, which do not have to be meaningful and typically can be any mix of letters in upper (26) or lower (+26) case, numbers (+10), or special characters (maybe another +20 or more). This means a password of length \(d\) can be \(26*2+10+20 = 82^d\) options. Even for just length 2 there are then 6724 possible options. For length 4 there are over 45 million. So once you allow long passwords that are 30+ characters, it becomes really hard to store all of those options.

11.13.7. What exactly is a relay?#

We will see this later when we talk more about hardware. It is an electrically operated switch.

Relay from wikipedia

the wikipedia article also has an animation

11.13.8. Are there any hashing algorithms that use numerical systems greater than base 16?#

The hashing algorithm does not necessarily select a specific base. For example the SHA-1 algorithm is defined using a series of bitwise and mathematical operations on the binary representation of the content:

  • split the messages into batches of length 512 bits

  • break into 32 bit words

  • xor

  • and

  • comparisons (if statments)

  • bit shifts

  • add

  • or

you can see the full psuedocode for the algorithm

Any base can be used to write out the strings. 16 is the highest commonly used, but for example the wikipedia article on SHA-1 uses both base 64 and 16 in the examples

11.13.9. How would a developer create a function to hash passwords and store them securely?#

In general, a developer would not develop a new hash function, you wouls use an existing one. Increasingly, authentication standards or using other authentication tools are common.

11.14. Questions that are a good explore badge#

Where to collate these?

Ideas for where/how to collate questions like this are welcome and worth a community badge

11.14.1. What are good practices for adding authentication to my program?#

This could be a good explore badge or potentially even a build on this topic

11.14.2. I want to learn more about encryption#

If you make this a little more specific it could be a great explore badge. You could try out an encryption library or read and write a summary or something else that makes sense to you.

11.14.3. Hashing and Quantum computing#

  • Once quantum computing is more prominent, will hashing algorithms change because now it will can be both 0 and 1?

  • Will current hashing algorithms be safe against quantum computing?

A good answer to this would rely on multiple sources and some independent reasoning.

Where to collate these

Ideas for where/how to collate questions like this are welcome and worth a community badge