GIT from the inside

or: How GIT sees the world

Jens Neuhalfen

@gh-pages@printsource

LICENSE

  • Press s for speakers view
  • Press o for overview

Why SCM?

SCM System
*S*ource *C*code *M*anagement System
Tracks changes
Who, When, What and Why
Time Machine
Go back to any point in the past
Collaborative
Allows multiple people to collaborate on a source set

What is a hash?

A hash \(h(x)\) is a fixed length value derived from some data \(x\).

\begin{align} A == B &\Rightarrow h(A) == h(B) & \text{always} \\ h(A) == h(B) &\Rightarrow A == B & \text{almost always (*)} \end{align}

(*) It is very unlikely that different values hashes to the same hash. If this happens it is called a hash collision.

How likely is a hash collision?

Sorry, your browser does not support SVG.

Collisions depend on the number of changed files & commits.

Rule of thumb: Don't create more than \(10^{10}\) (ten US-billion) files with \(10^{10}\) commits per repository

Git basics

Mental model

Think of git as a filesystem with extra dimensions , or, if you like math, a directed acyclic graph.

Sorry, your browser does not support SVG.

The "git file system"

The following pages show how git implements the "file system" used for its magic. Files, directories and commits are handled all in the same way!

What is content addressed storage?

A content addressed storage is a very simple database

PRO CON
- simple - no query method besides hash
- data not duplicated  

Sorry, your browser does not support SVG.

put
\(put(data) \to hash_{data}\)
get
\(get(hash_{data}) \to data\)

sha1(data) == hash

Empty Object Storage

This is the empty object store. It is located in the .git/ directory

File Content

We can store the content of README.txt but not the file name. Git calls this a blob.

Directories

A directory (tree in git-speak) is a special file that contains file names and links to the content via the hash. E.g. B 0xafde README.txt is the README.

The First Commit

A commit (circle) points to a tree (the files) and has e.g. a commit message.

The Second Commit

  • Most of the time a commit points to a parent commit.
  • See how someone edited README.txt ( B 0xdead README.txt )?

What do changes mean?

Changing README.txt also changes the tree! The hash changed from 0x4711 to 0x0815 .

This is very important : Changing a file will change its hash. This will change the content of the parent tree. This will change the hash of the parent tree. This will change all the tree hashes up to the top.

The first branch

Starting out from the filesystem, let's have a look at how a branch can be constructed.

In order to to so, we need to answer a very important question:

How does git know which commit is the current commit?

How git finds the parent commit

Let's recapitulate:

  • Git heavily relies on content addressed storage
  • Content addressed storage is like chaotic storage
    • Very efficient
    • no query method besides get(hash)
    • Needs external paperwork

In order to know the current commit, we need to look at the paperwork.

Initialize repository

Create a fresh repository:


mkdir -p "${repo}" && cd "${repo}"
git init
Initialized empty Git repository in /tmp/git-from-the-inside/branches/.git/

and commit something:

echo "Please read me" >README.txt
git add README.txt
git commit -m"1st commit"
[master (root-commit) 9d1704e] 1st commit
 1 file changed, 1 insertion(+)
 create mode 100644 README.txt

The value 9d1704e in the first line of the output is the ID of the new commit.

External storage in .git/

Question: How does git know which commit is the current commit?

Answer: The .git/ directory provides additional context:

tree -L 1 -hF  "${repo}/.git"
/tmp/git-from-the-inside/branches/.git
|-- [  11]  COMMIT_EDITMSG
|-- [  23]  HEAD
|-- [4.0K]  branches/
|-- [ 143]  config
|-- [  73]  description
|-- [4.0K]  hooks/
|-- [ 145]  index
|-- [4.0K]  info/
|-- [4.0K]  logs/
|-- [4.0K]  objects/
`-- [4.0K]  refs/

6 directories, 5 files
  • .git/HEAD - Tells git what the current commit is
  • .git/refs/.. and .git/branches/.. - later…

Let's see what .git/HEAD contains.

cat .git/HEAD
ref: refs/heads/master

What does git make of ref: refs/heads/master?

git rev-parse  refs/heads/master
9d1704e1300fe89c47cca353569e0377609fcc7f

What is the last commit?

git log
commit 9d1704e1300fe89c47cca353569e0377609fcc7f
Author: Alice <alice@neuhalfen.name>
Date:   Wed Apr 28 06:45:54 2021 +0000

    1st commit

./git/HEAD part II

Question: How does git know which commit is the current commit?

Answer: HEAD points to the current branch. The branch resolves to the current commit.

Sorry, your browser does not support SVG.

Commit

Working on one branch

git commit -m'2nd commit' --allow-empty
[master dcddac0] 2nd commit

Multiple Branches: Theory

Sorry, your browser does not support SVG.

Multiple Branches: practical I/II

git checkout -b devel HEAD^1 # Start "devel" from "2nd commit"
git commit -m'Commit 1 on devel' --allow-empty >/dev/null # Ignore output
git commit -m'Commit 2 on devel' --allow-empty >/dev/null
git log --oneline  # since we have "devel" checked out, this shows the "devel" branch
cf44312 Commit 2 on devel
5542c3f Commit 1 on devel
dcddac0 2nd commit
9d1704e 1st commit

Multiple Branches: practical II/II

Sorry, your browser does not support SVG.

#  --topo-order: Sort by graph layout, not date.
#  --decorate: Print out the ref names of any commits that are shown.
git log devel --oneline --decorate --topo-order
cf44312 (HEAD -> devel) Commit 2 on devel
5542c3f Commit 1 on devel
dcddac0 2nd commit
9d1704e 1st commit

Merge

A merge commit has more than one parent and includes the commits of multiple branches.

Merge: theory

Sorry, your browser does not support SVG.

Merge: how it works

git checkout master

# GIT_MERGE_AUTOEDIT=no  uses the automatically created  commit message
GIT_MERGE_AUTOEDIT=no git merge devel
Already up to date!
Merge made by the 'recursive' strategy.

Merge: all commits belong to the graph

Sorry, your browser does not support SVG.

git log --oneline --decorate --topo-order
3cbe7ef (HEAD -> master) Merge branch 'devel'
cf44312 (devel) Commit 2 on devel
5542c3f Commit 1 on devel
3fd43b5 3rd commit - only master
dcddac0 2nd commit
9d1704e 1st commit

Merge: a branch can be merged more than once

git checkout devel
git commit --allow-empty -m"Hotfix on devel"
git checkout master

GIT_MERGE_AUTOEDIT=no git merge devel
[devel 03a47b3] Hotfix on devel
Already up to date!
Merge made by the 'recursive' strategy.

Merge Summary

Sorry, your browser does not support SVG.

  • PRO
    • Explicit, merge stays a part of the graph
  • CON
    • Graph gets complex
    • Not optimal for work in progress

Rebase

Rebasing "transplants" commits and can be a better way to merge.

Rebase: theory

Sorry, your browser does not support SVG.

Rebase: how it works

git checkout devel

git rebase master
First, rewinding head to replay your work on top of it...
Applying: devel: 1st commit
Applying: devel: 2nd commit

Rebase: All rebased commits are changed

Sorry, your browser does not support SVG.

git log devel --oneline --decorate
02b45a9 (HEAD -> devel) devel: 2nd commit
81e0bb3 devel: 1st commit
b170544 (master) master: 3rd commit
09f2508 master: 2nd commit
603c85a master: 1st commit

Rebase: Merging gets easy

git checkout master

GIT_MERGE_AUTOEDIT=no git merge devel
Updating b170544..02b45a9
Fast-forward
 change_devel | 2 ++
 1 file changed, 2 insertions(+)
 create mode 100644 change_devel

Rebase Summary

Sorry, your browser does not support SVG.

  • PRO
    • Simplified graph
    • Suitable for work in progress
  • CON
    • Merge no longer explicit
    • CAVE push -f

Remote

  • A remote repository is just a normal repository
  • Both repositories (remote and local) work with the same DAG
  • The remote repository can have any name
    • Most often it is called origin
    • The branches pulled from origin are named origin/.... E.g. origin/master
    • Your branches follow remote branches. E.g. master follows origin/master
  • You can have as many remotes as you like!

Remote: how it works

Sorry, your browser does not support SVG.

Remote: non fast forward

Sorry, your browser does not support SVG.

Remote: push -f - a tragedy in V acts

Why is git push -f considered a bad idea? And when would you need it?

  • git rebase can change history
  • So does e.g. git reset HEAD^
  • If you change published history you'd need git push -f
  • But what happens when Bob bases his work on yours?

git push -f Act I: Bob added a feature!

Sorry, your browser does not support SVG.

git push -f Act II: But my history is ugly!

Sorry, your browser does not support SVG.

git push -f Act III: It's just a little bit of history rewriting

Sorry, your browser does not support SVG.

git push -f Act IV: Drama foreshadowing

Sorry, your browser does not support SVG.

git push -f Act V: Explanations are needed

Sorry, your browser does not support SVG.

Wrapping it up

Sorry, your browser does not support SVG.

  • Git relies on a directed acyclic graph (DAG)
    • Adding things to the graph is easy
    • Removing things from the graph (push -f) can get messy
  • Git relies on a content addressable storage
    • All hex numbers (0abef12...) are hashes used to build the DAG

Merge vs. Rebase

Sorry, your browser does not support SVG.

Great Resources

Git and BitCoin

  • Git and BitCoin are based on a DAG linked via content addressing
  • In BitCoin the longest "branch wins"
  • Generating "commits" in BitCoin ("blocks")
    • Gives you money
    • Is artificially made very difficult
fiddle = 0  # A cryptographer/BitCoiner calls this the "nonce"
while True:
    my_block = build_block(last_block_hash, fiddle, my_transactions)

    my_block_hash = hash(my_block)
    if my_block_hash.startswith("0"* 19):
	break
    else:
	fiddle = fiddle + 1        # Bad luck - No new block
print("Yay, I found a new Block!")

To find a hash that starts with 19 zeroes in hexadecimal you need a lot of luck. Or ~ \(10^{11}\) guesses.

BitCoin: Git

  • Data is written in commits
  • The author writes the commit
  • Commits are linked via their hash
  • no magic money

BitCoin: BitCoin

  • Transactions are written in blocks
    • One block can hold 10.000 transactions
  • The miner writes the block
    • With some extra work - the 64 (hex-) digit long hash of the new block needs to start with 19 zeros (04/2021)
    • The first transaction of each block is to magically give the miner some BTC
    • From each transaction a bit of BTC goes to the miner
  • Each block links to the previous block via its hash
    • Only the longest chain is "the truth"

Licensing