GIT from the inside

or: How GIT sees the world

Jens Neuhalfen

View OnlineSpeakers NotesSource Code

LICENSE

  • Press s for speakers view
  • Press o for overview

Why SCM?

SCM System
*S*ource *C*code *M*anagement System
Tracks changes
Who, When, What and Why
Time Machine
Go back to any point in the past
Collaborative
Allows multiple people to collaborate on a source set

What is a hash?

A hash \(h(x)\) is a fixed length value derived from some data \(x\).

\begin{align} A == B &\Rightarrow h(A) == h(B) & \text{always} \\ h(A) == h(B) &\Rightarrow A == B & \text{almost always (*)} \end{align}

(*) It is very unlikely that different values hashes to the same hash. If this happens it is called a hash collision.

How likely is a hash collision?

Sorry, your browser does not support SVG.

Collisions depend on the number of changed files & commits.

Rule of thumb: Don't create more than \(10^{10}\) (ten US-billion) files with \(10^{10}\) commits per repository

Git basics

Mental model

Think of git as a filesystem with extra dimensions , or, if you like math, a directed acyclic graph.

Sorry, your browser does not support SVG.

The "git file system"

The following pages show how git implements the "file system" used for its magic. Files, directories and commits are handled all in the same way!

What is content addressed storage?

A content addressed storage is a very simple database

PRO CON
- simple - no query method besides hash
- data not duplicated  

Sorry, your browser does not support SVG.

put
\(put(data) \to hash_{data}\)
get
\(get(hash_{data}) \to data\)

sha1(data) == hash

Empty Object Storage

This is the empty object store. It is located in the .git/ directory

File Content

We can store the content of README.txt but not the file name. Git calls this a blob.

Directories

A directory (tree in git-speak) is a special file that contains file names and links to the content via the hash. E.g. B 0xafde README.txt is the README.

The First Commit

A commit (circle) points to a tree (the files) and has e.g. a commit message.

The Second Commit

  • Most of the time a commit points to a parent commit.
  • See how someone edited README.txt ( B 0xdead README.txt )?

What do changes mean?

Changing README.txt also changes the tree! The hash changed from 0x4711 to 0x0815 .

This is very important : Changing a file will change its hash. This will change the content of the parent tree. This will change the hash of the parent tree. This will change all the tree hashes up to the top.

The first branch

Starting out from the filesystem, let's have a look at how a branch can be constructed.

In order to to so, we need to answer a very important question:

How does git know which commit is the current commit?

How git finds the parent commit

Let's recapitulate:

  • Git heavily relies on content addressed storage
  • Content addressed storage is like chaotic storage
    • Very efficient
    • no query method besides get(hash)
    • Needs external paperwork

In order to know the current commit, we need to look at the paperwork.

Initialize repository

Create a fresh repository:


mkdir -p "${repo}" && cd "${repo}"
git init
Initialized empty Git repository in /tmp/git-from-the-inside/branches/.git/

and commit something:

echo "Please read me" >README.txt
git add README.txt
git commit -m"1st commit"
[master (root-commit) 50ad332] 1st commit
 1 file changed, 1 insertion(+)
 create mode 100644 README.txt

The value 50ad332 in the first line of the output is the ID of the new commit.

External storage in .git/

Question: How does git know which commit is the current commit?

Answer: The .git/ directory provides additional context:

tree -L 1 -hF  "${repo}/.git"
/tmp/git-from-the-inside/branches/.git
|-- [  11]  COMMIT_EDITMSG
|-- [  23]  HEAD
|-- [4.0K]  branches/
|-- [ 143]  config
|-- [  73]  description
|-- [4.0K]  hooks/
|-- [ 145]  index
|-- [4.0K]  info/
|-- [4.0K]  logs/
|-- [4.0K]  objects/
`-- [4.0K]  refs/

6 directories, 5 files
  • .git/HEAD - Tells git what the current commit is
  • .git/refs/.. and .git/branches/.. - later…

Let's see what .git/HEAD contains.

cat .git/HEAD
ref: refs/heads/master

What does git make of ref: refs/heads/master?

git rev-parse  refs/heads/master
50ad33284a01b5c440ffa1c1ac0b848100943039

What is the last commit?

git log
commit 50ad33284a01b5c440ffa1c1ac0b848100943039
Author: Alice <alice@neuhalfen.name>
Date:   Fri Feb 19 08:39:36 2021 +0000

    1st commit

./git/HEAD part II

Question: How does git know which commit is the current commit?

Answer: HEAD points to the current branch. The branch resolves to the current commit.

Sorry, your browser does not support SVG.

Commit

Working on one branch

git commit -m'2nd commit' --allow-empty
[master ded185a] 2nd commit

Multiple Branches: Theory

Sorry, your browser does not support SVG.

Multiple Branches: practical I/II

git checkout -b devel HEAD^1 # Start "devel" from "2nd commit"
git commit -m'Commit 1 on devel' --allow-empty >/dev/null # Ignore output
git commit -m'Commit 2 on devel' --allow-empty >/dev/null
git log --oneline  # since we have "devel" checked out, this shows the "devel" branch
84f83f3 Commit 2 on devel
4983875 Commit 1 on devel
ded185a 2nd commit
50ad332 1st commit

Multiple Branches: practical II/II

Sorry, your browser does not support SVG.

#  --topo-order: Sort by graph layout, not date.
#  --decorate: Print out the ref names of any commits that are shown.
git log devel --oneline --decorate --topo-order
84f83f3 (HEAD -> devel) Commit 2 on devel
4983875 Commit 1 on devel
ded185a 2nd commit
50ad332 1st commit

Merge

A merge commit has more than one parent and includes the commits of multiple branches.

Merge: theory

Sorry, your browser does not support SVG.

Merge: how it works

git checkout master

# GIT_MERGE_AUTOEDIT=no  uses the automatically created  commit message
GIT_MERGE_AUTOEDIT=no git merge devel
Already up to date!
Merge made by the 'recursive' strategy.

Merge: all commits belong to the graph

Sorry, your browser does not support SVG.

git log --oneline --decorate --topo-order
6904d70 (HEAD -> master) Merge branch 'devel'
84f83f3 (devel) Commit 2 on devel
4983875 Commit 1 on devel
471953e 3rd commit - only master
ded185a 2nd commit
50ad332 1st commit

Merge: a branch can be merged more than once

git checkout devel
git commit --allow-empty -m"Hotfix on devel"
git checkout master

GIT_MERGE_AUTOEDIT=no git merge devel
[devel f0dd7ab] Hotfix on devel
Already up to date!
Merge made by the 'recursive' strategy.

Merge Summary

Sorry, your browser does not support SVG.

  • PRO
    • Explicit, merge stays a part of the graph
  • CON
    • Graph gets complex
    • Not optimal for work in progress

Rebase

Rebasing "transplants" commits and can be a better way to merge.

Rebase: theory

Sorry, your browser does not support SVG.

Rebase: how it works

git checkout devel

git rebase master
First, rewinding head to replay your work on top of it...
Applying: devel: 1st commit
Applying: devel: 2nd commit

Rebase: All rebased commits are changed

Sorry, your browser does not support SVG.

git log devel --oneline --decorate
10a0e31 (HEAD -> devel) devel: 2nd commit
b98f451 devel: 1st commit
20f6bee (master) master: 3rd commit
12ed95b master: 2nd commit
95f0cef master: 1st commit

Rebase: Merging gets easy

git checkout master

GIT_MERGE_AUTOEDIT=no git merge devel
Updating 20f6bee..10a0e31
Fast-forward
 change_devel | 2 ++
 1 file changed, 2 insertions(+)
 create mode 100644 change_devel

Rebase Summary

Sorry, your browser does not support SVG.

  • PRO
    • Simplified graph
    • Suitable for work in progress
  • CON
    • Merge no longer explicit
    • CAVE push -f

Remote

wip

Licensing