This repository has been archived by the owner on Mar 15, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathgit.tex
935 lines (785 loc) · 44.1 KB
/
git.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
\section{The Git Version Control System}
\label{sec:git}
%Global Information Tracking with Git}
%Tracking Content with git}
% ... originated
Git originated in the aftermath of BitKeeper, a proprietary
decentralized VCS then used by many Linux kernel developers, no longer
being available free of charge. When kernel developers, led by Linus
Torvalds, first started to use BitKeeper, it created a lot of
controversy over fears of data lock-in. However, BitKeeper showed that
a decentralized VCS could help Linus and the kernel subsystem
maintainers to accommodate the ever growing development activity of
the Linux kernel project. So it was clear that there was no going back
to simply managing patches; a new VCS would have to be found. After
scouring existing open source version control systems, Linus concluded
that none would scale to the task of replacing BitKeeper as the main
VCS for the Linux kernel, so he set out to create a new VCS which
specifically addressed the requirements of the Linux kernel
development process.
Among the requirements the new VCS would have to meet was the desire
to speed up handling of large directory trees, since the Linux kernel
project literally consists of thousands of files. Furthermore, the
rapid pace at which the Linux kernel project advances meant that the
VCS would have to efficiently handle deeply nested revision graphs.
Lastly, as already mentioned, a requirement which grew from the
experience with BitKeeper was that a decentralized VCS was the only
way to go; the highly distributed nature of the Linux kernel
development process would simply not fit into a centralized model.
% 00:54 < pasky> fonseca: really, git is at core just a cleaned up and less paranoid monotone ;)
The result was the git version control system, which since its
inception in April 2005, has grown at an unimaginable rate to become a
feature-rich set of tools built around a very simple core. Compared to
other version control systems, git has shed many firmly established
practices, which has led to git having
several unique characteristics, some of which has excited a bit of
controversy. As an example, git does not have explicit file rename
recording; instead rename tracking is a feature which must be enabled,
and is done implicitly by comparing similarity between files touched
by a given revision.
The remainder of this section will introduce git by investigating more
of the unique characteristics, along with how some of the requirements
listed above are addressed. It will describe how git stores and tracks
content and move on to explain how each repository is organized,
including the structure of meta-data. The main focus will be on the
concepts related to the synchronization of repositories with important
points summarized for the various topics.
\subsection{Separation of Plumbing and Porcelain}
% The Git Core}
From the very beginning, there has been a great incentive to separate
the version control system into two layers: the core, sometimes called
the ``plumbing'', and the user interface, called the ``porcelain''.
Much of this stems from the fact that in its early days, git
development concentrated mainly on expanding the core part of the
system. However, this separation also follows the tradition of UNIX,
where there are a number of small and simple core tools that focus on
doing just one thing, but which can be combined via scripting
languages to accomplish complex tasks.
A driving motivation behind the separation is the idea that the
porcelain layer is simply a set of policies on how a user may interact
with the VCS; a front-end that models a particular view on top of the
content-based storage model. For example, how remote branches are
mapped to local ones or how merging and handling of merge conflicts
are done. As a result, different front-ends, apart from the one
distributed with git itself, have emerged each focusing on different
aspects of version control. To name a few, there is Cogito\cite{7}, focusing
on user friendliness, and stgit\cite{8}\footnote{An abbreviation of ``stacked
git''.}, a patch queue manager with the goal of making it easy to work
with upstream branches.
The separation has allowed the core to remain purely content-oriented
and dedicated to the low-level details of tracking its data without
compromising data integrity over features required in the porcelain
interfaces. This has helped to keep the core clean with consistent, well-defined behaviors,
leaving all the ``hacks'' to the porcelain layer. It has also been an
important factor in keeping the VCS stable during the rapid
development process.
\subsubsection*{Summary}
The clear separation and the resulting clean and transparent interface
of the core are what allow one easily to plug in different storage
or synchronization back-ends. A lot of tools exist for working at the
git core level, which can be useful when implementing a new back-end.
Only the core part of git is important for synchronization, so only
the git core will be described in the following.
\subsection{The Git Philosophy}
%In the
%following, the concept and motivation behind it will be examined in
%greater depth, since it is often touted as fundamental to
%understanding how git works.
% A main design idea of git is
The founding philosophy behind git is \emph{pure content tracking}.
More specifically, git only cares about content, where content refers to
data contained in files. However, where version control systems
traditionally have been very file-oriented, focusing on the individual
files and how they evolve independently of each other, git tries to take
on a broader view and look beyond the individual files.
First, some characteristics of pure content tracking will be
introduced:
\begin{description}
\item[Content is atomic]
%
Content must be considered in its entirety and tracked through
full copy snapshots. That is, if a file is edited and
committed, a full copy of the file's new content must be
recorded in the repository, even if only one byte in the file
was modified.
\item[Content has no history or structure]
%
This stems from the fact that two versions of a file can be
identical, if for example a change was applied in one version
and reverted in the following version. The two identical
versions must therefore by definition share the same content,
which means that the content itself cannot have any history.
Similarly for the absence of structure, two files situated in
different parts of a directory tree can contain the same
content. The content itself can, thus, neither have any
structure.
% 00:56 < pasky> I believe that the claim that pure directory is not content is subjective and not universal
\item[Implicit directory tracking]
%
From git's point of view, directories only describe structure.
They do not themselves have any content and are therefore not
tracked explicitly. This means that it is not possible to
track an empty directory. However, when a file in a directory
is added to the list of tracked files, the following commit
will automatically add the nesting directory.
\item[A snapshot is purely defined by its content]
%
Objects for identical directory trees are always identical
regardless of whether they were derived by pulling the tree
objects from a git repository or by converting an archive, such as
a tarball, to a git repository.
\end{description}
% 00:59 < pasky> x0ample, git will gracefully handle...t0hat sentence sounds just strange in that context
% 00:59 < pasky> first, it should rather read s0o, ...d0 second, it's not true :)
% 01:00 < pasky> Not Even so... it just has nothing to do with the fact that the decision is made heuristically
% 01:01 < fonseca> I admit that I know nothing about how git does the lowlevel merging, so it's probably better not to make any such claim.
% 01:01 < pasky> I don't see how pure content tracking allows me to track code snippet history in any way u0ren0tent tracking wouldn't
As can be seen from the above, pure content tracking is in essence a
model for version control built on the concept of snapshots.
The departure from the file-oriented approach is an acknowledgment of
the fact that copying and moving content between files is much more
common than moving whole files. Which means that in the end, it is
much more powerful to track how content moves around in the tree. As
a result, git has no explicit rename tracking that is recorded at
commit time, instead it allows renames to be detected when requested
based on similarity. This dynamic rename detection is more expensive,
but allows for more advanced heuristics to be developed and applied if
deemed necessary.
% The idea of focusing how content is copied or moved
% between different files enables the exact history of a code snippet to
% be derived, which in many ways is far superior to the file-limited
% blame annotation normally used for such a task. Besides searching
% across multiple files, one reason for its superiority is that this
% feature, contrary to file annotation, will automatically search beyond
% the last time lines in the code snippet of interest was touched by a
% revision.
\subsubsection*{Summary}
While the topics in this section mostly relate to how git presents
itself to the users, it also touches on important characteristics
of git that facilitate distributing the content tracked by git in a
peer-to-peer network. The content-based approach ensures that similar
snapshots across different end points will be uniquely identifiable.
How this is done is the topic of the following sections.
% \subsection{Hashing a Global and Decentralized Model}
\subsection{Global Information Tracking}
In the git project's README file it is suggested that ``git'' is an
acronym for \emph{global information tracker}. While \emph{information
tracker} obviously refers to git being a version control system, the
meaning of \emph{global} is much more than just decentralized when
looking at the basic architecture of git. In his search for an open
source VCS, Linus had considered monotone, a decentralized version
control system based on secure hashes and cryptographic signatures,
and decided to adopt secure hashes for git as the primary method for
identifying various parts of the repository, such as revisions and
directory trees. As a consequence, git is global not only in the sense
that it allows projects to be version controlled in a decentralized
manner, it also enables completely separate projects to be either
contained in the same repository or merged together into one ``super''
project.
Secure hashes have been used in a variety of distributed systems. For
example, the BitTorrent protocol employs secure hashes as means for
peers to validate downloaded data and as a method of referring to a
specific torrent\footnote{The term for the file (single file torrent)
or the group of files (multi file torrent) the peer is downloading.}.
This shows how secure hashes can be used to uniquely identify
objects that are shared across end points and to verify the integrity
of data being exchanged between end points. Git combines both these
properties by naming objects by the hash values of their content.
Objects are then stored in the repository using the object names as
part of the object file name. By organizing the repository based on
the content instead of based on the files being tracked, the
repository ends up being a database having a \emph{content addressable
file system} as its backing store.
% 01:04 < pasky> s/stronger hash algorithm&& yet (as of Oct 2006)/
One caveat is the assumption that the secure hash function is able to
provide unique object identifiers. This is vital to avoid accidental
object collisions but also to ensure that a vicious person is not able
to corrupt the object store with false data. As a result, the secure
hash must be strong enough to avoid collisions and attacks. The hash
function used by git is currently SHA1. Attacks have been found for
SHA1, and it may therefore be necessary to move to a different hash
function in the future. However, given the nature of the proven
attacks and how cryptographic hash functions are used by git, it has
not yet been deemed serious enough to warrant a switch to a stronger
hash function.
Using secure hashes as the principal idea of the decentralized model
leads to some interesting characteristics:
\begin{description}
\item[Fast comparison]
%
If two objects share the same hash value, they are by
definition the same object, under the above assumption.
\item[Fast access]
%
Makes it possible to use a content addressable file system,
where objects can be accessed directly given the object name
without any lookup or indexing required.
% The catch is that it requires more storage space. This can be
% a huge win compared to delta-oriented formats, such as RCS,
% where long delta chains may have to be rewound.
\item[Universal naming scheme]
%
Using a content-based naming scheme implies that object names
are stable across different end points. Being able to
universally identify an object is a prerequisite for
supporting a decentralized model.
\item[Type and structuring transparency]
%
The low-level storage layer does not need to differentiate
between what type of object it is storing or the object
hierarchy. Consequently, certain aspects of repository
synchronization is simplified.
\end{description}
The universal naming scheme allows a repository to host multiple
distinct projects, enabling them to share objects, e.g. file objects
containing the GPL license. The ability for projects to co-exist also
makes it possible to clone a local repository by simply linking the
resulting repository object store with that of the cloned repository.
% in-project branches
As can be seen, git is so fundamentally decentralized, that even
within one single repository, each branch is totally independent and
``self-hosting''. This has the consequence, that merging branches from
projects which shares no development history is no different from
merging branches belonging to the same project. The ability to perform
such a merge and still maintain the complete revision history of each
branch is a feature, which only few version control systems have.
\subsubsection*{Summary}
The decentralized model used by git means that branches hosted in the
same repository is completely independent and self-hosting. The use
of secure hashes enable peers to verify the integrity of data
downloaded on the wire. The simple addressing scheme ensures that
objects are globally identifiable.
\subsection{The Repository}
The repository is where git stores all its information. It is usually
situated in the top-level directory of the checked out working tree in
a directory named \texttt{.git}. If no working tree is required, it
may also exist as a bare repository where the convention is to use
\texttt{.git} as the extension of the directory name. In the
following, the organization of the repository will be described.
% FIXME: Mention objects/pack??
The two most important parts of the repository is the object store and
the part devoted to tracking branches and tags. Branches and tags are
located under the \texttt{refs} subdirectory, with branch heads living
in the \texttt{refs/heads} directory and tags in \texttt{refs/tags}.
They are examined in Section \ref{sec:branch-tag} below. The object
store can be found under the subdirectory named \texttt{objects} and
is described in more detail in Section \ref{sec:object-store}.
Git supports a variety of protocols for synchronizing repositories,
each of which must be configured to map remote branches to local
branches. Two methods for configuring this mapping exist, the original
one contained in \texttt{branches} and the more flexible one contained
in \texttt{remotes}. The original method is the simpler one and
configure only one branch per entry. However, since repositories can
contain multiple branches the more flexible concept of \emph{remotes}
was added to allow a pull to update multiple local branches at once,
reducing the overhead of a series of pulls. The topic of repository
synchronization is further handled in Section \ref{sec:repo-sync}.
The repository also contains a \texttt{hooks} directory where it is
possible to define scripts scheduled to run when certain events occur.
As an example, there is an \emph{update} hook which can be used to
block a repository update or send a notification message, whenever new
changes are pushed into the repository. Some of the hooks take
arguments, such as revision IDs, and can use their exit code to
signal, whether a specific operation should proceed or be canceled.
Included in git is a collection of sample hook scripts, which are all
installed, but disabled by default. They can be enabled by simply
making them executable.
\begin{comment}
\subsubsection{The Index File}
Staging area during merges
A cache of file system (fstat()) info
-> a fast method for checking for modifications
Allows certain operations to be performed directly in the index (file)
instead of the working tree.
This was solved by introducing a directory cache
in the form of an index file listing all files tracked by the VCS
together with file system status information, that allows to quickly
determine whether a file has changed on disk.
\end{comment}
\subsubsection*{Summary}
Repositories have a well-defined layout and are completely separated
and self-contained. They can exist as bare repositories. This is
suitable for a storage back-end which does not care about working
trees. There already exists various data formats for specifying
repository related information which can be of help for new storage
back-ends.
\subsection{The Object Store} \label{sec:object-store}
The bulk of the repository is the object store where all revision
snapshots are stored in the form of objects having a hierarchical
structure. By default, the object store lives in a subdirectory in the
repository, however, via a mechanism called \emph{alternatives},
objects may be distributed among numerous directories creating an
overlay of object stores. This can be used to speed up cloning of
subprojects with intersecting object stores, such as the repositories
published by the Linux kernel subsystem maintainers.
The object store uses a content addressable file system structure,
where each object is stored in a separate file containing all
information regarding the object, and where all objects are organized
based on their hash value in hexadecimal form. To improve file system
performance when the object store accumulates many objects, object
files are distributed over 256 subdirectories based on the first two
bytes of the object name. For example, the object having the hash
\texttt{27165a1ad2617898b38d0fc16e530c5b12fa92} will be stored
under the subdirectory \texttt{27} as
\texttt{27/165a1ad2617898b38d0fc16e530c5b12fa92}.
\begin{figure}
\begin{center}
\newlength{\ObjWidth}
\settowidth{\ObjWidth}{\scriptsize{\bf Commit C}}
\newcommand{\Obj}[2]{[name=#1]\psframebox[cornersize=absolute,linearc=3pt]{\makebox[\ObjWidth]{\scriptsize{\bf #2}}}}
\newcommand{\ObjLine}[2]{\ncline{#1}{#2}\naput{\scriptsize{\emph{}}}}
\begin{psmatrix}[colsep=0.5,rowsep=1,mnode=r]
& \Obj{A}{Tag A} \\
\TextNode{Z}{...} & \Obj{B}{Commit A} & & \Obj{D}{Commit B} & & \Obj{F}{Commit C} & & \TextNode{V}{...} \\
& \Obj{C}{Tree A} & \Obj{K}{Blob A1} & \Obj{E}{Tree B} & \Obj{H}{Blob B1} & \Obj{G}{Tree C} & \Obj{M}{Blob C1} \\
& & \Obj{L}{Blob A2} & & \Obj{I}{Blob B2} & \Obj{O}{Tree C3} & \Obj{N}{Blob C2} \\
& & & & \Obj{J}{Blob B3} & & \Obj{P}{Blob C31}
% Node connections
\psset{nodesep=1pt,arrows=->}
\ObjLine{A}{B}{}
\ObjLine{B}{C}{}
\ObjLine{D}{E}{}
\ObjLine{F}{G}{}
\ObjLine{V}{F}{}
\ObjLine{F}{D}{}
\ObjLine{D}{B}{}
\ObjLine{B}{Z}{}
\ObjLine{C}{X}{}
\ObjLine{G}{Y}{}
\ObjLine{C}{K}{}
\ObjLine{C}{L}{}
\ObjLine{E}{H}{}
\ObjLine{E}{I}{}
\ObjLine{E}{J}{}
\ObjLine{G}{M}{}
\ObjLine{G}{N}{}
\ObjLine{G}{O}{}
\ObjLine{O}{P}{}
\end{psmatrix}
\caption{The git object hierarchy.}
\label{fig:object-hierarchy}
\end{center}
\end{figure}
% It uses a self verifying
% on-disk format to ensure data integrity and
%
% Content - hierarchy - state (history) - trust
There exist four basic types of objects. Following, they are
introduced bottom-up with respect to the object hierarchy depicted in
Figure \ref{fig:object-hierarchy}:
\begin{description}
\item[Blob]
%
Blob objects are, as the name suggests, a pure storage object,
that contains the data associated with a particular file
version. Blob objects themselves do not refer to other
objects.
\item[Tree]
%
A tree object holds information about a single directory in
the broader directory tree of a snapshot. Files in the directory are
referenced as blob objects and nested directories are
referenced as other tree objects. Apart from containing
hierarchic information about the directory tree, it also
records information such as permission flags.
\item[Commit]
%
Each revision will create a commit object that records the
state of the working tree by referencing a tree object.
Optionally, the commit object may also record one or more
parent commit objects, depending on whether the commit is a
merge or not. The initial revision in a project will have no
parents. The same applies for pristine branches. Additionally,
the object also records the identity of the author and
committer along with a time stamp for both, as a mean to
attribute who is responsible for the commit, and free form
text to hold the commit message.
\item[Tag]
%
Tag objects may point to any type of object. It holds
information about the tag name, and the identity of the tagger
along with a time stamp. Finally, it may also contain free-form
text that can embed a PGP signature of the tag information.
\end{description}
Similar for all objects is, that they each contain an object header
holding meta-information, such as the object type. Whereas the object
body may use a binary format to represent its data, the object header
uses a line-oriented ASCII encoded name-value format.
% FIXME: Zlib reference??
To reduce the size of objects they are compressed using zlib. The
zlib library uses dictionary based data compression and as a result
will perform poorly on certain types of objects. For example, tree
objects, that by definition contain many random SHA1 strings, may not
compress very well, and the same holds for blob objects containing
binary data, which themselves may already be compressed using zlib or
another compression format. The zlib library will ensure that
compression will never increase the size of the data compared to the
original, by simply falling back to not apply any compression.
\subsubsection{Self-verifying On-disk Format}
As already established, securing data integrity is among the most
notable goals of a VCS. In a decentralized VCS this is even more
important since the integrity of data may be challenged not only by
hard disk failures or faults in the operating system, but also by
rogue end points. Having a self-verifying on-disk
format is therefore desirable, since it allows incoming data to be
verified and the repository to be regularly checked against
corruption. This increases the chance, that corruption is caught as
early as possible and helps to avoid that it spreads to other end
points and disrupts the system.
One consequence imposed by the desire for a self-verifying on-disk
format is stricter object relations. For example, a commit object not
only encodes the state of the commit and what the associated directory
tree looked like, but also the history that led up to that state. In
other words, the revision history becomes immutable and append-only
since this strict inter-object relationship ensures that it is
impossible to tamper with the revision history without it being
immediately obvious.
Nevertheless, there may exist valid cases where the need for editing a
previous revision arises, e.g. amending of a commit message. In these
cases all revisions following the amended revision must also be
edited, an operation called \emph{rebasing}, in order to reestablish
the inter-object relationship. Rebasing can also be used to update the
branch point of a development branch so that it fast-forwards on top
of another branch; a method often used by downstream developers to
synchronize with the upstream branch they are tracking. Caution must
however be taken when rewriting revisions in published branches, since
they will no longer fast-forwarding when they are pulled.
Rebasing may leave behind unused objects in the repository database.
Again, the self verifying on-disk format provides a solution,
however, in order to prune such objects the complete database must be
traversed to identify the unused objects. This method can also be used
to resurrect unreferenced revisions that has been lost by accident,
again ensuring data integrity in the repository database.
An instance where the importance of the self verifying on-disk format
is further underlined is with tags. The inter-object relationships
ensures that by tagging a specific revision object, the whole
development history leading up to the revision is also tagged. By
combining this with cryptographically signed tags, it is possible to
further introduce a human derived degree of trust that a
specific revision has been identified as the only valid.
\subsubsection{Grafting Revision Graphs}
\label{sec:grafting}
In some situations, the strict inter-dependencies of revisions become
a restriction rather than an appreciative feature that helps to ensure
data integrity. The problem mainly arises, when an project that has
long been using a version control system decides to switch and start
using the git version control system. Although tools exists for
importing project history from the most popular version control
systems, it may in some circumstances be undesirable to require users
to download a lot of historic data in order to just try out the most
recent version. Additionally, the project may later decide to
convert its old history and make it available as a git repository, in
which case all of the new history would have to be rebased on top of
the old history. This would require the repository to be cloned again
since all branch heads would have been updated.
To solve both of these issues, git allows revision graphs to be
\emph{grafted} on top of each other. Grafting works by simply allowing
``fake'' parent revisions to be inserted into the revision graph so
that it appears as if the revisions logically precede another
revision. The result is that projects can start from scratch without
worrying about the old history and then later convert it and offer it
as an optional extension to the main repository.
Obviously, this has the potential to undermine the self-verifying
nature of the object store by allowing any revision to precede another
with no regards to whether it actually is a real parent of the
revision. For this reason, grafting requires the user to actively
create the graft file and insert the grafted revisions. In other
words, a repository that the user decides to pull some changes from
cannot insert fake revisions information and corrupt a repository.
\subsubsection*{Summary}
The structure of the object store allows the integrity of the data in
it to be verified regarding both the individual objects using their
SHA1 name and the object hierarchy of which they are part. As for all
storage back-end, they should always ensure that data is valid before
making it part of the repository structure.
Rebasing of certain branches might leave unused objects in the object
store so it may be necessary to periodically prune it to reduce
disk-space utilization. The space needed for storing objects can
further be reduced by using the alternatives mechanism to cascade
object stores of repositories that have intersecting objects. For such
repositories, sharing object stores can also reduce the number of
objects that needs to be downloaded.
% Similar to when the alternatives method is used for local repositories
% this raises the issue of ensuring that data is written to disk in a
% safe manner. This is even more important
%
% All download into the same repository
\subsection{Branches and Tags}
\label{sec:branch-tag}
Since all information related to branch revisions and tags is
maintained as objects in the object store, the branches and tags end
up being simply symbolic references to objects. They map a name to the
SHA1 of an object in the object store. Originally the references has
been stored in 41-byte files, 40-bytes for the SHA1 value followed by
a newline, however, the use of such small files doesn't scale well
when the number of branches and tags grows. Consequently, git is
currently moving to using a single flat-file containing all symbolic
references.
Git already defines a notation for expressing the list of references
in a repository, which is used when synchronizing repositories. The
\texttt{git-ls-remote} program can be used for listing references in a
repository, local as well as remote. It has the following output:
\begin{verbatim}
0248f303a64c0db45b78cf987db5bc1ddf00d15b HEAD
0248f303a64c0db45b78cf987db5bc1ddf00d15b refs/heads/master
0248f303a64c0db45b78cf987db5bc1ddf00d15b refs/heads/origin
7113a59591aab60f979c6993bffa4cdd66236fdf refs/tags/initial
8dec84bb6b5eb7eb8831582dfd7ebbadb0403474 refs/tags/initial^{}
c732f546d2924665a197fede48721870f96f53d7 refs/tags/ref-0.1
bd562ca1ef05b91e6dbcddf3ace0c897ef76567e refs/tags/ref-0.1^{}
\end{verbatim}
Each line lists one reference by its hexadecimal encoded SHA1 and
name, both separated by a tab. The name is the path under the git
repository and the special \texttt{HEAD} reference refers to the
trunk. For each tag, the object it points to is also listed using a
special object ``dereference'' notation where \texttt{\^}\texttt{\{\}}
is appended to the tag name.
While the append-only semantics of the object store reduces the
changes of race conditions when adding new objects, ensuring atomic
updates of the symbolic references is more critical. Before a symbolic
reference can be updated it needs to be locked, which is done by first
creating a lock file associated with the branch head. If this
succeeds, the lock file is then renamed to the branch head file.
Although switching to a single flat-file for storing references
results in coarser lock granularity, lock contention is only an issue
for shared repositories and will likely not become a problem.
Contrary to the object store, there is no way to verify the symbolic
reference in any way. Usually, they are updated either when committing
locally or when pulling changes from a remote repository. In
both cases, the source of the new references is assumed to be trusted.
\subsubsection*{Summary}
Branches and tags are simply symbolic references mapping object SHA1
to a name and can be expressed using a dense format. An open question
is how to ensure the validity of the reference lists when
synchronizing it between end points, since git does not itself provide
a method for this. In some repositories many tags and branches can
exist so the reference list may become very long.
\subsection{Delta-based Storage and Packing}
% Concern
The storage model described in Section \ref{sec:object-store} will
maintain a complete copy of each object in order to allow the direct
addressing. If only a byte is changed in a file, a full copy of the
new blob object will enter the object store. Even though each objects
is compressed this can lead to repositories quickly requiring a lot of
backing store. Initially, the argument was, that disk space is cheap
and that this would not pose a problem, at least not right away.
% Stats about object sizes in the Linux kernel?
Several issues with this very simple storage model gradually became
clear. First, many of the objects, that are created, are small and as
a result leads to a large amount of storage overhead when they are
stored in block-based file systems. Second, when working with
many objects, they all have to be loaded into memory which can require
a considerable amount of memory and may exhibit poor caching behavior
because of bad access patterns. Finally, git initially used rsync to
synchronize repositories across different end points, making cloning
of big repositories painful by requiring a lot of data to be
transfered. It was evident that there was a need for a more suitable
storage model that addressed these issues.
\subsubsection{Introducing Deltas}
% Stats about ``locality''
The solution employed by many existing version control systems is to
exploit the locality of the file data being tracked. Often two
successive versions of a file will share most of its content and only
a small difference will remain. By representing the difference in the
form of a delta object, it is possible to derive one from the other by
simply applying the delta on the version that is available. This can
be repeated endlessly so that only one complete version needs to be
stored, while the remaining versions are all represented as a chain
of deltas.
Traditionally, a weave format such as RCS or SCCS has been used for
encoding the delta compression of a single file. They make the
most recent version of the file readily available and keeps all other
versions as deltas. This is often a good heuristic, since the most
recent versions are usually the ones being accessed most frequently.
However, if many versions exist, the delta chain can become very deep,
making access to an old version very expensive.
% (affects how many steps there
% must be taken in order to assemble the compressed object.
\subsubsection{Packing Heuristics}
Since git is not tied to the file based model of other version control
systems, it has been able to pursue other compression heuristics and
consequently also alleviate the problem with deep delta-chains. The
problem of coming up with a good heuristic, that works for many
different use cases, is a subject that has been discussed continuously
in the community. Compression should be able to efficiently handle
both binary files, files of very different sizes, randomly created
file names as well as logically structured names, and try to take
locality of data into account. In the following some of the heuristics
regarding packing will be introduced.
The general idea is to first sort objects by type, size and optionally
where they reside in the directory tree. Sorting by type ensures that
objects with similar characteristics will be compressed against each
other. Using size as a sorting heuristics tries to take into
account that files of similar size are more likely to have a lot in
common. By optionally sorting blob objects with respect to their
corresponding file path, locality is prioritized which increases the
chances that deltas will be smaller. Coupled with the size sorting,
this also considers that files often increase in size over a set of
versions and thus further improves locality with respect to time.
% Default values?
One approach to comply with the many different use cases has been to
make key aspects of the compression algorithm configurable, such as
the window of objects being compressed against and the depth of the
delta chain. By limiting the depth of the delta chain certain
guarantees regarding the cost of accessing an object can be made.
Similar to how video format often encode using several base frames
followed by delta frames, the delta chain can be ``restarted'' by
simply inserting a complete object once the maximum depth has been
reached. Using a window to limit which objects to delta against
further enables the compression algorithm to only use delta
compression when it saves space, because objects will only be compared
against ``local'' objects with respect to the sorting order.
Another thing to consider when packing is the properties of the
different object types. Blob objects provide the best basis for delta
compression, and since they are the bulk of the object store,
compression heuristics specifically oriented towards blob objects has
a significant impact on the packing. The commit and tag objects both
contain random free-form text, and therefore rarely share data between
successors apart from author information. Although tree objects often
share data across different versions they are by definition more
``fragile'', since a change in a deeply nested directory will cause
all parent tree objects to be rewritten, whereby the attempt to delta
compress tree objects is over-shadowed by the volume and rate at which
they are created.
Finally, there is the problem of coming up with a good heuristic of
when and what to pack. Where locally everything can just be packed
into one monolithic pack file, repositories used as distribution
points must also consider using incremental packing to minimize the
number of packs and objects that needs to be fetched. To date, this
still remains an open problem.
\subsubsection{Using Pack Files}
Using the delta-based compression git is able to pack the object store
into a dense representation of its objects. The pack files can then be
used both for efficiently transferring objects between repositories and
for local storage. This eliminates the problem with many small files,
and to some extend provides better caching when objects are in memory
because pack files can be memory mapped in their entirety.
Pack files are, however, expensive to generate, since many objects
potentially must be examined. This downside affects the performance of
the git server daemon used by many projects for providing access to
repositories. The solution is to reuse portions of existing pack files
when repacking a repository, so that only new objects need to be
examined while already compressed objects can be kept untouched. Since
this incremental form of compression does not take the whole data set
into consideration, it does not lead the optimal packing. This means
it may still be necessary to periodically fully repack a repository
without using existing pack files.
Currently there is a limit on the size of pack files both in terms of
how packs are loaded by the git core but also in terms of the data
format used, however it can be expected that pack files in the future
can be arbitrarily big. As a consequence, it is not always viable to
load an object pack or fetch it from a remote end point to see if it
contains objects that are of use. Additionally, for big packs it can
be expensive to locate an object in a pack. To solve this, a pack can
be accompanied by an index file which lists the content of a pack
together with offsets so that it is possible to both see the content
of a pack and where in the pack objects are located.
% For this reason git uses explicit packing, meaning that
% packing is something that must be performed explicitly by the user.
% : feature or bug?
% As a result pack
% files are not ``stable'' in the sense that different configurations
% may lead to different pack files that all represent the same
% collection of objects. Flexibility has
Using packs to some extent affects the idea of the content addressable
file system. Since objects may now be represented as a series of
deltas, it is no longer possible to fast and directly access them.
While this is a regression in terms of speed for this particular case,
packs provide so many other benefits that they end up
improving the overall performance considerably.
\subsubsection*{Summary}
Using object packing it is possible to densely represent objects in
the object store. This is usable both for local storage and for
transferring data over the wire when synchronizing repositories. Index
files provide a way to summarize the content of a pack file making it
possible to identify which pack files have objects that are
interesting. Deltas from old packs can be reused when creating new
packs to reduce the cost of packing objects.
An open question is how to best pack the object store. Different
heuristics exists and the packing format is designed so that it is
open for new heuristics to be used. As a consequence, it is not
possible to rely on packs being the same across different instances of
a repository.
\subsection{Repository Synchronization}
\label{sec:repo-sync}
There exist a variety of methods or protocols for synchronizing
repositories, each providing different options regarding what
requirements are imposed on the end points involved in the
synchronization. The many choices have originated to provide better
support for the many different work flows in which git is being used,
for example being able to push or pull changes across a company firewall.
Similar for all existing transport protocols are that they are
strictly client-server.
The protocols can be divided into two subcategories: \emph{static} and
\emph{dynamic}, depending on whether they require a
specialized server:
\begin{description}
\item[Static protocols]
%
The group of static protocols are sometimes referred to as
\emph{dumb} protocols, because they do not require anything
from the server other than being able to serve static data
in the form of the repository directory tree. It includes
protocols, such as HTTP, FTP, rsync, and to some extent local
file copying, although local cloning can also use a dynamic
protocol.
Rsync and local file copying are discouraged, since they do
not ensure that the resulting repository is valid because of
the sequential order in which they copy the files. For
example, a branch head can point to a non existing (as in
copied) commit, if the repository being synchronized was
updated in the process. Consequently, neither rsync nor
simple copying are safe methods to synchronize or backup a git
repository.
Compared to rsync, the HTTP and FTP protocols are a little
smarter, since they will both download and verify that the
downloaded data is valid. To optimize examination of a remote
repository, the HTTP and FTP protocols rely on a set of
``meta'' files being present. These files, which can be
generated automatically by calling the
\texttt{git-update-server-info} program from a hook, lists
information about branch and tag references as well as pack
files available in the repository.
The only static protocol, that supports pushing, is HTTP,
which can be configured to use WebDAV to provide the necessary
authentication.
\item[Dynamic protocols]
%
The dynamic protocols, known as \emph{smart} protocols,
are recognizable by the fact that they all use pack files to
transfer updates, whereby they minimize bandwidth usage. Among
the dynamic protocols is the native git protocol, which has
its own dedicated daemon and provides read-only access to
repositories. If write-access is required, options exist for
using SSH to authenticate users and then provide access to
repositories. If users are not allowed to log into the system
in question a special \texttt{git-shell} program has been
created which can be set as the default shell for the user and
thereby restrict access to only those involving git
activities.
\end{description}
As can be seen from the above, git does not itself implement any
authentication methods regarding write-access, instead it relies
completely on those provided by third party systems, mainly SSH and
WebDAV. The primary argument is that authentication is hard to get
right and by using existing authentication systems, existing
configurations can also be used, for example SSH accounts.
In order to ensure that the resulting repository is valid, the
following steps are taken when synchronizing a repository: first, the
branch heads in the two repositories are checked to see if any
synchronization is required, then for each branch that contain
updates, the relevant objects are synchronized between the two object
stores, and finally upon successful completion, the branch heads are
updated atomically to reflect the new changes. This operation sequence
guarantees that the synchronization is idempotent and that it is safe
to cancel the synchronization in any of the steps. The only result
will be that unreferenced objects can exist in the object store,
objects that can be later pruned if deemed necessary.
% As a special dynamic protocol there also exists
% Different protocol: ssh, HTTP, ``native'' git. git cvs server
\subsubsection*{Summary}
Different protocols exists for synchronizing repositories, some of
which are better than others at capitalizing on the use of object
packs when transferring changes. They are all client-server based and
have each solved different issues related to synchronizations, some of
which may also be relevant to new networked storage back-ends.