1 2 GIT - the stupid content tracker 3 4"git" can mean anything, depending on your mood. 5 6 - random three-letter combination that is pronounceable, and not 7 actually used by any common UNIX command. The fact that it is a 8 mispronounciation of "get" may or may not be relevant. 9 - stupid. contemptible and despicable. simple. Take your pick from the 10 dictionary of slang. 11 - "global information tracker": you're in a good mood, and it actually 12 works for you. Angels sing, and a light suddenly fills the room. 13 - "goddamn idiotic truckload of sh*t": when it breaks 14 15This is a stupid (but extremely fast) directory content manager. It 16doesn't do a whole lot, but what it _does_ do is track directory 17contents efficiently. 18 19There are two object abstractions: the "object database", and the 20"current directory cache". 21 22 The Object Database (SHA1_FILE_DIRECTORY) 23 24The object database is literally just a content-addressable collection 25of objects. All objects are named by their content, which is 26approximated by the SHA1 hash of the object itself. Objects may refer 27to other objects (by referencing their SHA1 hash), and so you can build 28up a hierarchy of objects. 29 30There are several kinds of objects in the content-addressable collection 31database. They are all in deflated with zlib, and start off with a tag 32of their type, and size information about the data. The SHA1 hash is 33always the hash of the _compressed_ object, not the original one. 34 35In particular, the consistency of an object can always be tested 36independently of the contents or the type of the object: all objects can 37be validated by verifying that (a) their hashes match the content of the 38file and (b) the object successfully inflates to a stream of bytes that 39forms a sequence of <ascii tag without space> + <space> + <ascii decimal 40size> + <byte\0> + <binary object data>. 41 42BLOB: A "blob" object is nothing but a binary blob of data, and doesn't 43refer to anything else. There is no signature or any other verification 44of the data, so while the object is consistent (it _is_ indexed by its 45sha1 hash, so the data itself is certainly correct), it has absolutely 46no other attributes. No name associations, no permissions. It is 47purely a blob of data (ie normally "file contents"). 48 49TREE: The next hierarchical object type is the "tree" object. A tree 50object is a list of permission/name/blob data, sorted by name. In other 51words the tree object is uniquely determined by the set contents, and so 52two separate but identical trees will always share the exact same 53object. 54 55Again, a "tree" object is just a pure data abstraction: it has no 56history, no signatures, no verification of validity, except that the 57contents are again protected by the hash itself. So you can trust the 58contents of a tree, the same way you can trust the contents of a blob, 59but you don't know where those contents _came_ from. 60 61Side note on trees: since a "tree" object is a sorted list of 62"filename+content", you can create a diff between two trees without 63actually having to unpack two trees. Just ignore all common parts, and 64your diff will look right. In other words, you can effectively (and 65efficiently) tell the difference between any two random trees by O(n) 66where "n" is the size of the difference, rather than the size of the 67tree. 68 69Side note 2 on trees: since the name of a "blob" depends entirely and 70exclusively on its contents (ie there are no names or permissions 71involved), you can see trivial renames or permission changes by noticing 72that the blob stayed the same. However, renames with data changes need 73a smarter "diff" implementation. 74 75CHANGESET: The "changeset" object is an object that introduces the 76notion of history into the picture. In contrast to the other objects, 77it doesn't just describe the physical state of a tree, it describes how 78we got there, and why. 79 80A "changeset" is defined by the tree-object that it results in, the 81parent changesets (zero, one or more) that led up to that point, and a 82comment on what happened. Again, a changeset is not trusted per se: 83the contents are well-defined and "safe" due to the cryptographically 84strong signatures at all levels, but there is no reason to believe that 85the tree is "good" or that the merge information makes sense. The 86parents do not have to actually have any relationship with the result, 87for example. 88 89Note on changesets: unlike real SCM's, changesets do not contain rename 90information or file mode chane information. All of that is implicit in 91the trees involved (the result tree, and the result trees of the 92parents), and describing that makes no sense in this idiotic file 93manager. 94 95TRUST: The notion of "trust" is really outside the scope of "git", but 96it's worth noting a few things. First off, since everything is hashed 97with SHA1, you _can_ trust that an object is intact and has not been 98messed with by external sources. So the name of an object uniquely 99identifies a known state - just not a state that you may want to trust. 100 101Furthermore, since the SHA1 signature of a changeset refers to the 102SHA1 signatures of the tree it is associated with and the signatures 103of the parent, a single named changeset specifies uniquely a whole 104set of history, with full contents. You can't later fake any step of 105the way once you have the name of a changeset. 106 107So to introduce some real trust in the system, the only thing you need 108to do is to digitally sign just _one_ special note, which includes the 109name of a top-level changeset. Your digital signature shows others that 110you trust that changeset, and the immutability of the history of 111changesets tells others that they can trust the whole history. 112 113In other words, you can easily validate a whole archive by just sending 114out a single email that tells the people the name (SHA1 hash) of the top 115changeset, and digitally sign that email using something like GPG/PGP. 116 117In particular, you can also have a separate archive of "trust points" or 118tags, which document your (and other peoples) trust. You may, of 119course, archive these "certificates of trust" using "git" itself, but 120it's not something "git" does for you. 121 122Another way of saying the same thing: "git" itself only handles content 123integrity, the trust has to come from outside. 124 125 Current Directory Cache (".git/index") 126 127The "current directory cache" is a simple binary file, which contains an 128efficient representation of a virtual directory content at some random 129time. It does so by a simple array that associates a set of names, 130dates, permissions and content (aka "blob") objects together. The cache 131is always kept ordered by name, and names are unique at any point in 132time, but the cache has no long-term meaning, and can be partially 133updated at any time. 134 135In particular, the "current directory cache" certainly does not need to 136be consistent with the current directory contents, but it has two very 137important attributes: 138 139 (a) it can re-generate the full state it caches (not just the directory 140 structure: through the "blob" object it can regenerate the data too) 141 142 As a special case, there is a clear and unambiguous one-way mapping 143 from a current directory cache to a "tree object", which can be 144 efficiently created from just the current directory cache without 145 actually looking at any other data. So a directory cache at any 146 one time uniquely specifies one and only one "tree" object (but 147 has additional data to make it easy to match up that tree object 148 with what has happened in the directory) 149 150 151and 152 153 (b) it has efficient methods for finding inconsistencies between that 154 cached state ("tree object waiting to be instantiated") and the 155 current state. 156 157Those are the two ONLY things that the directory cache does. It's a 158cache, and the normal operation is to re-generate it completely from a 159known tree object, or update/compare it with a live tree that is being 160developed. If you blow the directory cache away entirely, you haven't 161lost any information as long as you have the name of the tree that it 162described. 163 164(But directory caches can also have real information in them: in 165particular, they can have the representation of an intermediate tree 166that has not yet been instantiated. So they do have meaning and usage 167outside of caching - in one sense you can think of the current directory 168cache as being the "work in progress" towards a tree commit).