Documentation / technical / partial-clone.txton commit Merge branch 'dl/use-sq-from-test-lib' (d693345)
   1Partial Clone Design Notes
   2==========================
   3
   4The "Partial Clone" feature is a performance optimization for Git that
   5allows Git to function without having a complete copy of the repository.
   6The goal of this work is to allow Git better handle extremely large
   7repositories.
   8
   9During clone and fetch operations, Git downloads the complete contents
  10and history of the repository.  This includes all commits, trees, and
  11blobs for the complete life of the repository.  For extremely large
  12repositories, clones can take hours (or days) and consume 100+GiB of disk
  13space.
  14
  15Often in these repositories there are many blobs and trees that the user
  16does not need such as:
  17
  18  1. files outside of the user's work area in the tree.  For example, in
  19     a repository with 500K directories and 3.5M files in every commit,
  20     we can avoid downloading many objects if the user only needs a
  21     narrow "cone" of the source tree.
  22
  23  2. large binary assets.  For example, in a repository where large build
  24     artifacts are checked into the tree, we can avoid downloading all
  25     previous versions of these non-mergeable binary assets and only
  26     download versions that are actually referenced.
  27
  28Partial clone allows us to avoid downloading such unneeded objects *in
  29advance* during clone and fetch operations and thereby reduce download
  30times and disk usage.  Missing objects can later be "demand fetched"
  31if/when needed.
  32
  33A remote that can later provide the missing objects is called a
  34promisor remote, as it promises to send the objects when
  35requested. Initialy Git supported only one promisor remote, the origin
  36remote from which the user cloned and that was configured in the
  37"extensions.partialClone" config option. Later support for more than
  38one promisor remote has been implemented.
  39
  40Use of partial clone requires that the user be online and the origin
  41remote or other promisor remotes be available for on-demand fetching
  42of missing objects.  This may or may not be problematic for the user.
  43For example, if the user can stay within the pre-selected subset of
  44the source tree, they may not encounter any missing objects.
  45Alternatively, the user could try to pre-fetch various objects if they
  46know that they are going offline.
  47
  48
  49Non-Goals
  50---------
  51
  52Partial clone is a mechanism to limit the number of blobs and trees downloaded
  53*within* a given range of commits -- and is therefore independent of and not
  54intended to conflict with existing DAG-level mechanisms to limit the set of
  55requested commits (i.e. shallow clone, single branch, or fetch '<refspec>').
  56
  57
  58Design Overview
  59---------------
  60
  61Partial clone logically consists of the following parts:
  62
  63- A mechanism for the client to describe unneeded or unwanted objects to
  64  the server.
  65
  66- A mechanism for the server to omit such unwanted objects from packfiles
  67  sent to the client.
  68
  69- A mechanism for the client to gracefully handle missing objects (that
  70  were previously omitted by the server).
  71
  72- A mechanism for the client to backfill missing objects as needed.
  73
  74
  75Design Details
  76--------------
  77
  78- A new pack-protocol capability "filter" is added to the fetch-pack and
  79  upload-pack negotiation.
  80+
  81This uses the existing capability discovery mechanism.
  82See "filter" in Documentation/technical/pack-protocol.txt.
  83
  84- Clients pass a "filter-spec" to clone and fetch which is passed to the
  85  server to request filtering during packfile construction.
  86+
  87There are various filters available to accommodate different situations.
  88See "--filter=<filter-spec>" in Documentation/rev-list-options.txt.
  89
  90- On the server pack-objects applies the requested filter-spec as it
  91  creates "filtered" packfiles for the client.
  92+
  93These filtered packfiles are *incomplete* in the traditional sense because
  94they may contain objects that reference objects not contained in the
  95packfile and that the client doesn't already have.  For example, the
  96filtered packfile may contain trees or tags that reference missing blobs
  97or commits that reference missing trees.
  98
  99- On the client these incomplete packfiles are marked as "promisor packfiles"
 100  and treated differently by various commands.
 101
 102- On the client a repository extension is added to the local config to
 103  prevent older versions of git from failing mid-operation because of
 104  missing objects that they cannot handle.
 105  See "extensions.partialClone" in Documentation/technical/repository-version.txt"
 106
 107
 108Handling Missing Objects
 109------------------------
 110
 111- An object may be missing due to a partial clone or fetch, or missing
 112  due to repository corruption.  To differentiate these cases, the
 113  local repository specially indicates such filtered packfiles
 114  obtained from promisor remotes as "promisor packfiles".
 115+
 116These promisor packfiles consist of a "<name>.promisor" file with
 117arbitrary contents (like the "<name>.keep" files), in addition to
 118their "<name>.pack" and "<name>.idx" files.
 119
 120- The local repository considers a "promisor object" to be an object that
 121  it knows (to the best of its ability) that promisor remotes have promised
 122  that they have, either because the local repository has that object in one of
 123  its promisor packfiles, or because another promisor object refers to it.
 124+
 125When Git encounters a missing object, Git can see if it is a promisor object
 126and handle it appropriately.  If not, Git can report a corruption.
 127+
 128This means that there is no need for the client to explicitly maintain an
 129expensive-to-modify list of missing objects.[a]
 130
 131- Since almost all Git code currently expects any referenced object to be
 132  present locally and because we do not want to force every command to do
 133  a dry-run first, a fallback mechanism is added to allow Git to attempt
 134  to dynamically fetch missing objects from promisor remotes.
 135+
 136When the normal object lookup fails to find an object, Git invokes
 137promisor_remote_get_direct() to try to get the object from a promisor
 138remote and then retry the object lookup.  This allows objects to be
 139"faulted in" without complicated prediction algorithms.
 140+
 141For efficiency reasons, no check as to whether the missing object is
 142actually a promisor object is performed.
 143+
 144Dynamic object fetching tends to be slow as objects are fetched one at
 145a time.
 146
 147- `checkout` (and any other command using `unpack-trees`) has been taught
 148  to bulk pre-fetch all required missing blobs in a single batch.
 149
 150- `rev-list` has been taught to print missing objects.
 151+
 152This can be used by other commands to bulk prefetch objects.
 153For example, a "git log -p A..B" may internally want to first do
 154something like "git rev-list --objects --quiet --missing=print A..B"
 155and prefetch those objects in bulk.
 156
 157- `fsck` has been updated to be fully aware of promisor objects.
 158
 159- `repack` in GC has been updated to not touch promisor packfiles at all,
 160  and to only repack other objects.
 161
 162- The global variable "fetch_if_missing" is used to control whether an
 163  object lookup will attempt to dynamically fetch a missing object or
 164  report an error.
 165+
 166We are not happy with this global variable and would like to remove it,
 167but that requires significant refactoring of the object code to pass an
 168additional flag.
 169
 170
 171Fetching Missing Objects
 172------------------------
 173
 174- Fetching of objects is done using the existing transport mechanism using
 175  transport_fetch_refs(), setting a new transport option
 176  TRANS_OPT_NO_DEPENDENTS to indicate that only the objects themselves are
 177  desired, not any object that they refer to.
 178+
 179Because some transports invoke fetch_pack() in the same process, fetch_pack()
 180has been updated to not use any object flags when the corresponding argument
 181(no_dependents) is set.
 182
 183- The local repository sends a request with the hashes of all requested
 184  objects as "want" lines, and does not perform any packfile negotiation.
 185  It then receives a packfile.
 186
 187- Because we are reusing the existing fetch-pack mechanism, fetching
 188  currently fetches all objects referred to by the requested objects, even
 189  though they are not necessary.
 190
 191
 192Using many promisor remotes
 193---------------------------
 194
 195Many promisor remotes can be configured and used.
 196
 197This allows for example a user to have multiple geographically-close
 198cache servers for fetching missing blobs while continuing to do
 199filtered `git-fetch` commands from the central server.
 200
 201When fetching objects, promisor remotes are tried one after the other
 202until all the objects have been fetched.
 203
 204Remotes that are considered "promisor" remotes are those specified by
 205the following configuration variables:
 206
 207- `extensions.partialClone = <name>`
 208
 209- `remote.<name>.promisor = true`
 210
 211- `remote.<name>.partialCloneFilter = ...`
 212
 213Only one promisor remote can be configured using the
 214`extensions.partialClone` config variable. This promisor remote will
 215be the last one tried when fetching objects.
 216
 217We decided to make it the last one we try, because it is likely that
 218someone using many promisor remotes is doing so because the other
 219promisor remotes are better for some reason (maybe they are closer or
 220faster for some kind of objects) than the origin, and the origin is
 221likely to be the remote specified by extensions.partialClone.
 222
 223This justification is not very strong, but one choice had to be made,
 224and anyway the long term plan should be to make the order somehow
 225fully configurable.
 226
 227For now though the other promisor remotes will be tried in the order
 228they appear in the config file.
 229
 230Current Limitations
 231-------------------
 232
 233- It is not possible to specify the order in which the promisor
 234  remotes are tried in other ways than the order in which they appear
 235  in the config file.
 236+
 237It is also not possible to specify an order to be used when fetching
 238from one remote and a different order when fetching from another
 239remote.
 240
 241- It is not possible to push only specific objects to a promisor
 242  remote.
 243+
 244It is not possible to push at the same time to multiple promisor
 245remote in a specific order.
 246
 247- Dynamic object fetching will only ask promisor remotes for missing
 248  objects.  We assume that promisor remotes have a complete view of the
 249  repository and can satisfy all such requests.
 250
 251- Repack essentially treats promisor and non-promisor packfiles as 2
 252  distinct partitions and does not mix them.  Repack currently only works
 253  on non-promisor packfiles and loose objects.
 254
 255- Dynamic object fetching invokes fetch-pack once *for each item*
 256  because most algorithms stumble upon a missing object and need to have
 257  it resolved before continuing their work.  This may incur significant
 258  overhead -- and multiple authentication requests -- if many objects are
 259  needed.
 260
 261- Dynamic object fetching currently uses the existing pack protocol V0
 262  which means that each object is requested via fetch-pack.  The server
 263  will send a full set of info/refs when the connection is established.
 264  If there are large number of refs, this may incur significant overhead.
 265
 266
 267Future Work
 268-----------
 269
 270- Improve the way to specify the order in which promisor remotes are
 271  tried.
 272+
 273For example this could allow to specify explicitly something like:
 274"When fetching from this remote, I want to use these promisor remotes
 275in this order, though, when pushing or fetching to that remote, I want
 276to use those promisor remotes in that order."
 277
 278- Allow pushing to promisor remotes.
 279+
 280The user might want to work in a triangular work flow with multiple
 281promisor remotes that each have an incomplete view of the repository.
 282
 283- Allow repack to work on promisor packfiles (while keeping them distinct
 284  from non-promisor packfiles).
 285
 286- Allow non-pathname-based filters to make use of packfile bitmaps (when
 287  present).  This was just an omission during the initial implementation.
 288
 289- Investigate use of a long-running process to dynamically fetch a series
 290  of objects, such as proposed in [5,6] to reduce process startup and
 291  overhead costs.
 292+
 293It would be nice if pack protocol V2 could allow that long-running
 294process to make a series of requests over a single long-running
 295connection.
 296
 297- Investigate pack protocol V2 to avoid the info/refs broadcast on
 298  each connection with the server to dynamically fetch missing objects.
 299
 300- Investigate the need to handle loose promisor objects.
 301+
 302Objects in promisor packfiles are allowed to reference missing objects
 303that can be dynamically fetched from the server.  An assumption was
 304made that loose objects are only created locally and therefore should
 305not reference a missing object.  We may need to revisit that assumption
 306if, for example, we dynamically fetch a missing tree and store it as a
 307loose object rather than a single object packfile.
 308+
 309This does not necessarily mean we need to mark loose objects as promisor;
 310it may be sufficient to relax the object lookup or is-promisor functions.
 311
 312
 313Non-Tasks
 314---------
 315
 316- Every time the subject of "demand loading blobs" comes up it seems
 317  that someone suggests that the server be allowed to "guess" and send
 318  additional objects that may be related to the requested objects.
 319+
 320No work has gone into actually doing that; we're just documenting that
 321it is a common suggestion.  We're not sure how it would work and have
 322no plans to work on it.
 323+
 324It is valid for the server to send more objects than requested (even
 325for a dynamic object fetch), but we are not building on that.
 326
 327
 328Footnotes
 329---------
 330
 331[a] expensive-to-modify list of missing objects:  Earlier in the design of
 332    partial clone we discussed the need for a single list of missing objects.
 333    This would essentially be a sorted linear list of OIDs that the were
 334    omitted by the server during a clone or subsequent fetches.
 335
 336This file would need to be loaded into memory on every object lookup.
 337It would need to be read, updated, and re-written (like the .git/index)
 338on every explicit "git fetch" command *and* on any dynamic object fetch.
 339
 340The cost to read, update, and write this file could add significant
 341overhead to every command if there are many missing objects.  For example,
 342if there are 100M missing blobs, this file would be at least 2GiB on disk.
 343
 344With the "promisor" concept, we *infer* a missing object based upon the
 345type of packfile that references it.
 346
 347
 348Related Links
 349-------------
 350[0] https://crbug.com/git/2
 351    Bug#2: Partial Clone
 352
 353[1] https://public-inbox.org/git/20170113155253.1644-1-benpeart@microsoft.com/ +
 354    Subject: [RFC] Add support for downloading blobs on demand +
 355    Date: Fri, 13 Jan 2017 10:52:53 -0500
 356
 357[2] https://public-inbox.org/git/cover.1506714999.git.jonathantanmy@google.com/ +
 358    Subject: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) +
 359    Date: Fri, 29 Sep 2017 13:11:36 -0700
 360
 361[3] https://public-inbox.org/git/20170426221346.25337-1-jonathantanmy@google.com/ +
 362    Subject: Proposal for missing blob support in Git repos +
 363    Date: Wed, 26 Apr 2017 15:13:46 -0700
 364
 365[4] https://public-inbox.org/git/1488999039-37631-1-git-send-email-git@jeffhostetler.com/ +
 366    Subject: [PATCH 00/10] RFC Partial Clone and Fetch +
 367    Date: Wed,  8 Mar 2017 18:50:29 +0000
 368
 369[5] https://public-inbox.org/git/20170505152802.6724-1-benpeart@microsoft.com/ +
 370    Subject: [PATCH v7 00/10] refactor the filter process code into a reusable module +
 371    Date: Fri,  5 May 2017 11:27:52 -0400
 372
 373[6] https://public-inbox.org/git/20170714132651.170708-1-benpeart@microsoft.com/ +
 374    Subject: [RFC/PATCH v2 0/1] Add support for downloading blobs on demand +
 375    Date: Fri, 14 Jul 2017 09:26:50 -0400