We now calculate generation numbers in the commit-graph file and use
them in paint_down_to_common().
Expand the section on generation numbers to discuss how the three
special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and
_MAX interact with other generation numbers.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Now that we use generation numbers from the commit-graph, we must
ensure that all commits that exist in the commit-graph are loaded
from that file instead of from the object database. Since the
commit-graph file is only checked if core.commitGraph is true, we
must check the default config before we load any commits.
In the merge builtin, the config was checked after loading the HEAD
commit. This was due to the use of the global 'branch' when checking
merge-specific config settings.
Move the config load to be between the initialization of 'branch' and
the commit lookup.
Without this change, a fast-forward merge would hit a BUG("bad
generation skip") statement in commit.c during paint_down_to_common().
This is because the HEAD commit would be loaded with "infinite"
generation but then reached by commits with "finite" generation
numbers.
Add a test to t5318-commit-graph.sh that exercises this code path to
prevent a regression.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
commit: use generation number in remove_redundant()
The static remove_redundant() method is used to filter a list
of commits by removing those that are reachable from another
commit in the list. This is used to remove all possible merge-
bases except a maximal, mutually independent set.
To determine these commits are independent, we use a number of
paint_down_to_common() walks and use the PARENT1, PARENT2 flags
to determine reachability. Since we only care about reachability
and not the full set of merge-bases between 'one' and 'twos', we
can use the 'min_generation' parameter to short-circuit the walk.
When no commit-graph exists, there is no change in behavior.
For a copy of the Linux repository, we measured the following
performance improvements:
git merge-base v3.3 v4.5
Before: 234 ms
After: 208 ms
Rel %: -11%
git merge-base v4.3 v4.5
Before: 102 ms
After: 83 ms
Rel %: -19%
The experiments above were chosen to demonstrate that we are
improving the filtering of the merge-base set. In the first
example, more time is spent walking the history to find the
set of merge bases before the remove_redundant() call. The
starting commits are closer together in the second example,
therefore more time is spent in remove_redundant(). The relative
change in performance differs as expected.
Reported-by: Jakub Narebski <jnareb@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
commit: add short-circuit to paint_down_to_common()
When running 'git branch --contains', the in_merge_bases_many()
method calls paint_down_to_common() to discover if a specific
commit is reachable from a set of branches. Commits with lower
generation number are not needed to correctly answer the
containment query of in_merge_bases_many().
Add a new parameter, min_generation, to paint_down_to_common() that
prevents walking commits with generation number strictly less than
min_generation. If 0 is given, then there is no functional change.
For in_merge_bases_many(), we can pass commit->generation as the
cutoff, and this saves time during 'git branch --contains' queries
that would otherwise walk "around" the commit we are inspecting.
For a copy of the Linux repository, where HEAD is checked out at
v4.13~100, we get the following performance improvement for
'git branch --contains' over the previous commit:
Before: 0.21s
After: 0.13s
Rel %: -38%
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
commit: use generation numbers for in_merge_bases()
The containment algorithm for 'git branch --contains' is different
from that for 'git tag --contains' in that it uses is_descendant_of()
instead of contains_tag_algo(). The expensive portion of the branch
algorithm is computing merge bases.
When a commit-graph file exists with generation numbers computed,
we can avoid this merge-base calculation when the target commit has
a larger generation number than the initial commits.
Performance tests were run on a copy of the Linux repository where
HEAD is contained in v4.13 but no earlier tag. Also, all tags were
copied to branches and 'git branch --contains' was tested:
Before: 60.0s
After: 0.4s
Rel %: -99.3%
Reported-by: Jeff King <peff@peff.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
A commit A can reach a commit B only if the generation number of A
is strictly larger than the generation number of B. This condition
allows significantly short-circuiting commit-graph walks.
Use generation number for '--contains' type queries.
On a copy of the Linux repository where HEAD is contained in v4.13
but no earlier tag, the command 'git tag --contains HEAD' had the
following peformance improvement:
Before: 0.81s
After: 0.04s
Rel %: -95%
Helped-by: Jeff King <peff@peff.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
commit-graph: always load commit-graph information
Most code paths load commits using lookup_commit() and then
parse_commit(). In some cases, including some branch lookups, the commit
is parsed using parse_object_buffer() which side-steps parse_commit() in
favor of parse_commit_buffer().
With generation numbers in the commit-graph, we need to ensure that any
commit that exists in the commit-graph file has its generation number
loaded.
Create new load_commit_graph_info() method to fill in the information
for a commit that exists only in the commit-graph file. Call it from
parse_commit_buffer() after loading the other commit information from
the given buffer. Only fill this information when specified by the
'check_graph' parameter.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Define compare_commits_by_gen_then_commit_date(), which uses generation
numbers as a primary comparison and commit date to break ties (or as a
comparison when both commits do not have computed generation numbers).
Since the commit-graph file is closed under reachability, we know that
all commits in the file have generation at most GENERATION_NUMBER_MAX
which is less than GENERATION_NUMBER_INFINITY.
This change does not affect the number of commits that are walked during
the execution of paint_down_to_common(), only the order that those
commits are inspected. In the case that commit dates violate topological
order (i.e. a parent is "newer" than a child), the previous code could
walk a commit twice: if a commit is reached with the PARENT1 bit, but
later is re-visited with the PARENT2 bit, then that PARENT2 bit must be
propagated to its parents. Using generation numbers avoids this extra
effort, even if it is somewhat rare.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
While preparing commits to be written into a commit-graph file, compute
the generation numbers using a depth-first strategy.
The only commits that are walked in this depth-first search are those
without a precomputed generation number. Thus, computation time will be
relative to the number of new commits to the commit-graph file.
If a computed generation number would exceed GENERATION_NUMBER_MAX, then
use GENERATION_NUMBER_MAX instead.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The generation number of a commit is defined recursively as follows:
* If a commit A has no parents, then the generation number of A is one.
* If a commit A has parents, then the generation number of A is one
more than the maximum generation number among the parents of A.
Add a uint32_t generation field to struct commit so we can pass this
information to revision walks. We use three special values to signal
the generation number is invalid:
The first (_INFINITY) means the generation number has not been loaded or
computed. The second (_MAX) means the generation number is too large to
store in the commit-graph file. The third (_ZERO) means the generation
number was loaded from a commit graph file that was written by a version
of git that did not support generation numbers.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
ref-filter: fix outdated comment on in_commit_list
The in_commit_list() method does not check the parents of
the candidate for containment in the list. Fix the comment
that incorrectly states that it does.
Reported-by: Jakub Narebski <jnareb@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
coccinelle: avoid wrong transformation suggestions from commit.cocci
The semantic patch 'contrib/coccinelle/commit.cocci' added in 2e27bd7731 (treewide: replace maybe_tree with accessor methods,
2018-04-06) is supposed to "ensure that all references to the
'maybe_tree' member of struct commit are either mutations or accesses
through get_commit_tree()". So get_commit_tree() clearly must be able
to directly access the 'maybe_tree' member, and 'commit.cocci' has a
bit of a roundabout workaround to ensure that get_commit_tree()'s
direct access in its return statement is not transformed: after all
references to 'maybe_tree' have been transformed to a call to
get_commit_tree(), including the reference in get_commit_tree()
itself, the last rule transforms back a 'return get_commit_tree()'
statement, back then found only in get_commit_tree() itself, to a
direct access.
Unfortunately, already the very next commit shows that this workaround
is insufficient: 7b8a21dba1 (commit-graph: lazy-load trees for
commits, 2018-04-06) extends get_commit_tree() with a condition
directly accessing the 'maybe_tree' member, and Coccinelle with
'commit.cocci' promptly detects it and suggests a transformation to
avoid it. This transformation is clearly wrong, because calling
get_commit_tree() to access 'maybe_tree' _in_ get_commit_tree() would
obviously lead to recursion. Furthermore, the same commit added
another, more specialized getter function get_commit_tree_in_graph(),
whose legitimate direct access to 'maybe_tree' triggers a similar
wrong transformation suggestion.
Exclude both of these getter functions from the general rule in
'commit.cocci' that matches their direct accesses to 'maybe_tree'.
Also exclude load_tree_for_commit(), which, as static helper funcion
of get_commit_tree_in_graph(), has legitimate direct access to
'maybe_tree' as well.
The last rule transforming back 'return get_commit_tree()' statements
to direct accesses thus became unnecessary, remove it.
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Acked-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The commit-graph file provides quick access to commit data, including
the OID of the root tree for each commit in the graph. When performing
a deep commit-graph walk, we may not need to load most of the trees
for these commits.
Delay loading the tree object for a commit loaded from the graph
until requested via get_commit_tree(). Do not lazy-load trees for
commits not in the graph, since that requires duplicate parsing
and the relative peformance improvement when trees are not needed
is small.
On the Linux repository, performance tests were run for the following
command:
git log --graph --oneline -1000
Before: 0.92s
After: 0.66s
Rel %: -28.3%
Adding '-- kernel/' to the command requires loading the root tree
for every commit that is walked. There was no measureable performance
change as a result of this patch.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
treewide: replace maybe_tree with accessor methods
In anticipation of making trees load lazily, create a Coccinelle
script (contrib/coccinelle/commit.cocci) to ensure that all
references to the 'maybe_tree' member of struct commit are either
mutations or accesses through get_commit_tree() or
get_commit_tree_oid().
Apply the Coccinelle script to create the rest of the patch.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
While walking the commit graph, we load struct commit objects into
the object cache. During this process, we also load struct tree
objects for the root tree of each of these commits. We load these
objects even if we are only computing commit reachability information,
such as a merge base or ahead/behind information.
Create get_commit_tree() as a first step to removing direct
references to the 'maybe_tree' member of struct commit.
Create get_commit_tree_oid() as a shortcut for several references
to "&commit->maybe_tree->object.oid" in the codebase.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Using the commit-graph file to walk commit history removes the large
cost of parsing commits during the walk. This exposes a performance
issue: lookup_tree() takes a large portion of the computation time,
even when Git never uses those trees.
In anticipation of lazy-loading these trees, rename the 'tree' member
of struct commit to 'maybe_tree'. This serves two purposes: it hints
at the future role of possibly being NULL even if the commit has a
valid tree, and it allows for unambiguous transformation from simple
member access (i.e. commit->maybe_tree) to method access.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Teach git-commit-graph to add all commits from the existing
commit-graph file to the file about to be written. This should be
used when adding new commits without performing garbage collection.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Teach git-commit-graph to read commits from stdin when the
--stdin-commits flag is specified. Commits reachable from these
commits are added to the graph. This is a much faster way to construct
the graph than inspecting all packed objects, but is restricted to
known tips.
For the Linux repository, 700,000+ commits were added to the graph
file starting from 'master' in 7-9 seconds, depending on the number
of packfiles in the repo (1, 24, or 120).
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
commit-graph: read only from specific pack-indexes
Teach git-commit-graph to inspect the objects only in a certain list
of pack-indexes within the given pack directory. This allows updating
the commit graph iteratively.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
commit: integrate commit graph with commit parsing
Teach Git to inspect a commit graph file to supply the contents of a
struct commit when calling parse_commit_gently(). This implementation
satisfies all post-conditions on the struct commit, including loading
parents, the root tree, and the commit date.
If core.commitGraph is false, then do not check graph files.
In test script t5318-commit-graph.sh, add output-matching conditions on
read-only graph operations.
By loading commits from the graph instead of parsing commit buffers, we
save a lot of time on long commit walks. Here are some performance
results for a copy of the Linux repository where 'master' has 678,653
reachable commits and is behind 'origin/master' by 59,929 commits.
Teach write_commit_graph() to walk all parents from the commits
discovered in packfiles. This prevents gaps given by loose objects or
previously-missed packfiles.
Also automatically add commits from the existing graph file, if it
exists.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Teach git the 'commit-graph' builtin that will be used for writing and
reading packed graph files. The current implementation is mostly
empty, except for an '--object-dir' option.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Add document specifying the binary format for commit graphs. This
format allows for:
* New versions.
* New hash functions and hash lengths.
* Optional extensions.
Basic header information is followed by a binary table of contents
into "chunks" that include:
* An ordered list of commit object IDs.
* A 256-entry fanout into that list of OIDs.
* A list of metadata for the commits.
* A list of "large edges" to enable octopus merges.
The format automatically includes two parent positions for every
commit. This favors speed over space, since using only one position
per commit would cause an extra level of indirection for every merge
commit. (Octopus merges suffer from this indirection, but they are
very rare.)
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
If we want to use a hashfile on the temporary file for a lockfile, then
we need finalize_hashfile() to fully write the trailing hash but also keep
the file descriptor open.
Do this by adding a new CSUM_HASH_IN_STREAM flag along with a functional
change that checks this flag before writing the checksum to the stream.
This differs from previous behavior since it would be written if either
CSUM_CLOSE or CSUM_FSYNC is provided.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
This was an undocumented debugging aid that does not seem to
have come in handy in the past decade, judging from its lack
of mentions on the mailing list.
Let's drop it in the name of simplicity. This is morally a
revert of 3131b71301 (Add "--show-all" revision walker flag
for debugging, 2008-02-09), but note that I did leave in the
mapping of UNINTERESTING to "^" in get_revision_mark(). I
don't think this would be possible to trigger with the
current code, but it's the only sensible marker.
We'll skip the usual deprecation period because this was
explicitly a debugging aid that was never documented.
Signed-off-by: Jeff King <peff@peff.net> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The "--show-all" revision option shows UNINTERESTING
commits. Some of these commits may be unparsed when we try
to show them (since we may or may not need to walk their
parents to fulfill the request).
Commit 3131b71301 (Add "--show-all" revision walker flag for
debugging, 2008-02-09) resolved this by just skipping
pretty-printing for commits without their object contents
cached, saying:
Because we now end up listing commits we may not even have been parsed
at all "show_log" and "show_commit" need to protect against commits
that don't have a commit buffer entry.
That was the easy fix to avoid the pretty-printer segfaulting,
but:
1. It doesn't work for all formats. E.g., --oneline
prints the oid for each such commit but not a trailing
newline, leading to jumbled output.
2. It only affects some commits, depending on whether we
happened to parse them or not (so if they were at the
tip of an UNINTERESTING starting point, or if we
happened to traverse over them, you'd see more data).
3. It unncessarily ties the decision to show the verbose
header to whether the commit buffer was cached. That
makes it harder to change the logic around caching
(e.g., if we could traverse without actually loading
the full commit objects).
These days it's safe to feed such a commit to the
pretty-print code. Since be5c9fb904 (logmsg_reencode: lazily
load missing commit buffers, 2013-01-26), we'll load it on
demand in such a case. So let's just always show the verbose
headers.
This does change the behavior of plumbing, but:
a. The --show-all option was explicitly introduced as a
debugging aid, and was never documented (and has rarely
even been mentioned on the list by git devs).
b. Avoiding the commits was already not deterministic due
to (2) above. So the caller might have seen full
headers for these commits anyway, and would need to be
prepared for it.
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Merge branch 'jk/abort-clone-with-existing-dest' into maint
"git clone $there $here" is allowed even when here directory exists
as long as it is an empty directory, but the command incorrectly
removed it upon a failure of the operation.
* jk/abort-clone-with-existing-dest:
clone: do not clean up directories we didn't create
clone: factor out dir_exists() helper
t5600: modernize style
t5600: fix outdated comment about unborn HEAD
* rs/lose-leak-pending:
commit: remove unused function clear_commit_marks_for_object_array()
revision: remove the unused flag leak_pending
checkout: avoid using the rev_info flag leak_pending
bundle: avoid using the rev_info flag leak_pending
bisect: avoid using the rev_info flag leak_pending
object: add clear_commit_marks_all()
ref-filter: use clear_commit_marks_many() in do_merge_filter()
commit: use clear_commit_marks_many() in remove_redundant()
commit: avoid allocation in clear_commit_marks_many()
Merge branch 'jm/svn-pushmergeinfo-fix' into maint
"git svn dcommit" did not take into account the fact that a
svn+ssh:// URL with a username@ (typically used for pushing) refers
to the same SVN repository without the username@ and failed when
svn.pushmergeinfo option is set.
* jm/svn-pushmergeinfo-fix:
git-svn: fix svn.pushmergeinfo handling of svn+ssh usernames.
More abstraction of hash function from the codepath.
* bc/hash-algo:
hash: update obsolete reference to SHA1_HEADER
bulk-checkin: abstract SHA-1 usage
csum-file: abstract uses of SHA-1
csum-file: rename sha1file to hashfile
read-cache: abstract away uses of SHA-1
pack-write: switch various SHA-1 values to abstract forms
pack-check: convert various uses of SHA-1 to abstract forms
fast-import: switch various uses of SHA-1 to the_hash_algo
sha1_file: switch uses of SHA-1 to the_hash_algo
builtin/unpack-objects: switch uses of SHA-1 to the_hash_algo
builtin/index-pack: improve hash function abstraction
hash: create union for hash context allocation
hash: move SHA-1 macros to hash.h
The way "git reset --hard" reports the commit the updated HEAD
points at is made consistent with the way how the commit title is
generated by the other parts of the system. This matters when the
title is spread across physically multiple lines.
* tg/reset-hard-show-head-with-pretty:
reset --hard: make use of the pretty machinery
* ab/wildmatch-tests:
wildmatch test: mark test as EXPENSIVE_ON_WINDOWS
test-lib: add an EXPENSIVE_ON_WINDOWS prerequisite
wildmatch test: create & test files on disk in addition to in-memory
wildmatch test: perform all tests under all wildmatch() modes
wildmatch test: use test_must_fail, not ! for test-wildmatch
wildmatch test: remove dead fnmatch() test code
wildmatch test: use a paranoia pattern from nul_match()
wildmatch test: don't try to vertically align our output
wildmatch test: use more standard shell style
wildmatch test: indent with tabs, not spaces
Avoid mmapping small files while using packed refs (especially ones
with zero size, which would cause later munmap() to fail).
* kg/packed-ref-cache-fix:
packed_ref_cache: don't use mmap() for small files
load_contents(): don't try to mmap an empty file
packed_ref_iterator_begin(): make optimization more general
find_reference_location(): make function safe for empty snapshots
create_snapshot(): use `xmemdupz()` rather than a strbuf
struct snapshot: store `start` rather than `header_len`
* en/merge-recursive-fixes:
merge-recursive: add explanation for src_entry and dst_entry
merge-recursive: fix logic ordering issue
Tighten and correct a few testcases for merging and cherry-picking
Push the submodule version of collision-detecting SHA-1 hash
implementation a bit harder on builders.
* ab/sha1dc-build:
sha1dc_git.h: re-arrange an ifdef chain for a subsequent change
Makefile: under "make dist", include the sha1collisiondetection submodule
Makefile: don't error out under DC_SHA1_EXTERNAL if DC_SHA1_SUBMODULE=auto
Subsequent patches will introduce file formats that make use of a fanout
array and a sorted table containing hashes, just like packfiles.
Refactor the hash search in packfile.c into its own function, so that
those patches can make use of it as well.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
In commit 628522ec1439 ("sha1-lookup: more memory efficient search in
sorted list of SHA-1", 2008-04-09), a different algorithm for searching
a sorted list was introduced, together with a set of log statements
guarded by GIT_DEBUG_LOOKUP that are invoked both when using that
algorithm and when using the existing binary search. Those log
statements was meant for experiments and debugging, but with the removal
of the aforementioned different algorithm in commit f1068efefe6d
("sha1_file: drop experimental GIT_USE_LOOKUP search", 2017-08-09),
those log statements are probably no longer necessary.
Remove those statements.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>