fsck: lazily load types under --connectivity-only
authorJeff King <peff@peff.net>
Thu, 26 Jan 2017 04:12:07 +0000 (23:12 -0500)
committerJunio C Hamano <gitster@pobox.com>
Thu, 26 Jan 2017 18:51:09 +0000 (10:51 -0800)
The recent fixes to "fsck --connectivity-only" load all of
the objects with their correct types. This keeps the
connectivity-only code path close to the regular one, but it
also introduces some unnecessary inefficiency. While getting
the type of an object is cheap compared to actually opening
and parsing the object (as the non-connectivity-only case
would do), it's still not free.

For reachable non-blob objects, we end up having to parse
them later anyway (to see what they point to), making our
type lookup here redundant.

For unreachable objects, we might never hit them at all in
the reachability traversal, making the lookup completely
wasted. And in some cases, we might have quite a few
unreachable objects (e.g., when alternates are used for
shared object storage between repositories, it's normal for
there to be objects reachable from other repositories but
not the one running fsck).

The comment in mark_object_for_connectivity() claims two
benefits to getting the type up front:

1. We need to know the types during fsck_walk(). (And not
explicitly mentioned, but we also need them when
printing the types of broken or dangling commits).

We can address this by lazy-loading the types as
necessary. Most objects never need this lazy-load at
all, because they fall into one of these categories:

a. Reachable from our tips, and are coerced into the
correct type as we traverse (e.g., a parent link
will call lookup_commit(), which converts OBJ_NONE
to OBJ_COMMIT).

b. Unreachable, but not at the tip of a chunk of
unreachable history. We only mention the tips as
"dangling", so an unreachable commit which links
to hundreds of other objects needs only report the
type of the tip commit.

2. It serves as a cross-check that the coercion in (1a) is
correct (i.e., we'll complain about a parent link that
points to a blob). But we get most of this for free
already, because right after coercing, we'll parse any
non-blob objects. So we'd notice then if we expected a
commit and got a blob.

The one exception is when we expect a blob, in which
case we never actually read the object contents.

So this is a slight weakening, but given that the whole
point of --connectivity-only is to sacrifice some data
integrity checks for speed, this seems like an
acceptable tradeoff.

Here are before and after timings for an extreme case with
~5M reachable objects and another ~12M unreachable (it's the
torvalds/linux repository on GitHub, connected to shared
storage for all of the other kernel forks):

[before]
$ time git fsck --no-dangling --connectivity-only
real 3m4.323s
user 1m25.121s
sys 1m38.710s

[after]
$ time git fsck --no-dangling --connectivity-only
real 0m51.497s
user 0m49.575s
sys 0m1.776s

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
builtin/fsck.c
fsck.c
index 3d5ced2d3ac2e9d2fdfacd498a7592902d96ebdb..140357b6fc9deba63e9aaf64eeae935f942322be 100644 (file)
@@ -60,6 +60,12 @@ static const char *printable_type(struct object *obj)
 {
        const char *ret;
 
+       if (obj->type == OBJ_NONE) {
+               enum object_type type = sha1_object_info(obj->oid.hash, NULL);
+               if (type > 0)
+                       object_as_type(obj, type, 0);
+       }
+
        ret = typename(obj->type);
        if (!ret)
                ret = "unknown";
@@ -595,57 +601,7 @@ static int fsck_cache_tree(struct cache_tree *it)
 
 static void mark_object_for_connectivity(const unsigned char *sha1)
 {
-       struct object *obj = lookup_object(sha1);
-
-       /*
-        * Setting the object type here isn't strictly necessary for a
-        * connectivity check. In most cases, our walk will expect a certain
-        * type (e.g., a tree referencing a blob) and will use lookup_blob() to
-        * assign the type. But doing it here has two advantages:
-        *
-        *   1. When the fsck_walk code looks at objects that _don't_ come from
-        *      links (e.g., the tip of a ref), it may complain about the
-        *      "unknown object type".
-        *
-        *   2. This serves as a nice cross-check that the graph links are
-        *      sane. So --connectivity-only does not check that the bits of
-        *      blobs are not corrupted, but it _does_ check that 100644 tree
-        *      entries point to blobs, and so forth.
-        *
-        * Unfortunately we can't just use parse_object() here, because the
-        * whole point of --connectivity-only is to avoid reading the object
-        * data more than necessary.
-        */
-       if (!obj || obj->type == OBJ_NONE) {
-               enum object_type type = sha1_object_info(sha1, NULL);
-               switch (type) {
-               case OBJ_BAD:
-                       error("%s: unable to read object type",
-                             sha1_to_hex(sha1));
-                       break;
-               case OBJ_COMMIT:
-                       obj = (struct object *)lookup_commit(sha1);
-                       break;
-               case OBJ_TREE:
-                       obj = (struct object *)lookup_tree(sha1);
-                       break;
-               case OBJ_BLOB:
-                       obj = (struct object *)lookup_blob(sha1);
-                       break;
-               case OBJ_TAG:
-                       obj = (struct object *)lookup_tag(sha1);
-                       break;
-               default:
-                       error("%s: unknown object type %d",
-                             sha1_to_hex(sha1), type);
-               }
-
-               if (!obj || obj->type == OBJ_NONE) {
-                       errors_found |= ERROR_OBJECT;
-                       return;
-               }
-       }
-
+       struct object *obj = lookup_unknown_object(sha1);
        obj->flags |= HAS_OBJ;
 }
 
diff --git a/fsck.c b/fsck.c
index 4a3069e204ea555bfb6d1ce05a0163860f852437..939792752bf39c72cc536f00c21e2af8ed854c3e 100644 (file)
--- a/fsck.c
+++ b/fsck.c
@@ -458,6 +458,10 @@ int fsck_walk(struct object *obj, void *data, struct fsck_options *options)
 {
        if (!obj)
                return -1;
+
+       if (obj->type == OBJ_NONE)
+               parse_object(obj->oid.hash);
+
        switch (obj->type) {
        case OBJ_BLOB:
                return 0;