Skip to content

Commit e6b971f

Browse files
committed
Merge branch 'tb/reverse-midx'
An on-disk reverse-index to map the in-pack location of an object back to its object name across multiple packfiles is introduced. * tb/reverse-midx: midx.c: improve cache locality in midx_pack_order_cmp() pack-revindex: write multi-pack reverse indexes pack-write.c: extract 'write_rev_file_order' pack-revindex: read multi-pack reverse indexes Documentation/technical: describe multi-pack reverse indexes midx: make some functions non-static midx: keep track of the checksum midx: don't free midx_name early midx: allow marking a pack as preferred t/helper/test-read-midx.c: add '--show-objects' builtin/multi-pack-index.c: display usage on unrecognized command builtin/multi-pack-index.c: don't enter bogus cmd_mode builtin/multi-pack-index.c: split sub-commands builtin/multi-pack-index.c: define common usage with a macro builtin/multi-pack-index.c: don't handle 'progress' separately builtin/multi-pack-index.c: inline 'flags' with options
2 parents a0dda60 + 3007752 commit e6b971f

14 files changed

+733
-68
lines changed

Documentation/git-multi-pack-index.txt

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,8 @@ git-multi-pack-index - Write and verify multi-pack-indexes
99
SYNOPSIS
1010
--------
1111
[verse]
12-
'git multi-pack-index' [--object-dir=<dir>] [--[no-]progress] <subcommand>
12+
'git multi-pack-index' [--object-dir=<dir>] [--[no-]progress]
13+
[--preferred-pack=<pack>] <subcommand>
1314

1415
DESCRIPTION
1516
-----------
@@ -30,7 +31,16 @@ OPTIONS
3031
The following subcommands are available:
3132

3233
write::
33-
Write a new MIDX file.
34+
Write a new MIDX file. The following options are available for
35+
the `write` sub-command:
36+
+
37+
--
38+
--preferred-pack=<pack>::
39+
Optionally specify the tie-breaking pack used when
40+
multiple packs contain the same object. If not given,
41+
ties are broken in favor of the pack with the lowest
42+
mtime.
43+
--
3444

3545
verify::
3646
Verify the contents of the MIDX file.

Documentation/technical/multi-pack-index.txt

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,8 +43,9 @@ Design Details
4343
a change in format.
4444

4545
- The MIDX keeps only one record per object ID. If an object appears
46-
in multiple packfiles, then the MIDX selects the copy in the most-
47-
recently modified packfile.
46+
in multiple packfiles, then the MIDX selects the copy in the
47+
preferred packfile, otherwise selecting from the most-recently
48+
modified packfile.
4849

4950
- If there exist packfiles in the pack directory not registered in
5051
the MIDX, then those packfiles are loaded into the `packed_git`

Documentation/technical/pack-format.txt

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -379,3 +379,86 @@ CHUNK DATA:
379379
TRAILER:
380380

381381
Index checksum of the above contents.
382+
383+
== multi-pack-index reverse indexes
384+
385+
Similar to the pack-based reverse index, the multi-pack index can also
386+
be used to generate a reverse index.
387+
388+
Instead of mapping between offset, pack-, and index position, this
389+
reverse index maps between an object's position within the MIDX, and
390+
that object's position within a pseudo-pack that the MIDX describes
391+
(i.e., the ith entry of the multi-pack reverse index holds the MIDX
392+
position of ith object in pseudo-pack order).
393+
394+
To clarify the difference between these orderings, consider a multi-pack
395+
reachability bitmap (which does not yet exist, but is what we are
396+
building towards here). Each bit needs to correspond to an object in the
397+
MIDX, and so we need an efficient mapping from bit position to MIDX
398+
position.
399+
400+
One solution is to let bits occupy the same position in the oid-sorted
401+
index stored by the MIDX. But because oids are effectively random, their
402+
resulting reachability bitmaps would have no locality, and thus compress
403+
poorly. (This is the reason that single-pack bitmaps use the pack
404+
ordering, and not the .idx ordering, for the same purpose.)
405+
406+
So we'd like to define an ordering for the whole MIDX based around
407+
pack ordering, which has far better locality (and thus compresses more
408+
efficiently). We can think of a pseudo-pack created by the concatenation
409+
of all of the packs in the MIDX. E.g., if we had a MIDX with three packs
410+
(a, b, c), with 10, 15, and 20 objects respectively, we can imagine an
411+
ordering of the objects like:
412+
413+
|a,0|a,1|...|a,9|b,0|b,1|...|b,14|c,0|c,1|...|c,19|
414+
415+
where the ordering of the packs is defined by the MIDX's pack list,
416+
and then the ordering of objects within each pack is the same as the
417+
order in the actual packfile.
418+
419+
Given the list of packs and their counts of objects, you can
420+
naïvely reconstruct that pseudo-pack ordering (e.g., the object at
421+
position 27 must be (c,1) because packs "a" and "b" consumed 25 of the
422+
slots). But there's a catch. Objects may be duplicated between packs, in
423+
which case the MIDX only stores one pointer to the object (and thus we'd
424+
want only one slot in the bitmap).
425+
426+
Callers could handle duplicates themselves by reading objects in order
427+
of their bit-position, but that's linear in the number of objects, and
428+
much too expensive for ordinary bitmap lookups. Building a reverse index
429+
solves this, since it is the logical inverse of the index, and that
430+
index has already removed duplicates. But, building a reverse index on
431+
the fly can be expensive. Since we already have an on-disk format for
432+
pack-based reverse indexes, let's reuse it for the MIDX's pseudo-pack,
433+
too.
434+
435+
Objects from the MIDX are ordered as follows to string together the
436+
pseudo-pack. Let `pack(o)` return the pack from which `o` was selected
437+
by the MIDX, and define an ordering of packs based on their numeric ID
438+
(as stored by the MIDX). Let `offset(o)` return the object offset of `o`
439+
within `pack(o)`. Then, compare `o1` and `o2` as follows:
440+
441+
- If one of `pack(o1)` and `pack(o2)` is preferred and the other
442+
is not, then the preferred one sorts first.
443+
+
444+
(This is a detail that allows the MIDX bitmap to determine which
445+
pack should be used by the pack-reuse mechanism, since it can ask
446+
the MIDX for the pack containing the object at bit position 0).
447+
448+
- If `pack(o1) ≠ pack(o2)`, then sort the two objects in descending
449+
order based on the pack ID.
450+
451+
- Otherwise, `pack(o1) = pack(o2)`, and the objects are sorted in
452+
pack-order (i.e., `o1` sorts ahead of `o2` exactly when `offset(o1)
453+
< offset(o2)`).
454+
455+
In short, a MIDX's pseudo-pack is the de-duplicated concatenation of
456+
objects in packs stored by the MIDX, laid out in pack order, and the
457+
packs arranged in MIDX order (with the preferred pack coming first).
458+
459+
Finally, note that the MIDX's reverse index is not stored as a chunk in
460+
the multi-pack-index itself. This is done because the reverse index
461+
includes the checksum of the pack or MIDX to which it belongs, which
462+
makes it impossible to write in the MIDX. To avoid races when rewriting
463+
the MIDX, a MIDX reverse index includes the MIDX's checksum in its
464+
filename (e.g., `multi-pack-index-xyz.rev`).

builtin/multi-pack-index.c

Lines changed: 148 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -4,67 +4,181 @@
44
#include "parse-options.h"
55
#include "midx.h"
66
#include "trace2.h"
7+
#include "object-store.h"
78

9+
#define BUILTIN_MIDX_WRITE_USAGE \
10+
N_("git multi-pack-index [<options>] write [--preferred-pack=<pack>]")
11+
12+
#define BUILTIN_MIDX_VERIFY_USAGE \
13+
N_("git multi-pack-index [<options>] verify")
14+
15+
#define BUILTIN_MIDX_EXPIRE_USAGE \
16+
N_("git multi-pack-index [<options>] expire")
17+
18+
#define BUILTIN_MIDX_REPACK_USAGE \
19+
N_("git multi-pack-index [<options>] repack [--batch-size=<size>]")
20+
21+
static char const * const builtin_multi_pack_index_write_usage[] = {
22+
BUILTIN_MIDX_WRITE_USAGE,
23+
NULL
24+
};
25+
static char const * const builtin_multi_pack_index_verify_usage[] = {
26+
BUILTIN_MIDX_VERIFY_USAGE,
27+
NULL
28+
};
29+
static char const * const builtin_multi_pack_index_expire_usage[] = {
30+
BUILTIN_MIDX_EXPIRE_USAGE,
31+
NULL
32+
};
33+
static char const * const builtin_multi_pack_index_repack_usage[] = {
34+
BUILTIN_MIDX_REPACK_USAGE,
35+
NULL
36+
};
837
static char const * const builtin_multi_pack_index_usage[] = {
9-
N_("git multi-pack-index [<options>] (write|verify|expire|repack --batch-size=<size>)"),
38+
BUILTIN_MIDX_WRITE_USAGE,
39+
BUILTIN_MIDX_VERIFY_USAGE,
40+
BUILTIN_MIDX_EXPIRE_USAGE,
41+
BUILTIN_MIDX_REPACK_USAGE,
1042
NULL
1143
};
1244

1345
static struct opts_multi_pack_index {
1446
const char *object_dir;
47+
const char *preferred_pack;
1548
unsigned long batch_size;
16-
int progress;
49+
unsigned flags;
1750
} opts;
1851

19-
int cmd_multi_pack_index(int argc, const char **argv,
20-
const char *prefix)
52+
static struct option common_opts[] = {
53+
OPT_FILENAME(0, "object-dir", &opts.object_dir,
54+
N_("object directory containing set of packfile and pack-index pairs")),
55+
OPT_BIT(0, "progress", &opts.flags, N_("force progress reporting"), MIDX_PROGRESS),
56+
OPT_END(),
57+
};
58+
59+
static struct option *add_common_options(struct option *prev)
2160
{
22-
unsigned flags = 0;
61+
return parse_options_concat(common_opts, prev);
62+
}
63+
64+
static int cmd_multi_pack_index_write(int argc, const char **argv)
65+
{
66+
struct option *options;
67+
static struct option builtin_multi_pack_index_write_options[] = {
68+
OPT_STRING(0, "preferred-pack", &opts.preferred_pack,
69+
N_("preferred-pack"),
70+
N_("pack for reuse when computing a multi-pack bitmap")),
71+
OPT_END(),
72+
};
73+
74+
options = add_common_options(builtin_multi_pack_index_write_options);
75+
76+
trace2_cmd_mode(argv[0]);
2377

24-
static struct option builtin_multi_pack_index_options[] = {
25-
OPT_FILENAME(0, "object-dir", &opts.object_dir,
26-
N_("object directory containing set of packfile and pack-index pairs")),
27-
OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
78+
argc = parse_options(argc, argv, NULL,
79+
options, builtin_multi_pack_index_write_usage,
80+
PARSE_OPT_KEEP_UNKNOWN);
81+
if (argc)
82+
usage_with_options(builtin_multi_pack_index_write_usage,
83+
options);
84+
85+
FREE_AND_NULL(options);
86+
87+
return write_midx_file(opts.object_dir, opts.preferred_pack,
88+
opts.flags);
89+
}
90+
91+
static int cmd_multi_pack_index_verify(int argc, const char **argv)
92+
{
93+
struct option *options = common_opts;
94+
95+
trace2_cmd_mode(argv[0]);
96+
97+
argc = parse_options(argc, argv, NULL,
98+
options, builtin_multi_pack_index_verify_usage,
99+
PARSE_OPT_KEEP_UNKNOWN);
100+
if (argc)
101+
usage_with_options(builtin_multi_pack_index_verify_usage,
102+
options);
103+
104+
return verify_midx_file(the_repository, opts.object_dir, opts.flags);
105+
}
106+
107+
static int cmd_multi_pack_index_expire(int argc, const char **argv)
108+
{
109+
struct option *options = common_opts;
110+
111+
trace2_cmd_mode(argv[0]);
112+
113+
argc = parse_options(argc, argv, NULL,
114+
options, builtin_multi_pack_index_expire_usage,
115+
PARSE_OPT_KEEP_UNKNOWN);
116+
if (argc)
117+
usage_with_options(builtin_multi_pack_index_expire_usage,
118+
options);
119+
120+
return expire_midx_packs(the_repository, opts.object_dir, opts.flags);
121+
}
122+
123+
static int cmd_multi_pack_index_repack(int argc, const char **argv)
124+
{
125+
struct option *options;
126+
static struct option builtin_multi_pack_index_repack_options[] = {
28127
OPT_MAGNITUDE(0, "batch-size", &opts.batch_size,
29128
N_("during repack, collect pack-files of smaller size into a batch that is larger than this size")),
30129
OPT_END(),
31130
};
32131

132+
options = add_common_options(builtin_multi_pack_index_repack_options);
133+
134+
trace2_cmd_mode(argv[0]);
135+
136+
argc = parse_options(argc, argv, NULL,
137+
options,
138+
builtin_multi_pack_index_repack_usage,
139+
PARSE_OPT_KEEP_UNKNOWN);
140+
if (argc)
141+
usage_with_options(builtin_multi_pack_index_repack_usage,
142+
options);
143+
144+
FREE_AND_NULL(options);
145+
146+
return midx_repack(the_repository, opts.object_dir,
147+
(size_t)opts.batch_size, opts.flags);
148+
}
149+
150+
int cmd_multi_pack_index(int argc, const char **argv,
151+
const char *prefix)
152+
{
153+
struct option *builtin_multi_pack_index_options = common_opts;
154+
33155
git_config(git_default_config, NULL);
34156

35-
opts.progress = isatty(2);
157+
if (isatty(2))
158+
opts.flags |= MIDX_PROGRESS;
36159
argc = parse_options(argc, argv, prefix,
37160
builtin_multi_pack_index_options,
38-
builtin_multi_pack_index_usage, 0);
161+
builtin_multi_pack_index_usage,
162+
PARSE_OPT_STOP_AT_NON_OPTION);
39163

40164
if (!opts.object_dir)
41165
opts.object_dir = get_object_directory();
42-
if (opts.progress)
43-
flags |= MIDX_PROGRESS;
44166

45167
if (argc == 0)
168+
goto usage;
169+
170+
if (!strcmp(argv[0], "repack"))
171+
return cmd_multi_pack_index_repack(argc, argv);
172+
else if (!strcmp(argv[0], "write"))
173+
return cmd_multi_pack_index_write(argc, argv);
174+
else if (!strcmp(argv[0], "verify"))
175+
return cmd_multi_pack_index_verify(argc, argv);
176+
else if (!strcmp(argv[0], "expire"))
177+
return cmd_multi_pack_index_expire(argc, argv);
178+
else {
179+
usage:
180+
error(_("unrecognized subcommand: %s"), argv[0]);
46181
usage_with_options(builtin_multi_pack_index_usage,
47182
builtin_multi_pack_index_options);
48-
49-
if (argc > 1) {
50-
die(_("too many arguments"));
51-
return 1;
52183
}
53-
54-
trace2_cmd_mode(argv[0]);
55-
56-
if (!strcmp(argv[0], "repack"))
57-
return midx_repack(the_repository, opts.object_dir,
58-
(size_t)opts.batch_size, flags);
59-
if (opts.batch_size)
60-
die(_("--batch-size option is only for 'repack' subcommand"));
61-
62-
if (!strcmp(argv[0], "write"))
63-
return write_midx_file(opts.object_dir, flags);
64-
if (!strcmp(argv[0], "verify"))
65-
return verify_midx_file(the_repository, opts.object_dir, flags);
66-
if (!strcmp(argv[0], "expire"))
67-
return expire_midx_packs(the_repository, opts.object_dir, flags);
68-
69-
die(_("unrecognized subcommand: %s"), argv[0]);
70184
}

builtin/repack.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -721,7 +721,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
721721
remove_temporary_files();
722722

723723
if (git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0))
724-
write_midx_file(get_object_directory(), 0);
724+
write_midx_file(get_object_directory(), NULL, 0);
725725

726726
string_list_clear(&names, 0);
727727
string_list_clear(&rollback, 0);

0 commit comments

Comments
 (0)