Skip to content

Commit d83bc70

Browse files
committed
Merge branch 'ds/name-hash-tweaks' into next
"git pack-objects" and its wrapper "git repack" learned an option to use an alternative path-hash function to improve delta-base selection to produce a packfile with deeper history than window size. * ds/name-hash-tweaks: pack-objects: prevent name hash version change test-tool: add helper for name-hash values p5313: add size comparison test pack-objects: add GIT_TEST_NAME_HASH_VERSION repack: add --name-hash-version option pack-objects: add --name-hash-version option pack-objects: create new name-hash function version
2 parents c4cfc42 + b4cf684 commit d83bc70

22 files changed

+389
-16
lines changed

Documentation/git-pack-objects.adoc

Lines changed: 31 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,8 @@ SYNOPSIS
1515
[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
1616
[--cruft] [--cruft-expiration=<time>]
1717
[--stdout [--filter=<filter-spec>] | <base-name>]
18-
[--shallow] [--keep-true-parents] [--[no-]sparse] < <object-list>
18+
[--shallow] [--keep-true-parents] [--[no-]sparse]
19+
[--name-hash-version=<n>] < <object-list>
1920

2021

2122
DESCRIPTION
@@ -345,6 +346,35 @@ raise an error.
345346
Restrict delta matches based on "islands". See DELTA ISLANDS
346347
below.
347348
349+
--name-hash-version=<n>::
350+
While performing delta compression, Git groups objects that may be
351+
similar based on heuristics using the path to that object. While
352+
grouping objects by an exact path match is good for paths with
353+
many versions, there are benefits for finding delta pairs across
354+
different full paths. Git collects objects by type and then by a
355+
"name hash" of the path and then by size, hoping to group objects
356+
that will compress well together.
357+
+
358+
The default name hash version is `1`, which prioritizes hash locality by
359+
considering the final bytes of the path as providing the maximum magnitude
360+
to the hash function. This version excels at distinguishing short paths
361+
and finding renames across directories. However, the hash function depends
362+
primarily on the final 16 bytes of the path. If there are many paths in
363+
the repo that have the same final 16 bytes and differ only by parent
364+
directory, then this name-hash may lead to too many collisions and cause
365+
poor results. At the moment, this version is required when writing
366+
reachability bitmap files with `--write-bitmap-index`.
367+
+
368+
The name hash version `2` has similar locality features as version `1`,
369+
except it considers each path component separately and overlays the hashes
370+
with a shift. This still prioritizes the final bytes of the path, but also
371+
"salts" the lower bits of the hash using the parent directory names. This
372+
method allows for some of the locality benefits of version `1` while
373+
breaking most of the collisions from a similarly-named file appearing in
374+
many different directories. At the moment, this version is not allowed
375+
when writing reachability bitmap files with `--write-bitmap-index` and it
376+
will be automatically changed to version `1`.
377+
348378
349379
DELTA ISLANDS
350380
-------------

Documentation/git-repack.adoc

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,9 @@ git-repack - Pack unpacked objects in a repository
99
SYNOPSIS
1010
--------
1111
[verse]
12-
'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m] [--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>] [--write-midx]
12+
'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]
13+
[--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]
14+
[--write-midx] [--name-hash-version=<n>]
1315

1416
DESCRIPTION
1517
-----------
@@ -249,6 +251,11 @@ linkgit:git-multi-pack-index[1]).
249251
Write a multi-pack index (see linkgit:git-multi-pack-index[1])
250252
containing the non-redundant packs.
251253

254+
--name-hash-version=<n>::
255+
Provide this argument to the underlying `git pack-objects` process.
256+
See linkgit:git-pack-objects[1] for full details.
257+
258+
252259
CONFIGURATION
253260
-------------
254261

Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -813,6 +813,7 @@ TEST_BUILTINS_OBJS += test-lazy-init-name-hash.o
813813
TEST_BUILTINS_OBJS += test-match-trees.o
814814
TEST_BUILTINS_OBJS += test-mergesort.o
815815
TEST_BUILTINS_OBJS += test-mktemp.o
816+
TEST_BUILTINS_OBJS += test-name-hash.o
816817
TEST_BUILTINS_OBJS += test-online-cpus.o
817818
TEST_BUILTINS_OBJS += test-pack-mtimes.o
818819
TEST_BUILTINS_OBJS += test-parse-options.o

builtin/pack-objects.c

Lines changed: 58 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -269,6 +269,43 @@ struct configured_exclusion {
269269
static struct oidmap configured_exclusions;
270270

271271
static struct oidset excluded_by_config;
272+
static int name_hash_version = -1;
273+
274+
/**
275+
* Check whether the name_hash_version chosen by user input is appropriate,
276+
* and also validate whether it is compatible with other features.
277+
*/
278+
static void validate_name_hash_version(void)
279+
{
280+
if (name_hash_version < 1 || name_hash_version > 2)
281+
die(_("invalid --name-hash-version option: %d"), name_hash_version);
282+
if (write_bitmap_index && name_hash_version != 1) {
283+
warning(_("currently, --write-bitmap-index requires --name-hash-version=1"));
284+
name_hash_version = 1;
285+
}
286+
}
287+
288+
static inline uint32_t pack_name_hash_fn(const char *name)
289+
{
290+
static int seen_version = -1;
291+
292+
if (seen_version < 0)
293+
seen_version = name_hash_version;
294+
else if (seen_version != name_hash_version)
295+
BUG("name hash version changed from %d to %d mid-process",
296+
seen_version, name_hash_version);
297+
298+
switch (name_hash_version) {
299+
case 1:
300+
return pack_name_hash(name);
301+
302+
case 2:
303+
return pack_name_hash_v2((const unsigned char *)name);
304+
305+
default:
306+
BUG("invalid name-hash version: %d", name_hash_version);
307+
}
308+
}
272309

273310
/*
274311
* stats
@@ -1689,7 +1726,7 @@ static int add_object_entry(const struct object_id *oid, enum object_type type,
16891726
return 0;
16901727
}
16911728

1692-
create_object_entry(oid, type, pack_name_hash(name),
1729+
create_object_entry(oid, type, pack_name_hash_fn(name),
16931730
exclude, name && no_try_delta(name),
16941731
found_pack, found_offset);
16951732
return 1;
@@ -1903,7 +1940,7 @@ static void add_preferred_base_object(const char *name)
19031940
{
19041941
struct pbase_tree *it;
19051942
size_t cmplen;
1906-
unsigned hash = pack_name_hash(name);
1943+
unsigned hash = pack_name_hash_fn(name);
19071944

19081945
if (!num_preferred_base || check_pbase_path(hash))
19091946
return;
@@ -3415,7 +3452,7 @@ static void show_object_pack_hint(struct object *object, const char *name,
34153452
* here using a now in order to perhaps improve the delta selection
34163453
* process.
34173454
*/
3418-
oe->hash = pack_name_hash(name);
3455+
oe->hash = pack_name_hash_fn(name);
34193456
oe->no_try_delta = name && no_try_delta(name);
34203457

34213458
stdin_packs_hints_nr++;
@@ -3565,7 +3602,7 @@ static void add_cruft_object_entry(const struct object_id *oid, enum object_type
35653602
entry = packlist_find(&to_pack, oid);
35663603
if (entry) {
35673604
if (name) {
3568-
entry->hash = pack_name_hash(name);
3605+
entry->hash = pack_name_hash_fn(name);
35693606
entry->no_try_delta = no_try_delta(name);
35703607
}
35713608
} else {
@@ -3588,7 +3625,7 @@ static void add_cruft_object_entry(const struct object_id *oid, enum object_type
35883625
return;
35893626
}
35903627

3591-
entry = create_object_entry(oid, type, pack_name_hash(name),
3628+
entry = create_object_entry(oid, type, pack_name_hash_fn(name),
35923629
0, name && no_try_delta(name),
35933630
pack, offset);
35943631
}
@@ -4068,6 +4105,15 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
40684105
if (!(bitmap_git = prepare_bitmap_walk(revs, 0)))
40694106
return -1;
40704107

4108+
/*
4109+
* For now, force the name-hash version to be 1 since that
4110+
* is the version implied by the bitmap format. Later, the
4111+
* format can include this version explicitly in its format,
4112+
* allowing readers to know the version that was used during
4113+
* the bitmap write.
4114+
*/
4115+
name_hash_version = 1;
4116+
40714117
if (pack_options_allow_reuse())
40724118
reuse_partial_packfile_from_bitmap(bitmap_git,
40734119
&reuse_packfiles,
@@ -4443,6 +4489,8 @@ int cmd_pack_objects(int argc,
44434489
OPT_STRING_LIST(0, "uri-protocol", &uri_protocols,
44444490
N_("protocol"),
44454491
N_("exclude any configured uploadpack.blobpackfileuri with this protocol")),
4492+
OPT_INTEGER(0, "name-hash-version", &name_hash_version,
4493+
N_("use the specified name-hash function to group similar objects")),
44464494
OPT_END(),
44474495
};
44484496

@@ -4598,6 +4646,11 @@ int cmd_pack_objects(int argc,
45984646
if (pack_to_stdout || !rev_list_all)
45994647
write_bitmap_index = 0;
46004648

4649+
if (name_hash_version < 0)
4650+
name_hash_version = (int)git_env_ulong("GIT_TEST_NAME_HASH_VERSION", 1);
4651+
4652+
validate_name_hash_version();
4653+
46014654
if (use_delta_islands)
46024655
strvec_push(&rp, "--topo-order");
46034656

builtin/repack.c

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,9 @@ static int run_update_server_info = 1;
4141
static char *packdir, *packtmp_name, *packtmp;
4242

4343
static const char *const git_repack_usage[] = {
44-
N_("git repack [<options>]"),
44+
N_("git repack [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]\n"
45+
"[--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]\n"
46+
"[--write-midx] [--name-hash-version=<n>]"),
4547
NULL
4648
};
4749

@@ -60,6 +62,7 @@ struct pack_objects_args {
6062
int no_reuse_object;
6163
int quiet;
6264
int local;
65+
int name_hash_version;
6366
struct list_objects_filter_options filter_options;
6467
};
6568

@@ -308,6 +311,8 @@ static void prepare_pack_objects(struct child_process *cmd,
308311
strvec_pushf(&cmd->args, "--no-reuse-delta");
309312
if (args->no_reuse_object)
310313
strvec_pushf(&cmd->args, "--no-reuse-object");
314+
if (args->name_hash_version)
315+
strvec_pushf(&cmd->args, "--name-hash-version=%d", args->name_hash_version);
311316
if (args->local)
312317
strvec_push(&cmd->args, "--local");
313318
if (args->quiet)
@@ -1205,6 +1210,8 @@ int cmd_repack(int argc,
12051210
N_("pass --no-reuse-delta to git-pack-objects")),
12061211
OPT_BOOL('F', NULL, &po_args.no_reuse_object,
12071212
N_("pass --no-reuse-object to git-pack-objects")),
1213+
OPT_INTEGER(0, "name-hash-version", &po_args.name_hash_version,
1214+
N_("specify the name hash version to use for grouping similar objects by path")),
12081215
OPT_NEGBIT('n', NULL, &run_update_server_info,
12091216
N_("do not run git-update-server-info"), 1),
12101217
OPT__QUIET(&po_args.quiet, N_("be quiet")),

pack-objects.h

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -208,6 +208,34 @@ static inline uint32_t pack_name_hash(const char *name)
208208
return hash;
209209
}
210210

211+
static inline uint32_t pack_name_hash_v2(const unsigned char *name)
212+
{
213+
uint32_t hash = 0, base = 0, c;
214+
215+
if (!name)
216+
return 0;
217+
218+
while ((c = *name++)) {
219+
if (isspace(c))
220+
continue;
221+
if (c == '/') {
222+
base = (base >> 6) ^ hash;
223+
hash = 0;
224+
} else {
225+
/*
226+
* 'c' is only a single byte. Reverse it and move
227+
* it to the top of the hash, moving the rest to
228+
* less-significant bits.
229+
*/
230+
c = (c & 0xF0) >> 4 | (c & 0x0F) << 4;
231+
c = (c & 0xCC) >> 2 | (c & 0x33) << 2;
232+
c = (c & 0xAA) >> 1 | (c & 0x55) << 1;
233+
hash = (hash >> 2) + (c << 24);
234+
}
235+
}
236+
return (base >> 6) ^ hash;
237+
}
238+
211239
static inline enum object_type oe_type(const struct object_entry *e)
212240
{
213241
return e->type_valid ? e->type_ : OBJ_BAD;

t/README

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -471,6 +471,10 @@ a test and then fails then the whole test run will abort. This can help to make
471471
sure the expected tests are executed and not silently skipped when their
472472
dependency breaks or is simply not present in a new environment.
473473

474+
GIT_TEST_NAME_HASH_VERSION=<int>, when set, causes 'git pack-objects' to
475+
assume '--name-hash-version=<n>'.
476+
477+
474478
Naming Tests
475479
------------
476480

t/helper/meson.build

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ test_tool_sources = [
3434
'test-match-trees.c',
3535
'test-mergesort.c',
3636
'test-mktemp.c',
37+
'test-name-hash.c',
3738
'test-online-cpus.c',
3839
'test-pack-mtimes.c',
3940
'test-parse-options.c',

t/helper/test-name-hash.c

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
/*
2+
* test-name-hash.c: Read a list of paths over stdin and report on their
3+
* name-hash and full name-hash.
4+
*/
5+
6+
#include "test-tool.h"
7+
#include "git-compat-util.h"
8+
#include "pack-objects.h"
9+
#include "strbuf.h"
10+
11+
int cmd__name_hash(int argc UNUSED, const char **argv UNUSED)
12+
{
13+
struct strbuf line = STRBUF_INIT;
14+
15+
while (!strbuf_getline(&line, stdin)) {
16+
printf("%10u ", pack_name_hash(line.buf));
17+
printf("%10u ", pack_name_hash_v2((unsigned const char *)line.buf));
18+
printf("%s\n", line.buf);
19+
}
20+
21+
strbuf_release(&line);
22+
return 0;
23+
}

t/helper/test-tool.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ static struct test_cmd cmds[] = {
4444
{ "match-trees", cmd__match_trees },
4545
{ "mergesort", cmd__mergesort },
4646
{ "mktemp", cmd__mktemp },
47+
{ "name-hash", cmd__name_hash },
4748
{ "online-cpus", cmd__online_cpus },
4849
{ "pack-mtimes", cmd__pack_mtimes },
4950
{ "parse-options", cmd__parse_options },

t/helper/test-tool.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ int cmd__lazy_init_name_hash(int argc, const char **argv);
3737
int cmd__match_trees(int argc, const char **argv);
3838
int cmd__mergesort(int argc, const char **argv);
3939
int cmd__mktemp(int argc, const char **argv);
40+
int cmd__name_hash(int argc, const char **argv);
4041
int cmd__online_cpus(int argc, const char **argv);
4142
int cmd__pack_mtimes(int argc, const char **argv);
4243
int cmd__parse_options(int argc, const char **argv);

0 commit comments

Comments
 (0)