Skip to content
This repository was archived by the owner on Sep 11, 2020. It is now read-only.

dotgit: rewrite the way references are looked up #563

Merged
merged 1 commit into from
Aug 27, 2017

Conversation

erizocosmico
Copy link
Contributor

@erizocosmico erizocosmico commented Aug 25, 2017

This change introduces a cache of already visited references for future visits.
Right now, using Ref method of DotGit fetches all the references and then
returns the correct one, this is a very expensive operation if executed a lot
of times. Adding the cache, makes this orders of mangnitude faster on successive
calls at the expense of a little bit more of memory consumption.

Results of the added benchmark before:

➜ go test ./storage/filesystem/internal/dotgit/... -bench=.
OK: 30 passed
BenchmarkRefMultipleTimes-4   	    5000	    264626 ns/op
PASS
ok  	gopkg.in/src-d/go-git.v4/storage/filesystem/internal/dotgit	3.418s

Results of the benchmark now:

➜ go test ./storage/filesystem/internal/dotgit/... -bench=.
OK: 30 passed
BenchmarkRefMultipleTimes-4   	  200000	      7264 ns/op
PASS
ok  	gopkg.in/src-d/go-git.v4/storage/filesystem/internal/dotgit	2.520s

Caveats:

  • Refs is not cached. Can't make it pass RemoteSuite.TestPushNewReference and RemoteSuite.TestPushRejectNonFastForward even checking the modification time of .git. The cached results are not the same as the ones obtained with the usual process, even if the modtime of the .git is the same as it was.
  • In Ref the first step is try to look directly for the file named after the ref. Same reason as above, only for other tests.

I've been trying another few implementations of this to try to make it work with References and not require that I/O operation at the beginning of Ref, but it's been proving really hard to cache at this level if we cannot rely on the ModTime.

This change makes a push that took ~30 mins with 0 changes take just a few seconds.

@codecov
Copy link

codecov bot commented Aug 25, 2017

Codecov Report

Merging #563 into master will decrease coverage by 0.64%.
The diff coverage is 55.88%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #563      +/-   ##
==========================================
- Coverage    78.1%   77.45%   -0.65%     
==========================================
  Files         129      129              
  Lines        9807     9822      +15     
==========================================
- Hits         7660     7608      -52     
- Misses       1316     1393      +77     
+ Partials      831      821      -10
Impacted Files Coverage Δ
storage/filesystem/internal/dotgit/dotgit.go 71.59% <55.88%> (-1.55%) ⬇️
plumbing/transport/ssh/common.go 20.54% <0%> (-45.21%) ⬇️
plumbing/transport/ssh/auth_method.go 31.57% <0%> (-22.81%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2f58c82...5a7b7af. Read the comment docs.

@erizocosmico erizocosmico changed the title dotgit: cache already visited references for future visits [WIP] dotgit: cache already visited references for future visits Aug 25, 2017
@orirawlings
Copy link
Contributor

orirawlings commented Aug 26, 2017

If the latency is because Ref is implemented by scanning through all Refs, why can't we just change Ref to lookup only the name in question without scanning all the files? I thought the various reference storage formats were designed to support fast random access. Is that not the case? Is there some reason the full scan is necessary? I haven't looked too closely, but maybe the current implementation was motivated more by coding convenience than necessity?

It seems to me that putting a cache around one of the few data types in git that is mutable is a little tricky to get right. It is certainly more difficult to reason about during troubleshooting. Perhaps the troubles with getting the caching working for Refs in this PR is suggestive of that? Caching seems like it should be more of a last resort than first step.

Rather than trading off memory to save CPU we should make sure our current CPU usage is not unnecessarily inefficient.

@erizocosmico
Copy link
Contributor Author

@orirawlings nope, that does not cut it, sadly :(. All the refs are not in all the different formats. All the packed-refs are not in under the refs/ folder. We could do the following instead, though:

Cache the packed refs (we can check that modtime and have them cached), just check for name in refs/ folder and then check in packed refs if not exists.

There is another option I've been working on and it's caching everything without checking any modtime (which makes the operation take ~50ns) and just updating the cache when calling SetRef and RemoveRef. But that's dangerous because it's instance-dependant and if something outside of the given instance modifies the refs we'd never know.

Now there's only two ways of getting a reference, by checking under refs/ directory or in packed-refs. refs/ directory is checked using a direct read by reference name and packed refs are cached until they have been changed.

Signed-off-by: Miguel Molina <[email protected]>
@erizocosmico erizocosmico force-pushed the perf/dotgit-ref-cache branch from 00c51f1 to 5a7b7af Compare August 26, 2017 16:54
@erizocosmico
Copy link
Contributor Author

@orirawlings I've just rewritten the Ref method. Now checks directly by reference name in the filesystem and resorts to packed-refs if it's not found. Packed-refs are cached checking if they actually have changed before using the cache.

@erizocosmico erizocosmico changed the title [WIP] dotgit: cache already visited references for future visits dotgit: rewrite the way references are looked up Aug 26, 2017
@orirawlings
Copy link
Contributor

Yeah, I think just caching everything and only updating on SetRef and RemoveRef is undesirable because it prevents concurrent access to the .git by other processes or git implementations.

This updated patch looks really good to me. It also changes the implementation to be much more consistent with the documentation on the standard git tools. Here is a snippet from the man page for git pack-refs:

This command is used to solve the storage and performance problem by storing the refs in a single file, $GIT_DIR/packed-refs. When a ref is missing from the traditional $GIT_DIR/refs directory hierarchy, it is looked up in this file and used if found.

I think the caching based on the mod time should be sufficient too, and not introduce any data consistency issues.

@mcuadros mcuadros merged commit 334923a into src-d:master Aug 27, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants