-
Notifications
You must be signed in to change notification settings - Fork 1.3k
feat: code insights language stats performance improvements #62011
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a great approach! And the numbers quoted are promising.
…tserver semaphores
… exclusion filter for getInventory
Co-authored-by: Camden Cheek <[email protected]>
lineCount, byteCount, err := countLines(rc, buf) | ||
trc.End() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite a few additional spans. Might I suggest making the ReadFull
and CountLines
logged events on the getLang
span?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since our spans now properly surround the filereader we can drop the ones here alltogether. I will instead add an log even to the trace that tells us which path the evaluation is taking for which file. Since the path with countLines
could be optimized with a gitserver side LOC count that should help us get a better understanding of the potential performance gain there. Same for files where enry may be unsure, but we can make a confident call that a certain file type should be treated as a particular language.
cmd/frontend/backend/inventory.go
Outdated
if err != nil { | ||
return nil, err | ||
} | ||
trc.End() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we end this span when the reader finishes rather than when we return a reader? e.g. in releaseSemaphore
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This particular trace was just for info about acquiring the semaphore, with the intention to inform us about an opportunity to increase the semaphore size.
cmd/frontend/backend/inventory.go
Outdated
} | ||
defer gitServerSemaphore.Release(1) | ||
// Using recurse=true is not possible, because we're likely to run into the following error: | ||
// "Maximum call stack size exceeded" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's pretty surprising to me -- why is the stack size so large? Does recurse also return the target so we end up in an infinite recursion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't gone much deeper than trying this out and seeing the error. My first intention was "it's just going through a lot of folders, and that's breaking it", but now I feel more like this tree call PLUS all the other tree calls that I still did at that time are leading to the call stack error. I can try again with just one ReadDir request later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took another stab at this, and it was indeed about recursive recursion breaking the thing.
The numbers aren't too promising. When running with a recursive call (and skipping the recursion in the inventory), we get between 0.8% and 9.5% performance improvement, without a conclusive pointer what kind of repos actually benefit from this. I'll update the comment with a reference to this, but wouldn't attempt a further improvement here until we hear back from customers.
"entries_test.go", | ||
], | ||
embed = [":inventory"], | ||
race = "off", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated entries.go to use concurrency via conc's iter.MapErr. This means that we're racing various tasks. Our bazel tests by default don't like racing. Unless we want to do a major refactor of the code, we need to allow racing here. From what I can see in the code this should be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's a problem. If the race detector triggers, that almost always indicates a bug. It has a very low false positive rate. We should fix the race conditions.
This PR updates the documentation as a follow-up to https://github.com/sourcegraph/sourcegraph/pull/62011 where we introduced new environment variables and improved the performance of language stats insights.
…ve loading (#62946) We previously improved the performance of Language Stats Insights by introducing parallel requests to gitserver: https://github.com/sourcegraph/sourcegraph/pull/62011 This PR replaces the previous approach where we would iterate through and request each file from gitserver with an approach where we request just one archive. This eliminates a lot of network traffic, and gives us an additional(!) performance improvement of 70-90%. Even repositories like chromium (42GB) can now be processed (on my machine in just one minute). --- Caching: We dropped most of the caching, and kept only the top-level caching (repo@commit). This means that we only need to compute the language stats once per commit, and subsequent users/requests can see the cached data. We dropped the file/directory level caching, because (1) the code to do that got very complex and (2) we can assume that most repositories are able to compute within the 5 minutes timeout (which can be increase via the environment variable `GET_INVENTORY_TIMEOUT`). The timeout is not bound to the user's request anymore. Before, the frontend would request the stats up to three times to let the computation move forward and pick up where the last request aborted. While we still have this frontend retry mechanism, we don't have to worry about an abort-and-continue mechanism in the backend. --- Credits for the code to @eseliger: https://github.com/sourcegraph/sourcegraph/issues/62019#issuecomment-2119278481 I've taken the diff, and updated the caching methods to allow for more advanced use cases should we decide to introduce more caching. We can take that out again if the current caching is sufficient. Todos: - [x] Check if CI passes, manual testing seems to be fine - [x] Verify that insights are cached at the top level --- Test data: - sourcegraph/sourcegraph: 9.07s (main) -> 1.44s (current): 74% better - facebook/react: 17.52s (main) -> 0.87s (current): 95% better - godotengine/godot: 28.92s (main) -> 1.98s (current): 93% better - chromium/chromium: ~1 minute: 100% better, because it didn't compute before ## Changelog - Language stats queries now request one archive from gitserver instead of individual file requests. This leads to a huge performance improvement. Even extra large repositories like chromium are now able to compute within one minute. Previously they timed out. ## Test plan - New unit tests - Plenty of manual testing
Closes https://github.com/sourcegraph/customer/issues/2962
This PR improves the performance of language stats insights by 50-70%.
We achieved this by
Forcing a depth-first-search to better utilize the cache that each directory-step uses(see below)*.lock
files from the language stats insightsFurthermore we're
New environment variables
GET_INVENTORY_TIMEOUT
: number, defines the number of minutes that the getInventory call has before timing outGET_INVENTORY_GIT_SERVER_CONCURRENCY
: number, defines the maximum number of concurrent requests from the getInventory operation to the gitserverTest results
DFS = Depth First Search
Why only 4 concurrent requests to gitserver
As you can see in the test results above, we started to get diminishing returns after 4 concurrent requests. This is probably a good default for customers, where they can get performance improvements by default, and have the ability to tweak it further if they have very powerful gitserver instances.
When we trace the language stats requests, we can see that increasing the concurrency further leads to longer response times from gitserver, which then sets off the improvement we get from concurrent requests.
No depth first search
It turns out that the DFS used in https://github.com/sourcegraph/sourcegraph/pull/62011/commits/e3ef818aac365c6ed380bc32cb5a06f1d7418bf5 is 5-10% slower than the original approach. When the request is cancelled, only the files in the currently checked directory (not the files within subdirectories) need to be re-checked, which is an acceptable trade-off.
Excluding .lock files
From my understanding, lock files are not valuable when it comes to understanding the language of a repository. To keep compatibility with the linguist library, go-envry doesn't flag all generated files as such.
In a future change we can enhance the pattern matching and make it configurable so that customers can tweak which files should be checked for their large repositories. Adding that right now may be premature optimization.
Test plan
The affected code paths are covered by existing tests. I've added some unit tests for a newly added function. Furthermore I've done extensive manual testing.