feat: code insights language stats performance improvements #62011

bahrmichael · 2024-04-18T14:03:55Z

Closes https://github.com/sourcegraph/customer/issues/2962

This PR improves the performance of language stats insights by 50-70%.

We achieved this by

Traversing the file tree of a repository with a concurrency of up to 4 (default) against gitserver
~~Forcing a depth-first-search to better utilize the cache that each directory-step uses~~ (see below)
Excluding *.lock files from the language stats insights

Furthermore we're

Raising the timeout for language stats insights from 3 to 5 minutes
Making the concurrency and timeout configurable

New environment variables

GET_INVENTORY_TIMEOUT: number, defines the number of minutes that the getInventory call has before timing out
GET_INVENTORY_GIT_SERVER_CONCURRENCY: number, defines the maximum number of concurrent requests from the getInventory operation to the gitserver

Test results

DFS = Depth First Search

Why only 4 concurrent requests to gitserver

As you can see in the test results above, we started to get diminishing returns after 4 concurrent requests. This is probably a good default for customers, where they can get performance improvements by default, and have the ability to tweak it further if they have very powerful gitserver instances.

When we trace the language stats requests, we can see that increasing the concurrency further leads to longer response times from gitserver, which then sets off the improvement we get from concurrent requests.

No depth first search

It turns out that the DFS used in https://github.com/sourcegraph/sourcegraph/pull/62011/commits/e3ef818aac365c6ed380bc32cb5a06f1d7418bf5 is 5-10% slower than the original approach. When the request is cancelled, only the files in the currently checked directory (not the files within subdirectories) need to be re-checked, which is an acceptable trade-off.

Excluding .lock files

From my understanding, lock files are not valuable when it comes to understanding the language of a repository. To keep compatibility with the linguist library, go-envry doesn't flag all generated files as such.

In a future change we can enhance the pattern matching and make it configurable so that customers can tweak which files should be checked for their large repositories. Adding that right now may be premature optimization.

Test plan

The affected code paths are covered by existing tests. I've added some unit tests for a newly added function. Furthermore I've done extensive manual testing.

camdencheek

This looks like a great approach! And the numbers quoted are promising.

cmd/frontend/internal/inventory/entries.go

…tserver semaphores

… exclusion filter for getInventory

Co-authored-by: Camden Cheek <[email protected]>

…ile updates)

camdencheek · 2024-04-23T21:53:33Z

cmd/frontend/internal/inventory/inventory.go

 	lineCount, byteCount, err := countLines(rc, buf)
+	trc.End()


This is quite a few additional spans. Might I suggest making the ReadFull and CountLines logged events on the getLang span?

Since our spans now properly surround the filereader we can drop the ones here alltogether. I will instead add an log even to the trace that tells us which path the evaluation is taking for which file. Since the path with countLines could be optimized with a gitserver side LOC count that should help us get a better understanding of the potential performance gain there. Same for files where enry may be unsure, but we can make a confident call that a certain file type should be treated as a particular language.

cmd/frontend/internal/inventory/inventory.go

camdencheek · 2024-04-23T22:50:55Z

cmd/frontend/backend/inventory.go

+			if err != nil {
+				return nil, err
+			}
+			trc.End()


Should we end this span when the reader finishes rather than when we return a reader? e.g. in releaseSemaphore?

This particular trace was just for info about acquiring the semaphore, with the intention to inform us about an opportunity to increase the semaphore size.

camdencheek · 2024-04-23T22:53:05Z

cmd/frontend/backend/inventory.go

+			}
+			defer gitServerSemaphore.Release(1)
+			// Using recurse=true is not possible, because we're likely to run into the following error:
+			// "Maximum call stack size exceeded"


That's pretty surprising to me -- why is the stack size so large? Does recurse also return the target so we end up in an infinite recursion?

I haven't gone much deeper than trying this out and seeing the error. My first intention was "it's just going through a lot of folders, and that's breaking it", but now I feel more like this tree call PLUS all the other tree calls that I still did at that time are leading to the call stack error. I can try again with just one ReadDir request later.

I took another stab at this, and it was indeed about recursive recursion breaking the thing.

The numbers aren't too promising. When running with a recursive call (and skipping the recursion in the inventory), we get between 0.8% and 9.5% performance improvement, without a conclusive pointer what kind of repos actually benefit from this. I'll update the comment with a reference to this, but wouldn't attempt a further improvement here until we hear back from customers.

bahrmichael · 2024-04-24T10:03:18Z

cmd/frontend/internal/inventory/BUILD.bazel

+        "entries_test.go",
+    ],
+    embed = [":inventory"],
+    race = "off",


I updated entries.go to use concurrency via conc's iter.MapErr. This means that we're racing various tasks. Our bazel tests by default don't like racing. Unless we want to do a major refactor of the code, we need to allow racing here. From what I can see in the code this should be safe.

Oh, that's a problem. If the race detector triggers, that almost always indicates a bug. It has a very low false positive rate. We should fix the race conditions.

This PR updates the documentation as a follow-up to https://github.com/sourcegraph/sourcegraph/pull/62011 where we introduced new environment variables and improved the performance of language stats insights.

@eseliger

…ve loading (#62946) We previously improved the performance of Language Stats Insights by introducing parallel requests to gitserver: https://github.com/sourcegraph/sourcegraph/pull/62011 This PR replaces the previous approach where we would iterate through and request each file from gitserver with an approach where we request just one archive. This eliminates a lot of network traffic, and gives us an additional(!) performance improvement of 70-90%. Even repositories like chromium (42GB) can now be processed (on my machine in just one minute). --- Caching: We dropped most of the caching, and kept only the top-level caching (repo@commit). This means that we only need to compute the language stats once per commit, and subsequent users/requests can see the cached data. We dropped the file/directory level caching, because (1) the code to do that got very complex and (2) we can assume that most repositories are able to compute within the 5 minutes timeout (which can be increase via the environment variable `GET_INVENTORY_TIMEOUT`). The timeout is not bound to the user's request anymore. Before, the frontend would request the stats up to three times to let the computation move forward and pick up where the last request aborted. While we still have this frontend retry mechanism, we don't have to worry about an abort-and-continue mechanism in the backend. --- Credits for the code to @eseliger: https://github.com/sourcegraph/sourcegraph/issues/62019#issuecomment-2119278481 I've taken the diff, and updated the caching methods to allow for more advanced use cases should we decide to introduce more caching. We can take that out again if the current caching is sufficient. Todos: - [x] Check if CI passes, manual testing seems to be fine - [x] Verify that insights are cached at the top level --- Test data: - sourcegraph/sourcegraph: 9.07s (main) -> 1.44s (current): 74% better - facebook/react: 17.52s (main) -> 0.87s (current): 95% better - godotengine/godot: 28.92s (main) -> 1.98s (current): 93% better - chromium/chromium: ~1 minute: 100% better, because it didn't compute before ## Changelog - Language stats queries now request one archive from gitserver instead of individual file requests. This leads to a huge performance improvement. Even extra large repositories like chromium are now able to compute within one minute. Previously they timed out. ## Test plan - New unit tests - Plenty of manual testing

feat: raise timeout and introduce parallel gitserver calls

b23c74f

cla-bot bot added the cla-signed label Apr 18, 2024

camdencheek reviewed Apr 18, 2024

View reviewed changes

bahrmichael and others added 17 commits April 19, 2024 14:25

chore: rework semaphore into inventory client + separate redis and gi…

8566ffe

…tserver semaphores

Merge branch 'main' into bahrmichael/2962

a132c83

Merge branch 'main' into bahrmichael/2962

a5ac551

feat: introduce new properties to tweak GetInventory performance; new…

1ac9cd5

… exclusion filter for getInventory

chore: add and fix tests

3b159fb

chore: revert rcache version change

3034652

chore: omit err check

bfd4cdc

Co-authored-by: Camden Cheek <[email protected]>

chore: simplify closure

2efa3cf

fix: update bazel files

e64a7a4

fix: linting and use grafana regexp

3beafa6

fix: default values for config; use conc package

d7d34bb

fix: apply correct locking and tracing

e1c639c

chore: replace experimental config with env vars (still needs build f…

4a9c357

…ile updates)

fix: enforce DFS to better utilize tree caching

e3ef818

fix: run bazel configure

4807b1e

fix: bazel configure and use correct var

5eedbca

fix: remove DFS and fix nogo errors

92363c7

bahrmichael marked this pull request as ready for review April 23, 2024 14:42

fix: increase redis concurrency

373ebb3

camdencheek approved these changes Apr 23, 2024

View reviewed changes

bahrmichael and others added 7 commits April 24, 2024 10:04

chore: trace cleanup and logging

f01d577

Merge branch 'main' into bahrmichael/2962

c7ae721

chore: add changelog entry

d6caa17

chore: update bazel files

768babf

chore: update bazel files

cefdaf5

chore: harden test for concurrency

ab08d6f

fix: allow data race in entries_test.go

fe7b523

bahrmichael commented Apr 24, 2024

View reviewed changes

bahrmichael added 2 commits April 24, 2024 12:05

fix: allow data race in entries_test.go

6671ee4

chore: update comment on why we don't do a recursive call

b7ee4a3

bahrmichael merged commit a47018c into main Apr 24, 2024

bahrmichael deleted the bahrmichael/2962 branch April 24, 2024 11:14

camdencheek mentioned this pull request Apr 24, 2024

Code insights: followup from #62011 #62175

Merged

bahrmichael mentioned this pull request Apr 26, 2024

docs: explain language insights and config mechanisms sourcegraph/docs#280

Merged

bahrmichael mentioned this pull request May 31, 2024

feat(code insights): language stats speed improvements by using archive loading #62946

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: code insights language stats performance improvements #62011

feat: code insights language stats performance improvements #62011

Uh oh!

bahrmichael commented Apr 18, 2024 •

edited

Loading

Uh oh!

camdencheek left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

camdencheek Apr 23, 2024

Uh oh!

bahrmichael Apr 24, 2024

Uh oh!

Uh oh!

camdencheek Apr 23, 2024

Uh oh!

bahrmichael Apr 24, 2024 •

edited

Loading

Uh oh!

camdencheek Apr 23, 2024

Uh oh!

bahrmichael Apr 24, 2024

Uh oh!

bahrmichael Apr 24, 2024

Uh oh!

bahrmichael Apr 24, 2024

Uh oh!

camdencheek Apr 24, 2024

Uh oh!

Uh oh!

feat: code insights language stats performance improvements #62011

feat: code insights language stats performance improvements #62011

Uh oh!

Conversation

bahrmichael commented Apr 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New environment variables

Test results

Why only 4 concurrent requests to gitserver

No depth first search

Excluding .lock files

Test plan

Uh oh!

camdencheek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bahrmichael Apr 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bahrmichael commented Apr 18, 2024 •

edited

Loading

bahrmichael Apr 24, 2024 •

edited

Loading