Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

feat: code insights language stats performance improvements #62011

Merged
merged 28 commits into from
Apr 24, 2024

Conversation

bahrmichael
Copy link
Contributor

@bahrmichael bahrmichael commented Apr 18, 2024

Closes https://github.com/sourcegraph/customer/issues/2962

This PR improves the performance of language stats insights by 50-70%.

We achieved this by

  • Traversing the file tree of a repository with a concurrency of up to 4 (default) against gitserver
  • Forcing a depth-first-search to better utilize the cache that each directory-step uses (see below)
  • Excluding *.lock files from the language stats insights

Furthermore we're

  • Raising the timeout for language stats insights from 3 to 5 minutes
  • Making the concurrency and timeout configurable

New environment variables

  • GET_INVENTORY_TIMEOUT: number, defines the number of minutes that the getInventory call has before timing out
  • GET_INVENTORY_GIT_SERVER_CONCURRENCY: number, defines the maximum number of concurrent requests from the getInventory operation to the gitserver

Test results

Screenshot 2024-04-23 at 16 15 16

DFS = Depth First Search

Why only 4 concurrent requests to gitserver

As you can see in the test results above, we started to get diminishing returns after 4 concurrent requests. This is probably a good default for customers, where they can get performance improvements by default, and have the ability to tweak it further if they have very powerful gitserver instances.

When we trace the language stats requests, we can see that increasing the concurrency further leads to longer response times from gitserver, which then sets off the improvement we get from concurrent requests.

No depth first search

It turns out that the DFS used in https://github.com/sourcegraph/sourcegraph/pull/62011/commits/e3ef818aac365c6ed380bc32cb5a06f1d7418bf5 is 5-10% slower than the original approach. When the request is cancelled, only the files in the currently checked directory (not the files within subdirectories) need to be re-checked, which is an acceptable trade-off.

Excluding .lock files

From my understanding, lock files are not valuable when it comes to understanding the language of a repository. To keep compatibility with the linguist library, go-envry doesn't flag all generated files as such.

In a future change we can enhance the pattern matching and make it configurable so that customers can tweak which files should be checked for their large repositories. Adding that right now may be premature optimization.

Test plan

The affected code paths are covered by existing tests. I've added some unit tests for a newly added function. Furthermore I've done extensive manual testing.

@cla-bot cla-bot bot added the cla-signed label Apr 18, 2024
Copy link
Member

@camdencheek camdencheek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great approach! And the numbers quoted are promising.

@bahrmichael bahrmichael marked this pull request as ready for review April 23, 2024 14:42
lineCount, byteCount, err := countLines(rc, buf)
trc.End()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite a few additional spans. Might I suggest making the ReadFull and CountLines logged events on the getLang span?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since our spans now properly surround the filereader we can drop the ones here alltogether. I will instead add an log even to the trace that tells us which path the evaluation is taking for which file. Since the path with countLines could be optimized with a gitserver side LOC count that should help us get a better understanding of the potential performance gain there. Same for files where enry may be unsure, but we can make a confident call that a certain file type should be treated as a particular language.

if err != nil {
return nil, err
}
trc.End()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we end this span when the reader finishes rather than when we return a reader? e.g. in releaseSemaphore?

Copy link
Contributor Author

@bahrmichael bahrmichael Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular trace was just for info about acquiring the semaphore, with the intention to inform us about an opportunity to increase the semaphore size.

}
defer gitServerSemaphore.Release(1)
// Using recurse=true is not possible, because we're likely to run into the following error:
// "Maximum call stack size exceeded"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pretty surprising to me -- why is the stack size so large? Does recurse also return the target so we end up in an infinite recursion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't gone much deeper than trying this out and seeing the error. My first intention was "it's just going through a lot of folders, and that's breaking it", but now I feel more like this tree call PLUS all the other tree calls that I still did at that time are leading to the call stack error. I can try again with just one ReadDir request later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another stab at this, and it was indeed about recursive recursion breaking the thing.

The numbers aren't too promising. When running with a recursive call (and skipping the recursion in the inventory), we get between 0.8% and 9.5% performance improvement, without a conclusive pointer what kind of repos actually benefit from this. I'll update the comment with a reference to this, but wouldn't attempt a further improvement here until we hear back from customers.

Screenshot 2024-04-24 at 10 34 14

"entries_test.go",
],
embed = [":inventory"],
race = "off",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated entries.go to use concurrency via conc's iter.MapErr. This means that we're racing various tasks. Our bazel tests by default don't like racing. Unless we want to do a major refactor of the code, we need to allow racing here. From what I can see in the code this should be safe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's a problem. If the race detector triggers, that almost always indicates a bug. It has a very low false positive rate. We should fix the race conditions.

@bahrmichael bahrmichael merged commit a47018c into main Apr 24, 2024
@bahrmichael bahrmichael deleted the bahrmichael/2962 branch April 24, 2024 11:14
bahrmichael referenced this pull request in sourcegraph/docs Apr 26, 2024
This PR updates the documentation as a follow-up to
https://github.com/sourcegraph/sourcegraph/pull/62011 where we
introduced new environment variables and improved the performance of
language stats insights.
bahrmichael referenced this pull request Jul 18, 2024
…ve loading (#62946)

We previously improved the performance of Language Stats Insights by
introducing parallel requests to gitserver:
https://github.com/sourcegraph/sourcegraph/pull/62011

This PR replaces the previous approach where we would iterate through
and request each file from gitserver with an approach where we request
just one archive. This eliminates a lot of network traffic, and gives us
an additional(!) performance improvement of 70-90%.

Even repositories like chromium (42GB) can now be processed (on my
machine in just one minute).

---

Caching: We dropped most of the caching, and kept only the top-level
caching (repo@commit). This means that we only need to compute the
language stats once per commit, and subsequent users/requests can see
the cached data. We dropped the file/directory level caching, because
(1) the code to do that got very complex and (2) we can assume that most
repositories are able to compute within the 5 minutes timeout (which can
be increase via the environment variable `GET_INVENTORY_TIMEOUT`). The
timeout is not bound to the user's request anymore. Before, the frontend
would request the stats up to three times to let the computation move
forward and pick up where the last request aborted. While we still have
this frontend retry mechanism, we don't have to worry about an
abort-and-continue mechanism in the backend.

---

Credits for the code to @eseliger:
https://github.com/sourcegraph/sourcegraph/issues/62019#issuecomment-2119278481

I've taken the diff, and updated the caching methods to allow for more
advanced use cases should we decide to introduce more caching. We can
take that out again if the current caching is sufficient.

Todos:

- [x] Check if CI passes, manual testing seems to be fine
- [x] Verify that insights are cached at the top level

---

Test data:

- sourcegraph/sourcegraph: 9.07s (main) -> 1.44s (current): 74% better
- facebook/react: 17.52s (main) -> 0.87s (current): 95% better
- godotengine/godot: 28.92s (main) -> 1.98s (current): 93% better
- chromium/chromium: ~1 minute: 100% better, because it didn't compute
before

## Changelog

- Language stats queries now request one archive from gitserver instead
of individual file requests. This leads to a huge performance
improvement. Even extra large repositories like chromium are now able to
compute within one minute. Previously they timed out.

## Test plan

- New unit tests
- Plenty of manual testing
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants