-
Notifications
You must be signed in to change notification settings - Fork 648
Make recent downloads fast #1363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
bors-voyager
merged 1 commit into
rust-lang:master
from
sgrif:sg-make-recent-downloads-fast
Apr 25, 2018
Merged
Make recent downloads fast #1363
bors-voyager
merged 1 commit into
rust-lang:master
from
sgrif:sg-make-recent-downloads-fast
Apr 25, 2018
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Every aspect of crates.io is extremely fast except one. Calculating recent downloads is a moderately expensive operation that we are doing too often. It's not "slow" per-se. The query takes 500ms to complete on average, which is not unreasonable. However, this query does put a good bit of load on the DB server. The root cause of rust-lang#1304 was an irresponsible bot hitting /crates (which includes recent downloads) enough that it caused our DB server's CPU load to reach 100%. We're on a very low tier database server right now, so we could fix this by just upgrading to the next tier if we wanted to. However, as the number of crates grows, this problem will just get worse. Recent downloads appear in 3 places. Summary, search, and crate details. Summary and search are the two endpoints that crawlers want to hit, so we can't have slow queries on those pages. There are other solutions we could take here. We could just not include recent downloads on those pages, and include them only if a flag is set. We could also move recent downloads to another endpoint which gets hit separately. Both of these require an annoying amount of code though, and we need a crawler policy that says "don't hit these endpoints", which requires work for us to monitor those endpoints, etc. Ultimately we don't need this data to be real time (it already is slightly delayed anyway). We can just store this particular piece of data in a materialized view, which we refresh periodically. Right now I'm refreshing it whenever update-downloads runs, but I think we can go even further and update it daily if this becomes a scaling problem in the future. Even though repopulating the view currently only takes ~500ms (same amount of time as the query), I've opted to do it concurrently to avoid locking the table for long periods of time as we continue to scale. Crate details is still using the `crate_downloads` table, so it's a bit closer to real time. When we're selecting this for a single crate, the query is already fast enough. We may want to change this in the future. We could also consider getting rid of `crate_downloads` entirely, and populate the view from `version_downloads`. This will cause more CPU load when we refresh the view, but it probably doesn't matter.
4b6ed40
to
c6101b5
Compare
ashleygwilliams
approved these changes
Apr 25, 2018
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok i think this is good the only thing this might need is a way to share with people that the download data is no longer realtime. i'm not sure if such a place exists but that's my only concern.
bors: r+ |
bors-voyager bot
added a commit
that referenced
this pull request
Apr 25, 2018
1363: Make recent downloads fast r=ashleygwilliams Every aspect of crates.io is extremely fast except one. Calculating recent downloads is a moderately expensive operation that we are doing too often. It's not "slow" per-se. The query takes 500ms to complete on average, which is not unreasonable. However, this query does put a good bit of load on the DB server. The root cause of #1304 was an irresponsible bot hitting /crates (which includes recent downloads) enough that it caused our DB server's CPU load to reach 100%. We're on a very low tier database server right now, so we could fix this by just upgrading to the next tier if we wanted to. However, as the number of crates grows, this problem will just get worse. Recent downloads appear in 3 places. Summary, search, and crate details. Summary and search are the two endpoints that crawlers want to hit, so we can't have slow queries on those pages. There are other solutions we could take here. We could just not include recent downloads on those pages, and include them only if a flag is set. We could also move recent downloads to another endpoint which gets hit separately. Both of these require an annoying amount of code though, and we need a crawler policy that says "don't hit these endpoints", which requires work for us to monitor those endpoints, etc. Ultimately we don't need this data to be real time (it already is slightly delayed anyway). We can just store this particular piece of data in a materialized view, which we refresh periodically. Right now I'm refreshing it whenever update-downloads runs, but I think we can go even further and update it daily if this becomes a scaling problem in the future. Even though repopulating the view currently only takes ~500ms (same amount of time as the query), I've opted to do it concurrently to avoid locking the table for long periods of time as we continue to scale. Crate details is still using the `crate_downloads` table, so it's a bit closer to real time. When we're selecting this for a single crate, the query is already fast enough. We may want to change this in the future. We could also consider getting rid of `crate_downloads` entirely, and populate the view from `version_downloads`. This will cause more CPU load when we refresh the view, but it probably doesn't matter.
Build succeeded |
I can confirm it is indeed now fast. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Every aspect of crates.io is extremely fast except one. Calculating
recent downloads is a moderately expensive operation that we are doing
too often. It's not "slow" per-se. The query takes 500ms to complete on
average, which is not unreasonable.
However, this query does put a good bit of load on the DB server. The
root cause of #1304 was an irresponsible bot hitting /crates (which
includes recent downloads) enough that it caused our DB server's CPU
load to reach 100%. We're on a very low tier database server right now,
so we could fix this by just upgrading to the next tier if we wanted to.
However, as the number of crates grows, this problem will just get
worse.
Recent downloads appear in 3 places. Summary, search, and crate details.
Summary and search are the two endpoints that crawlers want to hit, so
we can't have slow queries on those pages.
There are other solutions we could take here. We could just not include
recent downloads on those pages, and include them only if a flag is set.
We could also move recent downloads to another endpoint which gets hit
separately. Both of these require an annoying amount of code though, and
we need a crawler policy that says "don't hit these endpoints", which
requires work for us to monitor those endpoints, etc.
Ultimately we don't need this data to be real time (it already is
slightly delayed anyway). We can just store this particular piece of
data in a materialized view, which we refresh periodically. Right now
I'm refreshing it whenever update-downloads runs, but I think we can go
even further and update it daily if this becomes a scaling problem in
the future.
Even though repopulating the view currently only takes ~500ms (same
amount of time as the query), I've opted to do it concurrently to avoid
locking the table for long periods of time as we continue to scale.
Crate details is still using the
crate_downloads
table, so it's a bitcloser to real time. When we're selecting this for a single crate, the
query is already fast enough. We may want to change this in the future.
We could also consider getting rid of
crate_downloads
entirely, andpopulate the view from
version_downloads
. This will cause more CPUload when we refresh the view, but it probably doesn't matter.