Skip to content

Make recent downloads fast #1363

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 25, 2018

Conversation

sgrif
Copy link
Contributor

@sgrif sgrif commented Apr 24, 2018

Every aspect of crates.io is extremely fast except one. Calculating
recent downloads is a moderately expensive operation that we are doing
too often. It's not "slow" per-se. The query takes 500ms to complete on
average, which is not unreasonable.

However, this query does put a good bit of load on the DB server. The
root cause of #1304 was an irresponsible bot hitting /crates (which
includes recent downloads) enough that it caused our DB server's CPU
load to reach 100%. We're on a very low tier database server right now,
so we could fix this by just upgrading to the next tier if we wanted to.
However, as the number of crates grows, this problem will just get
worse.

Recent downloads appear in 3 places. Summary, search, and crate details.
Summary and search are the two endpoints that crawlers want to hit, so
we can't have slow queries on those pages.

There are other solutions we could take here. We could just not include
recent downloads on those pages, and include them only if a flag is set.
We could also move recent downloads to another endpoint which gets hit
separately. Both of these require an annoying amount of code though, and
we need a crawler policy that says "don't hit these endpoints", which
requires work for us to monitor those endpoints, etc.

Ultimately we don't need this data to be real time (it already is
slightly delayed anyway). We can just store this particular piece of
data in a materialized view, which we refresh periodically. Right now
I'm refreshing it whenever update-downloads runs, but I think we can go
even further and update it daily if this becomes a scaling problem in
the future.

Even though repopulating the view currently only takes ~500ms (same
amount of time as the query), I've opted to do it concurrently to avoid
locking the table for long periods of time as we continue to scale.

Crate details is still using the crate_downloads table, so it's a bit
closer to real time. When we're selecting this for a single crate, the
query is already fast enough. We may want to change this in the future.
We could also consider getting rid of crate_downloads entirely, and
populate the view from version_downloads. This will cause more CPU
load when we refresh the view, but it probably doesn't matter.

Every aspect of crates.io is extremely fast except one. Calculating
recent downloads is a moderately expensive operation that we are doing
too often. It's not "slow" per-se. The query takes 500ms to complete on
average, which is not unreasonable.

However, this query does put a good bit of load on the DB server. The
root cause of rust-lang#1304 was an irresponsible bot hitting /crates (which
includes recent downloads) enough that it caused our DB server's CPU
load to reach 100%. We're on a very low tier database server right now,
so we could fix this by just upgrading to the next tier if we wanted to.
However, as the number of crates grows, this problem will just get
worse.

Recent downloads appear in 3 places. Summary, search, and crate details.
Summary and search are the two endpoints that crawlers want to hit, so
we can't have slow queries on those pages.

There are other solutions we could take here. We could just not include
recent downloads on those pages, and include them only if a flag is set.
We could also move recent downloads to another endpoint which gets hit
separately. Both of these require an annoying amount of code though, and
we need a crawler policy that says "don't hit these endpoints", which
requires work for us to monitor those endpoints, etc.

Ultimately we don't need this data to be real time (it already is
slightly delayed anyway). We can just store this particular piece of
data in a materialized view, which we refresh periodically. Right now
I'm refreshing it whenever update-downloads runs, but I think we can go
even further and update it daily if this becomes a scaling problem in
the future.

Even though repopulating the view currently only takes ~500ms (same
amount of time as the query), I've opted to do it concurrently to avoid
locking the table for long periods of time as we continue to scale.

Crate details is still using the `crate_downloads` table, so it's a bit
closer to real time. When we're selecting this for a single crate, the
query is already fast enough. We may want to change this in the future.
We could also consider getting rid of `crate_downloads` entirely, and
populate the view from `version_downloads`. This will cause more CPU
load when we refresh the view, but it probably doesn't matter.
@sgrif sgrif force-pushed the sg-make-recent-downloads-fast branch from 4b6ed40 to c6101b5 Compare April 25, 2018 19:35
Copy link
Member

@ashleygwilliams ashleygwilliams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i think this is good the only thing this might need is a way to share with people that the download data is no longer realtime. i'm not sure if such a place exists but that's my only concern.

@ashleygwilliams
Copy link
Member

bors: r+

bors-voyager bot added a commit that referenced this pull request Apr 25, 2018
1363: Make recent downloads fast r=ashleygwilliams

Every aspect of crates.io is extremely fast except one. Calculating
recent downloads is a moderately expensive operation that we are doing
too often. It's not "slow" per-se. The query takes 500ms to complete on
average, which is not unreasonable.

However, this query does put a good bit of load on the DB server. The
root cause of #1304 was an irresponsible bot hitting /crates (which
includes recent downloads) enough that it caused our DB server's CPU
load to reach 100%. We're on a very low tier database server right now,
so we could fix this by just upgrading to the next tier if we wanted to.
However, as the number of crates grows, this problem will just get
worse.

Recent downloads appear in 3 places. Summary, search, and crate details.
Summary and search are the two endpoints that crawlers want to hit, so
we can't have slow queries on those pages.

There are other solutions we could take here. We could just not include
recent downloads on those pages, and include them only if a flag is set.
We could also move recent downloads to another endpoint which gets hit
separately. Both of these require an annoying amount of code though, and
we need a crawler policy that says "don't hit these endpoints", which
requires work for us to monitor those endpoints, etc.

Ultimately we don't need this data to be real time (it already is
slightly delayed anyway). We can just store this particular piece of
data in a materialized view, which we refresh periodically. Right now
I'm refreshing it whenever update-downloads runs, but I think we can go
even further and update it daily if this becomes a scaling problem in
the future.

Even though repopulating the view currently only takes ~500ms (same
amount of time as the query), I've opted to do it concurrently to avoid
locking the table for long periods of time as we continue to scale.

Crate details is still using the `crate_downloads` table, so it's a bit
closer to real time. When we're selecting this for a single crate, the
query is already fast enough. We may want to change this in the future.
We could also consider getting rid of `crate_downloads` entirely, and
populate the view from `version_downloads`. This will cause more CPU
load when we refresh the view, but it probably doesn't matter.
@bors-voyager
Copy link
Contributor

bors-voyager bot commented Apr 25, 2018

Build succeeded

@bors-voyager bors-voyager bot merged commit c6101b5 into rust-lang:master Apr 25, 2018
@sgrif
Copy link
Contributor Author

sgrif commented Apr 25, 2018

I can confirm it is indeed now fast.

@sgrif sgrif deleted the sg-make-recent-downloads-fast branch April 26, 2018 23:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants