-
Notifications
You must be signed in to change notification settings - Fork 648
Switch most recently downloaded summary query to be faster #1312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch most recently downloaded summary query to be faster #1312
Conversation
Because we only want to show the top 10 for /api/v1/summary, move the LIMIT within the query for the top downloads rather than joining on everything and then limiting.
bb96278
to
86cd1d8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my eye, these queries are equivalent and the performance improvements look encouraging.
bors: r+
ORDER BY SUM(crate_downloads.downloads) DESC NULLS LAST \ | ||
LIMIT 10 \ | ||
) cd ON crates.id = cd.crate_id \ | ||
ORDER BY cd.sum DESC NULLS LAST", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe NULLS LAST can be dropped because this is an INNER JOIN which should not include any null sums. Doesn't hurt to be explicit though.
1312: Switch most recently downloaded summary query to be faster r=jtgeibel Related to #1304. Because we only want to show the top 10 for /api/v1/summary, move the LIMIT within the query for the top downloads rather than joining on everything and then limiting. This should help alleviate at least some of the slowdowns we were seeing today with the /api/v1/summary call; there are other queries that are taking up some load that will need other mitigation strategies though :( Here's an explain analyze from the old query on a production dump from today: ``` explain analyze SELECT "crates"."id", "crates"."name", "crates"."updated_at", "crates"."created_at", "crates"."downloads", "crates"."description", "crates"."homepage", "crates"."documentation", "crates"."readme", "crates"."readme_file", "crates"."license", "crates"."repository", "crates"."max_upload_size" FROM ( "crates" LEFT OUTER JOIN "crate_downloads" ON "crates"."id" = "crate_downloads"."crate_id" AND "crate_downloads"."date" > date(CURRENT_TIMESTAMP - INTERVAL '90 DAYS') ) GROUP BY "crates"."id" ORDER BY SUM(crate_downloads.downloads) DESC NULLS LAST LIMIT 10; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Limit (cost=81546.14..81546.17 rows=10 width=858) (actual time=546.007..546.009 rows=10 loops=1) -> Sort (cost=81546.14..81583.65 rows=15004 width=858) (actual time=546.006..546.007 rows=10 loops=1) Sort Key: (sum(crate_downloads.downloads)) DESC NULLS LAST Sort Method: top-N heapsort Memory: 37kB -> GroupAggregate (cost=65223.31..81221.91 rows=15004 width=858) (actual time=273.398..539.672 rows=15004 loops=1) Group Key: crates.id -> Merge Left Join (cost=65223.31..78880.53 rows=438269 width=854) (actual time=273.331..485.878 rows=435230 loops=1) Merge Cond: (crates.id = crate_downloads.crate_id) -> Index Scan using packages_pkey on crates (cost=0.29..5950.28 rows=15004 width=850) (actual time=0.009..9.989 rows=15004 loops=1) -> Materialize (cost=65223.03..67414.37 rows=438269 width=8) (actual time=273.317..386.191 rows=435229 loops=1) -> Sort (cost=65223.03..66318.70 rows=438269 width=8) (actual time=273.314..332.477 rows=435229 loops=1) Sort Key: crate_downloads.crate_id Sort Method: external merge Disk: 7712kB -> Index Scan using index_crate_downloads_date on crate_downloads (cost=0.44..18162.03 rows=438269 width=8) (actual time=0.017..112.539 rows=435229 loops=1) Index Cond: (date > date((CURRENT_TIMESTAMP - '90 days'::interval))) Planning time: 0.502 ms Execution time: 605.679 ms (17 rows) ``` And the new query: ``` explain analyze SELECT crates.* FROM crates JOIN ( SELECT crate_downloads.crate_id, SUM(crate_downloads.downloads) FROM crate_downloads WHERE crate_downloads.date > date(CURRENT_TIMESTAMP - INTERVAL '90 days') GROUP BY crate_downloads.crate_id ORDER BY SUM(crate_downloads.downloads) DESC NULLS LAST LIMIT 10 ) cd ON crates.id = cd.crate_id production_crates_io_explain-# ORDER BY cd.sum DESC NULLS LAST; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=20323.24..20402.11 rows=10 width=1057) (actual time=180.585..180.682 rows=10 loops=1) -> Limit (cost=20322.96..20322.98 rows=10 width=12) (actual time=180.537..180.539 rows=10 loops=1) -> Sort (cost=20322.96..20350.82 rows=11146 width=12) (actual time=180.536..180.537 rows=10 loops=1) Sort Key: (sum(crate_downloads.downloads)) DESC NULLS LAST Sort Method: top-N heapsort Memory: 25kB -> Finalize HashAggregate (cost=19970.64..20082.10 rows=11146 width=12) (actual time=176.709..178.856 rows=15003 loops=1) Group Key: crate_downloads.crate_id -> Gather (cost=17518.52..19859.18 rows=22292 width=12) (actual time=150.947..163.257 rows=44654 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial HashAggregate (cost=16518.52..16629.98 rows=11146 width=12) (actual time=146.173..149.575 rows=14885 loops=3) Group Key: crate_downloads.crate_id -> Parallel Index Scan using index_crate_downloads_date on crate_downloads (cost=0.44..15605.46 rows=182612 width=8) (actual time=0.045..86.019 rows=145076 loops=3) Index Cond: (date > date((CURRENT_TIMESTAMP - '90 days'::interval))) -> Index Scan using packages_pkey on crates (cost=0.29..7.90 rows=1 width=1049) (actual time=0.012..0.012 rows=1 loops=10) Index Cond: (id = crate_downloads.crate_id) Planning time: 1.253 ms Execution time: 181.550 ms (18 rows) ```
Build succeeded |
While rust-lang#1312 improved that query a lot, we can shave off even more time by giving it an index containing only the data it's going to need. This index would need to be refreshed daily by a heroku scheduler task at 00:01 UTC. I don't have a large enough dataset locally to get a great perf comparison, so it'd be great to have a test comparison on prod before we merge this. The funkiness in how we go about creating/dropping the index is due to the fact that we can't use bind parameters to create indices, nor can we use `CURRENT_DATE` directly. If we want to create the index concurrently (I don't think we need to), then we will have to do string concatenation in Rust.
Prior to this commit, `bin/update-downloads` will hold a lock on every row it updates for the entire batch of 1000. Each row gets locked when it is `updated`, but the lock isn't released until the whole transaction is comitted. This means that the first row in the batch is locked for the whole transaction, but the last row in the batch isn't locked longer than it should be. Processing 1000 rows is routinely taking more than 30 seconds to complete. Whenever it does, this causes requests to time out. Even when it doesn't, this locking behavior is the reason that our 99th percentile response time is getting to be so bad. The endpoint affected by this is `/crates/{name}/{version}/download`. In other words, this bug is causing `cargo install` to fail. Even when it doesn't, this bug is causing it to take seconds when it should take milliseconds. By establishing a transaction per row instead of per batch, we retain atomicity, without holding a lock for the whole batch. To keep everything atomic, we need to update the global counter per row instead of per batch. This relies on the invariant that no rows in `version_downloads` get touched once `date != CURRENT_DATE` other than by this script. I believe this is also the root cause of rust-lang#1304, and that we can maybe revert rust-lang#1312
1345: Don't lock more than 1 row at a time in `update-downloads` r=sgrif Prior to this commit, `bin/update-downloads` will hold a lock on every row it updates for the entire batch of 1000. Each row gets locked when it is `updated`, but the lock isn't released until the whole transaction is comitted. This means that the first row in the batch is locked for the whole transaction, but the last row in the batch isn't locked longer than it should be. Processing 1000 rows is routinely taking more than 30 seconds to complete. Whenever it does, this causes requests to time out. Even when it doesn't, this locking behavior is the reason that our 99th percentile response time is getting to be so bad. The endpoint affected by this is `/crates/{name}/{version}/download`. In other words, this bug is causing `cargo install` to fail. Even when it doesn't, this bug is causing it to take seconds when it should take milliseconds. By establishing a transaction per row instead of per batch, we retain atomicity, without holding a lock for the whole batch. To keep everything atomic, we need to update the global counter per row instead of per batch. This relies on the invariant that no rows in `version_downloads` get touched once `date != CURRENT_DATE` other than by this script. I believe this is also the root cause of #1304, and that we can maybe revert #1312
Related to #1304.
Because we only want to show the top 10 for /api/v1/summary, move the
LIMIT within the query for the top downloads rather than joining on
everything and then limiting.
This should help alleviate at least some of the slowdowns we were seeing today with the /api/v1/summary call; there are other queries that are taking up some load that will need other mitigation strategies though :(
Here's an explain analyze from the old query on a production dump from today:
And the new query: