Switch most recently downloaded summary query to be faster #1312

carols10cents · 2018-04-04T00:18:10Z

Related to #1304.

Because we only want to show the top 10 for /api/v1/summary, move the
LIMIT within the query for the top downloads rather than joining on
everything and then limiting.

This should help alleviate at least some of the slowdowns we were seeing today with the /api/v1/summary call; there are other queries that are taking up some load that will need other mitigation strategies though :(

Here's an explain analyze from the old query on a production dump from today:

explain analyze SELECT "crates"."id", "crates"."name", "crates"."updated_at", "crates"."created_at", "crates"."downloads", "crates"."description", "crates"."homepage", "crates"."documentation", "crates"."readme", "crates"."readme_file", "crates"."license", "crates"."repository", "crates"."max_upload_size" FROM (
 "crates" LEFT OUTER JOIN "crate_downloads" ON "crates"."id" = "crate_downloads"."crate_id" AND "crate_downloads"."date" > date(CURRENT_TIMESTAMP - INTERVAL '90 DAYS')
 ) GROUP BY "crates"."id" ORDER BY SUM(crate_downloads.downloads) DESC NULLS LAST LIMIT 10;
                                                                                           QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=81546.14..81546.17 rows=10 width=858) (actual time=546.007..546.009 rows=10 loops=1)
   ->  Sort  (cost=81546.14..81583.65 rows=15004 width=858) (actual time=546.006..546.007 rows=10 loops=1)
         Sort Key: (sum(crate_downloads.downloads)) DESC NULLS LAST
         Sort Method: top-N heapsort  Memory: 37kB
         ->  GroupAggregate  (cost=65223.31..81221.91 rows=15004 width=858) (actual time=273.398..539.672 rows=15004 loops=1)
               Group Key: crates.id
               ->  Merge Left Join  (cost=65223.31..78880.53 rows=438269 width=854) (actual time=273.331..485.878 rows=435230 loops=1)
                     Merge Cond: (crates.id = crate_downloads.crate_id)
                     ->  Index Scan using packages_pkey on crates  (cost=0.29..5950.28 rows=15004 width=850) (actual time=0.009..9.989 rows=15004 loops=1)
                     ->  Materialize  (cost=65223.03..67414.37 rows=438269 width=8) (actual time=273.317..386.191 rows=435229 loops=1)
                           ->  Sort  (cost=65223.03..66318.70 rows=438269 width=8) (actual time=273.314..332.477 rows=435229 loops=1)
                                 Sort Key: crate_downloads.crate_id
                                 Sort Method: external merge  Disk: 7712kB
                                 ->  Index Scan using index_crate_downloads_date on crate_downloads  (cost=0.44..18162.03 rows=438269 width=8) (actual time=0.017..112.539 rows=435229 loops=1)
                                       Index Cond: (date > date((CURRENT_TIMESTAMP - '90 days'::interval)))
 Planning time: 0.502 ms
 Execution time: 605.679 ms
(17 rows)

And the new query:

explain analyze SELECT crates.* FROM crates JOIN ( SELECT crate_downloads.crate_id, SUM(crate_downloads.downloads) FROM crate_downloads WHERE crate_downloads.date > date(CURRENT_TIMESTAMP - INTERVAL '90 days') GROUP BY crate_downloads.crate_id ORDER BY SUM(crate_downloads.downloads) DESC NULLS LAST LIMIT 10 ) cd ON crates.id = cd.crate_id
production_crates_io_explain-# ORDER BY cd.sum DESC NULLS LAST;
                                                                                               QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=20323.24..20402.11 rows=10 width=1057) (actual time=180.585..180.682 rows=10 loops=1)
   ->  Limit  (cost=20322.96..20322.98 rows=10 width=12) (actual time=180.537..180.539 rows=10 loops=1)
         ->  Sort  (cost=20322.96..20350.82 rows=11146 width=12) (actual time=180.536..180.537 rows=10 loops=1)
               Sort Key: (sum(crate_downloads.downloads)) DESC NULLS LAST
               Sort Method: top-N heapsort  Memory: 25kB
               ->  Finalize HashAggregate  (cost=19970.64..20082.10 rows=11146 width=12) (actual time=176.709..178.856 rows=15003 loops=1)
                     Group Key: crate_downloads.crate_id
                     ->  Gather  (cost=17518.52..19859.18 rows=22292 width=12) (actual time=150.947..163.257 rows=44654 loops=1)
                           Workers Planned: 2
                           Workers Launched: 2
                           ->  Partial HashAggregate  (cost=16518.52..16629.98 rows=11146 width=12) (actual time=146.173..149.575 rows=14885 loops=3)
                                 Group Key: crate_downloads.crate_id
                                 ->  Parallel Index Scan using index_crate_downloads_date on crate_downloads  (cost=0.44..15605.46 rows=182612 width=8) (actual time=0.045..86.019 rows=145076 loops=3)
                                       Index Cond: (date > date((CURRENT_TIMESTAMP - '90 days'::interval)))
   ->  Index Scan using packages_pkey on crates  (cost=0.29..7.90 rows=1 width=1049) (actual time=0.012..0.012 rows=1 loops=10)
         Index Cond: (id = crate_downloads.crate_id)
 Planning time: 1.253 ms
 Execution time: 181.550 ms
(18 rows)

Because we only want to show the top 10 for /api/v1/summary, move the LIMIT within the query for the top downloads rather than joining on everything and then limiting.

jtgeibel

To my eye, these queries are equivalent and the performance improvements look encouraging.

bors: r+

jtgeibel · 2018-04-05T02:20:13Z

src/controllers/krate/metadata.rs

+         ORDER BY SUM(crate_downloads.downloads) DESC NULLS LAST \
+         LIMIT 10 \
+         ) cd ON crates.id = cd.crate_id \
+         ORDER BY cd.sum DESC NULLS LAST",


I believe NULLS LAST can be dropped because this is an INNER JOIN which should not include any null sums. Doesn't hurt to be explicit though.

1312: Switch most recently downloaded summary query to be faster r=jtgeibel Related to #1304. Because we only want to show the top 10 for /api/v1/summary, move the LIMIT within the query for the top downloads rather than joining on everything and then limiting. This should help alleviate at least some of the slowdowns we were seeing today with the /api/v1/summary call; there are other queries that are taking up some load that will need other mitigation strategies though :( Here's an explain analyze from the old query on a production dump from today: ``` explain analyze SELECT "crates"."id", "crates"."name", "crates"."updated_at", "crates"."created_at", "crates"."downloads", "crates"."description", "crates"."homepage", "crates"."documentation", "crates"."readme", "crates"."readme_file", "crates"."license", "crates"."repository", "crates"."max_upload_size" FROM ( "crates" LEFT OUTER JOIN "crate_downloads" ON "crates"."id" = "crate_downloads"."crate_id" AND "crate_downloads"."date" > date(CURRENT_TIMESTAMP - INTERVAL '90 DAYS') ) GROUP BY "crates"."id" ORDER BY SUM(crate_downloads.downloads) DESC NULLS LAST LIMIT 10; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Limit (cost=81546.14..81546.17 rows=10 width=858) (actual time=546.007..546.009 rows=10 loops=1) -> Sort (cost=81546.14..81583.65 rows=15004 width=858) (actual time=546.006..546.007 rows=10 loops=1) Sort Key: (sum(crate_downloads.downloads)) DESC NULLS LAST Sort Method: top-N heapsort Memory: 37kB -> GroupAggregate (cost=65223.31..81221.91 rows=15004 width=858) (actual time=273.398..539.672 rows=15004 loops=1) Group Key: crates.id -> Merge Left Join (cost=65223.31..78880.53 rows=438269 width=854) (actual time=273.331..485.878 rows=435230 loops=1) Merge Cond: (crates.id = crate_downloads.crate_id) -> Index Scan using packages_pkey on crates (cost=0.29..5950.28 rows=15004 width=850) (actual time=0.009..9.989 rows=15004 loops=1) -> Materialize (cost=65223.03..67414.37 rows=438269 width=8) (actual time=273.317..386.191 rows=435229 loops=1) -> Sort (cost=65223.03..66318.70 rows=438269 width=8) (actual time=273.314..332.477 rows=435229 loops=1) Sort Key: crate_downloads.crate_id Sort Method: external merge Disk: 7712kB -> Index Scan using index_crate_downloads_date on crate_downloads (cost=0.44..18162.03 rows=438269 width=8) (actual time=0.017..112.539 rows=435229 loops=1) Index Cond: (date > date((CURRENT_TIMESTAMP - '90 days'::interval))) Planning time: 0.502 ms Execution time: 605.679 ms (17 rows) ``` And the new query: ``` explain analyze SELECT crates.* FROM crates JOIN ( SELECT crate_downloads.crate_id, SUM(crate_downloads.downloads) FROM crate_downloads WHERE crate_downloads.date > date(CURRENT_TIMESTAMP - INTERVAL '90 days') GROUP BY crate_downloads.crate_id ORDER BY SUM(crate_downloads.downloads) DESC NULLS LAST LIMIT 10 ) cd ON crates.id = cd.crate_id production_crates_io_explain-# ORDER BY cd.sum DESC NULLS LAST; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=20323.24..20402.11 rows=10 width=1057) (actual time=180.585..180.682 rows=10 loops=1) -> Limit (cost=20322.96..20322.98 rows=10 width=12) (actual time=180.537..180.539 rows=10 loops=1) -> Sort (cost=20322.96..20350.82 rows=11146 width=12) (actual time=180.536..180.537 rows=10 loops=1) Sort Key: (sum(crate_downloads.downloads)) DESC NULLS LAST Sort Method: top-N heapsort Memory: 25kB -> Finalize HashAggregate (cost=19970.64..20082.10 rows=11146 width=12) (actual time=176.709..178.856 rows=15003 loops=1) Group Key: crate_downloads.crate_id -> Gather (cost=17518.52..19859.18 rows=22292 width=12) (actual time=150.947..163.257 rows=44654 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial HashAggregate (cost=16518.52..16629.98 rows=11146 width=12) (actual time=146.173..149.575 rows=14885 loops=3) Group Key: crate_downloads.crate_id -> Parallel Index Scan using index_crate_downloads_date on crate_downloads (cost=0.44..15605.46 rows=182612 width=8) (actual time=0.045..86.019 rows=145076 loops=3) Index Cond: (date > date((CURRENT_TIMESTAMP - '90 days'::interval))) -> Index Scan using packages_pkey on crates (cost=0.29..7.90 rows=1 width=1049) (actual time=0.012..0.012 rows=1 loops=10) Index Cond: (id = crate_downloads.crate_id) Planning time: 1.253 ms Execution time: 181.550 ms (18 rows) ```

bors-voyager · 2018-04-05T02:59:13Z

Build succeeded

continuous-integration/travis-ci/push

While rust-lang#1312 improved that query a lot, we can shave off even more time by giving it an index containing only the data it's going to need. This index would need to be refreshed daily by a heroku scheduler task at 00:01 UTC. I don't have a large enough dataset locally to get a great perf comparison, so it'd be great to have a test comparison on prod before we merge this. The funkiness in how we go about creating/dropping the index is due to the fact that we can't use bind parameters to create indices, nor can we use `CURRENT_DATE` directly. If we want to create the index concurrently (I don't think we need to), then we will have to do string concatenation in Rust.

Prior to this commit, `bin/update-downloads` will hold a lock on every row it updates for the entire batch of 1000. Each row gets locked when it is `updated`, but the lock isn't released until the whole transaction is comitted. This means that the first row in the batch is locked for the whole transaction, but the last row in the batch isn't locked longer than it should be. Processing 1000 rows is routinely taking more than 30 seconds to complete. Whenever it does, this causes requests to time out. Even when it doesn't, this locking behavior is the reason that our 99th percentile response time is getting to be so bad. The endpoint affected by this is `/crates/{name}/{version}/download`. In other words, this bug is causing `cargo install` to fail. Even when it doesn't, this bug is causing it to take seconds when it should take milliseconds. By establishing a transaction per row instead of per batch, we retain atomicity, without holding a lock for the whole batch. To keep everything atomic, we need to update the global counter per row instead of per batch. This relies on the invariant that no rows in `version_downloads` get touched once `date != CURRENT_DATE` other than by this script. I believe this is also the root cause of rust-lang#1304, and that we can maybe revert rust-lang#1312

1345: Don't lock more than 1 row at a time in `update-downloads` r=sgrif Prior to this commit, `bin/update-downloads` will hold a lock on every row it updates for the entire batch of 1000. Each row gets locked when it is `updated`, but the lock isn't released until the whole transaction is comitted. This means that the first row in the batch is locked for the whole transaction, but the last row in the batch isn't locked longer than it should be. Processing 1000 rows is routinely taking more than 30 seconds to complete. Whenever it does, this causes requests to time out. Even when it doesn't, this locking behavior is the reason that our 99th percentile response time is getting to be so bad. The endpoint affected by this is `/crates/{name}/{version}/download`. In other words, this bug is causing `cargo install` to fail. Even when it doesn't, this bug is causing it to take seconds when it should take milliseconds. By establishing a transaction per row instead of per batch, we retain atomicity, without holding a lock for the whole batch. To keep everything atomic, we need to update the global counter per row instead of per batch. This relies on the invariant that no rows in `version_downloads` get touched once `date != CURRENT_DATE` other than by this script. I believe this is also the root cause of #1304, and that we can maybe revert #1312

carols10cents requested a review from sgrif April 4, 2018 00:18

Switch most recently downloaded summary query to be faster

86cd1d8

Because we only want to show the top 10 for /api/v1/summary, move the LIMIT within the query for the top downloads rather than joining on everything and then limiting.

carols10cents force-pushed the rearrange-most-recent-downloads-query branch from bb96278 to 86cd1d8 Compare April 4, 2018 00:52

This was referenced Apr 4, 2018

Remove Google Analytics #1306

Merged

crates.io takes ~6 seconds to load #1304

Closed

jtgeibel approved these changes Apr 5, 2018

View reviewed changes

bors-voyager bot merged commit 86cd1d8 into rust-lang:master Apr 5, 2018

sgrif mentioned this pull request Apr 5, 2018

Create a smaller index to be used on the summary page #1328

Closed

sgrif mentioned this pull request Apr 13, 2018

Don't lock more than 1 row at a time in update-downloads #1345

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Switch most recently downloaded summary query to be faster #1312

Switch most recently downloaded summary query to be faster #1312

Uh oh!

carols10cents commented Apr 4, 2018 •

edited

Loading

Uh oh!

jtgeibel left a comment

Uh oh!

jtgeibel Apr 5, 2018

Uh oh!

bors-voyager bot commented Apr 5, 2018

Uh oh!

Uh oh!

Switch most recently downloaded summary query to be faster #1312

Switch most recently downloaded summary query to be faster #1312

Uh oh!

Conversation

carols10cents commented Apr 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jtgeibel left a comment

Choose a reason for hiding this comment

Uh oh!

jtgeibel Apr 5, 2018

Choose a reason for hiding this comment

Uh oh!

bors-voyager bot commented Apr 5, 2018

Build succeeded

Uh oh!

Uh oh!

carols10cents commented Apr 4, 2018 •

edited

Loading