Skip to content

Switch most recently downloaded summary query to be faster #1312

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

carols10cents
Copy link
Member

@carols10cents carols10cents commented Apr 4, 2018

Related to #1304.

Because we only want to show the top 10 for /api/v1/summary, move the
LIMIT within the query for the top downloads rather than joining on
everything and then limiting.

This should help alleviate at least some of the slowdowns we were seeing today with the /api/v1/summary call; there are other queries that are taking up some load that will need other mitigation strategies though :(

Here's an explain analyze from the old query on a production dump from today:

explain analyze SELECT "crates"."id", "crates"."name", "crates"."updated_at", "crates"."created_at", "crates"."downloads", "crates"."description", "crates"."homepage", "crates"."documentation", "crates"."readme", "crates"."readme_file", "crates"."license", "crates"."repository", "crates"."max_upload_size" FROM (
 "crates" LEFT OUTER JOIN "crate_downloads" ON "crates"."id" = "crate_downloads"."crate_id" AND "crate_downloads"."date" > date(CURRENT_TIMESTAMP - INTERVAL '90 DAYS')
 ) GROUP BY "crates"."id" ORDER BY SUM(crate_downloads.downloads) DESC NULLS LAST LIMIT 10;
                                                                                           QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=81546.14..81546.17 rows=10 width=858) (actual time=546.007..546.009 rows=10 loops=1)
   ->  Sort  (cost=81546.14..81583.65 rows=15004 width=858) (actual time=546.006..546.007 rows=10 loops=1)
         Sort Key: (sum(crate_downloads.downloads)) DESC NULLS LAST
         Sort Method: top-N heapsort  Memory: 37kB
         ->  GroupAggregate  (cost=65223.31..81221.91 rows=15004 width=858) (actual time=273.398..539.672 rows=15004 loops=1)
               Group Key: crates.id
               ->  Merge Left Join  (cost=65223.31..78880.53 rows=438269 width=854) (actual time=273.331..485.878 rows=435230 loops=1)
                     Merge Cond: (crates.id = crate_downloads.crate_id)
                     ->  Index Scan using packages_pkey on crates  (cost=0.29..5950.28 rows=15004 width=850) (actual time=0.009..9.989 rows=15004 loops=1)
                     ->  Materialize  (cost=65223.03..67414.37 rows=438269 width=8) (actual time=273.317..386.191 rows=435229 loops=1)
                           ->  Sort  (cost=65223.03..66318.70 rows=438269 width=8) (actual time=273.314..332.477 rows=435229 loops=1)
                                 Sort Key: crate_downloads.crate_id
                                 Sort Method: external merge  Disk: 7712kB
                                 ->  Index Scan using index_crate_downloads_date on crate_downloads  (cost=0.44..18162.03 rows=438269 width=8) (actual time=0.017..112.539 rows=435229 loops=1)
                                       Index Cond: (date > date((CURRENT_TIMESTAMP - '90 days'::interval)))
 Planning time: 0.502 ms
 Execution time: 605.679 ms
(17 rows)

And the new query:

explain analyze SELECT crates.* FROM crates JOIN ( SELECT crate_downloads.crate_id, SUM(crate_downloads.downloads) FROM crate_downloads WHERE crate_downloads.date > date(CURRENT_TIMESTAMP - INTERVAL '90 days') GROUP BY crate_downloads.crate_id ORDER BY SUM(crate_downloads.downloads) DESC NULLS LAST LIMIT 10 ) cd ON crates.id = cd.crate_id
production_crates_io_explain-# ORDER BY cd.sum DESC NULLS LAST;
                                                                                               QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=20323.24..20402.11 rows=10 width=1057) (actual time=180.585..180.682 rows=10 loops=1)
   ->  Limit  (cost=20322.96..20322.98 rows=10 width=12) (actual time=180.537..180.539 rows=10 loops=1)
         ->  Sort  (cost=20322.96..20350.82 rows=11146 width=12) (actual time=180.536..180.537 rows=10 loops=1)
               Sort Key: (sum(crate_downloads.downloads)) DESC NULLS LAST
               Sort Method: top-N heapsort  Memory: 25kB
               ->  Finalize HashAggregate  (cost=19970.64..20082.10 rows=11146 width=12) (actual time=176.709..178.856 rows=15003 loops=1)
                     Group Key: crate_downloads.crate_id
                     ->  Gather  (cost=17518.52..19859.18 rows=22292 width=12) (actual time=150.947..163.257 rows=44654 loops=1)
                           Workers Planned: 2
                           Workers Launched: 2
                           ->  Partial HashAggregate  (cost=16518.52..16629.98 rows=11146 width=12) (actual time=146.173..149.575 rows=14885 loops=3)
                                 Group Key: crate_downloads.crate_id
                                 ->  Parallel Index Scan using index_crate_downloads_date on crate_downloads  (cost=0.44..15605.46 rows=182612 width=8) (actual time=0.045..86.019 rows=145076 loops=3)
                                       Index Cond: (date > date((CURRENT_TIMESTAMP - '90 days'::interval)))
   ->  Index Scan using packages_pkey on crates  (cost=0.29..7.90 rows=1 width=1049) (actual time=0.012..0.012 rows=1 loops=10)
         Index Cond: (id = crate_downloads.crate_id)
 Planning time: 1.253 ms
 Execution time: 181.550 ms
(18 rows)

@carols10cents carols10cents requested a review from sgrif April 4, 2018 00:18
Because we only want to show the top 10 for /api/v1/summary, move the
LIMIT within the query for the top downloads rather than joining on
everything and then limiting.
@carols10cents carols10cents force-pushed the rearrange-most-recent-downloads-query branch from bb96278 to 86cd1d8 Compare April 4, 2018 00:52
Copy link
Member

@jtgeibel jtgeibel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my eye, these queries are equivalent and the performance improvements look encouraging.

bors: r+

ORDER BY SUM(crate_downloads.downloads) DESC NULLS LAST \
LIMIT 10 \
) cd ON crates.id = cd.crate_id \
ORDER BY cd.sum DESC NULLS LAST",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe NULLS LAST can be dropped because this is an INNER JOIN which should not include any null sums. Doesn't hurt to be explicit though.

bors-voyager bot added a commit that referenced this pull request Apr 5, 2018
1312: Switch most recently downloaded summary query to be faster r=jtgeibel

Related to #1304.

Because we only want to show the top 10 for /api/v1/summary, move the
LIMIT within the query for the top downloads rather than joining on
everything and then limiting.

This should help alleviate at least some of the slowdowns we were seeing today with the /api/v1/summary call; there are other queries that are taking up some load that will need other mitigation strategies though :(

Here's an explain analyze from the old query on a production dump from today:

```
explain analyze SELECT "crates"."id", "crates"."name", "crates"."updated_at", "crates"."created_at", "crates"."downloads", "crates"."description", "crates"."homepage", "crates"."documentation", "crates"."readme", "crates"."readme_file", "crates"."license", "crates"."repository", "crates"."max_upload_size" FROM (
 "crates" LEFT OUTER JOIN "crate_downloads" ON "crates"."id" = "crate_downloads"."crate_id" AND "crate_downloads"."date" > date(CURRENT_TIMESTAMP - INTERVAL '90 DAYS')
 ) GROUP BY "crates"."id" ORDER BY SUM(crate_downloads.downloads) DESC NULLS LAST LIMIT 10;
                                                                                           QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=81546.14..81546.17 rows=10 width=858) (actual time=546.007..546.009 rows=10 loops=1)
   ->  Sort  (cost=81546.14..81583.65 rows=15004 width=858) (actual time=546.006..546.007 rows=10 loops=1)
         Sort Key: (sum(crate_downloads.downloads)) DESC NULLS LAST
         Sort Method: top-N heapsort  Memory: 37kB
         ->  GroupAggregate  (cost=65223.31..81221.91 rows=15004 width=858) (actual time=273.398..539.672 rows=15004 loops=1)
               Group Key: crates.id
               ->  Merge Left Join  (cost=65223.31..78880.53 rows=438269 width=854) (actual time=273.331..485.878 rows=435230 loops=1)
                     Merge Cond: (crates.id = crate_downloads.crate_id)
                     ->  Index Scan using packages_pkey on crates  (cost=0.29..5950.28 rows=15004 width=850) (actual time=0.009..9.989 rows=15004 loops=1)
                     ->  Materialize  (cost=65223.03..67414.37 rows=438269 width=8) (actual time=273.317..386.191 rows=435229 loops=1)
                           ->  Sort  (cost=65223.03..66318.70 rows=438269 width=8) (actual time=273.314..332.477 rows=435229 loops=1)
                                 Sort Key: crate_downloads.crate_id
                                 Sort Method: external merge  Disk: 7712kB
                                 ->  Index Scan using index_crate_downloads_date on crate_downloads  (cost=0.44..18162.03 rows=438269 width=8) (actual time=0.017..112.539 rows=435229 loops=1)
                                       Index Cond: (date > date((CURRENT_TIMESTAMP - '90 days'::interval)))
 Planning time: 0.502 ms
 Execution time: 605.679 ms
(17 rows)
```

And the new query:

```
explain analyze SELECT crates.* FROM crates JOIN ( SELECT crate_downloads.crate_id, SUM(crate_downloads.downloads) FROM crate_downloads WHERE crate_downloads.date > date(CURRENT_TIMESTAMP - INTERVAL '90 days') GROUP BY crate_downloads.crate_id ORDER BY SUM(crate_downloads.downloads) DESC NULLS LAST LIMIT 10 ) cd ON crates.id = cd.crate_id
production_crates_io_explain-# ORDER BY cd.sum DESC NULLS LAST;
                                                                                               QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=20323.24..20402.11 rows=10 width=1057) (actual time=180.585..180.682 rows=10 loops=1)
   ->  Limit  (cost=20322.96..20322.98 rows=10 width=12) (actual time=180.537..180.539 rows=10 loops=1)
         ->  Sort  (cost=20322.96..20350.82 rows=11146 width=12) (actual time=180.536..180.537 rows=10 loops=1)
               Sort Key: (sum(crate_downloads.downloads)) DESC NULLS LAST
               Sort Method: top-N heapsort  Memory: 25kB
               ->  Finalize HashAggregate  (cost=19970.64..20082.10 rows=11146 width=12) (actual time=176.709..178.856 rows=15003 loops=1)
                     Group Key: crate_downloads.crate_id
                     ->  Gather  (cost=17518.52..19859.18 rows=22292 width=12) (actual time=150.947..163.257 rows=44654 loops=1)
                           Workers Planned: 2
                           Workers Launched: 2
                           ->  Partial HashAggregate  (cost=16518.52..16629.98 rows=11146 width=12) (actual time=146.173..149.575 rows=14885 loops=3)
                                 Group Key: crate_downloads.crate_id
                                 ->  Parallel Index Scan using index_crate_downloads_date on crate_downloads  (cost=0.44..15605.46 rows=182612 width=8) (actual time=0.045..86.019 rows=145076 loops=3)
                                       Index Cond: (date > date((CURRENT_TIMESTAMP - '90 days'::interval)))
   ->  Index Scan using packages_pkey on crates  (cost=0.29..7.90 rows=1 width=1049) (actual time=0.012..0.012 rows=1 loops=10)
         Index Cond: (id = crate_downloads.crate_id)
 Planning time: 1.253 ms
 Execution time: 181.550 ms
(18 rows)
```
@bors-voyager
Copy link
Contributor

bors-voyager bot commented Apr 5, 2018

Build succeeded

@bors-voyager bors-voyager bot merged commit 86cd1d8 into rust-lang:master Apr 5, 2018
sgrif added a commit to sgrif/crates.io that referenced this pull request Apr 5, 2018
While rust-lang#1312 improved that query a lot, we can shave off even more time
by giving it an index containing only the data it's going to need. This
index would need to be refreshed daily by a heroku scheduler task at
00:01 UTC.

I don't have a large enough dataset locally to get a great perf
comparison, so it'd be great to have a test comparison on prod before we
merge this.

The funkiness in how we go about creating/dropping the index is due to
the fact that we can't use bind parameters to create indices, nor can we
use `CURRENT_DATE` directly. If we want to create the index concurrently
(I don't think we need to), then we will have to do string concatenation
in Rust.
sgrif added a commit to sgrif/crates.io that referenced this pull request Apr 13, 2018
Prior to this commit, `bin/update-downloads` will hold a lock on every
row it updates for the entire batch of 1000. Each row gets locked when
it is `updated`, but the lock isn't released until the whole transaction
is comitted. This means that the first row in the batch is locked for
the whole transaction, but the last row in the batch isn't locked longer
than it should be.

Processing 1000 rows is routinely taking more than 30 seconds to
complete. Whenever it does, this causes requests to time out. Even when
it doesn't, this locking behavior is the reason that our 99th percentile
response time is getting to be so bad.

The endpoint affected by this is `/crates/{name}/{version}/download`. In
other words, this bug is causing `cargo install` to fail. Even when it
doesn't, this bug is causing it to take seconds when it should take
milliseconds.

By establishing a transaction per row instead of per batch, we retain
atomicity, without holding a lock for the whole batch. To keep
everything atomic, we need to update the global counter per row instead
of per batch.

This relies on the invariant that no rows in `version_downloads` get
touched once `date != CURRENT_DATE` other than by this script.

I believe this is also the root cause of rust-lang#1304, and that we can maybe
revert rust-lang#1312
bors-voyager bot added a commit that referenced this pull request Apr 13, 2018
1345: Don't lock more than 1 row at a time in `update-downloads` r=sgrif

Prior to this commit, `bin/update-downloads` will hold a lock on every
row it updates for the entire batch of 1000. Each row gets locked when
it is `updated`, but the lock isn't released until the whole transaction
is comitted. This means that the first row in the batch is locked for
the whole transaction, but the last row in the batch isn't locked longer
than it should be.

Processing 1000 rows is routinely taking more than 30 seconds to
complete. Whenever it does, this causes requests to time out. Even when
it doesn't, this locking behavior is the reason that our 99th percentile
response time is getting to be so bad.

The endpoint affected by this is `/crates/{name}/{version}/download`. In
other words, this bug is causing `cargo install` to fail. Even when it
doesn't, this bug is causing it to take seconds when it should take
milliseconds.

By establishing a transaction per row instead of per batch, we retain
atomicity, without holding a lock for the whole batch. To keep
everything atomic, we need to update the global counter per row instead
of per batch.

This relies on the invariant that no rows in `version_downloads` get
touched once `date != CURRENT_DATE` other than by this script.

I believe this is also the root cause of #1304, and that we can maybe
revert #1312
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants