Use the release phase to run migrations #1703

sgrif · 2019-04-01T19:03:17Z

When we deploy today, the current process is as follows:

Build new code
- If build fails, release is aborted, failed build does not affect
  cache
Shut down all dynos
- At this point requests are being held by the router until the new
  dyno starts up. The 30 second timeout still applies, so we have to
  boot and process the queued requests in under 30 seconds or they
  will get a 503
Wait for shutdown to complete
Run database migrations
- If this fails, our app crashes, all requests will return 500 until
  we manually intervene
Boot app
- If this fails, our app crashes, all requests will return 500 until
  we manually intervene
Requests are processed

With this change, the flow will now be:

Build new code
Run database migrations
- If this fails, the release will not be deployed. The build cache
  will still contain the results of this build, since the build itself
  succeeded.
Shut down all dynos
Wait for shutdown to complete
Boot app
Requests are processed

This has a few notable benefits:

If a migration fails in production (this has happened a few times), we
will just not deploy the release instead of completely taking the app
down
Since our migrations run in parallel with the app, we can make changes
which take a long time to run, as long as those changes don't lock the
table they're operating on
Our app will boot more quickly, leading to shorter delays when we
deploy

This doesn't come for free of course. Since our app will still be
running with the migrations, we'll have to be careful about what we put
in a single deploy.

Any destructive migrations will need to be deployed in at least two
phases. First we deploy code that works with both the old and new
schema, then we deploy the migration (along with code that only works
on the new schema) separately.
- We already follow this practice today in order to make sure that
  each individual deploy can be rolled back quickly
Migrations which perform long running changes will need to be sure
that those changes do not lock the table they're operating on
- We mostly already practice this today, since we don't want to delay
  booting the app with a long migration
Migrations need to run in a transaction to avoid leaving things in a
half migrated state if something fails
- Diesel already does this

Overall I think this tradeoff is worth it. Removing the risk of a failed
migration taking the app down is a huge win, and we already follow the
practices required to do this safely.

With this change, we may also want to consider enabling
preboot, which ensures
that new dynos are running and accepting connections before shutting
down the old servers -- at the cost of making deploys take several
minutes longer.

When we deploy today, the current process is as follows: - Build new code - If build fails, release is aborted, failed build does not affect cache - Shut down all dynos - At this point requests are being held by the router until the new dyno starts up. The 30 second timeout still applies, so we have to boot and process the queued requests in under 30 seconds or they will get a 503 - Wait for shutdown to complete - Run database migrations - If this fails, our app crashes, all requests will return 500 until we manually intervene - Boot app - If this fails, our app crashes, all requests will return 500 until we manually intervene - Requests are processed With this change, the flow will now be: - Build new code - Run database migrations - If this fails, the release will not be deployed. The build cache will still contain the results of this build, since the build itself succeeded. - Shut down all dynos - Wait for shutdown to complete - Boot app - Requests are processed This has a few notable benefits: - If a migration fails in production (this has happened a few times), we will just not deploy the release instead of completely taking the app down - Since our migrations run in parallel with the app, we can make changes which take a long time to run, as long as those changes don't lock the table they're operating on - Our app will boot more quickly, leading to shorter delays when we deploy This doesn't come for free of course. Since our app will still be running with the migrations, we'll have to be careful about what we put in a single deploy. - Any destructive migrations will need to be deployed in at least two phases. First we deploy code that works with both the old and new schema, then we deploy the migration (along with code that only works on the new schema) separately. - We already follow this practice today in order to make sure that each individual deploy can be rolled back quickly - Migrations which perform long running changes will need to be sure that those changes do not lock the table they're operating on - We mostly already practice this today, since we don't want to delay booting the app with a long migration - Migrations need to run in a transaction to avoid leaving things in a half migrated state if something fails - Diesel already does this Overall I think this tradeoff is worth it. Removing the risk of a failed migration taking the app down is a huge win, and we already follow the practices required to do this safely. With this change, we may also want to consider enabling [preboot](https://devcenter.heroku.com/articles/preboot), which ensures that new dynos are running and accepting connections before shutting down the old servers -- at the cost of making deploys take several minutes longer.

kzys · 2019-04-10T03:41:05Z

Looks good to me.

jtgeibel · 2019-05-01T23:26:01Z

LGTM!

@bors r+

bors · 2019-05-01T23:26:02Z

📌 Commit e46a1a5 has been approved by jtgeibel

bors · 2019-05-01T23:26:11Z

⌛ Testing commit e46a1a5 with merge 101a935...

Use the release phase to run migrations When we deploy today, the current process is as follows: - Build new code - If build fails, release is aborted, failed build does not affect cache - Shut down all dynos - At this point requests are being held by the router until the new dyno starts up. The 30 second timeout still applies, so we have to boot and process the queued requests in under 30 seconds or they will get a 503 - Wait for shutdown to complete - Run database migrations - If this fails, our app crashes, all requests will return 500 until we manually intervene - Boot app - If this fails, our app crashes, all requests will return 500 until we manually intervene - Requests are processed With this change, the flow will now be: - Build new code - Run database migrations - If this fails, the release will not be deployed. The build cache will still contain the results of this build, since the build itself succeeded. - Shut down all dynos - Wait for shutdown to complete - Boot app - Requests are processed This has a few notable benefits: - If a migration fails in production (this has happened a few times), we will just not deploy the release instead of completely taking the app down - Since our migrations run in parallel with the app, we can make changes which take a long time to run, as long as those changes don't lock the table they're operating on - Our app will boot more quickly, leading to shorter delays when we deploy This doesn't come for free of course. Since our app will still be running with the migrations, we'll have to be careful about what we put in a single deploy. - Any destructive migrations will need to be deployed in at least two phases. First we deploy code that works with both the old and new schema, then we deploy the migration (along with code that only works on the new schema) separately. - We already follow this practice today in order to make sure that each individual deploy can be rolled back quickly - Migrations which perform long running changes will need to be sure that those changes do not lock the table they're operating on - We mostly already practice this today, since we don't want to delay booting the app with a long migration - Migrations need to run in a transaction to avoid leaving things in a half migrated state if something fails - Diesel already does this Overall I think this tradeoff is worth it. Removing the risk of a failed migration taking the app down is a huge win, and we already follow the practices required to do this safely. With this change, we may also want to consider enabling [preboot](https://devcenter.heroku.com/articles/preboot), which ensures that new dynos are running and accepting connections before shutting down the old servers -- at the cost of making deploys take several minutes longer.

bors · 2019-05-01T23:48:26Z

☀️ Test successful - checks-travis
Approved by: jtgeibel
Pushing 101a935 to master...

carols10cents added the S-waiting-on-review label Apr 7, 2019

bors merged commit e46a1a5 into rust-lang:master May 1, 2019

sgrif deleted the sg-heroku-release-phase branch May 14, 2019 16:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use the release phase to run migrations #1703

Use the release phase to run migrations #1703

Uh oh!

sgrif commented Apr 1, 2019

Uh oh!

kzys commented Apr 10, 2019

Uh oh!

jtgeibel commented May 1, 2019

Uh oh!

bors commented May 1, 2019

Uh oh!

bors commented May 1, 2019

Uh oh!

bors commented May 1, 2019

Uh oh!

Uh oh!

Use the release phase to run migrations #1703

Use the release phase to run migrations #1703

Uh oh!

Conversation

sgrif commented Apr 1, 2019

Uh oh!

kzys commented Apr 10, 2019

Uh oh!

jtgeibel commented May 1, 2019

Uh oh!

bors commented May 1, 2019

Uh oh!

bors commented May 1, 2019

Uh oh!

bors commented May 1, 2019

Uh oh!

Uh oh!