Skip to content

Use the release phase to run migrations #1703

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 1, 2019

Conversation

sgrif
Copy link
Contributor

@sgrif sgrif commented Apr 1, 2019

When we deploy today, the current process is as follows:

  • Build new code
    • If build fails, release is aborted, failed build does not affect
      cache
  • Shut down all dynos
    • At this point requests are being held by the router until the new
      dyno starts up. The 30 second timeout still applies, so we have to
      boot and process the queued requests in under 30 seconds or they
      will get a 503
  • Wait for shutdown to complete
  • Run database migrations
    • If this fails, our app crashes, all requests will return 500 until
      we manually intervene
  • Boot app
    • If this fails, our app crashes, all requests will return 500 until
      we manually intervene
  • Requests are processed

With this change, the flow will now be:

  • Build new code
  • Run database migrations
    • If this fails, the release will not be deployed. The build cache
      will still contain the results of this build, since the build itself
      succeeded.
  • Shut down all dynos
  • Wait for shutdown to complete
  • Boot app
  • Requests are processed

This has a few notable benefits:

  • If a migration fails in production (this has happened a few times), we
    will just not deploy the release instead of completely taking the app
    down
  • Since our migrations run in parallel with the app, we can make changes
    which take a long time to run, as long as those changes don't lock the
    table they're operating on
  • Our app will boot more quickly, leading to shorter delays when we
    deploy

This doesn't come for free of course. Since our app will still be
running with the migrations, we'll have to be careful about what we put
in a single deploy.

  • Any destructive migrations will need to be deployed in at least two
    phases. First we deploy code that works with both the old and new
    schema, then we deploy the migration (along with code that only works
    on the new schema) separately.
    • We already follow this practice today in order to make sure that
      each individual deploy can be rolled back quickly
  • Migrations which perform long running changes will need to be sure
    that those changes do not lock the table they're operating on
    • We mostly already practice this today, since we don't want to delay
      booting the app with a long migration
  • Migrations need to run in a transaction to avoid leaving things in a
    half migrated state if something fails
    • Diesel already does this

Overall I think this tradeoff is worth it. Removing the risk of a failed
migration taking the app down is a huge win, and we already follow the
practices required to do this safely.

With this change, we may also want to consider enabling
preboot, which ensures
that new dynos are running and accepting connections before shutting
down the old servers -- at the cost of making deploys take several
minutes longer.

When we deploy today, the current process is as follows:

- Build new code
  - If build fails, release is aborted, failed build does not affect
    cache
- Shut down all dynos
  - At this point requests are being held by the router until the new
    dyno starts up. The 30 second timeout still applies, so we have to
    boot and process the queued requests in under 30 seconds or they
    will get a 503
- Wait for shutdown to complete
- Run database migrations
  - If this fails, our app crashes, all requests will return 500 until
    we manually intervene
- Boot app
  - If this fails, our app crashes, all requests will return 500 until
    we manually intervene
- Requests are processed

With this change, the flow will now be:

- Build new code
- Run database migrations
  - If this fails, the release will not be deployed. The build cache
    will still contain the results of this build, since the build itself
    succeeded.
- Shut down all dynos
- Wait for shutdown to complete
- Boot app
- Requests are processed

This has a few notable benefits:

- If a migration fails in production (this has happened a few times), we
  will just not deploy the release instead of completely taking the app
  down
- Since our migrations run in parallel with the app, we can make changes
  which take a long time to run, as long as those changes don't lock the
  table they're operating on
- Our app will boot more quickly, leading to shorter delays when we
  deploy

This doesn't come for free of course. Since our app will still be
running with the migrations, we'll have to be careful about what we put
in a single deploy.

- Any destructive migrations will need to be deployed in at least two
  phases. First we deploy code that works with both the old and new
  schema, then we deploy the migration (along with code that only works
  on the new schema) separately.
  - We already follow this practice today in order to make sure that
    each individual deploy can be rolled back quickly
- Migrations which perform long running changes will need to be sure
  that those changes do not lock the table they're operating on
  - We mostly already practice this today, since we don't want to delay
    booting the app with a long migration
- Migrations need to run in a transaction to avoid leaving things in a
  half migrated state if something fails
  - Diesel already does this

Overall I think this tradeoff is worth it. Removing the risk of a failed
migration taking the app down is a huge win, and we already follow the
practices required to do this safely.

With this change, we may also want to consider enabling
[preboot](https://devcenter.heroku.com/articles/preboot), which ensures
that new dynos are running and accepting connections before shutting
down the old servers -- at the cost of making deploys take several
minutes longer.
@kzys
Copy link
Contributor

kzys commented Apr 10, 2019

Looks good to me.

@jtgeibel
Copy link
Member

jtgeibel commented May 1, 2019

LGTM!

@bors r+

@bors
Copy link
Contributor

bors commented May 1, 2019

📌 Commit e46a1a5 has been approved by jtgeibel

@bors
Copy link
Contributor

bors commented May 1, 2019

⌛ Testing commit e46a1a5 with merge 101a935...

bors added a commit that referenced this pull request May 1, 2019
Use the release phase to run migrations

When we deploy today, the current process is as follows:

- Build new code
  - If build fails, release is aborted, failed build does not affect
    cache
- Shut down all dynos
  - At this point requests are being held by the router until the new
    dyno starts up. The 30 second timeout still applies, so we have to
    boot and process the queued requests in under 30 seconds or they
    will get a 503
- Wait for shutdown to complete
- Run database migrations
  - If this fails, our app crashes, all requests will return 500 until
    we manually intervene
- Boot app
  - If this fails, our app crashes, all requests will return 500 until
    we manually intervene
- Requests are processed

With this change, the flow will now be:

- Build new code
- Run database migrations
  - If this fails, the release will not be deployed. The build cache
    will still contain the results of this build, since the build itself
    succeeded.
- Shut down all dynos
- Wait for shutdown to complete
- Boot app
- Requests are processed

This has a few notable benefits:

- If a migration fails in production (this has happened a few times), we
  will just not deploy the release instead of completely taking the app
  down
- Since our migrations run in parallel with the app, we can make changes
  which take a long time to run, as long as those changes don't lock the
  table they're operating on
- Our app will boot more quickly, leading to shorter delays when we
  deploy

This doesn't come for free of course. Since our app will still be
running with the migrations, we'll have to be careful about what we put
in a single deploy.

- Any destructive migrations will need to be deployed in at least two
  phases. First we deploy code that works with both the old and new
  schema, then we deploy the migration (along with code that only works
  on the new schema) separately.
  - We already follow this practice today in order to make sure that
    each individual deploy can be rolled back quickly
- Migrations which perform long running changes will need to be sure
  that those changes do not lock the table they're operating on
  - We mostly already practice this today, since we don't want to delay
    booting the app with a long migration
- Migrations need to run in a transaction to avoid leaving things in a
  half migrated state if something fails
  - Diesel already does this

Overall I think this tradeoff is worth it. Removing the risk of a failed
migration taking the app down is a huge win, and we already follow the
practices required to do this safely.

With this change, we may also want to consider enabling
[preboot](https://devcenter.heroku.com/articles/preboot), which ensures
that new dynos are running and accepting connections before shutting
down the old servers -- at the cost of making deploys take several
minutes longer.
@bors
Copy link
Contributor

bors commented May 1, 2019

☀️ Test successful - checks-travis
Approved by: jtgeibel
Pushing 101a935 to master...

@bors bors merged commit e46a1a5 into rust-lang:master May 1, 2019
@sgrif sgrif deleted the sg-heroku-release-phase branch May 14, 2019 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants