Skip to content

Add infrastructure for pagerduty integration #1498

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 24, 2018

Conversation

sgrif
Copy link
Contributor

@sgrif sgrif commented Sep 24, 2018

Right our pagerduty notifications come from heroku email alerts and a
pingdom integration. We want to start paging on more specific metrics.
So far we've discussed:

  • When no crate is uploaded for N period of time (likely 4 hours)
  • When a background job has failed N times (likely 5)
  • When the background queue has N jobs (likely 5)

In order to do this we need to have some code in place that actually
starts an incident and pages whoever is on call. We'd also like to test
this code before we start relying on it, so I've added a binary we can
use to make sure everything is configured correctly.

@sgrif sgrif mentioned this pull request Sep 24, 2018
@ishitatsuyuki
Copy link
Contributor

I'm against hardcoding any kind of proprietary SaaS into the main product. The cloud-native way (that I'm familiar with) is to setup a Prometheus compatible metrics endpoint, and let Prometheus detect all the anomalies.

@sgrif
Copy link
Contributor Author

sgrif commented Sep 25, 2018

Setting up another system we have to manage isn't really in the cards right now. Pagerduty is the service we're using, and that is unlikely to change in the near future.

@carols10cents
Copy link
Member

It's a good point though that when we actually add the code that calls this code to send events to pagerduty, the code shouldn't do anything if pagerduty isn't configured so that this is completely optional for anyone else running this codebase (and also optional during development).

Taking a look now.

Copy link
Member

@carols10cents carols10cents left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No-op if not configured plz

src/on_call.rs Outdated
/// (potentially waking them up at 3 AM).
pub fn send(self) -> CargoResult<()> {
let api_token = env::var("PAGERDUTY_API_TOKEN")?;
let service_key = env::var("PAGERDUTY_INTEGRATION_KEY")?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, I didn't see this in my first skim. I'd like to see this code be a no-op if these variables aren't set, rather than returning an error, so that pagerduty usage is optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what you mean. This code is only ever going to get called from separate processes used in production to monitor metrics. This code compiles regardless of whether it's configured or not, and would never be called from code paths used outside of our monitors. If that process is run and not configured, we certainly want to be informed immediately rather than have our monitoring silently broken.

@sgrif
Copy link
Contributor Author

sgrif commented Sep 25, 2018

I think there's some confusion on when or how this code will be used. It's not intended to be called from the main app, it's intended to be called by separate processes which periodically check the invariants we want to page on (such as those described above). It's included in the main library instead of those binaries directly because I expect to be calling it from multiple binaries (and this was the most logical place to put it). Its usage is already optional by not running those binaries.

@ishitatsuyuki
Copy link
Contributor

So I think you're confusing PagerDuty as a monitoring tool. It's not, you're missing a layer.

You need a server handling and visualizing all the metrics, or you won't be able to see what's going on even if you get paged. I know you hate managing servers, so if then try Datadog. It's a SaaS, and they sponsors open source development.

Also, in this kind of implementation I guess you will need to code the triggers into application source. That's like a burden for changing them; I know that crates.io is deployed infrequently.

You also mentioned failing background jobs. Sentry will handle them in a brilliant way. You don't have to roll your own monitoring system.

Prometheus format is the industry standard for metrics endpoint. It's far easier to expose, as it's just a HTTP route.

@sgrif
Copy link
Contributor Author

sgrif commented Sep 25, 2018

@ishitatsuyuki This does not include the monitoring layer. The majority of the things we want to monitor are not yet implemented, as I've pointed out in the PR description. This is just the test code for what we want the end result to be. We can only trigger an alert on crate upload timings for now, and it has to have such a high threshold to avoid false positives that it's barely useful. However, we cannot move index updates to a background jobs without a mechanism in place to insure we are paged when that fails, and I'd like to make sure that works before we deploy it.

I take some offense at you coming into this PR and saying that I don't understand what monitoring is or what this service does. I'd appreciate it if you assume that anybody who opens a PR here knows what they're doing in the future.

@carols10cents
Copy link
Member

I think there's some confusion on when or how this code will be used.

Indeed!

It's not intended to be called from the main app,

There's not really much preventing it from being called from the main app, though-- could we at least add some comments about where it is and is not appropriate to call this code? I'd really prefer to encode this restriction by making this code not error if pagerduty isn't configured; is there a reason you're resisting that?

we cannot move index updates to a background jobs without a mechanism in place to insure we are paged when that fails

When I read this, I was imagining logic like this:

// in crate publish
if publish succeeds
    enqueue background job to update index
end

---

// In background job
if index update fails
    open pagerduty incident
end

The index updating will still need to be done by anyone running this app in development or anyone running this app without pagerduty configured; I'd like to make sure that we don't disrupt their usage of the app with this code.

I see your comments along these lines:

This code is only ever going to get called from separate processes used in production to monitor metrics.

and understand now that the logic I was imagining is not the logic you are imagining, but I had trouble getting from one to the other with the initial PR description :)

It's included in the main library instead of those binaries directly because I expect to be calling it from multiple binaries (and this was the most logical place to put it). Its usage is already optional by not running those binaries.

What about making a separate crate entirely for these binaries, and then making that crate an optional dependency? Just trying to brainstorm other ways to keep this code from sneaking into the main app.

@sgrif
Copy link
Contributor Author

sgrif commented Sep 26, 2018

is there a reason you're resisting that?

Because when we expect it to be configured, I'd like to get an error if we forgot to configure it, rather than silently not getting paged.

The index updating will still need to be done by anyone running this app in development or anyone running this app without pagerduty configured;

That's not how the background job works though. When a job fails it will be retried with exponential backoff to let it succeed eventually without getting incidents from network blips or minor github downtime. The job might also not be getting run because the runner itself is broken. This is why we have a separate monitor process, whose only purpose is to monitor everything else and page when bad things happen.

What about making a separate crate entirely for these binaries, and then making that crate an optional dependency?

I'm not really sure how that helps here, the app would just fail to compile if you didn't pass that feature flag. I'm really not following your concern that this will somehow sneak into the main app. This isn't something that will just randomly start getting called from all over the place.

If it's really that big of a concern, I can just copy-paste the code in the binaries that need it :\

@jtgeibel
Copy link
Member

One thought I had that might make things a bit more clear, would be to move the library level code into src/bin/on_call/mod.rs, and then mod on_call; from each binary that needs it. It would still be possible to call this code from server.rs, but everything in there is part of the booting the server (where we can already panic if some env vars are missing) and it would be much more difficult to include this module in any of the request handling logic in the lib crate.

@carols10cents
Copy link
Member

I like @jtgeibel's suggestion.

Because when we expect it to be configured, I'd like to get an error if we forgot to configure it, rather than silently not getting paged.

Isn't that what the test-pagerduty bin is for though? To test that pagerduty is configured correctly?

I'm really not following your concern that this will somehow sneak into the main app. This isn't something that will just randomly start getting called from all over the place.

Ok then, would you humor me with some comments in on_call to the effect that this code shouldn't be called from any code involved with the main functionality of the app?

sgrif added 2 commits October 23, 2018 12:27
Right our pagerduty notifications come from heroku email alerts and a
pingdom integration. We want to start paging on more specific metrics.
So far we've discussed:

- When no crate is uploaded for N period of time (likely 4 hours)
- When a background job has failed N times (likely 5)
- When the background queue has N jobs (likely 5)

In order to do this we need to have some code in place that actually
starts an incident and pages whoever is on call. We'd also like to test
this code before we start relying on it, so I've added a binary we can
use to make sure everything is configured correctly.
We don't want to require pagerduty for external registries. But making
this code no-op if the environment variables are missing would mean we
don't get informed if things are mis-configured or un-configured.

We aren't expecting this to be called from most code, only a handful of
binaries that are specifically intended to monitor the service. In order
to prevent this code from accidentally getting used throughout the app
(though hopefully nobody is calling a function documented as "wakes up
whoever is on-call" willy nilly), we can move this into the `bin`
directory.

We're specifically using `on_call/mod.rs` instead of `on_call.rs` so
Cargo doesn't think that this file is supposed to be its own binary.
@sgrif sgrif force-pushed the sg-test-pagerduty branch from 4b6f676 to 2267c99 Compare October 23, 2018 18:36
@sgrif
Copy link
Contributor Author

sgrif commented Oct 23, 2018

@carols10cents I've moved the code so it cannot be called from the main app. Does that resolve your concerns? This is blocking the async git stuff so I'd like to move this forward

@carols10cents
Copy link
Member

Thank you!

bors: r+

@bors-voyager
Copy link
Contributor

bors-voyager bot commented Oct 24, 2018

Canceled

@carols10cents
Copy link
Member

oops guess i have to wait for travis to be done.

@carols10cents
Copy link
Member

bors: r+

bors-voyager bot added a commit that referenced this pull request Oct 24, 2018
1498: Add infrastructure for pagerduty integration r=carols10cents a=sgrif

Right our pagerduty notifications come from heroku email alerts and a
pingdom integration. We want to start paging on more specific metrics.
So far we've discussed:

- When no crate is uploaded for N period of time (likely 4 hours)
- When a background job has failed N times (likely 5)
- When the background queue has N jobs (likely 5)

In order to do this we need to have some code in place that actually
starts an incident and pages whoever is on call. We'd also like to test
this code before we start relying on it, so I've added a binary we can
use to make sure everything is configured correctly.

Co-authored-by: Sean Griffin <[email protected]>
Co-authored-by: Carol (Nichols || Goulding) <[email protected]>
@bors-voyager
Copy link
Contributor

bors-voyager bot commented Oct 24, 2018

Build succeeded

@bors-voyager bors-voyager bot merged commit 0fcd6d1 into rust-lang:master Oct 24, 2018
@sgrif sgrif deleted the sg-test-pagerduty branch March 9, 2019 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants