Add infrastructure for pagerduty integration #1498

sgrif · 2018-09-24T18:14:02Z

Right our pagerduty notifications come from heroku email alerts and a
pingdom integration. We want to start paging on more specific metrics.
So far we've discussed:

When no crate is uploaded for N period of time (likely 4 hours)
When a background job has failed N times (likely 5)
When the background queue has N jobs (likely 5)

In order to do this we need to have some code in place that actually
starts an incident and pages whoever is on call. We'd also like to test
this code before we start relying on it, so I've added a binary we can
use to make sure everything is configured correctly.

ishitatsuyuki · 2018-09-25T02:45:53Z

I'm against hardcoding any kind of proprietary SaaS into the main product. The cloud-native way (that I'm familiar with) is to setup a Prometheus compatible metrics endpoint, and let Prometheus detect all the anomalies.

sgrif · 2018-09-25T15:19:39Z

Setting up another system we have to manage isn't really in the cards right now. Pagerduty is the service we're using, and that is unlikely to change in the near future.

carols10cents · 2018-09-25T16:56:02Z

It's a good point though that when we actually add the code that calls this code to send events to pagerduty, the code shouldn't do anything if pagerduty isn't configured so that this is completely optional for anyone else running this codebase (and also optional during development).

Taking a look now.

carols10cents

No-op if not configured plz

carols10cents · 2018-09-25T17:00:01Z

src/on_call.rs

+    /// (potentially waking them up at 3 AM).
+    pub fn send(self) -> CargoResult<()> {
+        let api_token = env::var("PAGERDUTY_API_TOKEN")?;
+        let service_key = env::var("PAGERDUTY_INTEGRATION_KEY")?;


Ah yeah, I didn't see this in my first skim. I'd like to see this code be a no-op if these variables aren't set, rather than returning an error, so that pagerduty usage is optional.

I'm not sure I understand what you mean. This code is only ever going to get called from separate processes used in production to monitor metrics. This code compiles regardless of whether it's configured or not, and would never be called from code paths used outside of our monitors. If that process is run and not configured, we certainly want to be informed immediately rather than have our monitoring silently broken.

sgrif · 2018-09-25T21:01:57Z

I think there's some confusion on when or how this code will be used. It's not intended to be called from the main app, it's intended to be called by separate processes which periodically check the invariants we want to page on (such as those described above). It's included in the main library instead of those binaries directly because I expect to be calling it from multiple binaries (and this was the most logical place to put it). Its usage is already optional by not running those binaries.

ishitatsuyuki · 2018-09-25T22:42:22Z

So I think you're confusing PagerDuty as a monitoring tool. It's not, you're missing a layer.

You need a server handling and visualizing all the metrics, or you won't be able to see what's going on even if you get paged. I know you hate managing servers, so if then try Datadog. It's a SaaS, and they sponsors open source development.

Also, in this kind of implementation I guess you will need to code the triggers into application source. That's like a burden for changing them; I know that crates.io is deployed infrequently.

You also mentioned failing background jobs. Sentry will handle them in a brilliant way. You don't have to roll your own monitoring system.

Prometheus format is the industry standard for metrics endpoint. It's far easier to expose, as it's just a HTTP route.

sgrif · 2018-09-25T22:46:30Z

@ishitatsuyuki This does not include the monitoring layer. The majority of the things we want to monitor are not yet implemented, as I've pointed out in the PR description. This is just the test code for what we want the end result to be. We can only trigger an alert on crate upload timings for now, and it has to have such a high threshold to avoid false positives that it's barely useful. However, we cannot move index updates to a background jobs without a mechanism in place to insure we are paged when that fails, and I'd like to make sure that works before we deploy it.

I take some offense at you coming into this PR and saying that I don't understand what monitoring is or what this service does. I'd appreciate it if you assume that anybody who opens a PR here knows what they're doing in the future.

carols10cents · 2018-09-26T19:43:35Z

I think there's some confusion on when or how this code will be used.

Indeed!

It's not intended to be called from the main app,

There's not really much preventing it from being called from the main app, though-- could we at least add some comments about where it is and is not appropriate to call this code? I'd really prefer to encode this restriction by making this code not error if pagerduty isn't configured; is there a reason you're resisting that?

we cannot move index updates to a background jobs without a mechanism in place to insure we are paged when that fails

When I read this, I was imagining logic like this:

// in crate publish
if publish succeeds
    enqueue background job to update index
end

---

// In background job
if index update fails
    open pagerduty incident
end

The index updating will still need to be done by anyone running this app in development or anyone running this app without pagerduty configured; I'd like to make sure that we don't disrupt their usage of the app with this code.

I see your comments along these lines:

This code is only ever going to get called from separate processes used in production to monitor metrics.

and understand now that the logic I was imagining is not the logic you are imagining, but I had trouble getting from one to the other with the initial PR description :)

It's included in the main library instead of those binaries directly because I expect to be calling it from multiple binaries (and this was the most logical place to put it). Its usage is already optional by not running those binaries.

What about making a separate crate entirely for these binaries, and then making that crate an optional dependency? Just trying to brainstorm other ways to keep this code from sneaking into the main app.

sgrif · 2018-09-26T21:38:49Z

is there a reason you're resisting that?

Because when we expect it to be configured, I'd like to get an error if we forgot to configure it, rather than silently not getting paged.

The index updating will still need to be done by anyone running this app in development or anyone running this app without pagerduty configured;

That's not how the background job works though. When a job fails it will be retried with exponential backoff to let it succeed eventually without getting incidents from network blips or minor github downtime. The job might also not be getting run because the runner itself is broken. This is why we have a separate monitor process, whose only purpose is to monitor everything else and page when bad things happen.

What about making a separate crate entirely for these binaries, and then making that crate an optional dependency?

I'm not really sure how that helps here, the app would just fail to compile if you didn't pass that feature flag. I'm really not following your concern that this will somehow sneak into the main app. This isn't something that will just randomly start getting called from all over the place.

If it's really that big of a concern, I can just copy-paste the code in the binaries that need it :\

jtgeibel · 2018-09-26T23:02:59Z

One thought I had that might make things a bit more clear, would be to move the library level code into src/bin/on_call/mod.rs, and then mod on_call; from each binary that needs it. It would still be possible to call this code from server.rs, but everything in there is part of the booting the server (where we can already panic if some env vars are missing) and it would be much more difficult to include this module in any of the request handling logic in the lib crate.

carols10cents · 2018-09-28T13:56:37Z

I like @jtgeibel's suggestion.

Because when we expect it to be configured, I'd like to get an error if we forgot to configure it, rather than silently not getting paged.

Isn't that what the test-pagerduty bin is for though? To test that pagerduty is configured correctly?

I'm really not following your concern that this will somehow sneak into the main app. This isn't something that will just randomly start getting called from all over the place.

Ok then, would you humor me with some comments in on_call to the effect that this code shouldn't be called from any code involved with the main functionality of the app?

Right our pagerduty notifications come from heroku email alerts and a pingdom integration. We want to start paging on more specific metrics. So far we've discussed: - When no crate is uploaded for N period of time (likely 4 hours) - When a background job has failed N times (likely 5) - When the background queue has N jobs (likely 5) In order to do this we need to have some code in place that actually starts an incident and pages whoever is on call. We'd also like to test this code before we start relying on it, so I've added a binary we can use to make sure everything is configured correctly.

We don't want to require pagerduty for external registries. But making this code no-op if the environment variables are missing would mean we don't get informed if things are mis-configured or un-configured. We aren't expecting this to be called from most code, only a handful of binaries that are specifically intended to monitor the service. In order to prevent this code from accidentally getting used throughout the app (though hopefully nobody is calling a function documented as "wakes up whoever is on-call" willy nilly), we can move this into the `bin` directory. We're specifically using `on_call/mod.rs` instead of `on_call.rs` so Cargo doesn't think that this file is supposed to be its own binary.

sgrif · 2018-10-23T18:38:05Z

@carols10cents I've moved the code so it cannot be called from the main app. Does that resolve your concerns? This is blocking the async git stuff so I'd like to move this forward

carols10cents · 2018-10-24T17:58:51Z

Thank you!

bors: r+

bors-voyager · 2018-10-24T17:59:02Z

Canceled

carols10cents · 2018-10-24T18:01:39Z

oops guess i have to wait for travis to be done.

carols10cents · 2018-10-24T18:36:11Z

bors: r+

1498: Add infrastructure for pagerduty integration r=carols10cents a=sgrif Right our pagerduty notifications come from heroku email alerts and a pingdom integration. We want to start paging on more specific metrics. So far we've discussed: - When no crate is uploaded for N period of time (likely 4 hours) - When a background job has failed N times (likely 5) - When the background queue has N jobs (likely 5) In order to do this we need to have some code in place that actually starts an incident and pages whoever is on call. We'd also like to test this code before we start relying on it, so I've added a binary we can use to make sure everything is configured correctly. Co-authored-by: Sean Griffin <[email protected]> Co-authored-by: Carol (Nichols || Goulding) <[email protected]>

bors-voyager · 2018-10-24T19:06:21Z

Build succeeded

continuous-integration/travis-ci/push

sgrif mentioned this pull request Sep 24, 2018

Rollup of #1498 and #1499 #1500

Closed

carols10cents requested changes Sep 25, 2018

View reviewed changes

sgrif added 2 commits October 23, 2018 12:27

sgrif force-pushed the sg-test-pagerduty branch from 4b6f676 to 2267c99 Compare October 23, 2018 18:36

carols10cents approved these changes Oct 24, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master' into sg-test-pagerduty

0fcd6d1

bors-voyager bot merged commit 0fcd6d1 into rust-lang:master Oct 24, 2018

sgrif deleted the sg-test-pagerduty branch March 9, 2019 01:34

Add infrastructure for pagerduty integration #1498

Add infrastructure for pagerduty integration #1498

Uh oh!

Conversation

sgrif commented Sep 24, 2018

Uh oh!

ishitatsuyuki commented Sep 25, 2018

Uh oh!

sgrif commented Sep 25, 2018

Uh oh!

carols10cents commented Sep 25, 2018

Uh oh!

carols10cents left a comment

Choose a reason for hiding this comment

Uh oh!

carols10cents Sep 25, 2018

Choose a reason for hiding this comment

Uh oh!

sgrif Sep 25, 2018

Choose a reason for hiding this comment

Uh oh!

sgrif commented Sep 25, 2018

Uh oh!

ishitatsuyuki commented Sep 25, 2018

Uh oh!

sgrif commented Sep 25, 2018

Uh oh!

carols10cents commented Sep 26, 2018

Uh oh!

sgrif commented Sep 26, 2018

Uh oh!

jtgeibel commented Sep 26, 2018

Uh oh!

carols10cents commented Sep 28, 2018

Uh oh!

sgrif commented Oct 23, 2018

Uh oh!

carols10cents commented Oct 24, 2018

Uh oh!

bors-voyager bot commented Oct 24, 2018

Canceled

Uh oh!

carols10cents commented Oct 24, 2018

Uh oh!

carols10cents commented Oct 24, 2018

Uh oh!

bors-voyager bot commented Oct 24, 2018

Build succeeded

Uh oh!

Uh oh!