-
Notifications
You must be signed in to change notification settings - Fork 648
Add infrastructure for pagerduty integration #1498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'm against hardcoding any kind of proprietary SaaS into the main product. The cloud-native way (that I'm familiar with) is to setup a Prometheus compatible metrics endpoint, and let Prometheus detect all the anomalies. |
Setting up another system we have to manage isn't really in the cards right now. Pagerduty is the service we're using, and that is unlikely to change in the near future. |
It's a good point though that when we actually add the code that calls this code to send events to pagerduty, the code shouldn't do anything if pagerduty isn't configured so that this is completely optional for anyone else running this codebase (and also optional during development). Taking a look now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No-op if not configured plz
src/on_call.rs
Outdated
/// (potentially waking them up at 3 AM). | ||
pub fn send(self) -> CargoResult<()> { | ||
let api_token = env::var("PAGERDUTY_API_TOKEN")?; | ||
let service_key = env::var("PAGERDUTY_INTEGRATION_KEY")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah, I didn't see this in my first skim. I'd like to see this code be a no-op if these variables aren't set, rather than returning an error, so that pagerduty usage is optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand what you mean. This code is only ever going to get called from separate processes used in production to monitor metrics. This code compiles regardless of whether it's configured or not, and would never be called from code paths used outside of our monitors. If that process is run and not configured, we certainly want to be informed immediately rather than have our monitoring silently broken.
I think there's some confusion on when or how this code will be used. It's not intended to be called from the main app, it's intended to be called by separate processes which periodically check the invariants we want to page on (such as those described above). It's included in the main library instead of those binaries directly because I expect to be calling it from multiple binaries (and this was the most logical place to put it). Its usage is already optional by not running those binaries. |
So I think you're confusing PagerDuty as a monitoring tool. It's not, you're missing a layer. You need a server handling and visualizing all the metrics, or you won't be able to see what's going on even if you get paged. I know you hate managing servers, so if then try Datadog. It's a SaaS, and they sponsors open source development. Also, in this kind of implementation I guess you will need to code the triggers into application source. That's like a burden for changing them; I know that crates.io is deployed infrequently. You also mentioned failing background jobs. Sentry will handle them in a brilliant way. You don't have to roll your own monitoring system. Prometheus format is the industry standard for metrics endpoint. It's far easier to expose, as it's just a HTTP route. |
@ishitatsuyuki This does not include the monitoring layer. The majority of the things we want to monitor are not yet implemented, as I've pointed out in the PR description. This is just the test code for what we want the end result to be. We can only trigger an alert on crate upload timings for now, and it has to have such a high threshold to avoid false positives that it's barely useful. However, we cannot move index updates to a background jobs without a mechanism in place to insure we are paged when that fails, and I'd like to make sure that works before we deploy it. I take some offense at you coming into this PR and saying that I don't understand what monitoring is or what this service does. I'd appreciate it if you assume that anybody who opens a PR here knows what they're doing in the future. |
Indeed!
There's not really much preventing it from being called from the main app, though-- could we at least add some comments about where it is and is not appropriate to call this code? I'd really prefer to encode this restriction by making this code not error if pagerduty isn't configured; is there a reason you're resisting that?
When I read this, I was imagining logic like this:
The index updating will still need to be done by anyone running this app in development or anyone running this app without pagerduty configured; I'd like to make sure that we don't disrupt their usage of the app with this code. I see your comments along these lines:
and understand now that the logic I was imagining is not the logic you are imagining, but I had trouble getting from one to the other with the initial PR description :)
What about making a separate crate entirely for these binaries, and then making that crate an optional dependency? Just trying to brainstorm other ways to keep this code from sneaking into the main app. |
Because when we expect it to be configured, I'd like to get an error if we forgot to configure it, rather than silently not getting paged.
That's not how the background job works though. When a job fails it will be retried with exponential backoff to let it succeed eventually without getting incidents from network blips or minor github downtime. The job might also not be getting run because the runner itself is broken. This is why we have a separate monitor process, whose only purpose is to monitor everything else and page when bad things happen.
I'm not really sure how that helps here, the app would just fail to compile if you didn't pass that feature flag. I'm really not following your concern that this will somehow sneak into the main app. This isn't something that will just randomly start getting called from all over the place. If it's really that big of a concern, I can just copy-paste the code in the binaries that need it :\ |
One thought I had that might make things a bit more clear, would be to move the library level code into |
I like @jtgeibel's suggestion.
Isn't that what the
Ok then, would you humor me with some comments in |
Right our pagerduty notifications come from heroku email alerts and a pingdom integration. We want to start paging on more specific metrics. So far we've discussed: - When no crate is uploaded for N period of time (likely 4 hours) - When a background job has failed N times (likely 5) - When the background queue has N jobs (likely 5) In order to do this we need to have some code in place that actually starts an incident and pages whoever is on call. We'd also like to test this code before we start relying on it, so I've added a binary we can use to make sure everything is configured correctly.
We don't want to require pagerduty for external registries. But making this code no-op if the environment variables are missing would mean we don't get informed if things are mis-configured or un-configured. We aren't expecting this to be called from most code, only a handful of binaries that are specifically intended to monitor the service. In order to prevent this code from accidentally getting used throughout the app (though hopefully nobody is calling a function documented as "wakes up whoever is on-call" willy nilly), we can move this into the `bin` directory. We're specifically using `on_call/mod.rs` instead of `on_call.rs` so Cargo doesn't think that this file is supposed to be its own binary.
4b6f676
to
2267c99
Compare
@carols10cents I've moved the code so it cannot be called from the main app. Does that resolve your concerns? This is blocking the async git stuff so I'd like to move this forward |
Thank you! bors: r+ |
Canceled |
oops guess i have to wait for travis to be done. |
bors: r+ |
1498: Add infrastructure for pagerduty integration r=carols10cents a=sgrif Right our pagerduty notifications come from heroku email alerts and a pingdom integration. We want to start paging on more specific metrics. So far we've discussed: - When no crate is uploaded for N period of time (likely 4 hours) - When a background job has failed N times (likely 5) - When the background queue has N jobs (likely 5) In order to do this we need to have some code in place that actually starts an incident and pages whoever is on call. We'd also like to test this code before we start relying on it, so I've added a binary we can use to make sure everything is configured correctly. Co-authored-by: Sean Griffin <[email protected]> Co-authored-by: Carol (Nichols || Goulding) <[email protected]>
Build succeeded |
Right our pagerduty notifications come from heroku email alerts and a
pingdom integration. We want to start paging on more specific metrics.
So far we've discussed:
In order to do this we need to have some code in place that actually
starts an incident and pages whoever is on call. We'd also like to test
this code before we start relying on it, so I've added a binary we can
use to make sure everything is configured correctly.