-
Notifications
You must be signed in to change notification settings - Fork 948
Limit gRPC's built-in exponential backoff #1390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
// gRPC will automatically perform exponential backoff if there is already a connection | ||
// to this host. In production this is not an issue because "firestore.googleapis.com" | ||
// resolves to many different IPs, so we almost always get a fresh connection. For the | ||
// Firestore emulator, which often runs on localhost, this backoff is undesireable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fundamentally, I think we don't want gRPC backoff ever, because we're doing our own. So I don't think we need to mention production vs. emulator, multiple IPs, etc. I'd simplify to something along the lines of:
// We do our own connection backoff (that for example is aware of whether or
// not a write stream error is permanent or not) so we don't want gRPC to do
// backoff on top of that.
I think I'd also be okay setting this to 0ms if that is an option. This is essentially what we get with our webchannel_connection.ts since webchannel has no builtin backoff. So I'm comfortable keeping gRPC in parity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bleh, it's being picky about numers:
E1119 17:56:21.535736171 48711 channel_args.cc:345] grpc.max_reconnect_backoff_ms ignored: it must be >= 100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there are hardcoded lower bounds for these values: https://github.com/grpc/grpc/blob/master/src/core/ext/filters/client_channel/subchannel.cc#L511
100ms is not terrible, I'll file an issue in the grpc repo to see if there's any objection to allowing clients to totally disable backoff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed grpc/grpc#17250
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks! 100ms is definitely fine for now.
Given the description of this change I'm still at a loss for what this actually fixes. Even in production we only ever talk to a single host: I'm also not certain that we want to completely disable gRPC's backoff: we don't necessarily see a failure when gRPC is retrying the underlying the transport-level connection which doesn't give us the opportunity to implement our own backoff. This change may have sped up gRPC's underlying connection retry under the covers in a way that's harmful. If there is a behavioral difference between production and the emulator and the behavior seems to come from grpc's connection-level backoff could it be that the emulator is terminating the channel unnecessarily when all it really needs to do is terminate an RPC stream? |
Slightly more details here: grpc/grpc#17282 |
OK, so this seems more like a temporary workaround that we can undo once gRPC releases a fix? |
I'm not sure whether we want to have gRPC handle backoff independently or fold it into our existing backoff strategy. Either way, once we fix the issue causing undesired backoff we can choose one or the other. Right now this is resolving a performance regression and I don't think we should revert it yet. |
When we upgraded to gRPC 1.16.0 (#1353) we accidentally introduced a performance issue in the Firestore emulator. Somewhere between 1.13.1 and 1.16.0, gRPC implemented exponential backoff on a per-connection basis.
We already implement "smart" backoffs that have knowledge of which error codes are safe to retry immediately and which are not, and the gRPC backoffs are happening underneath them. In production use-cases we're dodging this behavior because we're hitting multiple machines. For the emulator, this is introducing exponential delay for every failed write.
EDIT: After further investigation, this is actually due to ipv6 handling in the emulator. The emulator is rejecting ipv6 traffic (unless explicitly configured otherwise), which is triggering unwanted reconnection backoff behavior. Prod Firestore does not have this issue. I've filed grpc/grpc#17282 to track this issue in gRPC. Once it's resolved, we can revert this change.