Skip to content

Limit gRPC's built-in exponential backoff #1390

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Nov 20, 2018

Conversation

ryanpbrewster
Copy link
Contributor

@ryanpbrewster ryanpbrewster commented Nov 20, 2018

When we upgraded to gRPC 1.16.0 (#1353) we accidentally introduced a performance issue in the Firestore emulator. Somewhere between 1.13.1 and 1.16.0, gRPC implemented exponential backoff on a per-connection basis.

We already implement "smart" backoffs that have knowledge of which error codes are safe to retry immediately and which are not, and the gRPC backoffs are happening underneath them. In production use-cases we're dodging this behavior because we're hitting multiple machines. For the emulator, this is introducing exponential delay for every failed write.

EDIT: After further investigation, this is actually due to ipv6 handling in the emulator. The emulator is rejecting ipv6 traffic (unless explicitly configured otherwise), which is triggering unwanted reconnection backoff behavior. Prod Firestore does not have this issue. I've filed grpc/grpc#17282 to track this issue in gRPC. Once it's resolved, we can revert this change.

// gRPC will automatically perform exponential backoff if there is already a connection
// to this host. In production this is not an issue because "firestore.googleapis.com"
// resolves to many different IPs, so we almost always get a fresh connection. For the
// Firestore emulator, which often runs on localhost, this backoff is undesireable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fundamentally, I think we don't want gRPC backoff ever, because we're doing our own. So I don't think we need to mention production vs. emulator, multiple IPs, etc. I'd simplify to something along the lines of:

// We do our own connection backoff (that for example is aware of whether or
// not a write stream error is permanent or not) so we don't want gRPC to do
// backoff on top of that.

I think I'd also be okay setting this to 0ms if that is an option. This is essentially what we get with our webchannel_connection.ts since webchannel has no builtin backoff. So I'm comfortable keeping gRPC in parity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bleh, it's being picky about numers:

E1119 17:56:21.535736171   48711 channel_args.cc:345]        grpc.max_reconnect_backoff_ms ignored: it must be >= 100

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there are hardcoded lower bounds for these values: https://github.com/grpc/grpc/blob/master/src/core/ext/filters/client_channel/subchannel.cc#L511

100ms is not terrible, I'll file an issue in the grpc repo to see if there's any objection to allowing clients to totally disable backoff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks! 100ms is definitely fine for now.

@ryanpbrewster ryanpbrewster merged commit b3063e4 into master Nov 20, 2018
@ryanpbrewster ryanpbrewster deleted the rpb/bypass-grpc-backoff branch November 20, 2018 03:43
@wilhuff
Copy link
Contributor

wilhuff commented Nov 26, 2018

Given the description of this change I'm still at a loss for what this actually fixes. Even in production we only ever talk to a single host: firestore.googleapis.com:443. I don't see how we'd have production behavior that's different if that host were localhost:$EMULATOR_PORT.

I'm also not certain that we want to completely disable gRPC's backoff: we don't necessarily see a failure when gRPC is retrying the underlying the transport-level connection which doesn't give us the opportunity to implement our own backoff. This change may have sped up gRPC's underlying connection retry under the covers in a way that's harmful.

If there is a behavioral difference between production and the emulator and the behavior seems to come from grpc's connection-level backoff could it be that the emulator is terminating the channel unnecessarily when all it really needs to do is terminate an RPC stream?

@ryanpbrewster
Copy link
Contributor Author

Slightly more details here: grpc/grpc#17282

@wilhuff
Copy link
Contributor

wilhuff commented Nov 26, 2018

OK, so this seems more like a temporary workaround that we can undo once gRPC releases a fix?

@ryanpbrewster
Copy link
Contributor Author

I'm not sure whether we want to have gRPC handle backoff independently or fold it into our existing backoff strategy. Either way, once we fix the issue causing undesired backoff we can choose one or the other. Right now this is resolving a performance regression and I don't think we should revert it yet.

@firebase firebase locked and limited conversation to collaborators Oct 14, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants