Limit gRPC's built-in exponential backoff #1390

ryanpbrewster · 2018-11-20T01:32:04Z

When we upgraded to gRPC 1.16.0 (#1353) we accidentally introduced a performance issue in the Firestore emulator. Somewhere between 1.13.1 and 1.16.0, gRPC implemented exponential backoff on a per-connection basis.

We already implement "smart" backoffs that have knowledge of which error codes are safe to retry immediately and which are not, and the gRPC backoffs are happening underneath them. In production use-cases we're dodging this behavior because we're hitting multiple machines. For the emulator, this is introducing exponential delay for every failed write.

EDIT: After further investigation, this is actually due to ipv6 handling in the emulator. The emulator is rejecting ipv6 traffic (unless explicitly configured otherwise), which is triggering unwanted reconnection backoff behavior. Prod Firestore does not have this issue. I've filed grpc/grpc#17282 to track this issue in gRPC. Once it's resolved, we can revert this change.

mikelehen · 2018-11-20T01:43:56Z

packages/firestore/src/platform_node/grpc_connection.ts

+            // gRPC will automatically perform exponential backoff if there is already a connection
+            // to this host. In production this is not an issue because "firestore.googleapis.com"
+            // resolves to many different IPs, so we almost always get a fresh connection. For the
+            // Firestore emulator, which often runs on localhost, this backoff is undesireable.


Fundamentally, I think we don't want gRPC backoff ever, because we're doing our own. So I don't think we need to mention production vs. emulator, multiple IPs, etc. I'd simplify to something along the lines of:

// We do our own connection backoff (that for example is aware of whether or // not a write stream error is permanent or not) so we don't want gRPC to do // backoff on top of that.

I think I'd also be okay setting this to 0ms if that is an option. This is essentially what we get with our webchannel_connection.ts since webchannel has no builtin backoff. So I'm comfortable keeping gRPC in parity.

Bleh, it's being picky about numers:

E1119 17:56:21.535736171 48711 channel_args.cc:345] grpc.max_reconnect_backoff_ms ignored: it must be >= 100

Yeah, there are hardcoded lower bounds for these values: https://github.com/grpc/grpc/blob/master/src/core/ext/filters/client_channel/subchannel.cc#L511

100ms is not terrible, I'll file an issue in the grpc repo to see if there's any objection to allowing clients to totally disable backoff.

Filed grpc/grpc#17250

Awesome, thanks! 100ms is definitely fine for now.

wilhuff · 2018-11-26T18:44:27Z

Given the description of this change I'm still at a loss for what this actually fixes. Even in production we only ever talk to a single host: firestore.googleapis.com:443. I don't see how we'd have production behavior that's different if that host were localhost:$EMULATOR_PORT.

I'm also not certain that we want to completely disable gRPC's backoff: we don't necessarily see a failure when gRPC is retrying the underlying the transport-level connection which doesn't give us the opportunity to implement our own backoff. This change may have sped up gRPC's underlying connection retry under the covers in a way that's harmful.

If there is a behavioral difference between production and the emulator and the behavior seems to come from grpc's connection-level backoff could it be that the emulator is terminating the channel unnecessarily when all it really needs to do is terminate an RPC stream?

ryanpbrewster · 2018-11-26T18:47:22Z

Slightly more details here: grpc/grpc#17282

wilhuff · 2018-11-26T18:52:26Z

OK, so this seems more like a temporary workaround that we can undo once gRPC releases a fix?

ryanpbrewster · 2018-11-26T21:07:48Z

I'm not sure whether we want to have gRPC handle backoff independently or fold it into our existing backoff strategy. Either way, once we fix the issue causing undesired backoff we can choose one or the other. Right now this is resolving a performance regression and I don't think we should revert it yet.

ryanpbrewster added 2 commits November 19, 2018 17:19

Prevent gRPC from performing too much exponential backoff

3b891dd

[AUTOMATED]: Prettier Code Styling

6dc99fa

ryanpbrewster assigned mikelehen Nov 20, 2018

ryanpbrewster requested a review from mikelehen November 20, 2018 01:32

ryanpbrewster requested review from gsoltis, rsgowman and schmidt-sebastian as code owners November 20, 2018 01:32

google-oss-bot added the needs-triage label Nov 20, 2018

ryanpbrewster requested review from tonymeng, var-const, wilhuff and zxu123 as code owners November 20, 2018 01:32

mikelehen approved these changes Nov 20, 2018

View reviewed changes

ryanpbrewster added 2 commits November 19, 2018 18:04

Simplify comment

5e49a1c

Set the initial backoff as low as possible

cea5146

ryanpbrewster assigned tonymeng Nov 20, 2018

revert testing change

8da0425

ryanpbrewster merged commit b3063e4 into master Nov 20, 2018

ryanpbrewster deleted the rpb/bypass-grpc-backoff branch November 20, 2018 03:43

ryanpbrewster mentioned this pull request Nov 27, 2018

firebase.loadFirestoreRules timing out in emulator #1396

Closed

firebase locked and limited conversation to collaborators Oct 14, 2019

google-oss-bot unassigned tonymeng Oct 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Limit gRPC's built-in exponential backoff #1390

Limit gRPC's built-in exponential backoff #1390

Uh oh!

ryanpbrewster commented Nov 20, 2018 •

edited

Loading

Uh oh!

mikelehen Nov 20, 2018

Uh oh!

ryanpbrewster Nov 20, 2018

Uh oh!

ryanpbrewster Nov 20, 2018

Uh oh!

ryanpbrewster Nov 20, 2018

Uh oh!

mikelehen Nov 20, 2018

Uh oh!

wilhuff commented Nov 26, 2018

Uh oh!

ryanpbrewster commented Nov 26, 2018

Uh oh!

wilhuff commented Nov 26, 2018

Uh oh!

ryanpbrewster commented Nov 26, 2018

Uh oh!

Uh oh!

Limit gRPC's built-in exponential backoff #1390

Limit gRPC's built-in exponential backoff #1390

Uh oh!

Conversation

ryanpbrewster commented Nov 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikelehen Nov 20, 2018

Choose a reason for hiding this comment

Uh oh!

ryanpbrewster Nov 20, 2018

Choose a reason for hiding this comment

Uh oh!

ryanpbrewster Nov 20, 2018

Choose a reason for hiding this comment

Uh oh!

ryanpbrewster Nov 20, 2018

Choose a reason for hiding this comment

Uh oh!

mikelehen Nov 20, 2018

Choose a reason for hiding this comment

Uh oh!

wilhuff commented Nov 26, 2018

Uh oh!

ryanpbrewster commented Nov 26, 2018

Uh oh!

wilhuff commented Nov 26, 2018

Uh oh!

ryanpbrewster commented Nov 26, 2018

Uh oh!

Uh oh!

ryanpbrewster commented Nov 20, 2018 •

edited

Loading