CDRIVER-4192 Support retryable handshake network errors #1141

eramongodb · 2022-11-04T16:19:17Z

Description

This PR resolves CDRIVER-4192. Changes are verified by this patch.

Spec Tests

The old retryable reads and retryable writes tests were relocated into a new legacy subdirectory to make room for the new unified spec test files.

The handshakeError.json files are pre-emptively modified as part of CDRIVER-4517, whose parent ticket DRIVERS-2489 is still undergoing review. Once DRIVERS-2489 has been merged into the spec, these spec test files may be updated as a followup accordingly if necessary.

Retryable Reads and Retryable Writes

The essential function identified as being the most appropriate target for augmentation to support retryable reads and retryable writes according to the specification was _mongoc_cluster_stream_for_optype, which implements the mongoc_cluster_stream_for_reads and mongoc_cluster_stream_for_writes functions invoked when a read or write operation is being executed.

This implementation assumes:

any given read or write operation will invoke a mongoc_cluster_stream_for_* with which it will execute said operation.
even if composed of multiple sub-operations (i.e. bulkWrite), the same stream obtained via mongoc_cluster_stream_for_* will be used for all sub-operations.
a single retryable handshake error (recorded as retry_attempted = true) makes any following operations ineligible for retryability (as later detected by mongoc_cmd_parts_assemble).

These assumptions are not applicable to change streams, which may create multiple streams during a single operation (i.e. createChangeStream). Pending further clarification to the relationship between retryable handshake errors and change streams, I have elected to ignore this situation.

To make the interface and implementation uniform in its handling of reads and writes, the MONGOC_SS_AGGREGATE_WITH_WRITE case was split off from mongoc_cluster_stream_for_reads into a new, dedicated mongoc_cluster_stream_for_aggr_with_write function for symmetry.

Note, mongoc_cluster_stream_for_server, which is used for other miscellaneous connections to the server, is deliberately not involved in these changes. This function may eventually require similar changes according to DRIVERS-2063.

Note, the retryableWriteError error label is only applied to failed retryable writes, not retryable reads. Retryable reads may eventually use a retryableReadError error label pending DRIVERS-1401.

Several mock server tests that utilize deliberate hang ups to validate error handling behavior needed to be updated to account for new retry behavior. These tests were modified to have retryReads=false in their URI to opt out of retryable handshake errors so that their prior behavior is preserved.

Miscellaneous Drive-By Improvements and Fixes

Always True handshake_complete

The handshake_complete parameter for the static _handle_network_error function was identified as true in all invocations of the function. The parameter was therefore removed.

Empty Arguments for operation_list_collections

Testing revealed a missing condition to allow for zero arguments to the listCollections operation. This was fixed accordingly.

Timeouts

Several tasks were being timed out due to exceeding the default 40 minute exec_timeout_secs setting. This appears to be due to the increasing size of the test suite rather than any single particular test running for an excessively long time. Therefore, I elected to bump exec_timeout_secs up from 40 minutes to 60 minutes.

Thread Sanitizer

The TSAN tasks were failing due to a TSAN warning emitted by test_add_and_scan_failure. It is unclear to me why they were being caught now rather than earlier, but I elected to resolve this immediately.

Unused Const Variable

The gHexCharPairs variable was causing some noisy compiler warnings due to conditional usage. I added a BSON_MAYBE_UNUSED to its declaration appropriately.

OP_MSG request is null

The mock server functions assert that the request parameter is not null. These functions as a set can use some improvement in their error handling and reporting, but for now, I improved the message for a single case that I encountered during testing.

Monotonic Clock Time Comparison

A task was observed violating a monotonic clock invariant. The cause was not yet identified, but the assertion was modified to provide further detail regarding the values to assist with diagnosing the issue should it happen again.

scan-build Null Pointer Warnings

Recent additions to the test-bson.c test suite exposed a flaw with the ASSERT_CMPSTR assertion macro which allowed for null pointers to be passed as an argument to strcmp which is UB. Adjusted the condition to ensure null pointers are not passed to strcmp.

…ansient_txn_error

…ivate.h

…error

…ction to a server

kevinAlbs

The miscellaneous improvements are much appreciated.

src/libmongoc/tests/test-mongoc-cluster.c

src/libmongoc/src/mongoc/mongoc-cluster.c

src/libmongoc/src/mongoc/mongoc-util.c

src/libmongoc/src/mongoc/mongoc-cluster.c

eramongodb · 2022-11-09T16:02:42Z

Added spec test updates from mongodb/specifications#1336 which necessitated the support for listDatabaseNames in the unified test runner.

galon1 · 2022-11-09T16:23:28Z

src/libmongoc/src/mongoc/mongoc-cluster.c

-      _mongoc_bson_init_with_transient_txn_error (cs, reply);
+      if (reply) {
+         bson_init (reply);
+         _mongoc_add_transient_txn_error (cs, reply);


Why would a transient_txn_error be added here? I am confused on what these labels are.

According to the Transactions Spec:

Any command error that includes the "TransientTransactionError" error label in the "errorLabels" field. Any network error encountered running any command other than commitTransaction in a transaction. If a network error occurs while running the commitTransaction command then it is not known whether the transaction committed or not, and thus the "TransientTransactionError" label MUST NOT be added.

The use of _mongoc_add_transient_txn_error would correspond to the condition, "Any network error encountered running any command other than commitTransaction in a transaction", according to the _mongoc_client_session_in_txn check used in its implementation.

galon1

LGTM. I just have one question about transient_tx_error that I posted as a comment.

eramongodb · 2022-11-10T17:47:30Z

Latest changes verified by this patch. The two task regressions appear to be flaky/unrelated to changes in this PR.

vector-of-bool

Sorry for the delayed review. The misleading large file/change count made me continually procrastinate it. The refactor of looking up streams for aggr-with-write is helpful, as I wasn't particularly happy with the proliferation of the magic bool parameter. LGTM, aside from one minor comment about bson_steal.

vector-of-bool · 2022-11-11T01:35:28Z

src/libmongoc/src/mongoc/mongoc-cluster.c

+   // original retryable error.
+   {
+      if (reply) {
+         bson_steal (reply, &first_reply);


Beware: bson_steal seems subtly broken, and I no longer trust it. It marks the dst as read-only and makes bson_destroy(dst) never free dst itself (both of these because it sets BSON_FLAG_STATIC on dst). I have no idea why it does this, the function is under-documented, there's no commentary, and the git-blame is useless. I think it would be good to investigate, but not high-priority yet. It might "just work" in this case, but it has bitten me in the past (repeatedly).

It may be the case that it has not caused any trouble so far in this PR due to all the bson_t in question being circumstantially stack-allocated, thus not requiring a bson_free. You reminded me that I had a branch a while back with a new bson_move_to to avoid some of the idiosyncracies I had encountered with bson_steal. I may need to recover and propose a PR for that later. I will make an attempt to avoid bson_steal for now as suggested.🤔

vector-of-bool · 2022-11-11T01:40:35Z

src/libmongoc/tests/TestSuite.h

+      if ((_a != _b) && (!_a || !_b || (strcmp (_a, _b) != 0))) {              \
         fprintf (stderr,                                                      \
                  "FAIL\n\nAssert Failure:\n  \"%s\"\n  !=\n  \"%s\"\n %s:%d " \
                  " %s()\n",                                                   \
-                  _a,                                                          \
-                  _b,                                                          \
+                  _a ? _a : "(null)",                                          \
+                  _b ? _b : "(null)",                                          \


Very nice. I've wanted this to work for a long while.

eramongodb added 11 commits November 4, 2022 10:17

Remove README.rst

bc0cee0

Move legacy retryable reads test files into legacy subdirectory

1a51281

Move legacy retryable writes test files into legacy subdirectory

03352d1

Sync unified retryable reads test files with 08230607

d470f48

Sync unified retryable writes test files with 08230607

039c9ce

Sync unified transactions test files with 08230607

30898bf

Bump default task timeout from 40 minutes to to 1 hour

092c741

Address TSAN warnings in test_add_and_scan_failure

1b3e3cb

Address -Wunused-const-variable for gHexCharPairs

b0bc187

Improve error message when expected OP_MSG request is not received

7f72936

Improve assertion message for monotonic clock time comparison

fab25ab

eramongodb requested review from kevinAlbs and vector-of-bool November 4, 2022 16:19

eramongodb self-assigned this Nov 4, 2022

kevinAlbs requested a review from galon1 November 4, 2022 16:32

eramongodb force-pushed the cdriver-4192 branch from 9657a3e to efaa178 Compare November 4, 2022 16:32

eramongodb added 13 commits November 4, 2022 11:32

Address null pointer warnings by scan-build

6592c6f

CDRIVER-4517 Update retryable writes handshake error spec tests

9ee659d

CDRIVER-4517 Update retryable reads handshake error spec tests

8916427

Declare _mongoc_cluster_stream_for_server as static

f3b0c69

Refactor _mongoc_bson_init_with_transient_txn_error -> _mongoc_add_tr…

9b50167

…ansient_txn_error

Separate aggregates with writes from mongoc_cluster_stream_for_reads

625b139

Add _mongoc_error_is_auth

489beff

Declare _mongoc_write_error_append_retryable_label in mongoc-error-pr…

32346e0

…ivate.h

Remove always-true handshake_complete parameter from _handle_network_…

040399d

…error

Assert preconditions for mongoc_cluster_stream_for_server

9186c66

Permit empty arguments field for operation_list_collections

51a7c82

Retry when encountering a network error establishing an initial conne…

f3e44d5

…ction to a server

Update tests to account for retryable handshake network failures

5c9a0a0

eramongodb force-pushed the cdriver-4192 branch from efaa178 to 5c9a0a0 Compare November 4, 2022 16:35

kevinAlbs reviewed Nov 7, 2022

View reviewed changes

eramongodb added 5 commits November 7, 2022 13:32

Revert changes to test_cluster_command_error

403e4dc

Fix documentation for _mongoc_add_transient_txn_error

36d4c9a

Ensure clean errorLabels field when appending new label

37b4568

Add unified test runner support for listDatabaseNames

52d8fc6

CDRIVER-4517 Update handshakeError.json unified spec tests

a67d380

galon1 reviewed Nov 9, 2022

View reviewed changes

galon1 approved these changes Nov 9, 2022

View reviewed changes

eramongodb added 3 commits November 9, 2022 11:27

CDRIVER-4517 Update handshakeError.json unified spec tests

8c4b265

Format runner.c

687ad06

Skip retryable reads tests that require unsupported optional helpers

70808ae

eramongodb requested a review from kevinAlbs November 10, 2022 17:49

kevinAlbs approved these changes Nov 10, 2022

View reviewed changes

vector-of-bool approved these changes Nov 11, 2022

View reviewed changes

Replace bson_steal with bson_copy_to

6336637

eramongodb merged commit 6072b69 into mongodb:master Nov 11, 2022

eramongodb deleted the cdriver-4192 branch November 11, 2022 16:09

eramongodb mentioned this pull request Nov 16, 2022

CDRIVER-4517 Sync unified retryable read and write test files with 35b17b70 #1150

Merged

jmikola mentioned this pull request Dec 5, 2022

[BLOCKED] PHPLIB-1033 and PHPLIB-1042: Sync spec tests for retryable handshake errors mongodb/mongo-php-library#1011

Closed

eramongodb mentioned this pull request Mar 9, 2023

Disable retryable handshakes by default for mock server tests #1214

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CDRIVER-4192 Support retryable handshake network errors #1141

CDRIVER-4192 Support retryable handshake network errors #1141

Uh oh!

eramongodb commented Nov 4, 2022 •

edited

Loading

Uh oh!

kevinAlbs left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eramongodb commented Nov 9, 2022

Uh oh!

galon1 Nov 9, 2022

Uh oh!

eramongodb Nov 9, 2022

Uh oh!

galon1 left a comment •

edited

Loading

Uh oh!

eramongodb commented Nov 10, 2022

Uh oh!

vector-of-bool left a comment

Uh oh!

vector-of-bool Nov 11, 2022

Uh oh!

eramongodb Nov 11, 2022

Uh oh!

vector-of-bool Nov 11, 2022

Uh oh!

Uh oh!

CDRIVER-4192 Support retryable handshake network errors #1141

CDRIVER-4192 Support retryable handshake network errors #1141

Uh oh!

Conversation

eramongodb commented Nov 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Spec Tests

Retryable Reads and Retryable Writes

Miscellaneous Drive-By Improvements and Fixes

Always True handshake_complete

Empty Arguments for operation_list_collections

Timeouts

Thread Sanitizer

Unused Const Variable

OP_MSG request is null

Monotonic Clock Time Comparison

scan-build Null Pointer Warnings

Uh oh!

kevinAlbs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eramongodb commented Nov 9, 2022

Uh oh!

galon1 Nov 9, 2022

Choose a reason for hiding this comment

Uh oh!

eramongodb Nov 9, 2022

Choose a reason for hiding this comment

Uh oh!

galon1 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eramongodb commented Nov 10, 2022

Uh oh!

vector-of-bool left a comment

Choose a reason for hiding this comment

Uh oh!

vector-of-bool Nov 11, 2022

Choose a reason for hiding this comment

Uh oh!

eramongodb Nov 11, 2022

Choose a reason for hiding this comment

Uh oh!

vector-of-bool Nov 11, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eramongodb commented Nov 4, 2022 •

edited

Loading

galon1 left a comment •

edited

Loading