Skip to content

CDRIVER-4192 Support retryable handshake network errors #1141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Nov 11, 2022

Conversation

eramongodb
Copy link
Contributor

@eramongodb eramongodb commented Nov 4, 2022

Description

This PR resolves CDRIVER-4192. Changes are verified by this patch.

Spec Tests

The old retryable reads and retryable writes tests were relocated into a new legacy subdirectory to make room for the new unified spec test files.

The handshakeError.json files are pre-emptively modified as part of CDRIVER-4517, whose parent ticket DRIVERS-2489 is still undergoing review. Once DRIVERS-2489 has been merged into the spec, these spec test files may be updated as a followup accordingly if necessary.

Retryable Reads and Retryable Writes

The essential function identified as being the most appropriate target for augmentation to support retryable reads and retryable writes according to the specification was _mongoc_cluster_stream_for_optype, which implements the mongoc_cluster_stream_for_reads and mongoc_cluster_stream_for_writes functions invoked when a read or write operation is being executed.

This implementation assumes:

  1. any given read or write operation will invoke a mongoc_cluster_stream_for_* with which it will execute said operation.
  2. even if composed of multiple sub-operations (i.e. bulkWrite), the same stream obtained via mongoc_cluster_stream_for_* will be used for all sub-operations.
  3. a single retryable handshake error (recorded as retry_attempted = true) makes any following operations ineligible for retryability (as later detected by mongoc_cmd_parts_assemble).

These assumptions are not applicable to change streams, which may create multiple streams during a single operation (i.e. createChangeStream). Pending further clarification to the relationship between retryable handshake errors and change streams, I have elected to ignore this situation.

To make the interface and implementation uniform in its handling of reads and writes, the MONGOC_SS_AGGREGATE_WITH_WRITE case was split off from mongoc_cluster_stream_for_reads into a new, dedicated mongoc_cluster_stream_for_aggr_with_write function for symmetry.

Note, mongoc_cluster_stream_for_server, which is used for other miscellaneous connections to the server, is deliberately not involved in these changes. This function may eventually require similar changes according to DRIVERS-2063.

Note, the retryableWriteError error label is only applied to failed retryable writes, not retryable reads. Retryable reads may eventually use a retryableReadError error label pending DRIVERS-1401.

Several mock server tests that utilize deliberate hang ups to validate error handling behavior needed to be updated to account for new retry behavior. These tests were modified to have retryReads=false in their URI to opt out of retryable handshake errors so that their prior behavior is preserved.

Miscellaneous Drive-By Improvements and Fixes

Always True handshake_complete

The handshake_complete parameter for the static _handle_network_error function was identified as true in all invocations of the function. The parameter was therefore removed.

Empty Arguments for operation_list_collections

Testing revealed a missing condition to allow for zero arguments to the listCollections operation. This was fixed accordingly.

Timeouts

Several tasks were being timed out due to exceeding the default 40 minute exec_timeout_secs setting. This appears to be due to the increasing size of the test suite rather than any single particular test running for an excessively long time. Therefore, I elected to bump exec_timeout_secs up from 40 minutes to 60 minutes.

Thread Sanitizer

The TSAN tasks were failing due to a TSAN warning emitted by test_add_and_scan_failure. It is unclear to me why they were being caught now rather than earlier, but I elected to resolve this immediately.

Unused Const Variable

The gHexCharPairs variable was causing some noisy compiler warnings due to conditional usage. I added a BSON_MAYBE_UNUSED to its declaration appropriately.

OP_MSG request is null

The mock server functions assert that the request parameter is not null. These functions as a set can use some improvement in their error handling and reporting, but for now, I improved the message for a single case that I encountered during testing.

Monotonic Clock Time Comparison

A task was observed violating a monotonic clock invariant. The cause was not yet identified, but the assertion was modified to provide further detail regarding the values to assist with diagnosing the issue should it happen again.

scan-build Null Pointer Warnings

Recent additions to the test-bson.c test suite exposed a flaw with the ASSERT_CMPSTR assertion macro which allowed for null pointers to be passed as an argument to strcmp which is UB. Adjusted the condition to ensure null pointers are not passed to strcmp.

Copy link
Collaborator

@kevinAlbs kevinAlbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The miscellaneous improvements are much appreciated.

@eramongodb
Copy link
Contributor Author

Added spec test updates from mongodb/specifications#1336 which necessitated the support for listDatabaseNames in the unified test runner.

_mongoc_bson_init_with_transient_txn_error (cs, reply);
if (reply) {
bson_init (reply);
_mongoc_add_transient_txn_error (cs, reply);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would a transient_txn_error be added here? I am confused on what these labels are.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the Transactions Spec:

Any command error that includes the "TransientTransactionError" error label in the "errorLabels" field. Any network error encountered running any command other than commitTransaction in a transaction. If a network error occurs while running the commitTransaction command then it is not known whether the transaction committed or not, and thus the "TransientTransactionError" label MUST NOT be added.

The use of _mongoc_add_transient_txn_error would correspond to the condition, "Any network error encountered running any command other than commitTransaction in a transaction", according to the _mongoc_client_session_in_txn check used in its implementation.

Copy link
Contributor

@galon1 galon1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I just have one question about transient_tx_error that I posted as a comment.

@eramongodb
Copy link
Contributor Author

Latest changes verified by this patch. The two task regressions appear to be flaky/unrelated to changes in this PR.

@eramongodb eramongodb requested a review from kevinAlbs November 10, 2022 17:49
Copy link
Contributor

@vector-of-bool vector-of-bool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delayed review. The misleading large file/change count made me continually procrastinate it. The refactor of looking up streams for aggr-with-write is helpful, as I wasn't particularly happy with the proliferation of the magic bool parameter. LGTM, aside from one minor comment about bson_steal.

// original retryable error.
{
if (reply) {
bson_steal (reply, &first_reply);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beware: bson_steal seems subtly broken, and I no longer trust it. It marks the dst as read-only and makes bson_destroy(dst) never free dst itself (both of these because it sets BSON_FLAG_STATIC on dst). I have no idea why it does this, the function is under-documented, there's no commentary, and the git-blame is useless. I think it would be good to investigate, but not high-priority yet. It might "just work" in this case, but it has bitten me in the past (repeatedly).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be the case that it has not caused any trouble so far in this PR due to all the bson_t in question being circumstantially stack-allocated, thus not requiring a bson_free. You reminded me that I had a branch a while back with a new bson_move_to to avoid some of the idiosyncracies I had encountered with bson_steal. I may need to recover and propose a PR for that later. I will make an attempt to avoid bson_steal for now as suggested.🤔

Comment on lines +346 to +351
if ((_a != _b) && (!_a || !_b || (strcmp (_a, _b) != 0))) { \
fprintf (stderr, \
"FAIL\n\nAssert Failure:\n \"%s\"\n !=\n \"%s\"\n %s:%d " \
" %s()\n", \
_a, \
_b, \
_a ? _a : "(null)", \
_b ? _b : "(null)", \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. I've wanted this to work for a long while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants