-
Notifications
You must be signed in to change notification settings - Fork 455
CDRIVER-4192 Support retryable handshake network errors #1141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9657a3e
to
efaa178
Compare
…ansient_txn_error
…ction to a server
efaa178
to
5c9a0a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The miscellaneous improvements are much appreciated.
Added spec test updates from mongodb/specifications#1336 which necessitated the support for |
_mongoc_bson_init_with_transient_txn_error (cs, reply); | ||
if (reply) { | ||
bson_init (reply); | ||
_mongoc_add_transient_txn_error (cs, reply); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would a transient_txn_error
be added here? I am confused on what these labels are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the Transactions Spec:
Any command error that includes the "TransientTransactionError" error label in the "errorLabels" field. Any network error encountered running any command other than commitTransaction in a transaction. If a network error occurs while running the commitTransaction command then it is not known whether the transaction committed or not, and thus the "TransientTransactionError" label MUST NOT be added.
The use of _mongoc_add_transient_txn_error
would correspond to the condition, "Any network error encountered running any command other than commitTransaction in a transaction", according to the _mongoc_client_session_in_txn
check used in its implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I just have one question about transient_tx_error
that I posted as a comment.
Latest changes verified by this patch. The two task regressions appear to be flaky/unrelated to changes in this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delayed review. The misleading large file/change count made me continually procrastinate it. The refactor of looking up streams for aggr-with-write is helpful, as I wasn't particularly happy with the proliferation of the magic bool
parameter. LGTM, aside from one minor comment about bson_steal
.
// original retryable error. | ||
{ | ||
if (reply) { | ||
bson_steal (reply, &first_reply); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beware: bson_steal
seems subtly broken, and I no longer trust it. It marks the dst
as read-only and makes bson_destroy(dst)
never free dst
itself (both of these because it sets BSON_FLAG_STATIC
on dst
). I have no idea why it does this, the function is under-documented, there's no commentary, and the git-blame is useless. I think it would be good to investigate, but not high-priority yet. It might "just work" in this case, but it has bitten me in the past (repeatedly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be the case that it has not caused any trouble so far in this PR due to all the bson_t
in question being circumstantially stack-allocated, thus not requiring a bson_free
. You reminded me that I had a branch a while back with a new bson_move_to
to avoid some of the idiosyncracies I had encountered with bson_steal
. I may need to recover and propose a PR for that later. I will make an attempt to avoid bson_steal
for now as suggested.🤔
if ((_a != _b) && (!_a || !_b || (strcmp (_a, _b) != 0))) { \ | ||
fprintf (stderr, \ | ||
"FAIL\n\nAssert Failure:\n \"%s\"\n !=\n \"%s\"\n %s:%d " \ | ||
" %s()\n", \ | ||
_a, \ | ||
_b, \ | ||
_a ? _a : "(null)", \ | ||
_b ? _b : "(null)", \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice. I've wanted this to work for a long while.
Description
This PR resolves CDRIVER-4192. Changes are verified by this patch.
Spec Tests
The old retryable reads and retryable writes tests were relocated into a new
legacy
subdirectory to make room for the new unified spec test files.The
handshakeError.json
files are pre-emptively modified as part of CDRIVER-4517, whose parent ticket DRIVERS-2489 is still undergoing review. Once DRIVERS-2489 has been merged into the spec, these spec test files may be updated as a followup accordingly if necessary.Retryable Reads and Retryable Writes
The essential function identified as being the most appropriate target for augmentation to support retryable reads and retryable writes according to the specification was
_mongoc_cluster_stream_for_optype
, which implements themongoc_cluster_stream_for_reads
andmongoc_cluster_stream_for_writes
functions invoked when a read or write operation is being executed.This implementation assumes:
mongoc_cluster_stream_for_*
with which it will execute said operation.bulkWrite
), the same stream obtained viamongoc_cluster_stream_for_*
will be used for all sub-operations.retry_attempted = true
) makes any following operations ineligible for retryability (as later detected bymongoc_cmd_parts_assemble
).These assumptions are not applicable to change streams, which may create multiple streams during a single operation (i.e.
createChangeStream
). Pending further clarification to the relationship between retryable handshake errors and change streams, I have elected to ignore this situation.To make the interface and implementation uniform in its handling of reads and writes, the
MONGOC_SS_AGGREGATE_WITH_WRITE
case was split off frommongoc_cluster_stream_for_reads
into a new, dedicatedmongoc_cluster_stream_for_aggr_with_write
function for symmetry.Note,
mongoc_cluster_stream_for_server
, which is used for other miscellaneous connections to the server, is deliberately not involved in these changes. This function may eventually require similar changes according to DRIVERS-2063.Note, the
retryableWriteError
error label is only applied to failed retryable writes, not retryable reads. Retryable reads may eventually use aretryableReadError
error label pending DRIVERS-1401.Several mock server tests that utilize deliberate hang ups to validate error handling behavior needed to be updated to account for new retry behavior. These tests were modified to have
retryReads=false
in their URI to opt out of retryable handshake errors so that their prior behavior is preserved.Miscellaneous Drive-By Improvements and Fixes
Always True handshake_complete
The
handshake_complete
parameter for the static_handle_network_error
function was identified as true in all invocations of the function. The parameter was therefore removed.Empty Arguments for operation_list_collections
Testing revealed a missing condition to allow for zero arguments to the listCollections operation. This was fixed accordingly.
Timeouts
Several tasks were being timed out due to exceeding the default 40 minute
exec_timeout_secs
setting. This appears to be due to the increasing size of the test suite rather than any single particular test running for an excessively long time. Therefore, I elected to bumpexec_timeout_secs
up from 40 minutes to 60 minutes.Thread Sanitizer
The TSAN tasks were failing due to a TSAN warning emitted by
test_add_and_scan_failure
. It is unclear to me why they were being caught now rather than earlier, but I elected to resolve this immediately.Unused Const Variable
The
gHexCharPairs
variable was causing some noisy compiler warnings due to conditional usage. I added aBSON_MAYBE_UNUSED
to its declaration appropriately.OP_MSG request is null
The mock server functions assert that the request parameter is not null. These functions as a set can use some improvement in their error handling and reporting, but for now, I improved the message for a single case that I encountered during testing.
Monotonic Clock Time Comparison
A task was observed violating a monotonic clock invariant. The cause was not yet identified, but the assertion was modified to provide further detail regarding the values to assist with diagnosing the issue should it happen again.
scan-build Null Pointer Warnings
Recent additions to the
test-bson.c
test suite exposed a flaw with theASSERT_CMPSTR
assertion macro which allowed for null pointers to be passed as an argument tostrcmp
which is UB. Adjusted the condition to ensure null pointers are not passed tostrcmp
.