Skip to content

CDRIVER-4193 Ensure OP_MSG for handshakes and fix RPC op_egress counters #1256

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
May 3, 2023

Conversation

eramongodb
Copy link
Contributor

@eramongodb eramongodb commented May 2, 2023

Description

This PR resolves CDRIVER-4193, CDRIVER-4622, and CDRIVER-4265. Verified by this patch.

OP_MSG for Handshakes (CDRIVER-4193 and CDRIVER-4265)

Drivers are required to use the "hello" command with the OP_MSG protocol for initial connection handshakes when either a Stable API Version (serverApi=<value>) or Load Balancing mode (loadBalanced=true) is requested by the client. They should not be using the "legacy hello" command with the OP_QUERY protocol if either of these options are set.

Following CDRIVER-4629 (#1248 and #1251), initial connection handshakes performed by clients were verified as correctly selecting OP_MSG vs. OP_QUERY depending on serverApi (specifically in mongoc_cluster_run_command_private).

When using client pools, a server monitor thread is spawned, and if the server has a maxWireVersion equivalent to server version 4.4 or newer, an additional RTT thread is spawned. Although the monitor thread was correctly checking for serverApi during initial connection handshakes, the RTT thread was unconditionally using OP_QUERY for its initial connection handshakes, and both the monitor and RTT threads were unconditionally using OP_QUERY for their polling hellos. The new _server_monitor_send_and_recv function is now used by both threads to perform correct OP_MSG/OP_QUERY selection.

Subsequently, all sites that check for serverApi via mongoc_<object>_uses_server_api were updated to also check for loadbalanced=true via the new mongoc_<object>_uses_loadbalanced function.

CDRIVER-4622 RPC op_egress Counters

The _mongoc_rpc_gather function served the dual-purpose of translating RPC messages into an array of iovec structures that are used to write the messages to the streams, as well as incrementing the op_egress performance counter corresponding to the opcode of the RPC message to be written. This dual purpose led to overcounting within the topology scanner when multiple TCP connections are scheduled for each DNS result, but only the first successful TCP(+TLS) connection is used to actually write the RPC message.

This PR redefines _mongoc_rpc_gather to perform RPC message translation only. The increment of RPC op_egress counters is now done via the new dedicated _mongoc_rpc_op_egress_inc function. A temporary _mongoc_rpc_gather_no_inc function was used to track this separation during refactors (see the individual "RPC egress increment" commits for details). Once all references to the old _mongoc_rpc_gather were replaced with _mongoc_rpc_gather_no_inc and calls to _mongoc_rpc_op_egress_inc were placed where required, the _mongoc_rpc_gather_no_inc was renamed to _mongoc_rpc_gather, effectively substituting the old definition with the new non-incrementing definition.

/counters/rpc/op_egress Tests

The bulk of this PR (besides the addition of mongoc_<object>_uses_loadbalanced) is the addition of the /counters/rpc/op_egress tests to validate correct behavior of the RPC op_egress counters and OP_MSG/OP_QUERY selection during refactors. With the exception of the fix of the RTT thread and the mock server, all refactors of old calls to _mongoc_rpc_gather were synchronized with updates to the stack traces in the tests that document the connect/command -> RPC op_egress counter. These comments should hopefully also serve as a useful reference for developers to understand the C Driver's connection handling process.

These tests should cover all code paths that lead to RPC op_egress counter increments. With the exception of the /counters/rpc/op_egress/auth tests, the RPC op_egress tests use the mock server to deterministically time and test counter values incremented during initial connection handshakes. The auth tests run against a live server, with a 4.4+ requirement for /op_msg subtests that use serverApi to force OP_MSG for initial connection handshakes.

Documentation

The documentation for Performance Counters was updated with a note clarifying that op_egress counters do not account for connection or handshake/command failures.

@eramongodb eramongodb self-assigned this May 2, 2023
Copy link
Contributor

@vector-of-bool vector-of-bool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Collaborator

@kevinAlbs kevinAlbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. The comments in the /counters/rpc/op_egress tests are much appreciated as a reference. LGTM with suggested changes to tests.

Copy link
Contributor Author

@eramongodb eramongodb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest changes verified by this patch.

@eramongodb eramongodb requested a review from kevinAlbs May 3, 2023 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants