Skip to content

CDRIVER-3645 Close operation connections to removed servers #1618

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 30, 2024

Conversation

kevinAlbs
Copy link
Collaborator

@kevinAlbs kevinAlbs commented May 28, 2024

Summary

Close operation connections to removed servers.

Verified with this patch build: https://spruce.mongodb.com/version/665489593f686b000739adf0

Background & Motivation

Motivated by an issue reported in HELP-59749. Tests observed an unexpected increasing connection count. This may have been triggered by (misconfigured?) DNS records resulting in servers being frequently added/removed from the topology during SRV polling.

Rejected alternative: callback

I considered adding a callback to _mongoc_topology_description_remove_server to more immediately signal to close connections. This was rejected to reduce risk of an AB-BA deadlock.

_mongoc_topology_description_remove_server expects the caller to hold mongoc_topology_t::tpld_modification_mtx. Modifying pooled client requires holding mongoc_client_pool_t::mutex.

There is an existing pattern of locking mongoc_client_pool_t::mutex then mongoc_topology_t::tpld_modification_mtx (via mongoc_client_pool_try_pop => _start_scanner_if_needed => _mongoc_topology_background_monitoring_start => mc_tpld_modify_begin).

Instead, this PR proposes lazily pruning stale connections on mongoc_client_pool_push. I expect multi-threaded applications to pop/push clients frequently to encourage reuse of clients among threads. A note is added to documentation to make this recommendation explicit.

Performance Impact

A benchmark was run to compare throughput of repeated use of the pool through pop+command+push with pruning:

Test Baseline (mb/sec) With simple pruning With conditional pruning
Parallel/Pool/ShareClients/Threads:1 251995 236063 (-6.32%) 248131 (-1.53%)
Parallel/Pool/ShareClients/Threads:10 1364224 873991 (-35.93%) 1337326 (-1.97%)
Parallel/Pool/ShareClients/Threads:100 2017530 1050534 (-47.93%) 2020829 (0.16%)
Parallel/Pool/ShareClients/Threads:150 1540427 1571373 (2.01%) 1552121 (0.76%)

An initial solution ("With simple pruning") iterated over all pooled clients to prune on each call to mongoc_client_pool_push. Due to the performance impact, checks were added to only prune if topology description change was detected ("With conditional pruning"). This motivated storing the last_known_serverids in the mongoc_client_pool_t to quickly determine if the set of server IDs changes. I expect the set of server IDs changing is a "rare" event (e.g. HELP-59749 notes an extreme case where SRV records unexpectedly change. Even then, I expect the set of server IDs would change about once per minute).

kevinAlbs added 4 commits May 28, 2024 11:50
Track last known server IDs in the client pool. Before doing an expensive prune, do a cheaper check to see if server IDs changed.
Servers may be removed from the topology in other scenarios aside from SRV polling (e.g. during SDAM)
@kevinAlbs kevinAlbs marked this pull request as ready for review May 28, 2024 12:16
kevinAlbs and others added 3 commits May 28, 2024 15:43
@kevinAlbs kevinAlbs requested a review from eramongodb May 28, 2024 20:18
@kevinAlbs kevinAlbs requested a review from kkloberdanz May 29, 2024 19:28
Copy link
Contributor

@vector-of-bool vector-of-bool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with minor remarks

Comment on lines 384 to 395
static bool
maybe_prune (void *item, void *ctx_)
{
mongoc_cluster_node_t *cn = (mongoc_cluster_node_t *) item;
prune_ctx *ctx = (prune_ctx *) ctx_;
uint32_t server_id = cn->handshake_sd->id;

if (!bsearch (&server_id, ctx->server_ids->data, ctx->server_ids->len, sizeof (uint32_t), server_id_cmp)) {
mongoc_cluster_disconnect_node (ctx->cluster, server_id);
}
return true;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend adding commentary on what/why this function is does what it does.

Comment on lines 410 to 430
for (; idx1 < cluster->nodes->items_len; idx1++) {
// Compare both sorted lists in order. `cluster->nodes` may be a smaller list if not all servers were used.
mongoc_set_item_t *cn = &cluster->nodes->items[idx1];
bool found = false;
for (; idx2 < known_server_ids->len; idx2++) {
uint32_t last_known = _mongoc_array_index (known_server_ids, uint32_t, idx2);
if (cn->id == last_known) {
found = true;
break;
}
}
if (!found) {
// A server in the cluster is not in the last known server ids. Prune it.
needs_prune = true;
break;
}
}

if (!needs_prune) {
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it likely that this early-return saves a significant amount of time for the additional complexity that it adds?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Another benchmark does not show significant change with this early-return removed:

Test Baseline (all conditional pruning) Without conditional client pruning Without conditional pruning (simple prune)
Parallel/Pool/ShareClients/Threads:1 249159 246554 (-1.05%) 235143 (-5.63%)
Parallel/Pool/ShareClients/Threads:10 1312571 1313206 (0.05%) 866941 (-33.95%)
Parallel/Pool/ShareClients/Threads:100 1932369 1917569 (-0.77%) 1194404 (-38.19%)
Parallel/Pool/ShareClients/Threads:150 1555300 1564356 (0.58%) 1564161 (0.57%)

The "Baseline" is the current changeset. The "Without conditional client pruning" removes this early-return.

Updated to remove the early-return.

kevinAlbs added 2 commits May 30, 2024 12:46
And rename `server_ids` to `known_server_ids` for consistency
@kevinAlbs kevinAlbs merged commit e80ae7f into mongodb:master May 30, 2024
40 of 42 checks passed
kevinAlbs added a commit to kevinAlbs/mongo-c-driver that referenced this pull request May 30, 2024
…1618)

* add regression test: check connection counts of servers removed by SRV polling

* prune clients on push

Track last known server IDs in the client pool. Before doing an expensive prune, do a cheaper check to see if server IDs changed.

* test direct topology changes

Servers may be removed from the topology in other scenarios aside from SRV polling (e.g. during SDAM)

* document expected use of pop/push

---------

Co-authored-by: Ezra Chung <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants