CDRIVER-3645 Close operation connections to removed servers #1618

kevinAlbs · 2024-05-28T12:01:35Z

Summary

Close operation connections to removed servers.

Verified with this patch build: https://spruce.mongodb.com/version/665489593f686b000739adf0

Background & Motivation

Motivated by an issue reported in HELP-59749. Tests observed an unexpected increasing connection count. This may have been triggered by (misconfigured?) DNS records resulting in servers being frequently added/removed from the topology during SRV polling.

Rejected alternative: callback

I considered adding a callback to _mongoc_topology_description_remove_server to more immediately signal to close connections. This was rejected to reduce risk of an AB-BA deadlock.

_mongoc_topology_description_remove_server expects the caller to hold mongoc_topology_t::tpld_modification_mtx. Modifying pooled client requires holding mongoc_client_pool_t::mutex.

There is an existing pattern of locking mongoc_client_pool_t::mutex then mongoc_topology_t::tpld_modification_mtx (via mongoc_client_pool_try_pop => _start_scanner_if_needed => _mongoc_topology_background_monitoring_start => mc_tpld_modify_begin).

Instead, this PR proposes lazily pruning stale connections on mongoc_client_pool_push. I expect multi-threaded applications to pop/push clients frequently to encourage reuse of clients among threads. A note is added to documentation to make this recommendation explicit.

Performance Impact

A benchmark was run to compare throughput of repeated use of the pool through pop+command+push with pruning:

Test	Baseline (mb/sec)	With simple pruning	With conditional pruning
Parallel/Pool/ShareClients/Threads:1	251995	236063 (-6.32%)	248131 (-1.53%)
Parallel/Pool/ShareClients/Threads:10	1364224	873991 (-35.93%)	1337326 (-1.97%)
Parallel/Pool/ShareClients/Threads:100	2017530	1050534 (-47.93%)	2020829 (0.16%)
Parallel/Pool/ShareClients/Threads:150	1540427	1571373 (2.01%)	1552121 (0.76%)

An initial solution ("With simple pruning") iterated over all pooled clients to prune on each call to mongoc_client_pool_push. Due to the performance impact, checks were added to only prune if topology description change was detected ("With conditional pruning"). This motivated storing the last_known_serverids in the mongoc_client_pool_t to quickly determine if the set of server IDs changes. I expect the set of server IDs changing is a "rare" event (e.g. HELP-59749 notes an extreme case where SRV records unexpectedly change. Even then, I expect the set of server IDs would change about once per minute).

Track last known server IDs in the client pool. Before doing an expensive prune, do a cheaper check to see if server IDs changed.

Servers may be removed from the topology in other scenarios aside from SRV polling (e.g. during SDAM)

src/libmongoc/src/mongoc/mongoc-client-pool.c

Co-authored-by: Ezra Chung <[email protected]>

Reduces the scope holding the shared topology lock.

src/libmongoc/src/mongoc/mongoc-client-pool.c

vector-of-bool

LGTM with minor remarks

vector-of-bool · 2024-05-29T22:33:11Z

src/libmongoc/src/mongoc/mongoc-client-pool.c

+static bool
+maybe_prune (void *item, void *ctx_)
+{
+   mongoc_cluster_node_t *cn = (mongoc_cluster_node_t *) item;
+   prune_ctx *ctx = (prune_ctx *) ctx_;
+   uint32_t server_id = cn->handshake_sd->id;
+
+   if (!bsearch (&server_id, ctx->server_ids->data, ctx->server_ids->len, sizeof (uint32_t), server_id_cmp)) {
+      mongoc_cluster_disconnect_node (ctx->cluster, server_id);
+   }
+   return true;
+}


Recommend adding commentary on what/why this function is does what it does.

vector-of-bool · 2024-05-29T22:35:34Z

src/libmongoc/src/mongoc/mongoc-client-pool.c

+   for (; idx1 < cluster->nodes->items_len; idx1++) {
+      // Compare both sorted lists in order. `cluster->nodes` may be a smaller list if not all servers were used.
+      mongoc_set_item_t *cn = &cluster->nodes->items[idx1];
+      bool found = false;
+      for (; idx2 < known_server_ids->len; idx2++) {
+         uint32_t last_known = _mongoc_array_index (known_server_ids, uint32_t, idx2);
+         if (cn->id == last_known) {
+            found = true;
+            break;
+         }
+      }
+      if (!found) {
+         // A server in the cluster is not in the last known server ids. Prune it.
+         needs_prune = true;
+         break;
+      }
+   }
+
+   if (!needs_prune) {
+      return;
+   }


Is it likely that this early-return saves a significant amount of time for the additional complexity that it adds?

Good question. Another benchmark does not show significant change with this early-return removed:

Test Baseline (all conditional pruning) Without conditional client pruning Without conditional pruning (simple prune)

Parallel/Pool/ShareClients/Threads:1 249159 246554 (-1.05%) 235143 (-5.63%)

Parallel/Pool/ShareClients/Threads:10 1312571 1313206 (0.05%) 866941 (-33.95%)

Parallel/Pool/ShareClients/Threads:100 1932369 1917569 (-0.77%) 1194404 (-38.19%)

Parallel/Pool/ShareClients/Threads:150 1555300 1564356 (0.58%) 1564161 (0.57%)

The "Baseline" is the current changeset. The "Without conditional client pruning" removes this early-return.

Updated to remove the early-return.

And rename `server_ids` to `known_server_ids` for consistency

…1618) * add regression test: check connection counts of servers removed by SRV polling * prune clients on push Track last known server IDs in the client pool. Before doing an expensive prune, do a cheaper check to see if server IDs changed. * test direct topology changes Servers may be removed from the topology in other scenarios aside from SRV polling (e.g. during SDAM) * document expected use of pop/push --------- Co-authored-by: Ezra Chung <[email protected]>

kevinAlbs added 4 commits May 28, 2024 11:50

add failing test

e11dc52

prune clients on push

6e0d047

Track last known server IDs in the client pool. Before doing an expensive prune, do a cheaper check to see if server IDs changed.

test direct topology changes

2d32875

Servers may be removed from the topology in other scenarios aside from SRV polling (e.g. during SDAM)

document expected use of pop/push

8aa37a6

kevinAlbs marked this pull request as ready for review May 28, 2024 12:16

kevinAlbs requested review from vector-of-bool and eramongodb May 28, 2024 12:16

eramongodb requested changes May 28, 2024

View reviewed changes

src/libmongoc/src/mongoc/mongoc-client-pool.c Show resolved Hide resolved

src/libmongoc/src/mongoc/mongoc-client-pool.c Outdated Show resolved Hide resolved

src/libmongoc/src/mongoc/mongoc-client-pool.c Outdated Show resolved Hide resolved

kevinAlbs and others added 3 commits May 28, 2024 15:43

preserve constness in compare function

86b2032

Co-authored-by: Ezra Chung <[email protected]>

add missing #include "mongoc-array-private.h"

0c5c64e

copy-and-swap server IDs

d42ed7c

Reduces the scope holding the shared topology lock.

kevinAlbs requested a review from eramongodb May 28, 2024 20:18

eramongodb approved these changes May 28, 2024

View reviewed changes

src/libmongoc/src/mongoc/mongoc-client-pool.c Outdated Show resolved Hide resolved

fix comparison

148ff43

kevinAlbs requested a review from kkloberdanz May 29, 2024 19:28

vector-of-bool approved these changes May 29, 2024

View reviewed changes

kkloberdanz approved these changes May 30, 2024

View reviewed changes

kevinAlbs added 2 commits May 30, 2024 12:46

describe maybe_prune

1586c79

And rename `server_ids` to `known_server_ids` for consistency

remove unnecessary optimization

feb1101

kevinAlbs merged commit e80ae7f into mongodb:master May 30, 2024
40 of 42 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CDRIVER-3645 Close operation connections to removed servers #1618

CDRIVER-3645 Close operation connections to removed servers #1618

Uh oh!

kevinAlbs commented May 28, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vector-of-bool left a comment

Uh oh!

vector-of-bool May 29, 2024

Uh oh!

vector-of-bool May 29, 2024

Uh oh!

kevinAlbs May 30, 2024

Uh oh!

Uh oh!

Uh oh!

Test	Baseline (all conditional pruning)	Without conditional client pruning	Without conditional pruning (simple prune)
Parallel/Pool/ShareClients/Threads:1	249159	246554 (-1.05%)	235143 (-5.63%)
Parallel/Pool/ShareClients/Threads:10	1312571	1313206 (0.05%)	866941 (-33.95%)
Parallel/Pool/ShareClients/Threads:100	1932369	1917569 (-0.77%)	1194404 (-38.19%)
Parallel/Pool/ShareClients/Threads:150	1555300	1564356 (0.58%)	1564161 (0.57%)

CDRIVER-3645 Close operation connections to removed servers #1618

CDRIVER-3645 Close operation connections to removed servers #1618

Uh oh!

Conversation

kevinAlbs commented May 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background & Motivation

Rejected alternative: callback

Performance Impact

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vector-of-bool left a comment

Choose a reason for hiding this comment

Uh oh!

vector-of-bool May 29, 2024

Choose a reason for hiding this comment

Uh oh!

vector-of-bool May 29, 2024

Choose a reason for hiding this comment

Uh oh!

kevinAlbs May 30, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kevinAlbs commented May 28, 2024 •

edited

Loading