-
Notifications
You must be signed in to change notification settings - Fork 455
CDRIVER-3645 Close operation connections to removed servers #1618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Track last known server IDs in the client pool. Before doing an expensive prune, do a cheaper check to see if server IDs changed.
Servers may be removed from the topology in other scenarios aside from SRV polling (e.g. during SDAM)
Co-authored-by: Ezra Chung <[email protected]>
Reduces the scope holding the shared topology lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with minor remarks
static bool | ||
maybe_prune (void *item, void *ctx_) | ||
{ | ||
mongoc_cluster_node_t *cn = (mongoc_cluster_node_t *) item; | ||
prune_ctx *ctx = (prune_ctx *) ctx_; | ||
uint32_t server_id = cn->handshake_sd->id; | ||
|
||
if (!bsearch (&server_id, ctx->server_ids->data, ctx->server_ids->len, sizeof (uint32_t), server_id_cmp)) { | ||
mongoc_cluster_disconnect_node (ctx->cluster, server_id); | ||
} | ||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend adding commentary on what/why this function is does what it does.
for (; idx1 < cluster->nodes->items_len; idx1++) { | ||
// Compare both sorted lists in order. `cluster->nodes` may be a smaller list if not all servers were used. | ||
mongoc_set_item_t *cn = &cluster->nodes->items[idx1]; | ||
bool found = false; | ||
for (; idx2 < known_server_ids->len; idx2++) { | ||
uint32_t last_known = _mongoc_array_index (known_server_ids, uint32_t, idx2); | ||
if (cn->id == last_known) { | ||
found = true; | ||
break; | ||
} | ||
} | ||
if (!found) { | ||
// A server in the cluster is not in the last known server ids. Prune it. | ||
needs_prune = true; | ||
break; | ||
} | ||
} | ||
|
||
if (!needs_prune) { | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it likely that this early-return saves a significant amount of time for the additional complexity that it adds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Another benchmark does not show significant change with this early-return removed:
Test | Baseline (all conditional pruning) | Without conditional client pruning | Without conditional pruning (simple prune) |
---|---|---|---|
Parallel/Pool/ShareClients/Threads:1 | 249159 | 246554 (-1.05%) | 235143 (-5.63%) |
Parallel/Pool/ShareClients/Threads:10 | 1312571 | 1313206 (0.05%) | 866941 (-33.95%) |
Parallel/Pool/ShareClients/Threads:100 | 1932369 | 1917569 (-0.77%) | 1194404 (-38.19%) |
Parallel/Pool/ShareClients/Threads:150 | 1555300 | 1564356 (0.58%) | 1564161 (0.57%) |
The "Baseline" is the current changeset. The "Without conditional client pruning" removes this early-return.
Updated to remove the early-return.
And rename `server_ids` to `known_server_ids` for consistency
…1618) * add regression test: check connection counts of servers removed by SRV polling * prune clients on push Track last known server IDs in the client pool. Before doing an expensive prune, do a cheaper check to see if server IDs changed. * test direct topology changes Servers may be removed from the topology in other scenarios aside from SRV polling (e.g. during SDAM) * document expected use of pop/push --------- Co-authored-by: Ezra Chung <[email protected]>
Summary
Close operation connections to removed servers.
Verified with this patch build: https://spruce.mongodb.com/version/665489593f686b000739adf0
Background & Motivation
Motivated by an issue reported in HELP-59749. Tests observed an unexpected increasing connection count. This may have been triggered by (misconfigured?) DNS records resulting in servers being frequently added/removed from the topology during SRV polling.
Rejected alternative: callback
I considered adding a callback to
_mongoc_topology_description_remove_server
to more immediately signal to close connections. This was rejected to reduce risk of an AB-BA deadlock._mongoc_topology_description_remove_server
expects the caller to holdmongoc_topology_t::tpld_modification_mtx
. Modifying pooled client requires holdingmongoc_client_pool_t::mutex
.There is an existing pattern of locking
mongoc_client_pool_t::mutex
thenmongoc_topology_t::tpld_modification_mtx
(viamongoc_client_pool_try_pop
=>_start_scanner_if_needed
=>_mongoc_topology_background_monitoring_start
=>mc_tpld_modify_begin
).Instead, this PR proposes lazily pruning stale connections on
mongoc_client_pool_push
. I expect multi-threaded applications to pop/push clients frequently to encourage reuse of clients among threads. A note is added to documentation to make this recommendation explicit.Performance Impact
A benchmark was run to compare throughput of repeated use of the pool through pop+command+push with pruning:
An initial solution ("With simple pruning") iterated over all pooled clients to prune on each call to
mongoc_client_pool_push
. Due to the performance impact, checks were added to only prune if topology description change was detected ("With conditional pruning"). This motivated storing thelast_known_serverids
in themongoc_client_pool_t
to quickly determine if the set of server IDs changes. I expect the set of server IDs changing is a "rare" event (e.g. HELP-59749 notes an extreme case where SRV records unexpectedly change. Even then, I expect the set of server IDs would change about once per minute).