Skip to content

Load balancer: SDAM and Server Selection #816

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Jul 22, 2021

Conversation

kevinAlbs
Copy link
Collaborator

@kevinAlbs kevinAlbs commented Jul 3, 2021

  • CDRIVER-4053 Load balancer: SDAM changes
    • Add a LoadBalanced topology type and LoadBalancer server type.
    • Manually set the topology description to contain one server of type LoadBalancer when opening.
    • Prevent updates to the topology description after opening.
    • Disable SDAM monitoring for a LoadBalanced topology.
    • Emit correct SDAM monitoring events for a LoadBalanced topology.
    • Bypass the cooldown period for the single-threaded topology scanner.
  • CDRIVER-4055 Load balancer: Server Selection
    • Always return the one load balancer in server selection.

@kevinAlbs kevinAlbs force-pushed the sdam.loadbalancer branch 2 times, most recently from b83116d to b346c4e Compare July 7, 2021 02:29
@@ -317,9 +320,6 @@ process_sdam_test_hello_responses (bson_t *phase,
generation);
bson_mutex_unlock (&topology->mutex);
}
} else {
test_error (
"test does not have 'responses' or 'applicationErrors' array");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new discover_load_balancer SDAM test does not include the responses field. I think this is intentional since no hello responses need to be processed to transition to a LoadBalanced topology.

#include "mongoc/mongoc-client-session-private.h"
#include "test-conveniences.h"
#include "test-libmongoc.h"
#include "TestSuite.h"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are not run on Evergreen. I have run this locally to verify, but will add to Evergreen as part of CDRIVER-4062.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these tests/where are they required in the spec?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not required by the spec. I would like to include them as a simple validation of the implementation up to this point.

Load balancer support is still incomplete. My process has been run a narrowly scoped failing test, make the minimal behavior change, then refactor. ("red" => "green" => "refactor").

The server selection JSON spec tests verify server selection works.

The SDAM JSON tests verify that SDAM is disabled.

/loadbalanced/connect/pooled and /loadbalanced/connect/single verify that sending a command uses the server description tied to the connection. Without the recent fix of CDRIVER-3653, those tests fail.

/loadbalanced/cooldown_is_bypassed/single verifies behavior specific to the C implementation. Connection creation in single-threaded goes through the same machinery as SDAM. Without an additional change to bypass the 5 second cooldown, a network error could wait 5 seconds before successfully reconnecting.

/loadbalanced/server_selection_establishes_connection/single verifies a requirement from wrapping drivers: successful server selection implies a connection was established to a server and is ready for commands.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

@kevinAlbs kevinAlbs requested a review from jmikola July 7, 2021 02:46
@kevinAlbs kevinAlbs marked this pull request as ready for review July 7, 2021 02:46
@kevinAlbs kevinAlbs removed the request for review from jmikola July 8, 2021 00:27
@kevinAlbs
Copy link
Collaborator Author

As I was working on handshake changes, I realize some crucial details are missing:

  • Topology updates are not being skipped on the hello response to application connections.
  • Single threaded connections are not getting created correctly since they are created as part of monitoring.

I will add integration tests before re-request review once these are resolved. Moving back to draft.

@kevinAlbs kevinAlbs changed the title Load balancer: SDAM, Server Selection, and Sessions [WIP] Load balancer: SDAM, Server Selection, and Sessions Jul 8, 2021
@kevinAlbs kevinAlbs marked this pull request as draft July 14, 2021 21:01
@kevinAlbs kevinAlbs force-pushed the sdam.loadbalancer branch from 0fb7eef to 472242b Compare July 15, 2021 00:43
@kevinAlbs kevinAlbs changed the title [WIP] Load balancer: SDAM, Server Selection, and Sessions [WIP] Load balancer: SDAM and Server Selection Jul 15, 2021
@kevinAlbs kevinAlbs force-pushed the sdam.loadbalancer branch from d1c6507 to 83f9d8f Compare July 19, 2021 00:10
@@ -807,6 +807,9 @@ mongoc_server_description_new_copy (
&description->error);
} else {
mongoc_server_description_reset (copy);
/* preserve the original server description type, which is manually set
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, this was not necessary for regular (non load balanced) server descriptions, which determine the type when processing the last last_hello_response.

But a load balanced server description does not have a hello response. It is manually created.

@@ -352,6 +352,10 @@ _mongoc_topology_description_server_is_candidate (
default:
return false;
}

case MONGOC_TOPOLOGY_LOAD_BALANCED:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, there is currently no call path to this function. But I added a case to silence compiler warnings on an unhandled enum value.

Load balanced server selection starts in _mongoc_topology_select_server_id_loadbalanced => mongoc_topology_description_select => mongoc_topology_description_suitable_servers.

I alternatively considered making this a BSON_ASSERT (false) since this is untested. But it seemed low impact, and this is the behavior I'd expect if _mongoc_topology_description_server_is_candidate was called on a load balancer topology type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A brief comment to that effect in the code might be nice, as this GitHub context won't easily be available to the reader.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* create connections. The cooldown period does not apply. A network
* error to a load balanced connection does not imply subsequent
* connection attempts will be to the same server and that a delay
* should occur. */
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cooldown period is described in the server monitoring specification.

Single threaded connections are created in the topology scanner so go through the machinery that does SDAM monitoring. But in load balanced mode, once a connection is created, it is never monitored. If the connection is closed (e.g. through a network error) the scanner should create a new connection as soon as one is requested.

@@ -86,6 +96,41 @@ _topology_has_description (mongoc_topology_description_t *topology,
BSON_ITER_HOLDS_INT32 (&iter));
expected_generation = bson_iter_int32 (&iter);
ASSERT_CMPINT32 (expected_generation, ==, sd->generation);
} else if (strcmp ("logicalSessionTimeoutMinutes",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are to support the new discover_load_balancer.yml test. The test asserts many fields on a load balancer are unset by setting the expectation to null.

@kevinAlbs kevinAlbs changed the title [WIP] Load balancer: SDAM and Server Selection Load balancer: SDAM and Server Selection Jul 19, 2021
@kevinAlbs kevinAlbs marked this pull request as ready for review July 19, 2021 13:36
@kevinAlbs kevinAlbs requested review from benjirewis and chardan July 19, 2021 13:50
Copy link
Contributor

@chardan chardan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! I think I only see one thing that probably should be addressed (mongoc_server_description_new_copy() is unchecked), otherwise +1 from me-- but please
bear in mind that I'm not too familiar with this behavior or the codebase yet.

* an error would have occurred when constructing the topology. */
BSON_ASSERT (td->servers->items_len == 1);
sd = (mongoc_server_description_t *) mongoc_set_get_item (td->servers, 0);
prev_sd = mongoc_server_description_new_copy (sd);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that mongoc_server_description_new_copy() can return NULL on failure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Added a BSON_ASSERT since this would be an OOM error.

if (!selected_server) {
_mongoc_server_selection_error (
"No suitable server found in load balanced deployment", NULL, error);
bson_mutex_unlock (&topology->mutex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It looks like what you've written here correctly releases the mutex, but you might also consider the goto-return idiom, which makes this somewhat easier to clean up in the face of new code being added.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you might want to release the mutex as soon as possible, rather than holding it for the duration of the function. Ok. (Although in that case, would it be better as its own function? Tough call here IMO-- it's fine as-is.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in this case I did not want to hold the mutex for the remainder of the function. I will leave as-is.

int topology_changed_events;
int topology_opening_events;
int topology_closed_events;
} stats_t;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though this is in test code, "stats_t" seems generic enough that it has a high chance of conflicting with another symbol that the test might link to.

{
mongoc_apm_callbacks_t *cbs;

cbs = mongoc_apm_callbacks_new ();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: ok to ignore result, bson_malloc0() promises to call abort() on failure.

MONGOC_ERROR_STREAM_NOT_ESTABLISHED,
"Could not establish stream");

/* Failing to "scan" would normally cause the node to be in cooldown and fail
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@benjirewis benjirewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM mod @chardan's comments. Thanks again for all the offline explanation.

#include "mongoc/mongoc-client-session-private.h"
#include "test-conveniences.h"
#include "test-libmongoc.h"
#include "TestSuite.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

@kevinAlbs kevinAlbs merged commit f9db623 into mongodb:master Jul 22, 2021
@kevinAlbs kevinAlbs deleted the sdam.loadbalancer branch July 22, 2021 21:13
@kevinAlbs kevinAlbs restored the sdam.loadbalancer branch February 3, 2022 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants