-
Notifications
You must be signed in to change notification settings - Fork 455
Load balancer: SDAM and Server Selection #816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b83116d
to
b346c4e
Compare
@@ -317,9 +320,6 @@ process_sdam_test_hello_responses (bson_t *phase, | |||
generation); | |||
bson_mutex_unlock (&topology->mutex); | |||
} | |||
} else { | |||
test_error ( | |||
"test does not have 'responses' or 'applicationErrors' array"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new discover_load_balancer SDAM test does not include the responses
field. I think this is intentional since no hello responses need to be processed to transition to a LoadBalanced
topology.
#include "mongoc/mongoc-client-session-private.h" | ||
#include "test-conveniences.h" | ||
#include "test-libmongoc.h" | ||
#include "TestSuite.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are not run on Evergreen. I have run this locally to verify, but will add to Evergreen as part of CDRIVER-4062.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these tests/where are they required in the spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not required by the spec. I would like to include them as a simple validation of the implementation up to this point.
Load balancer support is still incomplete. My process has been run a narrowly scoped failing test, make the minimal behavior change, then refactor. ("red" => "green" => "refactor").
The server selection JSON spec tests verify server selection works.
The SDAM JSON tests verify that SDAM is disabled.
/loadbalanced/connect/pooled
and /loadbalanced/connect/single
verify that sending a command uses the server description tied to the connection. Without the recent fix of CDRIVER-3653, those tests fail.
/loadbalanced/cooldown_is_bypassed/single
verifies behavior specific to the C implementation. Connection creation in single-threaded goes through the same machinery as SDAM. Without an additional change to bypass the 5 second cooldown, a network error could wait 5 seconds before successfully reconnecting.
/loadbalanced/server_selection_establishes_connection/single
verifies a requirement from wrapping drivers: successful server selection implies a connection was established to a server and is ready for commands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me.
As I was working on handshake changes, I realize some crucial details are missing:
I will add integration tests before re-request review once these are resolved. Moving back to draft. |
0fb7eef
to
472242b
Compare
This reverts commit c18947e035b94e095d394dbd1764ee50a3fc9a3e.
d1c6507
to
83f9d8f
Compare
@@ -807,6 +807,9 @@ mongoc_server_description_new_copy ( | |||
&description->error); | |||
} else { | |||
mongoc_server_description_reset (copy); | |||
/* preserve the original server description type, which is manually set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, this was not necessary for regular (non load balanced) server descriptions, which determine the type when processing the last last_hello_response
.
But a load balanced server description does not have a hello response. It is manually created.
@@ -352,6 +352,10 @@ _mongoc_topology_description_server_is_candidate ( | |||
default: | |||
return false; | |||
} | |||
|
|||
case MONGOC_TOPOLOGY_LOAD_BALANCED: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, there is currently no call path to this function. But I added a case to silence compiler warnings on an unhandled enum value.
Load balanced server selection starts in _mongoc_topology_select_server_id_loadbalanced
=> mongoc_topology_description_select
=> mongoc_topology_description_suitable_servers
.
I alternatively considered making this a BSON_ASSERT (false)
since this is untested. But it seemed low impact, and this is the behavior I'd expect if _mongoc_topology_description_server_is_candidate
was called on a load balancer topology type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A brief comment to that effect in the code might be nice, as this GitHub context won't easily be available to the reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* create connections. The cooldown period does not apply. A network | ||
* error to a load balanced connection does not imply subsequent | ||
* connection attempts will be to the same server and that a delay | ||
* should occur. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cooldown period is described in the server monitoring specification.
Single threaded connections are created in the topology scanner so go through the machinery that does SDAM monitoring. But in load balanced mode, once a connection is created, it is never monitored. If the connection is closed (e.g. through a network error) the scanner should create a new connection as soon as one is requested.
@@ -86,6 +96,41 @@ _topology_has_description (mongoc_topology_description_t *topology, | |||
BSON_ITER_HOLDS_INT32 (&iter)); | |||
expected_generation = bson_iter_int32 (&iter); | |||
ASSERT_CMPINT32 (expected_generation, ==, sd->generation); | |||
} else if (strcmp ("logicalSessionTimeoutMinutes", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes are to support the new discover_load_balancer.yml test. The test asserts many fields on a load balancer are unset by setting the expectation to null
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! I think I only see one thing that probably should be addressed (mongoc_server_description_new_copy() is unchecked), otherwise +1 from me-- but please
bear in mind that I'm not too familiar with this behavior or the codebase yet.
* an error would have occurred when constructing the topology. */ | ||
BSON_ASSERT (td->servers->items_len == 1); | ||
sd = (mongoc_server_description_t *) mongoc_set_get_item (td->servers, 0); | ||
prev_sd = mongoc_server_description_new_copy (sd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that mongoc_server_description_new_copy() can return NULL on failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Added a BSON_ASSERT
since this would be an OOM error.
if (!selected_server) { | ||
_mongoc_server_selection_error ( | ||
"No suitable server found in load balanced deployment", NULL, error); | ||
bson_mutex_unlock (&topology->mutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It looks like what you've written here correctly releases the mutex, but you might also consider the goto-return idiom, which makes this somewhat easier to clean up in the face of new code being added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, you might want to release the mutex as soon as possible, rather than holding it for the duration of the function. Ok. (Although in that case, would it be better as its own function? Tough call here IMO-- it's fine as-is.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in this case I did not want to hold the mutex for the remainder of the function. I will leave as-is.
int topology_changed_events; | ||
int topology_opening_events; | ||
int topology_closed_events; | ||
} stats_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though this is in test code, "stats_t" seems generic enough that it has a high chance of conflicting with another symbol that the test might link to.
{ | ||
mongoc_apm_callbacks_t *cbs; | ||
|
||
cbs = mongoc_apm_callbacks_new (); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: ok to ignore result, bson_malloc0() promises to call abort() on failure.
MONGOC_ERROR_STREAM_NOT_ESTABLISHED, | ||
"Could not establish stream"); | ||
|
||
/* Failing to "scan" would normally cause the node to be in cooldown and fail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM mod @chardan's comments. Thanks again for all the offline explanation.
#include "mongoc/mongoc-client-session-private.h" | ||
#include "test-conveniences.h" | ||
#include "test-libmongoc.h" | ||
#include "TestSuite.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me.
Uh oh!
There was an error while loading. Please reload this page.