-
Notifications
You must be signed in to change notification settings - Fork 1.1k
PYTHON-2462 Avoid connection storms: implement pool PAUSED state #531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Mark server unknown and clear the pool when background connections fail Eagerly evict threads from the wait queue when pool is paused. Evicted threads will raise the following error: AutoReconnect('localhost:27017: connection pool paused') CMAP spec test changes: - CMAP unit tests should not use real monitors - Assert that CMAP threads complete all scheduled operations
test_client_disconnect (test.test_threads.TestThreads) ... Exception in thread Thread-1: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/Users/shane/git/mongo-python-driver/test/test_threads.py", line 61, in run for document in self.collection.find(): File "/Users/shane/git/mongo-python-driver/pymongo/cursor.py", line 1207, in next if len(self.__data) or self._refresh(): File "/Users/shane/git/mongo-python-driver/pymongo/cursor.py", line 1144, in _refresh self.__send_message(g) File "/Users/shane/git/mongo-python-driver/pymongo/cursor.py", line 1001, in __send_message address=self.__address) File "/Users/shane/git/mongo-python-driver/pymongo/mongo_client.py", line 1372, in _run_operation_with_response exhaust=exhaust) File "/Users/shane/git/mongo-python-driver/pymongo/mongo_client.py", line 1465, in _retryable_read exhaust=exhaust) as (sock_info, File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/contextlib.py", line 112, in __enter__ return next(self.gen) File "/Users/shane/git/mongo-python-driver/pymongo/mongo_client.py", line 1309, in _slaveok_for_server with self._get_socket(server, session, exhaust=exhaust) as sock_info: File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/contextlib.py", line 112, in __enter__ return next(self.gen) File "/Users/shane/git/mongo-python-driver/pymongo/mongo_client.py", line 1247, in _get_socket self.__all_credentials, checkout=exhaust) as sock_info: File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/contextlib.py", line 112, in __enter__ return next(self.gen) File "/Users/shane/git/mongo-python-driver/pymongo/pool.py", line 1320, in get_socket sock_info = self._get_socket(all_credentials) File "/Users/shane/git/mongo-python-driver/pymongo/pool.py", line 1370, in _get_socket self._raise_if_not_ready() File "/Users/shane/git/mongo-python-driver/pymongo/pool.py", line 1342, in _raise_if_not_ready self.address, AutoReconnect('connection pool paused')) File "/Users/shane/git/mongo-python-driver/pymongo/pool.py", line 288, in _raise_connection_failure raise AutoReconnect(msg) pymongo.errors.AutoReconnect: localhost:27017: connection pool paused FAIL ====================================================================== FAIL: test_client_disconnect (test.test_threads.TestThreads) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/shane/git/mongo-python-driver/test/test_threads.py", line 203, in test_client_disconnect self.assertTrue(t.passed) AssertionError: False is not true
@prashantmital, some tests are failing on latest but this is still ready for your review. |
5b457b2
to
81ccba5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify what all the different counter variables are tracking? Specifically - active_sockets
, operation_count
, requests
, waiters
pymongo/pool.py
Outdated
if self.enabled_for_cmap: | ||
self.opts.event_listeners.publish_connection_check_out_failed( | ||
self.address, ConnectionCheckOutFailedReason.CONN_ERROR) | ||
# TODO: ensure this error is retryable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you intend to address this TODO in this PR? Does it need a test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@patrickfreed have you thought about this question yet? How will we test that this error is retryable in all drivers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment discussing a few of the options on the SPEC PR: mongodb/specifications#878 (comment)
As right now, I'd like to wait until retryable reads gets the unified format tests before personally writing any of them for the spec so we can get this project closed out, though I could add a prose test to it though if you think we should.
Something like:
- Client with maxPoolSize=1
- failCommand insert with blockTimeMS=1000, errorCode: <something retryable that clears the pool>
- insert in thread1
- insert in thread2
- ensure both inserts succeed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you open a new DRIVERS ticket to add this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened https://jira.mongodb.org/browse/DRIVERS-1483, though it made me realize that errors during connection establishment may or may not be retryable. The spec is mostly silent on it. I think we definitely want this error to be retryable though, since no attempt at the command is actually made. I think given this confusion, we'll need to sort out this test before we close out DRIVERS-781.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it seems to hinge on https://jira.mongodb.org/browse/DRIVERS-746, which may be a larger effort than we'd want to drag into DRIVERS-781. We could consider not testing this behavior since we're technically not making anything worse than it already is (those threads in the WaitQueue will likely fail to make their connections anyways). Ideally we would complete DRIVERS-746 and then could have this retry though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test for this in python. Pymongo already retries after (retryable) connection handshake errors so the test passes as expected.
while not (self.requests < self.max_pool_size): | ||
if not _cond_wait(self.size_cond, deadline): | ||
# Timed out, notify the next thread to ensure a | ||
# timeout doesn't consume the condition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I follow the logic here. Under what circumstance would we enter this next if
block? The check seems to directly contradict the entry condition of the while
above. Also, if the number of requests did drop below max_pool_size
while we were waiting to acquire the lock, shouldn't the condition variable in a different variable be notified by whatever caused the reduction in requests
(e.g. checking a connection back in)? Why are we manually notifying in this case?
I know this is similar to how we notify the _max_connecting_cond
condition, but it would be great if you could explain what is going on here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment attempts to explain it. This extra logic fixes the following race:
- Thread 1 and 2 are waiting on this condition variable.
- Thread 3 completes and calls self.size_cond.notfiy() which wakes Thread 1.
- Thread 1 wakes up and realizes that it's timeout has expired and raises WaitQueueTimeout.
- Thread 1 has "consumed" the notification but did not proceed to run an operation.
- Thread 2 is stuck waiting for the next notification or timeout.
This change fixes the bug by notifying the next thread in the wait queue before raising a timeout.
pymongo/pool.py
Outdated
@@ -1058,6 +1066,8 @@ class _PoolClosedError(PyMongoError): | |||
pass | |||
|
|||
|
|||
PAUSED, READY, CLOSED = range(3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a class
enum for this like we do elsewhere (e.g. UuidRepresentation
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pymongo/pool.py
Outdated
@@ -1289,6 +1335,15 @@ def get_socket(self, all_credentials, checkout=False): | |||
if not checkout: | |||
self.return_socket(sock_info) | |||
|
|||
def _raise_if_not_ready(self): | |||
if self.opts.pause_enabled and self.state == PAUSED: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason we don't simply check self.state != READY
? Seems like we are leaving out the possibility of the state being CLOSED?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
self.closed = True | ||
self.state = CLOSED | ||
# Clear the wait queue | ||
self._max_connecting_cond.notify_all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So when this happens and there are threads waiting to create a connection, they will get notified (via either size_cond or max_connecting_cond) and end up raising an exception since the pool is not ready?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly that.
@prashantmital I believe I have addressed all the open comments. This is ready for another look. |
…ks with errorCode, not closeConnection SERVER-53512
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM. Any idea why that test fails?
Edit: I've opened SERVER-53624.
Yes, I suspect it's actually a server bug. I'll open a server ticket and skip the test on 4.4>= mongos for now. For future reference here's the test failure:
|
Updated to skip the |
PYTHON-2462 Avoid connection storms: implement pool PAUSED state
Significant behavior changes:
AutoReconnect('localhost:27017: connection pool paused')
PoolReadyEvent
andConnectionPoolListener.pool_ready()
which is published when a PAUSED pool is marked READY.Significant internal code changes:
Remove
pymongo.thread_util
which is no longer used.Replaces the socket_semaphore which enforces maxPoolSize and waitQueueMultiple with a condition variable (named "size_cond"). A condition variable is now required because drivers are required to eagerly evict threads from the wait queue when pool is paused. This is not possible with a semaphore. To do this we need to add a few counters to track the # waiters and # of requests.
AutoReconnect('localhost:27017: connection pool paused')
.Significant test changes: