PYTHON-2462 Avoid connection storms: implement pool PAUSED state #531

ShaneHarvey · 2020-12-09T00:55:19Z

PYTHON-2462 Avoid connection storms: implement pool PAUSED state

Significant behavior changes:

Mark server unknown and clear the pool when background connections fail.
Eagerly evict threads from the wait queue when pool is paused. Evicted threads will raise the following error: AutoReconnect('localhost:27017: connection pool paused')
Introduces PoolReadyEvent and ConnectionPoolListener.pool_ready() which is published when a PAUSED pool is marked READY.

Significant internal code changes:

Remove pymongo.thread_util which is no longer used.
Replaces the socket_semaphore which enforces maxPoolSize and waitQueueMultiple with a condition variable (named "size_cond"). A condition variable is now required because drivers are required to eagerly evict threads from the wait queue when pool is paused. This is not possible with a semaphore. To do this we need to add a few counters to track the # waiters and # of requests.
- Introduces "pool.requests" which counts the number of concurrent requests being serviced by the pool. "requests" is used to enforce maxPoolSize.
- Introduces "pool.waiters" which counts the number of concurrent requests in the wait queue.
- Introduces another condition variable called "size_cond". size_cond enforce maxPoolSize by acting as a barrier which only allows maxPoolSize threads past.
- When the pool is cleared, threads in the wait queue (waiting on size_cond or max_connecting_cond) are notfied and the thread raises AutoReconnect('localhost:27017: connection pool paused').

Significant test changes:

CMAP unit tests should not use real monitors. The real monitor class calls pool.ready() which messes up the test assertions.
Refactored MockPool classes to clean up some duplication.

Mark server unknown and clear the pool when background connections fail Eagerly evict threads from the wait queue when pool is paused. Evicted threads will raise the following error: AutoReconnect('localhost:27017: connection pool paused') CMAP spec test changes: - CMAP unit tests should not use real monitors - Assert that CMAP threads complete all scheduled operations

test_client_disconnect (test.test_threads.TestThreads) ... Exception in thread Thread-1: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/Users/shane/git/mongo-python-driver/test/test_threads.py", line 61, in run for document in self.collection.find(): File "/Users/shane/git/mongo-python-driver/pymongo/cursor.py", line 1207, in next if len(self.__data) or self._refresh(): File "/Users/shane/git/mongo-python-driver/pymongo/cursor.py", line 1144, in _refresh self.__send_message(g) File "/Users/shane/git/mongo-python-driver/pymongo/cursor.py", line 1001, in __send_message address=self.__address) File "/Users/shane/git/mongo-python-driver/pymongo/mongo_client.py", line 1372, in _run_operation_with_response exhaust=exhaust) File "/Users/shane/git/mongo-python-driver/pymongo/mongo_client.py", line 1465, in _retryable_read exhaust=exhaust) as (sock_info, File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/contextlib.py", line 112, in __enter__ return next(self.gen) File "/Users/shane/git/mongo-python-driver/pymongo/mongo_client.py", line 1309, in _slaveok_for_server with self._get_socket(server, session, exhaust=exhaust) as sock_info: File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/contextlib.py", line 112, in __enter__ return next(self.gen) File "/Users/shane/git/mongo-python-driver/pymongo/mongo_client.py", line 1247, in _get_socket self.__all_credentials, checkout=exhaust) as sock_info: File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/contextlib.py", line 112, in __enter__ return next(self.gen) File "/Users/shane/git/mongo-python-driver/pymongo/pool.py", line 1320, in get_socket sock_info = self._get_socket(all_credentials) File "/Users/shane/git/mongo-python-driver/pymongo/pool.py", line 1370, in _get_socket self._raise_if_not_ready() File "/Users/shane/git/mongo-python-driver/pymongo/pool.py", line 1342, in _raise_if_not_ready self.address, AutoReconnect('connection pool paused')) File "/Users/shane/git/mongo-python-driver/pymongo/pool.py", line 288, in _raise_connection_failure raise AutoReconnect(msg) pymongo.errors.AutoReconnect: localhost:27017: connection pool paused FAIL ====================================================================== FAIL: test_client_disconnect (test.test_threads.TestThreads) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/shane/git/mongo-python-driver/test/test_threads.py", line 203, in test_client_disconnect self.assertTrue(t.passed) AssertionError: False is not true

ShaneHarvey · 2020-12-09T02:17:29Z

@prashantmital, some tests are failing on latest but this is still ready for your review.

prashantmital

Can you clarify what all the different counter variables are tracking? Specifically - active_sockets, operation_count, requests, waiters

prashantmital · 2020-12-15T21:11:42Z

pymongo/pool.py

+            if self.enabled_for_cmap:
+                self.opts.event_listeners.publish_connection_check_out_failed(
+                    self.address, ConnectionCheckOutFailedReason.CONN_ERROR)
+            # TODO: ensure this error is retryable


Do you intend to address this TODO in this PR? Does it need a test?

@patrickfreed have you thought about this question yet? How will we test that this error is retryable in all drivers?

I left a comment discussing a few of the options on the SPEC PR: mongodb/specifications#878 (comment)

As right now, I'd like to wait until retryable reads gets the unified format tests before personally writing any of them for the spec so we can get this project closed out, though I could add a prose test to it though if you think we should.

Something like:

Client with maxPoolSize=1

failCommand insert with blockTimeMS=1000, errorCode: <something retryable that clears the pool>

insert in thread1

insert in thread2

ensure both inserts succeed

Can you open a new DRIVERS ticket to add this test?

Opened https://jira.mongodb.org/browse/DRIVERS-1483, though it made me realize that errors during connection establishment may or may not be retryable. The spec is mostly silent on it. I think we definitely want this error to be retryable though, since no attempt at the command is actually made. I think given this confusion, we'll need to sort out this test before we close out DRIVERS-781.

So it seems to hinge on https://jira.mongodb.org/browse/DRIVERS-746, which may be a larger effort than we'd want to drag into DRIVERS-781. We could consider not testing this behavior since we're technically not making anything worse than it already is (those threads in the WaitQueue will likely fail to make their connections anyways). Ideally we would complete DRIVERS-746 and then could have this retry though.

I added a test for this in python. Pymongo already retries after (retryable) connection handshake errors so the test passes as expected.

pymongo/pool.py

prashantmital · 2020-12-16T05:02:46Z

pymongo/pool.py

+                while not (self.requests < self.max_pool_size):
+                    if not _cond_wait(self.size_cond, deadline):
+                        # Timed out, notify the next thread to ensure a
+                        # timeout doesn't consume the condition.


I am not sure I follow the logic here. Under what circumstance would we enter this next if block? The check seems to directly contradict the entry condition of the while above. Also, if the number of requests did drop below max_pool_size while we were waiting to acquire the lock, shouldn't the condition variable in a different variable be notified by whatever caused the reduction in requests (e.g. checking a connection back in)? Why are we manually notifying in this case?
I know this is similar to how we notify the _max_connecting_cond condition, but it would be great if you could explain what is going on here.

The comment attempts to explain it. This extra logic fixes the following race:

Thread 1 and 2 are waiting on this condition variable.

Thread 3 completes and calls self.size_cond.notfiy() which wakes Thread 1.

Thread 1 wakes up and realizes that it's timeout has expired and raises WaitQueueTimeout.

Thread 1 has "consumed" the notification but did not proceed to run an operation.

Thread 2 is stuck waiting for the next notification or timeout.

This change fixes the bug by notifying the next thread in the wait queue before raising a timeout.

prashantmital · 2020-12-16T08:34:07Z

pymongo/pool.py

@@ -1058,6 +1066,8 @@ class _PoolClosedError(PyMongoError):
    pass


+PAUSED, READY, CLOSED = range(3)


Can we use a class enum for this like we do elsewhere (e.g. UuidRepresentation).

prashantmital · 2020-12-16T08:48:33Z

pymongo/pool.py

@@ -1289,6 +1335,15 @@ def get_socket(self, all_credentials, checkout=False):
            if not checkout:
                self.return_socket(sock_info)

+    def _raise_if_not_ready(self):
+        if self.opts.pause_enabled and self.state == PAUSED:


Any reason we don't simply check self.state != READY? Seems like we are leaving out the possibility of the state being CLOSED?

prashantmital · 2020-12-16T08:53:05Z

pymongo/pool.py

-                self.closed = True
+                self.state = CLOSED
+            # Clear the wait queue
+            self._max_connecting_cond.notify_all()


So when this happens and there are threads waiting to create a connection, they will get notified (via either size_cond or max_connecting_cond) and end up raising an exception since the pool is not ready?

Yes, exactly that.

ShaneHarvey · 2020-12-23T22:43:00Z

@prashantmital I believe I have addressed all the open comments. This is ready for another look.

…ks with errorCode, not closeConnection SERVER-53512

prashantmital

Changes LGTM. Any idea why that test fails?

ShaneHarvey · 2021-01-06T20:24:01Z

Edit: I've opened SERVER-53624.

Any idea why that test fails?

Yes, I suspect it's actually a server bug. I'll open a server ticket and skip the test on 4.4>= mongos for now. For future reference here's the test failure:

 [2020/12/24 00:32:49.790]   test_pool_paused_error_is_retryable (test_cmap.TestCMAP) ... Exception in thread Thread-241:
 [2020/12/24 00:32:49.790] Traceback (most recent call last):
 [2020/12/24 00:32:49.790]   File "/opt/python/3.8/lib/python3.8/threading.py", line 932, in _bootstrap_inner
 [2020/12/24 00:32:49.790]     self.run()
 [2020/12/24 00:32:49.790]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/test/test_cmap.py", line 480, in run
 [2020/12/24 00:32:49.790]     self.collection.insert_one({})
 [2020/12/24 00:32:49.790]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/pymongo/collection.py", line 698, in insert_one
 [2020/12/24 00:32:49.790]     self._insert(document,
 [2020/12/24 00:32:49.790]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/pymongo/collection.py", line 613, in _insert
 [2020/12/24 00:32:49.790]     return self._insert_one(
 [2020/12/24 00:32:49.790]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/pymongo/collection.py", line 602, in _insert_one
 [2020/12/24 00:32:49.790]     self.__database.client._retryable_write(
 [2020/12/24 00:32:49.790]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/pymongo/mongo_client.py", line 1498, in _retryable_write
 [2020/12/24 00:32:49.790]     return self._retry_with_session(retryable, func, s, None)
 [2020/12/24 00:32:49.790]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/pymongo/mongo_client.py", line 1384, in _retry_with_session
 [2020/12/24 00:32:49.790]     return self._retry_internal(retryable, func, session, bulk)
 [2020/12/24 00:32:49.790]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/pymongo/mongo_client.py", line 1416, in _retry_internal
 [2020/12/24 00:32:49.790]     return func(session, sock_info, retryable)
 [2020/12/24 00:32:49.790]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/pymongo/collection.py", line 590, in _insert_command
 [2020/12/24 00:32:49.790]     result = sock_info.command(
 [2020/12/24 00:32:49.790]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/pymongo/pool.py", line 732, in command
 [2020/12/24 00:32:49.790]     self._raise_connection_failure(error)
 [2020/12/24 00:32:49.790]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/pymongo/pool.py", line 716, in command
 [2020/12/24 00:32:49.790]     return command(self, dbname, spec, slave_ok,
 [2020/12/24 00:32:49.790]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/pymongo/network.py", line 159, in command
 [2020/12/24 00:32:49.790]     helpers._check_command_response(
 [2020/12/24 00:32:49.790]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/pymongo/helpers.py", line 147, in _check_command_response
 [2020/12/24 00:32:49.790]     raise NotMasterError(errmsg, response)
 [2020/12/24 00:32:49.790] pymongo.errors.NotMasterError: Failing command due to 'failCommand' failpoint, full error: {'ok': 0.0, 'errmsg': "Failing command due to 'failCommand' failpoint", 'code': 91, 'codeName': 'ShutdownInProgress', 'operationTime': Timestamp(1608769963, 1), '$clusterTime': {'clusterTime': Timestamp(1608769963, 1), 'signature': {'hash': b'^:\x86N\xb4\x95\xff\xc4\x8e4\xe1L\xff\xa9o8\xce\xed\xd5\xe8', 'keyId': 6909613231115862033}}}
 
...

 [2020/12/24 00:37:16.698] FAIL [1.524s]: test_pool_paused_error_is_retryable (test_cmap.TestCMAP)
 [2020/12/24 00:37:16.698] ----------------------------------------------------------------------
 [2020/12/24 00:37:16.698] Traceback (most recent call last):
 [2020/12/24 00:37:16.698]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/test/__init__.py", line 444, in wrap
 [2020/12/24 00:37:16.698]     return f(*args, **kwargs)
 [2020/12/24 00:37:16.698]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/test/__init__.py", line 444, in wrap
 [2020/12/24 00:37:16.698]     return f(*args, **kwargs)
 [2020/12/24 00:37:16.698]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/test/test_cmap.py", line 457, in test_pool_paused_error_is_retryable
 [2020/12/24 00:37:16.698]     self.assertTrue(thread.passed)
 [2020/12/24 00:37:16.698] AssertionError: False is not true

ShaneHarvey · 2021-01-06T22:27:59Z

Updated to skip the test_pool_paused_error_is_retryable test on 4.4+ until SERVER-53624 is fixed.

ShaneHarvey added 3 commits December 8, 2020 14:34

PYTHON-2462 Don't publish PoolClearedEvent when pool is already paused

7db31ba

ShaneHarvey requested a review from prashantmital December 9, 2020 00:55

ShaneHarvey added 3 commits December 9, 2020 09:44

PYTHON-2462 fix test TODO

6d27ccc

PYTHON-2462 Remove PoolClearedEvent to fix SDAM tests

d1e6435

PYTHON-2462 Implement prose test "Connection Pool Management"

81ccba5

ShaneHarvey force-pushed the PYTHON-2462 branch from 5b457b2 to 81ccba5 Compare December 9, 2020 20:45

prashantmital suggested changes Dec 15, 2020

View reviewed changes

prashantmital reviewed Dec 16, 2020

View reviewed changes

ShaneHarvey added 7 commits December 16, 2020 15:50

Fix test bug in BaseListener base class

fce48f8

PYTHON-2462 Add SDAM/CMAP minPoolSize handshake error test

f8b2d76

PYTHON-2462 Mark server pool ready on every successful heartbeat

3532c64

PYTHON-2462 Resync spec tests

0e24b46

PYTHON-2462 cleanup _raise_if_not_ready

307dbfe

PYTHON-2462 Add test to ensure pool paused error is retryable

b48d230

PYTHON-2462 Refactor pool states into PoolState "enum"

beb061f

ShaneHarvey requested a review from prashantmital December 23, 2020 22:42

Fix test_pool_paused_error_is_retryable test blockConnection only wor…

8cba5d2

…ks with errorCode, not closeConnection SERVER-53512

prashantmital reviewed Jan 6, 2021

View reviewed changes

PYTHON-2462 Skip test on 4.4+ until SERVER-53624 is fixed

0e0001b

ShaneHarvey merged commit 86b40c1 into mongodb:master Jan 6, 2021

		@@ -1058,6 +1066,8 @@ class _PoolClosedError(PyMongoError):
		pass


		PAUSED, READY, CLOSED = range(3)

PYTHON-2462 Avoid connection storms: implement pool PAUSED state #531

PYTHON-2462 Avoid connection storms: implement pool PAUSED state #531

Uh oh!

Conversation

ShaneHarvey commented Dec 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShaneHarvey commented Dec 9, 2020

Uh oh!

prashantmital left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey commented Dec 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prashantmital left a comment

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey commented Jan 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShaneHarvey commented Jan 6, 2021

Uh oh!

Uh oh!

ShaneHarvey commented Dec 9, 2020 •

edited

Loading

ShaneHarvey commented Dec 23, 2020 •

edited

Loading

ShaneHarvey commented Jan 6, 2021 •

edited

Loading