Skip to content

PYTHON-2462 Avoid connection storms: implement pool PAUSED state #531

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jan 6, 2021

Conversation

ShaneHarvey
Copy link
Member

@ShaneHarvey ShaneHarvey commented Dec 9, 2020

PYTHON-2462 Avoid connection storms: implement pool PAUSED state

Significant behavior changes:

  • Mark server unknown and clear the pool when background connections fail.
  • Eagerly evict threads from the wait queue when pool is paused. Evicted threads will raise the following error: AutoReconnect('localhost:27017: connection pool paused')
  • Introduces PoolReadyEvent and ConnectionPoolListener.pool_ready() which is published when a PAUSED pool is marked READY.

Significant internal code changes:

  • Remove pymongo.thread_util which is no longer used.

  • Replaces the socket_semaphore which enforces maxPoolSize and waitQueueMultiple with a condition variable (named "size_cond"). A condition variable is now required because drivers are required to eagerly evict threads from the wait queue when pool is paused. This is not possible with a semaphore. To do this we need to add a few counters to track the # waiters and # of requests.

    • Introduces "pool.requests" which counts the number of concurrent requests being serviced by the pool. "requests" is used to enforce maxPoolSize.
    • Introduces "pool.waiters" which counts the number of concurrent requests in the wait queue.
    • Introduces another condition variable called "size_cond". size_cond enforce maxPoolSize by acting as a barrier which only allows maxPoolSize threads past.
    • When the pool is cleared, threads in the wait queue (waiting on size_cond or max_connecting_cond) are notfied and the thread raises AutoReconnect('localhost:27017: connection pool paused').

Significant test changes:

  • CMAP unit tests should not use real monitors. The real monitor class calls pool.ready() which messes up the test assertions.
  • Refactored MockPool classes to clean up some duplication.

Mark server unknown and clear the pool when background connections fail
Eagerly evict threads from the wait queue when pool is paused. Evicted
threads will raise the following error:
AutoReconnect('localhost:27017: connection pool paused')

CMAP spec test changes:
- CMAP unit tests should not use real monitors
- Assert that CMAP threads complete all scheduled operations
test_client_disconnect (test.test_threads.TestThreads) ... Exception in thread Thread-1:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/Users/shane/git/mongo-python-driver/test/test_threads.py", line 61, in run
    for document in self.collection.find():
  File "/Users/shane/git/mongo-python-driver/pymongo/cursor.py", line 1207, in next
    if len(self.__data) or self._refresh():
  File "/Users/shane/git/mongo-python-driver/pymongo/cursor.py", line 1144, in _refresh
    self.__send_message(g)
  File "/Users/shane/git/mongo-python-driver/pymongo/cursor.py", line 1001, in __send_message
    address=self.__address)
  File "/Users/shane/git/mongo-python-driver/pymongo/mongo_client.py", line 1372, in _run_operation_with_response
    exhaust=exhaust)
  File "/Users/shane/git/mongo-python-driver/pymongo/mongo_client.py", line 1465, in _retryable_read
    exhaust=exhaust) as (sock_info,
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/Users/shane/git/mongo-python-driver/pymongo/mongo_client.py", line 1309, in _slaveok_for_server
    with self._get_socket(server, session, exhaust=exhaust) as sock_info:
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/Users/shane/git/mongo-python-driver/pymongo/mongo_client.py", line 1247, in _get_socket
    self.__all_credentials, checkout=exhaust) as sock_info:
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/Users/shane/git/mongo-python-driver/pymongo/pool.py", line 1320, in get_socket
    sock_info = self._get_socket(all_credentials)
  File "/Users/shane/git/mongo-python-driver/pymongo/pool.py", line 1370, in _get_socket
    self._raise_if_not_ready()
  File "/Users/shane/git/mongo-python-driver/pymongo/pool.py", line 1342, in _raise_if_not_ready
    self.address, AutoReconnect('connection pool paused'))
  File "/Users/shane/git/mongo-python-driver/pymongo/pool.py", line 288, in _raise_connection_failure
    raise AutoReconnect(msg)
pymongo.errors.AutoReconnect: localhost:27017: connection pool paused

FAIL

======================================================================
FAIL: test_client_disconnect (test.test_threads.TestThreads)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/shane/git/mongo-python-driver/test/test_threads.py", line 203, in test_client_disconnect
    self.assertTrue(t.passed)
AssertionError: False is not true
@ShaneHarvey
Copy link
Member Author

@prashantmital, some tests are failing on latest but this is still ready for your review.

Copy link
Contributor

@prashantmital prashantmital left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what all the different counter variables are tracking? Specifically - active_sockets, operation_count, requests, waiters

pymongo/pool.py Outdated
if self.enabled_for_cmap:
self.opts.event_listeners.publish_connection_check_out_failed(
self.address, ConnectionCheckOutFailedReason.CONN_ERROR)
# TODO: ensure this error is retryable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you intend to address this TODO in this PR? Does it need a test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patrickfreed have you thought about this question yet? How will we test that this error is retryable in all drivers?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment discussing a few of the options on the SPEC PR: mongodb/specifications#878 (comment)

As right now, I'd like to wait until retryable reads gets the unified format tests before personally writing any of them for the spec so we can get this project closed out, though I could add a prose test to it though if you think we should.

Something like:

  1. Client with maxPoolSize=1
  2. failCommand insert with blockTimeMS=1000, errorCode: <something retryable that clears the pool>
  3. insert in thread1
  4. insert in thread2
  5. ensure both inserts succeed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open a new DRIVERS ticket to add this test?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened https://jira.mongodb.org/browse/DRIVERS-1483, though it made me realize that errors during connection establishment may or may not be retryable. The spec is mostly silent on it. I think we definitely want this error to be retryable though, since no attempt at the command is actually made. I think given this confusion, we'll need to sort out this test before we close out DRIVERS-781.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it seems to hinge on https://jira.mongodb.org/browse/DRIVERS-746, which may be a larger effort than we'd want to drag into DRIVERS-781. We could consider not testing this behavior since we're technically not making anything worse than it already is (those threads in the WaitQueue will likely fail to make their connections anyways). Ideally we would complete DRIVERS-746 and then could have this retry though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test for this in python. Pymongo already retries after (retryable) connection handshake errors so the test passes as expected.

while not (self.requests < self.max_pool_size):
if not _cond_wait(self.size_cond, deadline):
# Timed out, notify the next thread to ensure a
# timeout doesn't consume the condition.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I follow the logic here. Under what circumstance would we enter this next if block? The check seems to directly contradict the entry condition of the while above. Also, if the number of requests did drop below max_pool_size while we were waiting to acquire the lock, shouldn't the condition variable in a different variable be notified by whatever caused the reduction in requests (e.g. checking a connection back in)? Why are we manually notifying in this case?
I know this is similar to how we notify the _max_connecting_cond condition, but it would be great if you could explain what is going on here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment attempts to explain it. This extra logic fixes the following race:

  1. Thread 1 and 2 are waiting on this condition variable.
  2. Thread 3 completes and calls self.size_cond.notfiy() which wakes Thread 1.
  3. Thread 1 wakes up and realizes that it's timeout has expired and raises WaitQueueTimeout.
  4. Thread 1 has "consumed" the notification but did not proceed to run an operation.
  5. Thread 2 is stuck waiting for the next notification or timeout.

This change fixes the bug by notifying the next thread in the wait queue before raising a timeout.

pymongo/pool.py Outdated
@@ -1058,6 +1066,8 @@ class _PoolClosedError(PyMongoError):
pass


PAUSED, READY, CLOSED = range(3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a class enum for this like we do elsewhere (e.g. UuidRepresentation).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

pymongo/pool.py Outdated
@@ -1289,6 +1335,15 @@ def get_socket(self, all_credentials, checkout=False):
if not checkout:
self.return_socket(sock_info)

def _raise_if_not_ready(self):
if self.opts.pause_enabled and self.state == PAUSED:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we don't simply check self.state != READY? Seems like we are leaving out the possibility of the state being CLOSED?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

self.closed = True
self.state = CLOSED
# Clear the wait queue
self._max_connecting_cond.notify_all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when this happens and there are threads waiting to create a connection, they will get notified (via either size_cond or max_connecting_cond) and end up raising an exception since the pool is not ready?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly that.

@ShaneHarvey
Copy link
Member Author

ShaneHarvey commented Dec 23, 2020

@prashantmital I believe I have addressed all the open comments. This is ready for another look.

…ks with errorCode, not closeConnection SERVER-53512
Copy link
Contributor

@prashantmital prashantmital left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM. Any idea why that test fails?

@ShaneHarvey
Copy link
Member Author

ShaneHarvey commented Jan 6, 2021

Edit: I've opened SERVER-53624.

Any idea why that test fails?

Yes, I suspect it's actually a server bug. I'll open a server ticket and skip the test on 4.4>= mongos for now. For future reference here's the test failure:

 [2020/12/24 00:32:49.790]   test_pool_paused_error_is_retryable (test_cmap.TestCMAP) ... Exception in thread Thread-241:
 [2020/12/24 00:32:49.790] Traceback (most recent call last):
 [2020/12/24 00:32:49.790]   File "/opt/python/3.8/lib/python3.8/threading.py", line 932, in _bootstrap_inner
 [2020/12/24 00:32:49.790]     self.run()
 [2020/12/24 00:32:49.790]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/test/test_cmap.py", line 480, in run
 [2020/12/24 00:32:49.790]     self.collection.insert_one({})
 [2020/12/24 00:32:49.790]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/pymongo/collection.py", line 698, in insert_one
 [2020/12/24 00:32:49.790]     self._insert(document,
 [2020/12/24 00:32:49.790]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/pymongo/collection.py", line 613, in _insert
 [2020/12/24 00:32:49.790]     return self._insert_one(
 [2020/12/24 00:32:49.790]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/pymongo/collection.py", line 602, in _insert_one
 [2020/12/24 00:32:49.790]     self.__database.client._retryable_write(
 [2020/12/24 00:32:49.790]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/pymongo/mongo_client.py", line 1498, in _retryable_write
 [2020/12/24 00:32:49.790]     return self._retry_with_session(retryable, func, s, None)
 [2020/12/24 00:32:49.790]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/pymongo/mongo_client.py", line 1384, in _retry_with_session
 [2020/12/24 00:32:49.790]     return self._retry_internal(retryable, func, session, bulk)
 [2020/12/24 00:32:49.790]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/pymongo/mongo_client.py", line 1416, in _retry_internal
 [2020/12/24 00:32:49.790]     return func(session, sock_info, retryable)
 [2020/12/24 00:32:49.790]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/pymongo/collection.py", line 590, in _insert_command
 [2020/12/24 00:32:49.790]     result = sock_info.command(
 [2020/12/24 00:32:49.790]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/pymongo/pool.py", line 732, in command
 [2020/12/24 00:32:49.790]     self._raise_connection_failure(error)
 [2020/12/24 00:32:49.790]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/pymongo/pool.py", line 716, in command
 [2020/12/24 00:32:49.790]     return command(self, dbname, spec, slave_ok,
 [2020/12/24 00:32:49.790]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/pymongo/network.py", line 159, in command
 [2020/12/24 00:32:49.790]     helpers._check_command_response(
 [2020/12/24 00:32:49.790]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/pymongo/helpers.py", line 147, in _check_command_response
 [2020/12/24 00:32:49.790]     raise NotMasterError(errmsg, response)
 [2020/12/24 00:32:49.790] pymongo.errors.NotMasterError: Failing command due to 'failCommand' failpoint, full error: {'ok': 0.0, 'errmsg': "Failing command due to 'failCommand' failpoint", 'code': 91, 'codeName': 'ShutdownInProgress', 'operationTime': Timestamp(1608769963, 1), '$clusterTime': {'clusterTime': Timestamp(1608769963, 1), 'signature': {'hash': b'^:\x86N\xb4\x95\xff\xc4\x8e4\xe1L\xff\xa9o8\xce\xed\xd5\xe8', 'keyId': 6909613231115862033}}}
 
...

 [2020/12/24 00:37:16.698] FAIL [1.524s]: test_pool_paused_error_is_retryable (test_cmap.TestCMAP)
 [2020/12/24 00:37:16.698] ----------------------------------------------------------------------
 [2020/12/24 00:37:16.698] Traceback (most recent call last):
 [2020/12/24 00:37:16.698]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/test/__init__.py", line 444, in wrap
 [2020/12/24 00:37:16.698]     return f(*args, **kwargs)
 [2020/12/24 00:37:16.698]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/test/__init__.py", line 444, in wrap
 [2020/12/24 00:37:16.698]     return f(*args, **kwargs)
 [2020/12/24 00:37:16.698]   File "/data/mci/49571bc03478c1c5b20361e0a8ce7426/src/test/test_cmap.py", line 457, in test_pool_paused_error_is_retryable
 [2020/12/24 00:37:16.698]     self.assertTrue(thread.passed)
 [2020/12/24 00:37:16.698] AssertionError: False is not true

@ShaneHarvey
Copy link
Member Author

Updated to skip the test_pool_paused_error_is_retryable test on 4.4+ until SERVER-53624 is fixed.

@ShaneHarvey ShaneHarvey merged commit 86b40c1 into mongodb:master Jan 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants