-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Fix garbage collection deadlock #1578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Corresponding celery PR: celery/celery#6969 |
Fixes #1583 |
@emorozov I want to noodle on this a bit more. I'm with you about being worried that it's a bit fragile, even though it might only be the way to go. I added the help wanted label, as I'd love input from others as well. Thank you so much for this! |
I think this makes sense. I've validated these tests as much as I can both with and without your changes, and I've tried on machines that should produce some deadlocks (really, really slow arm boxen). Mind merging in the latest master into your branch. Specifically you need to get your dependencies into the dev_dependencies and out of tox, and then I'll merge it in - probably on Sunday. Thanks a tonne! |
Hi @emorozov we're close! The changes you made to tox.ini, really belong in the dev_requirements.txt. Your tox.ini shouldn't need modifications, |
Codecov Report
@@ Coverage Diff @@
## master #1578 +/- ##
==========================================
+ Coverage 89.73% 89.75% +0.01%
==========================================
Files 57 57
Lines 11081 11093 +12
==========================================
+ Hits 9944 9956 +12
Misses 1137 1137
Continue to review full report at Codecov.
|
Hello @chayim , |
Thank you @emorozov, I really appreciate it. Let's get this into the next release candidate. Merging! |
Pull Request check-list
Please make sure to review and check all of these items:
$ tox
pass with this change (including linting)?For a long time we experienced weird lockups of the Celery package when using Redis backend (even when using redis backend only for results).
After weeks of debugging, I've found a culprit:
PubSub
objects create a circular reference to self viaconnection.register_connect_callback
. This means thatPubSub
objects are not garbage collected immediately, when their refcount decreases to zero, they're collected asynchronously using generational GC.Also,
PubSub
defines a destructor that releases a connection from pool. Pool uses threading.Lock when managing connections. If some pool method grabs the lock and at the same timePubSub
object is garbage collected (which is pretty likely as our celery instances lock up at least once a week),PubSub
destructor deadlocks forever waiting for a lock that will be never released.I've decided that the only solution to this problem is to avoid creating circular references to
PubSub
objects.Test look weird and I'm worried that it is a bit fragile, but I failed to devise a scheme that would cause Python garbage collector to run at exact moment in time when the pool is inside critical section. So the test run enough cycles to cause a problem. But different interpreters/different gc settings may cause a test to finish without locking up. I don't know how to solve that.
It runs on CPython 3.9 though: without weakref changes, the test will hang forever until pytest-timeout will cancel it.