Skip to content

Fix garbage collection deadlock #1578

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 8, 2021
Merged

Fix garbage collection deadlock #1578

merged 1 commit into from
Nov 8, 2021

Conversation

emorozov
Copy link
Contributor

@emorozov emorozov commented Sep 27, 2021

Pull Request check-list

Please make sure to review and check all of these items:

  • Does $ tox pass with this change (including linting)?
  • Do the CI tests pass with this change (enable it first in your forked repo and wait for the github action build to finish)?
  • Is the new or changed code fully tested?
  • Is a documentation update included (if this change modifies existing APIs, or introduces new ones)?

For a long time we experienced weird lockups of the Celery package when using Redis backend (even when using redis backend only for results).

After weeks of debugging, I've found a culprit: PubSub objects create a circular reference to self via connection.register_connect_callback. This means that PubSub objects are not garbage collected immediately, when their refcount decreases to zero, they're collected asynchronously using generational GC.

Also, PubSub defines a destructor that releases a connection from pool. Pool uses threading.Lock when managing connections. If some pool method grabs the lock and at the same time PubSub object is garbage collected (which is pretty likely as our celery instances lock up at least once a week), PubSub destructor deadlocks forever waiting for a lock that will be never released.

I've decided that the only solution to this problem is to avoid creating circular references to PubSub objects.

Test look weird and I'm worried that it is a bit fragile, but I failed to devise a scheme that would cause Python garbage collector to run at exact moment in time when the pool is inside critical section. So the test run enough cycles to cause a problem. But different interpreters/different gc settings may cause a test to finish without locking up. I don't know how to solve that.

It runs on CPython 3.9 though: without weakref changes, the test will hang forever until pytest-timeout will cancel it.

@emorozov
Copy link
Contributor Author

Corresponding celery PR: celery/celery#6969
Initially I thought that the problem is caused by Celery code.

@emorozov
Copy link
Contributor Author

emorozov commented Oct 7, 2021

Fixes #1583

@chayim
Copy link
Contributor

chayim commented Oct 25, 2021

@emorozov I want to noodle on this a bit more. I'm with you about being worried that it's a bit fragile, even though it might only be the way to go. I added the help wanted label, as I'd love input from others as well. Thank you so much for this!

@chayim
Copy link
Contributor

chayim commented Nov 4, 2021

I think this makes sense. I've validated these tests as much as I can both with and without your changes, and I've tried on machines that should produce some deadlocks (really, really slow arm boxen). Mind merging in the latest master into your branch. Specifically you need to get your dependencies into the dev_dependencies and out of tox, and then I'll merge it in - probably on Sunday.

Thanks a tonne!

@chayim
Copy link
Contributor

chayim commented Nov 7, 2021

Hi @emorozov we're close! The changes you made to tox.ini, really belong in the dev_requirements.txt. Your tox.ini shouldn't need modifications,

@codecov-commenter
Copy link

codecov-commenter commented Nov 7, 2021

Codecov Report

Merging #1578 (369214f) into master (4257ceb) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1578      +/-   ##
==========================================
+ Coverage   89.73%   89.75%   +0.01%     
==========================================
  Files          57       57              
  Lines       11081    11093      +12     
==========================================
+ Hits         9944     9956      +12     
  Misses       1137     1137              
Impacted Files Coverage Δ
redis/connection.py 71.85% <100.00%> (+0.10%) ⬆️
tests/test_pubsub.py 99.73% <100.00%> (+<0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4257ceb...369214f. Read the comment docs.

@emorozov
Copy link
Contributor Author

emorozov commented Nov 7, 2021

Hello @chayim ,
I've moved the dependency from tox.ini to dev_requirements.txt

@chayim chayim added the bug Bug label Nov 8, 2021
@chayim
Copy link
Contributor

chayim commented Nov 8, 2021

Thank you @emorozov, I really appreciate it. Let's get this into the next release candidate. Merging!

@chayim chayim changed the title Fixes garbage collection deadlock. Fix garbage collection deadlock Nov 8, 2021
@chayim chayim merged commit bba7518 into redis:master Nov 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants