Fix garbage collection deadlock #1578

emorozov · 2021-09-27T13:59:36Z

Pull Request check-list

Please make sure to review and check all of these items:

Does $ tox pass with this change (including linting)?
Do the CI tests pass with this change (enable it first in your forked repo and wait for the github action build to finish)?
Is the new or changed code fully tested?
Is a documentation update included (if this change modifies existing APIs, or introduces new ones)?

For a long time we experienced weird lockups of the Celery package when using Redis backend (even when using redis backend only for results).

After weeks of debugging, I've found a culprit: PubSub objects create a circular reference to self via connection.register_connect_callback. This means that PubSub objects are not garbage collected immediately, when their refcount decreases to zero, they're collected asynchronously using generational GC.

Also, PubSub defines a destructor that releases a connection from pool. Pool uses threading.Lock when managing connections. If some pool method grabs the lock and at the same time PubSub object is garbage collected (which is pretty likely as our celery instances lock up at least once a week), PubSub destructor deadlocks forever waiting for a lock that will be never released.

I've decided that the only solution to this problem is to avoid creating circular references to PubSub objects.

Test look weird and I'm worried that it is a bit fragile, but I failed to devise a scheme that would cause Python garbage collector to run at exact moment in time when the pool is inside critical section. So the test run enough cycles to cause a problem. But different interpreters/different gc settings may cause a test to finish without locking up. I don't know how to solve that.

It runs on CPython 3.9 though: without weakref changes, the test will hang forever until pytest-timeout will cancel it.

emorozov · 2021-09-27T14:00:12Z

Corresponding celery PR: celery/celery#6969
Initially I thought that the problem is caused by Celery code.

emorozov · 2021-10-07T08:53:45Z

Fixes #1583

chayim · 2021-10-25T06:36:21Z

@emorozov I want to noodle on this a bit more. I'm with you about being worried that it's a bit fragile, even though it might only be the way to go. I added the help wanted label, as I'd love input from others as well. Thank you so much for this!

chayim · 2021-11-04T15:01:25Z

I think this makes sense. I've validated these tests as much as I can both with and without your changes, and I've tried on machines that should produce some deadlocks (really, really slow arm boxen). Mind merging in the latest master into your branch. Specifically you need to get your dependencies into the dev_dependencies and out of tox, and then I'll merge it in - probably on Sunday.

Thanks a tonne!

chayim · 2021-11-07T09:00:26Z

Hi @emorozov we're close! The changes you made to tox.ini, really belong in the dev_requirements.txt. Your tox.ini shouldn't need modifications,

codecov-commenter · 2021-11-07T19:01:30Z

Codecov Report

Merging #1578 (369214f) into master (4257ceb) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1578      +/-   ##
==========================================
+ Coverage   89.73%   89.75%   +0.01%     
==========================================
  Files          57       57              
  Lines       11081    11093      +12     
==========================================
+ Hits         9944     9956      +12     
  Misses       1137     1137

Impacted Files	Coverage Δ
redis/connection.py	`71.85% <100.00%> (+0.10%)`	⬆️
tests/test_pubsub.py	`99.73% <100.00%> (+<0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4257ceb...369214f. Read the comment docs.

emorozov · 2021-11-07T19:04:01Z

Hello @chayim ,
I've moved the dependency from tox.ini to dev_requirements.txt

chayim · 2021-11-08T06:59:03Z

Thank you @emorozov, I really appreciate it. Let's get this into the next release candidate. Merging!

emorozov mentioned this pull request Oct 1, 2021

Redis-py deadlocks during garbage collection #1583

Closed

chayim added need more info help-wanted and removed need more info labels Oct 25, 2021

Fixes garbage collection deadlock.

369214f

emorozov force-pushed the master branch from 36dd44e to 369214f Compare November 7, 2021 18:59

chayim added the bug Bug label Nov 8, 2021

chayim changed the title ~~Fixes garbage collection deadlock.~~ Fix garbage collection deadlock Nov 8, 2021

chayim merged commit bba7518 into redis:master Nov 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix garbage collection deadlock #1578

Fix garbage collection deadlock #1578

Uh oh!

emorozov commented Sep 27, 2021 •

edited

Loading

Uh oh!

emorozov commented Sep 27, 2021

Uh oh!

emorozov commented Oct 7, 2021

Uh oh!

chayim commented Oct 25, 2021

Uh oh!

chayim commented Nov 4, 2021

Uh oh!

chayim commented Nov 7, 2021

Uh oh!

codecov-commenter commented Nov 7, 2021 •

edited

Loading

Uh oh!

emorozov commented Nov 7, 2021

Uh oh!

chayim commented Nov 8, 2021

Uh oh!

Uh oh!

Fix garbage collection deadlock #1578

Fix garbage collection deadlock #1578

Uh oh!

Conversation

emorozov commented Sep 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request check-list

Uh oh!

emorozov commented Sep 27, 2021

Uh oh!

emorozov commented Oct 7, 2021

Uh oh!

chayim commented Oct 25, 2021

Uh oh!

chayim commented Nov 4, 2021

Uh oh!

chayim commented Nov 7, 2021

Uh oh!

codecov-commenter commented Nov 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

emorozov commented Nov 7, 2021

Uh oh!

chayim commented Nov 8, 2021

Uh oh!

Uh oh!

emorozov commented Sep 27, 2021 •

edited

Loading

codecov-commenter commented Nov 7, 2021 •

edited

Loading