Skip to content

Block signals during fpm master initialization #4471

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

plmnikulin
Copy link
Contributor

Fix PHP-FPM failure in the case of concurrent reload attempts.

Postpone signal delivery to the fpm master process till proper signal
handlers are set. Prevent the following case:

  • Running master process receives SIGUSR2 and performs execvp().
  • Another SIGUSR2 is arrived before signal handlers are set.
  • Master process dies.
  • Requests to the HTTP server handled by PHP-fpm can not be served
    any more.

Block some signals using sigprocmask() before execvp() and early
in the main() function. Unblock signals as soon as proper
handlers are set.

Fixes bug #74083

Fix PHP-FPM failure in the case of concurrent reload attempts.

Postpone signal delivery to the fpm master process till proper signal
handlers are set. Prevent the following case:

- Running master process receives `SIGUSR2` and performs `execvp()`.
- Another `SIGUSR2` is arrived before signal handlers are set.
- Master process dies.
- Requests to the HTTP server handled by PHP-fpm can not be served
  any more.

Block some signals using `sigprocmask()` before `execvp()` and early
in the `main()` function. Unblock signals as soon as proper
handlers are set.

Fixes bug #74083
@plmnikulin
Copy link
Contributor Author

I have a patch for Bug #76601 but it adds more code to same places so independent pull request will conflict with this one. That is why I am expecting review of this change at first.

@nikic nikic added the Bug label Jul 24, 2019
@nikic nikic requested a review from bukka July 24, 2019 10:30
@bukka
Copy link
Member

bukka commented Jul 24, 2019

As I said in the bug report, this one needs a test.

I have got actually some changes in fpmi (sort of my testing version of fpm) for that so it's possible to enable automatically opcache in FPMi tests. Essentially these are the changes in the tester (might be missing some later changes maybe):

bukka/fpmi@9baff5d

and this is the test:

https://github.com/bukka/fpmi/blob/e0c2e8ec725ce5ee7749407f4b404d190cc37829/tests/bug76895-reload-after-error.phpt

I will need to port it to FPM and do a bit more testing.

I think it looks good in general but need to think about it a bit more. In the worst case we might just merge it to 7.4 and see what happens... :)

@plmnikulin
Copy link
Contributor Author

This particular issue #74083 has nothing common with opcache.

The full story is the following. We were receiving complains concerning PHP-FPM crashes during reload and so web-sites downtime but the root cause was unclear. Concurrent reload #74083 was the most obvious problem. Before this patch was ready, I faced #76601 that is even more weird (the state could be unnoticed by watchdogs) and has no common with opcache as well. The latter is currently hidden by the bug #74083, and it is easier to reproduce bug #76601 when this patch is applied. I believe that #76601 is quite rare in real life scenarios. I prepared a patch currently submitted as the pull request #4460 that is enough to make php bug #76601 invisible (the pull request #4465 by @nikic is likely even better). Later we accidentally realized that #76895 may be the most probable origin of reload "hangs" and the pull request #4460 surprisingly fixes it. It is not a crash however. You know, people tell you that it was PHP-FPM crash even actually their site was unavailable due to "scoreboard is full" in apache.

In #76601 thread on bugs.php.net I posted my unsuccessful attempts to write a test for the bug #76895. I could not figure out that syntax error problem is tightly bound to opcache before. Now @nikic is trying to fix bug #76895 in #4461 so I believe that tests involving opcache should be discussed there.

I have tried to add a test, but race condition tests usually require some non-negligible time to achieve high enough probability of failure and depend on performance of an executor. People usually complain that they already have enough slow tests.

From my point of view this pull request should be merged after #4465 or #4460. Additional patch against #76601 is preferable but not strictly required.

@plmnikulin
Copy link
Contributor Author

Look e.g. at fpm_events.c#L440 I suspect that such workarounds are required due to lack of proper signal blocking during some operations. This patch is the first part to fix such cases.

Have not expected side effects of `include`.
echo "Reached interval $interval us with $step us steps\n";
$tester->expectLogNotice('Reloading in progress ...');
/* Consume mix of 'Reloading in progress ...' and 'reloading: .*' */
$tester->getLogLines(2000);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test spends 2 seconds here due to timeout. Test will be unreliable with exact number of lines since number of actual reloads may vary. Reload may be interrupted issued only a part of logs. More careful analysis of output requires reimplementation of functions like expectLogStartNotices(). That is why extra delay could hardly be avoided.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I will try to find some time to take a look on this so we can remove the slow test flag...

@plmnikulin plmnikulin mentioned this pull request Jul 30, 2019
@krakjoe
Copy link
Member

krakjoe commented Sep 30, 2019

@bukka could you take care of this and the related requests please ?

@@ -109,6 +109,10 @@ int __riscosify_control = __RISCOSIFY_STRICT_UNIX_SPECS;
#include "fpm_log.h"
#include "zlog.h"

#if HAVE_SIGNAL_H
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice that HAVE_SIGNAL_H has been removed in 7.4 5f89157

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the rebase...

@plmnikulin
Copy link
Contributor Author

This pull request requires #4465 (merged to 7.4 only yet) or additional changes in children spawning that use functions introduced here.

@bukka
Copy link
Member

bukka commented Oct 6, 2019

I tried to run the test in FPMi: bukka/fpmi@9d02a13

But getting this:

sh: line 1: 13619 User defined signal 2   /home/jakub/prog/php/74/sapi/fpmi/php-fpmi -F -O -y /home/jakub/prog/php/74/sapi/fpmi/tests/bug74083-concurrent-reload.ini 2>&1
sh: line 0: kill: (13619) - No such process
ERROR: master process is dead
Reached interval 4000 us with 1000 us steps
sh: line 0: kill: (13619) - No such process
ERROR: The NOTICE does not match expected message:
- PATTERN: /^\[\d\d-\w\w\w-\d{4} \d\d:\d\d:\d\d\] NOTICE: Reloading in progress ...$/
- MESSAGE: 
- EXPECT: 'Reloading in progress ...'
- ACTUAL: ''
Done

This fix should be in it but might be because I'm missing some other fixes. Will probably need to look a bit more and port other stuff as well or will just try it in FPM.

@bukka
Copy link
Member

bukka commented Oct 6, 2019

@plmnikulin just to double check if the test is working fine for you in 7.4?

@plmnikulin
Copy link
Contributor Author

Sorry, I am quite busy these days.

  1. I rebased 3 commits of this pull request to 5b44560,
  2. tried a minimal variant in Ubuntu-18.04 lxc container ../php-src/configure --prefix /opt/issue-74083-7.4 --disable-fpmi --enable-fpm --enable-pcntl --enable-opcache, however opcache and fpmi are irrelevant here
  3. make test VERBOSE=1 TESTS='sapi/fpm/tests/bug74083-concurrent-reload.phpt' passed
  4. To check that test works I tried 5b44560 and sapi/fpm/tests/bug74083-concurrent-reload.phpt taken from the rebased branch. The test fails.
---- EXPECTED OUTPUT
Reached interval 25000 us with 1000 us steps
Done
---- ACTUAL OUTPUT
Reached interval 25000 us with 1000 us steps
ERROR: The NOTICE does not match expected message:
- PATTERN: /^\[\d\d-\w\w\w-\d{4} \d\d:\d\d:\d\d\] NOTICE: Reloading in progress ...$/
- MESSAGE: 
- EXPECT: 'Reloading in progress ...'
- ACTUAL: ''
Done
---- FAILED

Please, provide more details on your environment (maybe backtrace from coredump could be useful) if you suspect that the problem is not completely fixed. I have not checked if you have a patch fixing SIGKILL for children in the fpmi project.

@bukka
Copy link
Member

bukka commented Oct 20, 2019

@plmnikulin I made it work and the tests passes when patch is applied and fails when it's not which is good.

I merged to PHP 7.4 only for now so it gets in at least. If it's all good after 7.4 gets some usage, we might consider back port to 7.3

@plmnikulin
Copy link
Contributor Author

Promised patch against Bug #76601: #4836

@plmnikulin plmnikulin deleted the issue-74083 branch February 18, 2020 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants