Skip to content

Fix missed SIGKILL to children in PHP-FPM #4460

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

plmnikulin
Copy link
Contributor

Use persistent timer to avoid infinite wait for children completion and
to prevent web-sites down time. During reload (reexec) SIGKILL was
never sent by master process to workers due to single-shot timer
implementation details. It just can not be rescheduled after SIGTERM
from the handler.

  • Make Bug #76895 ("'break' not in the 'loop' or 'switch' context" when
    opcache is enabled) harmless. It is not a fix however.
  • Bug #76601 "Partially working php-fpm ater incomplete reload"
    (SIGUSR2 received at an inconvenient moment during reload)
    could be considered fixed but it is better to add one more patch.

More bugs have similar symptoms:

  • Bug #77140 "FPM occassionally hangs"
  • Bug #77443 "FPM hangs on reload due to children handling SIGTERM
    since PHP 7.1"

Use persistent timer to avoid infinite wait for children completion and
to prevent web-sites down time. During reload (reexec) SIGKILL was
never sent by master process to workers due to single-shot timer
implementation details. It just can not be rescheduled after SIGTERM
from the handler.

- Make Bug #76895 ("'break' not in the 'loop' or 'switch' context" when
  opcache is enabled) harmless. It is not a fix however.
- Bug #76601 "Partially working php-fpm ater incomplete reload"
  (SIGUSR2 received at an inconvenient moment during reload)
  could be considered fixed but it is better to add one more patch.

More bugs have similar symptoms:

- Bug #77140 "FPM occassionally hangs"
- Bug #77443 "FPM hangs on reload due to children handling SIGTERM
  since PHP 7.1"
@plmnikulin
Copy link
Contributor Author

Bug #76895 needs a proper fix but in the current state it is terrible that one buggy PHP script on a server could make all pools non-responding after an attempt of PHP-FPM reload. It may happen several hours after request to the bad script that is why such case is usually quite obscure. Maybe I do not realize some specific of pools configuration but I consider this bug as really ugly and easily reproducible.

Anyway this patch just makes working the code that was dead for years.

@nikic nikic requested a review from bukka July 23, 2019 07:59
@nikic
Copy link
Member

nikic commented Jul 23, 2019

During reload (reexec) SIGKILL was never sent by master process to workers due to single-shot timer implementation details.

Is this explained anywhere?

@nikic
Copy link
Member

nikic commented Jul 23, 2019

I've opened #4461 to fix the Zend signals issue, which also takes some steps to make sure we notice problems like this in the future. With that change, is this one still necessary as well?

@plmnikulin
Copy link
Contributor Author

During reload (reexec) SIGKILL was never sent by master process to workers due to single-shot timer implementation details.

Is this explained anywhere?

Look at fpm_event_loop()
https://github.com/php/php-src/blob/master/sapi/fpm/fpm/fpm_events.c#L446
non-persistent timer is unconditionally deleted after its handler is finished.

First time fpm_pctl_action_next() https://github.com/php/php-src/blob/master/sapi/fpm/fpm/fpm_process_ctl.c#L171 is activated by by the signal taht initiates reload. It sends SIGQUIT and scheduling SIGTERM in https://github.com/php/php-src/blob/master/sapi/fpm/fpm/fpm_process_ctl.c#L197 After sending of SIGTERM the same timer is rescheduled for SIGKILL at the same line https://github.com/php/php-src/blob/master/sapi/fpm/fpm/fpm_process_ctl.c#L197 but in the case of single-shot timer it is removed just after timer event completion. I may confuse details a bit, the same patch attached to the bug is one year old.

You can set debug log level for PHP-FPM to confirm that SIGKILL is never sent.

I've opened #4461 to fix the Zend signals issue, which also takes some steps to make sure we notice problems like this in the future. With that change, is this one still necessary as well?

Glad to see that the original problem is about to be fixed.

I believe that this patch still provides more reliability against Bug #76601 Though in the cases I have seen, blocking signals for the interval of spawning new child was enough.

There are may be other reason to have SIGKILL as the last measure. Unsure if it is realistic: child may be replaced by exec() to some process blocking SIGTERM, it might be a buggy extension. If you are sure that SIGKILL is not necessary, you may remove all code related to it.

@nikic
Copy link
Member

nikic commented Jul 23, 2019

During reload (reexec) SIGKILL was never sent by master process to workers due to single-shot timer implementation details.

Is this explained anywhere?

Look at fpm_event_loop()
https://github.com/php/php-src/blob/master/sapi/fpm/fpm/fpm_events.c#L446
non-persistent timer is unconditionally deleted after its handler is finished.

First time fpm_pctl_action_next() https://github.com/php/php-src/blob/master/sapi/fpm/fpm/fpm_process_ctl.c#L171 is activated by by the signal taht initiates reload. It sends SIGQUIT and scheduling SIGTERM in https://github.com/php/php-src/blob/master/sapi/fpm/fpm/fpm_process_ctl.c#L197 After sending of SIGTERM the same timer is rescheduled for SIGKILL at the same line https://github.com/php/php-src/blob/master/sapi/fpm/fpm/fpm_process_ctl.c#L197 but in the case of single-shot timer it is removed just after timer event completion. I may confuse details a bit, the same patch attached to the bug is one year old.

You can set debug log level for PHP-FPM to confirm that SIGKILL is never sent.

Thanks for the explanation. I think in that case the better way to patch this would be to either a) prevent it from being unregistered or b) register a separate single-shot timer event for the sigkill. I think option b) will be simpler. With your current implementation, wouldn't the timer just continue to fire every 1 second?

I've opened #4461 to fix the Zend signals issue, which also takes some steps to make sure we notice problems like this in the future. With that change, is this one still necessary as well?

Glad to see that the original problem is about to be fixed.

I believe that this patch still provides more reliability against Bug #76601 Though in the cases I have seen, blocking signals for the interval of spawning new child was enough.

There are may be other reason to have SIGKILL as the last measure. Unsure if it is realistic: child may be replaced by exec() to some process blocking SIGTERM, it might be a buggy extension. If you are sure that SIGKILL is not necessary, you may remove all code related to it.

Okay, I agree that sending the SIGKILL as a last measure is a good idea.

@nikic
Copy link
Member

nikic commented Jul 23, 2019

Here is a test I came up with:

--TEST--
If SIGQUIT and SIGTERM durin reloading fail, SIGKILL should be sent
--SKIPIF--
<?php
include "skipif.inc";
if (!function_exists('pcntl_sigprocmask')) die('skip Requires pcntl_sigprocmask()');
?>
--FILE--
<?php

require_once "tester.inc";

$cfg = <<<EOT
[global]
error_log = {{FILE:LOG}}
pid = {{FILE:PID}}
process_control_timeout=1
[unconfined]
listen = {{ADDR}}
ping.path = /ping
ping.response = pong
pm = dynamic
pm.max_children = 5
pm.start_servers = 1
pm.min_spare_servers = 1
pm.max_spare_servers = 1
EOT;

$code = <<<EOT
<?php
pcntl_sigprocmask(SIG_BLOCK, [SIGQUIT]);
pcntl_sigprocmask(SIG_BLOCK, [SIGTERM]);
EOT;

$tester = new FPM\Tester($cfg, $code);
$tester->start();
$tester->expectLogStartNotices();
$tester->request()->expectEmptyBody();
$tester->signal('USR2');
$tester->expectLogNotice('Reloading in progress ...');
$tester->expectLogNotice('reloading: .*');
$tester->expectLogNotice('using inherited socket fd=\d+, "127.0.0.1:\d+"');
$tester->expectLogStartNotices();
$tester->ping('{{ADDR}}');
$tester->terminate();
$tester->expectLogTerminatingNotices();
$tester->close();

?>
Done
--EXPECT--
Done
--CLEAN--
<?php
require_once "tester.inc";
FPM\Tester::clean();
?>

Best combined with a slight bump of the socket_select in getLogLine().

@nikic
Copy link
Member

nikic commented Jul 23, 2019

Thanks for the explanation. I think in that case the better way to patch this would be to either a) prevent it from being unregistered or b) register a separate single-shot timer event for the sigkill. I think option b) will be simpler. With your current implementation, wouldn't the timer just continue to fire every 1 second?

After thinking about this, maybe your approach is fine after all. If all children are killed, then we'll either exit or execvp, in which case we'll lose the timer at that point (right?) or not all children are killed, in which case resending sigkill until they are dead is fine.

@nikic
Copy link
Member

nikic commented Jul 23, 2019

This still leaves the implementation somewhat confusing. I think the real issue here might be that event_add will not add the event if it already exists, which might be generally sensible, but not if you try to reschedule a timer event from its handler. Might it make sense to first remove the event from the queue and then invoke the handler, so the rescheduling works properly?

@nikic
Copy link
Member

nikic commented Jul 23, 2019

I've opened #4465 with a possible alternative per my previous comment.

@plmnikulin plmnikulin mentioned this pull request Jul 24, 2019
@plmnikulin
Copy link
Contributor Author

As soon as I realized that the bug could be fixed by the one-line patch, I did not try fix the problem in the proper way. This change is local, it does not affect other calls of fpm_event_set_timer() that may rely on the current behavior. It is easier to spread short patch across various PHP version and support it for new releases.

Concerning repeating SIGKILL, I am aware of such feature. I considered it as not harm or even as an advantage in some rare cases that I could not imagine.

The approach in #4465 is better from my point of view. Feel free to decline this pull request when you variant is merged.

@nikic
Copy link
Member

nikic commented Aug 9, 2019

Now that #4461 and #4465 are merged, I think we can close this.

@nikic nikic closed this Aug 9, 2019
@plmnikulin plmnikulin deleted the issue-76895 branch February 18, 2020 04:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants