Fix missed SIGKILL to children in PHP-FPM #4460

plmnikulin · 2019-07-23T03:58:10Z

Use persistent timer to avoid infinite wait for children completion and
to prevent web-sites down time. During reload (reexec) SIGKILL was
never sent by master process to workers due to single-shot timer
implementation details. It just can not be rescheduled after SIGTERM
from the handler.

Make Bug #76895 ("'break' not in the 'loop' or 'switch' context" when
opcache is enabled) harmless. It is not a fix however.
Bug #76601 "Partially working php-fpm ater incomplete reload"
(SIGUSR2 received at an inconvenient moment during reload)
could be considered fixed but it is better to add one more patch.

More bugs have similar symptoms:

Bug #77140 "FPM occassionally hangs"
Bug #77443 "FPM hangs on reload due to children handling SIGTERM
since PHP 7.1"

Use persistent timer to avoid infinite wait for children completion and to prevent web-sites down time. During reload (reexec) SIGKILL was never sent by master process to workers due to single-shot timer implementation details. It just can not be rescheduled after SIGTERM from the handler. - Make Bug #76895 ("'break' not in the 'loop' or 'switch' context" when opcache is enabled) harmless. It is not a fix however. - Bug #76601 "Partially working php-fpm ater incomplete reload" (SIGUSR2 received at an inconvenient moment during reload) could be considered fixed but it is better to add one more patch. More bugs have similar symptoms: - Bug #77140 "FPM occassionally hangs" - Bug #77443 "FPM hangs on reload due to children handling SIGTERM since PHP 7.1"

plmnikulin · 2019-07-23T04:08:32Z

Bug #76895 needs a proper fix but in the current state it is terrible that one buggy PHP script on a server could make all pools non-responding after an attempt of PHP-FPM reload. It may happen several hours after request to the bad script that is why such case is usually quite obscure. Maybe I do not realize some specific of pools configuration but I consider this bug as really ugly and easily reproducible.

Anyway this patch just makes working the code that was dead for years.

nikic · 2019-07-23T08:01:34Z

During reload (reexec) SIGKILL was never sent by master process to workers due to single-shot timer implementation details.

Is this explained anywhere?

nikic · 2019-07-23T08:49:57Z

I've opened #4461 to fix the Zend signals issue, which also takes some steps to make sure we notice problems like this in the future. With that change, is this one still necessary as well?

plmnikulin · 2019-07-23T11:16:13Z

During reload (reexec) SIGKILL was never sent by master process to workers due to single-shot timer implementation details.

Is this explained anywhere?

Look at fpm_event_loop()
https://github.com/php/php-src/blob/master/sapi/fpm/fpm/fpm_events.c#L446
non-persistent timer is unconditionally deleted after its handler is finished.

First time fpm_pctl_action_next() https://github.com/php/php-src/blob/master/sapi/fpm/fpm/fpm_process_ctl.c#L171 is activated by by the signal taht initiates reload. It sends SIGQUIT and scheduling SIGTERM in https://github.com/php/php-src/blob/master/sapi/fpm/fpm/fpm_process_ctl.c#L197 After sending of SIGTERM the same timer is rescheduled for SIGKILL at the same line https://github.com/php/php-src/blob/master/sapi/fpm/fpm/fpm_process_ctl.c#L197 but in the case of single-shot timer it is removed just after timer event completion. I may confuse details a bit, the same patch attached to the bug is one year old.

You can set debug log level for PHP-FPM to confirm that SIGKILL is never sent.

I've opened #4461 to fix the Zend signals issue, which also takes some steps to make sure we notice problems like this in the future. With that change, is this one still necessary as well?

Glad to see that the original problem is about to be fixed.

I believe that this patch still provides more reliability against Bug #76601 Though in the cases I have seen, blocking signals for the interval of spawning new child was enough.

There are may be other reason to have SIGKILL as the last measure. Unsure if it is realistic: child may be replaced by exec() to some process blocking SIGTERM, it might be a buggy extension. If you are sure that SIGKILL is not necessary, you may remove all code related to it.

nikic · 2019-07-23T12:55:55Z

During reload (reexec) SIGKILL was never sent by master process to workers due to single-shot timer implementation details.

Is this explained anywhere?

Look at fpm_event_loop()
https://github.com/php/php-src/blob/master/sapi/fpm/fpm/fpm_events.c#L446
non-persistent timer is unconditionally deleted after its handler is finished.

First time fpm_pctl_action_next() https://github.com/php/php-src/blob/master/sapi/fpm/fpm/fpm_process_ctl.c#L171 is activated by by the signal taht initiates reload. It sends SIGQUIT and scheduling SIGTERM in https://github.com/php/php-src/blob/master/sapi/fpm/fpm/fpm_process_ctl.c#L197 After sending of SIGTERM the same timer is rescheduled for SIGKILL at the same line https://github.com/php/php-src/blob/master/sapi/fpm/fpm/fpm_process_ctl.c#L197 but in the case of single-shot timer it is removed just after timer event completion. I may confuse details a bit, the same patch attached to the bug is one year old.

You can set debug log level for PHP-FPM to confirm that SIGKILL is never sent.

Thanks for the explanation. I think in that case the better way to patch this would be to either a) prevent it from being unregistered or b) register a separate single-shot timer event for the sigkill. I think option b) will be simpler. With your current implementation, wouldn't the timer just continue to fire every 1 second?

I've opened #4461 to fix the Zend signals issue, which also takes some steps to make sure we notice problems like this in the future. With that change, is this one still necessary as well?

Glad to see that the original problem is about to be fixed.

I believe that this patch still provides more reliability against Bug #76601 Though in the cases I have seen, blocking signals for the interval of spawning new child was enough.

There are may be other reason to have SIGKILL as the last measure. Unsure if it is realistic: child may be replaced by exec() to some process blocking SIGTERM, it might be a buggy extension. If you are sure that SIGKILL is not necessary, you may remove all code related to it.

Okay, I agree that sending the SIGKILL as a last measure is a good idea.

nikic · 2019-07-23T13:35:03Z

Here is a test I came up with:

--TEST--
If SIGQUIT and SIGTERM durin reloading fail, SIGKILL should be sent
--SKIPIF--
<?php
include "skipif.inc";
if (!function_exists('pcntl_sigprocmask')) die('skip Requires pcntl_sigprocmask()');
?>
--FILE--
<?php

require_once "tester.inc";

$cfg = <<<EOT
[global]
error_log = {{FILE:LOG}}
pid = {{FILE:PID}}
process_control_timeout=1
[unconfined]
listen = {{ADDR}}
ping.path = /ping
ping.response = pong
pm = dynamic
pm.max_children = 5
pm.start_servers = 1
pm.min_spare_servers = 1
pm.max_spare_servers = 1
EOT;

$code = <<<EOT
<?php
pcntl_sigprocmask(SIG_BLOCK, [SIGQUIT]);
pcntl_sigprocmask(SIG_BLOCK, [SIGTERM]);
EOT;

$tester = new FPM\Tester($cfg, $code);
$tester->start();
$tester->expectLogStartNotices();
$tester->request()->expectEmptyBody();
$tester->signal('USR2');
$tester->expectLogNotice('Reloading in progress ...');
$tester->expectLogNotice('reloading: .*');
$tester->expectLogNotice('using inherited socket fd=\d+, "127.0.0.1:\d+"');
$tester->expectLogStartNotices();
$tester->ping('{{ADDR}}');
$tester->terminate();
$tester->expectLogTerminatingNotices();
$tester->close();

?>
Done
--EXPECT--
Done
--CLEAN--
<?php
require_once "tester.inc";
FPM\Tester::clean();
?>

Best combined with a slight bump of the socket_select in getLogLine().

nikic · 2019-07-23T13:42:31Z

Thanks for the explanation. I think in that case the better way to patch this would be to either a) prevent it from being unregistered or b) register a separate single-shot timer event for the sigkill. I think option b) will be simpler. With your current implementation, wouldn't the timer just continue to fire every 1 second?

After thinking about this, maybe your approach is fine after all. If all children are killed, then we'll either exit or execvp, in which case we'll lose the timer at that point (right?) or not all children are killed, in which case resending sigkill until they are dead is fine.

nikic · 2019-07-23T13:48:17Z

This still leaves the implementation somewhat confusing. I think the real issue here might be that event_add will not add the event if it already exists, which might be generally sensible, but not if you try to reschedule a timer event from its handler. Might it make sense to first remove the event from the queue and then invoke the handler, so the rescheduling works properly?

nikic · 2019-07-23T13:57:16Z

I've opened #4465 with a possible alternative per my previous comment.

plmnikulin · 2019-07-24T09:17:16Z

As soon as I realized that the bug could be fixed by the one-line patch, I did not try fix the problem in the proper way. This change is local, it does not affect other calls of fpm_event_set_timer() that may rely on the current behavior. It is easier to spread short patch across various PHP version and support it for new releases.

Concerning repeating SIGKILL, I am aware of such feature. I considered it as not harm or even as an advantage in some rare cases that I could not imagine.

The approach in #4465 is better from my point of view. Feel free to decline this pull request when you variant is merged.

nikic · 2019-08-09T09:07:01Z

Now that #4461 and #4465 are merged, I think we can close this.

nikic requested a review from bukka July 23, 2019 07:59

nikic mentioned this pull request Jul 23, 2019

Fix FPM timer event re-registration #4465

Closed

plmnikulin mentioned this pull request Jul 24, 2019

Fix bug #76895 #4461

Closed

nikic added the Bug label Jul 24, 2019

plmnikulin mentioned this pull request Jul 25, 2019

Block signals during fpm master initialization #4471

Closed

nikic closed this Aug 9, 2019

plmnikulin deleted the issue-76895 branch February 18, 2020 04:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix missed SIGKILL to children in PHP-FPM #4460

Fix missed SIGKILL to children in PHP-FPM #4460

Uh oh!

plmnikulin commented Jul 23, 2019

Uh oh!

plmnikulin commented Jul 23, 2019

Uh oh!

nikic commented Jul 23, 2019

Uh oh!

nikic commented Jul 23, 2019

Uh oh!

plmnikulin commented Jul 23, 2019

Uh oh!

nikic commented Jul 23, 2019

Uh oh!

nikic commented Jul 23, 2019

Uh oh!

nikic commented Jul 23, 2019

Uh oh!

nikic commented Jul 23, 2019

Uh oh!

nikic commented Jul 23, 2019

Uh oh!

plmnikulin commented Jul 24, 2019

Uh oh!

nikic commented Aug 9, 2019

Uh oh!

Uh oh!

Fix missed SIGKILL to children in PHP-FPM #4460

Fix missed SIGKILL to children in PHP-FPM #4460

Uh oh!

Conversation

plmnikulin commented Jul 23, 2019

Uh oh!

plmnikulin commented Jul 23, 2019

Uh oh!

nikic commented Jul 23, 2019

Uh oh!

nikic commented Jul 23, 2019

Uh oh!

plmnikulin commented Jul 23, 2019

Uh oh!

nikic commented Jul 23, 2019

Uh oh!

nikic commented Jul 23, 2019

Uh oh!

nikic commented Jul 23, 2019

Uh oh!

nikic commented Jul 23, 2019

Uh oh!

nikic commented Jul 23, 2019

Uh oh!

plmnikulin commented Jul 24, 2019

Uh oh!

nikic commented Aug 9, 2019

Uh oh!

Uh oh!