-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Fix missed SIGKILL to children in PHP-FPM #4460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Use persistent timer to avoid infinite wait for children completion and to prevent web-sites down time. During reload (reexec) SIGKILL was never sent by master process to workers due to single-shot timer implementation details. It just can not be rescheduled after SIGTERM from the handler. - Make Bug #76895 ("'break' not in the 'loop' or 'switch' context" when opcache is enabled) harmless. It is not a fix however. - Bug #76601 "Partially working php-fpm ater incomplete reload" (SIGUSR2 received at an inconvenient moment during reload) could be considered fixed but it is better to add one more patch. More bugs have similar symptoms: - Bug #77140 "FPM occassionally hangs" - Bug #77443 "FPM hangs on reload due to children handling SIGTERM since PHP 7.1"
Bug #76895 needs a proper fix but in the current state it is terrible that one buggy PHP script on a server could make all pools non-responding after an attempt of PHP-FPM reload. It may happen several hours after request to the bad script that is why such case is usually quite obscure. Maybe I do not realize some specific of pools configuration but I consider this bug as really ugly and easily reproducible. Anyway this patch just makes working the code that was dead for years. |
Is this explained anywhere? |
I've opened #4461 to fix the Zend signals issue, which also takes some steps to make sure we notice problems like this in the future. With that change, is this one still necessary as well? |
Look at First time You can set debug log level for PHP-FPM to confirm that
Glad to see that the original problem is about to be fixed. I believe that this patch still provides more reliability against Bug #76601 Though in the cases I have seen, blocking signals for the interval of spawning new child was enough. There are may be other reason to have |
Thanks for the explanation. I think in that case the better way to patch this would be to either a) prevent it from being unregistered or b) register a separate single-shot timer event for the sigkill. I think option b) will be simpler. With your current implementation, wouldn't the timer just continue to fire every 1 second?
Okay, I agree that sending the SIGKILL as a last measure is a good idea. |
Here is a test I came up with:
Best combined with a slight bump of the socket_select in getLogLine(). |
After thinking about this, maybe your approach is fine after all. If all children are killed, then we'll either exit or execvp, in which case we'll lose the timer at that point (right?) or not all children are killed, in which case resending sigkill until they are dead is fine. |
This still leaves the implementation somewhat confusing. I think the real issue here might be that event_add will not add the event if it already exists, which might be generally sensible, but not if you try to reschedule a timer event from its handler. Might it make sense to first remove the event from the queue and then invoke the handler, so the rescheduling works properly? |
I've opened #4465 with a possible alternative per my previous comment. |
As soon as I realized that the bug could be fixed by the one-line patch, I did not try fix the problem in the proper way. This change is local, it does not affect other calls of Concerning repeating The approach in #4465 is better from my point of view. Feel free to decline this pull request when you variant is merged. |
Use persistent timer to avoid infinite wait for children completion and
to prevent web-sites down time. During reload (reexec) SIGKILL was
never sent by master process to workers due to single-shot timer
implementation details. It just can not be rescheduled after SIGTERM
from the handler.
opcache is enabled) harmless. It is not a fix however.
(SIGUSR2 received at an inconvenient moment during reload)
could be considered fixed but it is better to add one more patch.
More bugs have similar symptoms:
since PHP 7.1"