Skip to content

Commit ebec18a

Browse files
poetteringtorvalds
authored andcommitted
prctl: add PR_{SET,GET}_CHILD_SUBREAPER to allow simple process supervision
Userspace service managers/supervisors need to track their started services. Many services daemonize by double-forking and get implicitly re-parented to PID 1. The service manager will no longer be able to receive the SIGCHLD signals for them, and is no longer in charge of reaping the children with wait(). All information about the children is lost at the moment PID 1 cleans up the re-parented processes. With this prctl, a service manager process can mark itself as a sort of 'sub-init', able to stay as the parent for all orphaned processes created by the started services. All SIGCHLD signals will be delivered to the service manager. Receiving SIGCHLD and doing wait() is in cases of a service-manager much preferred over any possible asynchronous notification about specific PIDs, because the service manager has full access to the child process data in /proc and the PID can not be re-used until the wait(), the service-manager itself is in charge of, has happened. As a side effect, the relevant parent PID information does not get lost by a double-fork, which results in a more elaborate process tree and 'ps' output: before: # ps afx 253 ? Ss 0:00 /bin/dbus-daemon --system --nofork 294 ? Sl 0:00 /usr/libexec/polkit-1/polkitd 328 ? S 0:00 /usr/sbin/modem-manager 608 ? Sl 0:00 /usr/libexec/colord 658 ? Sl 0:00 /usr/libexec/upowerd 819 ? Sl 0:00 /usr/libexec/imsettings-daemon 916 ? Sl 0:00 /usr/libexec/udisks-daemon 917 ? S 0:00 \_ udisks-daemon: not polling any devices after: # ps afx 294 ? Ss 0:00 /bin/dbus-daemon --system --nofork 426 ? Sl 0:00 \_ /usr/libexec/polkit-1/polkitd 449 ? S 0:00 \_ /usr/sbin/modem-manager 635 ? Sl 0:00 \_ /usr/libexec/colord 705 ? Sl 0:00 \_ /usr/libexec/upowerd 959 ? Sl 0:00 \_ /usr/libexec/udisks-daemon 960 ? S 0:00 | \_ udisks-daemon: not polling any devices 977 ? Sl 0:00 \_ /usr/libexec/packagekitd This prctl is orthogonal to PID namespaces. PID namespaces are isolated from each other, while a service management process usually requires the services to live in the same namespace, to be able to talk to each other. Users of this will be the systemd per-user instance, which provides init-like functionality for the user's login session and D-Bus, which activates bus services on-demand. Both need init-like capabilities to be able to properly keep track of the services they start. Many thanks to Oleg for several rounds of review and insights. [[email protected]: fix comment layout and spelling] [[email protected]: add lengthy code comment from Oleg] Reviewed-by: Oleg Nesterov <[email protected]> Signed-off-by: Lennart Poettering <[email protected]> Signed-off-by: Kay Sievers <[email protected]> Acked-by: Valdis Kletnieks <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent 953326c commit ebec18a

File tree

5 files changed

+54
-5
lines changed

5 files changed

+54
-5
lines changed

include/linux/prctl.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,4 +121,7 @@
121121
#define PR_SET_PTRACER 0x59616d61
122122
# define PR_SET_PTRACER_ANY ((unsigned long)-1)
123123

124+
#define PR_SET_CHILD_SUBREAPER 36
125+
#define PR_GET_CHILD_SUBREAPER 37
126+
124127
#endif /* _LINUX_PRCTL_H */

include/linux/sched.h

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -553,6 +553,18 @@ struct signal_struct {
553553
int group_stop_count;
554554
unsigned int flags; /* see SIGNAL_* flags below */
555555

556+
/*
557+
* PR_SET_CHILD_SUBREAPER marks a process, like a service
558+
* manager, to re-parent orphan (double-forking) child processes
559+
* to this process instead of 'init'. The service manager is
560+
* able to receive SIGCHLD signals and is able to investigate
561+
* the process until it calls wait(). All children of this
562+
* process will inherit a flag if they should look for a
563+
* child_subreaper process at exit.
564+
*/
565+
unsigned int is_child_subreaper:1;
566+
unsigned int has_child_subreaper:1;
567+
556568
/* POSIX.1b Interval Timers */
557569
struct list_head posix_timers;
558570

kernel/exit.c

Lines changed: 28 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -687,11 +687,11 @@ static void exit_mm(struct task_struct * tsk)
687687
}
688688

689689
/*
690-
* When we die, we re-parent all our children.
691-
* Try to give them to another thread in our thread
692-
* group, and if no such member exists, give it to
693-
* the child reaper process (ie "init") in our pid
694-
* space.
690+
* When we die, we re-parent all our children, and try to:
691+
* 1. give them to another thread in our thread group, if such a member exists
692+
* 2. give it to the first ancestor process which prctl'd itself as a
693+
* child_subreaper for its children (like a service manager)
694+
* 3. give it to the init process (PID 1) in our pid namespace
695695
*/
696696
static struct task_struct *find_new_reaper(struct task_struct *father)
697697
__releases(&tasklist_lock)
@@ -722,6 +722,29 @@ static struct task_struct *find_new_reaper(struct task_struct *father)
722722
* forget_original_parent() must move them somewhere.
723723
*/
724724
pid_ns->child_reaper = init_pid_ns.child_reaper;
725+
} else if (father->signal->has_child_subreaper) {
726+
struct task_struct *reaper;
727+
728+
/*
729+
* Find the first ancestor marked as child_subreaper.
730+
* Note that the code below checks same_thread_group(reaper,
731+
* pid_ns->child_reaper). This is what we need to DTRT in a
732+
* PID namespace. However we still need the check above, see
733+
* http://marc.info/?l=linux-kernel&m=131385460420380
734+
*/
735+
for (reaper = father->real_parent;
736+
reaper != &init_task;
737+
reaper = reaper->real_parent) {
738+
if (same_thread_group(reaper, pid_ns->child_reaper))
739+
break;
740+
if (!reaper->signal->is_child_subreaper)
741+
continue;
742+
thread = reaper;
743+
do {
744+
if (!(thread->flags & PF_EXITING))
745+
return reaper;
746+
} while_each_thread(reaper, thread);
747+
}
725748
}
726749

727750
return pid_ns->child_reaper;

kernel/fork.c

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1051,6 +1051,9 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
10511051
sig->oom_score_adj = current->signal->oom_score_adj;
10521052
sig->oom_score_adj_min = current->signal->oom_score_adj_min;
10531053

1054+
sig->has_child_subreaper = current->signal->has_child_subreaper ||
1055+
current->signal->is_child_subreaper;
1056+
10541057
mutex_init(&sig->cred_guard_mutex);
10551058

10561059
return 0;

kernel/sys.c

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1962,6 +1962,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
19621962
case PR_SET_MM:
19631963
error = prctl_set_mm(arg2, arg3, arg4, arg5);
19641964
break;
1965+
case PR_SET_CHILD_SUBREAPER:
1966+
me->signal->is_child_subreaper = !!arg2;
1967+
error = 0;
1968+
break;
1969+
case PR_GET_CHILD_SUBREAPER:
1970+
error = put_user(me->signal->is_child_subreaper,
1971+
(int __user *) arg2);
1972+
break;
19651973
default:
19661974
error = -EINVAL;
19671975
break;

0 commit comments

Comments
 (0)