Skip to content

Commit a9d0aab

Browse files
committed
tracing: Fix regression of filter waiting a long time on RCU synchronization
When faultable trace events were added, a trace event may no longer use normal RCU to synchronize but instead used synchronize_rcu_tasks_trace(). This synchronization takes a much longer time to synchronize. The filter logic would free the filters by calling tracepoint_synchronize_unregister() after it unhooked the filter strings and before freeing them. With this function now calling synchronize_rcu_tasks_trace() this increased the time to free a filter tremendously. On a PREEMPT_RT system, it was even more noticeable. # time trace-cmd record -p function sleep 1 [..] real 2m29.052s user 0m0.244s sys 0m20.136s As trace-cmd would clear out all the filters before recording, it could take up to 2 minutes to do a recording of "sleep 1". To find out where the issues was: ~# trace-cmd sqlhist -e -n sched_stack select start.prev_state as state, end.next_comm as comm, TIMESTAMP_DELTA_USECS as delta, start.STACKTRACE as stack from sched_switch as start join sched_switch as end on start.prev_pid = end.next_pid Which will produce the following commands (and -e will also execute them): echo 's:sched_stack s64 state; char comm[16]; u64 delta; unsigned long stack[];' >> /sys/kernel/tracing/dynamic_events echo 'hist:keys=prev_pid:__arg_18057_2=prev_state,__arg_18057_4=common_timestamp.usecs,__arg_18057_7=common_stacktrace' >> /sys/kernel/tracing/events/sched/sched_switch/trigger echo 'hist:keys=next_pid:__state_18057_1=$__arg_18057_2,__comm_18057_3=next_comm,__delta_18057_5=common_timestamp.usecs-$__arg_18057_4,__stack_18057_6=$__arg_18057_7:onmatch(sched.sched_switch).trace(sched_stack,$__state_18057_1,$__comm_18057_3,$__delta_18057_5,$__stack_18057_6)' >> /sys/kernel/tracing/events/sched/sched_switch/trigger The above creates a synthetic event that creates a stack trace when a task schedules out and records it with the time it scheduled back in. Basically the time a task is off the CPU. It also records the state of the task when it left the CPU (running, blocked, sleeping, etc). It also saves the comm of the task as "comm" (needed for the next command). ~# echo 'hist:keys=state,stack.stacktrace:vals=delta:sort=state,delta if comm == "trace-cmd" && state & 3' > /sys/kernel/tracing/events/synthetic/sched_stack/trigger The above creates a histogram with buckets per state, per stack, and the value of the total time it was off the CPU for that stack trace. It filters on tasks with "comm == trace-cmd" and only the sleeping and blocked states (1 - sleeping, 2 - blocked). ~# trace-cmd record -p function sleep 1 ~# cat /sys/kernel/tracing/events/synthetic/sched_stack/hist | tail -18 { state: 2, stack.stacktrace __schedule+0x1545/0x3700 schedule+0xe2/0x390 schedule_timeout+0x175/0x200 wait_for_completion_state+0x294/0x440 __wait_rcu_gp+0x247/0x4f0 synchronize_rcu_tasks_generic+0x151/0x230 apply_subsystem_event_filter+0xa2b/0x1300 subsystem_filter_write+0x67/0xc0 vfs_write+0x1e2/0xeb0 ksys_write+0xff/0x1d0 do_syscall_64+0x7b/0x420 entry_SYSCALL_64_after_hwframe+0x76/0x7e } hitcount: 237 delta: 99756288 <<--------------- Delta is 99 seconds! Totals: Hits: 525 Entries: 21 Dropped: 0 This shows that this particular trace waited for 99 seconds on synchronize_rcu_tasks() in apply_subsystem_event_filter(). In fact, there's a lot of places in the filter code that spends a lot of time waiting for synchronize_rcu_tasks_trace() in order to free the filters. Add helper functions that will use call_rcu*() variants to asynchronously free the filters. This brings the timings back to normal: # time trace-cmd record -p function sleep 1 [..] real 0m14.681s user 0m0.335s sys 0m28.616s And the histogram also shows this: ~# cat /sys/kernel/tracing/events/synthetic/sched_stack/hist | tail -21 { state: 2, stack.stacktrace __schedule+0x1545/0x3700 schedule+0xe2/0x390 schedule_timeout+0x175/0x200 wait_for_completion_state+0x294/0x440 __wait_rcu_gp+0x247/0x4f0 synchronize_rcu_normal+0x3db/0x5c0 tracing_reset_online_cpus+0x8f/0x1e0 tracing_open+0x335/0x440 do_dentry_open+0x4c6/0x17a0 vfs_open+0x82/0x360 path_openat+0x1a36/0x2990 do_filp_open+0x1c5/0x420 do_sys_openat2+0xed/0x180 __x64_sys_openat+0x108/0x1d0 do_syscall_64+0x7b/0x420 } hitcount: 2 delta: 77044 Totals: Hits: 55 Entries: 28 Dropped: 0 Where the total waiting time of synchronize_rcu_tasks_trace() is 77 milliseconds. Cc: [email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Jan Kiszka <[email protected]> Cc: Andreas Ziegler <[email protected]> Cc: Felix MOESSBAUER <[email protected]> Link: https://lore.kernel.org/[email protected] Reported-by: "Flot, Julien" <[email protected]> Tested-by: Julien Flot <[email protected]> Fixes: a363d27 ("tracing: Allow system call tracepoints to handle page faults") Closes: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Steven Rostedt (Google) <[email protected]>
1 parent 70087d2 commit a9d0aab

File tree

1 file changed

+138
-48
lines changed

1 file changed

+138
-48
lines changed

kernel/trace/trace_events_filter.c

Lines changed: 138 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -1335,22 +1335,139 @@ static void filter_free_subsystem_preds(struct trace_subsystem_dir *dir,
13351335
}
13361336
}
13371337

1338+
struct filter_list {
1339+
struct list_head list;
1340+
struct event_filter *filter;
1341+
};
1342+
1343+
struct filter_head {
1344+
struct list_head list;
1345+
struct rcu_head rcu;
1346+
};
1347+
1348+
1349+
static void free_filter_list(struct rcu_head *rhp)
1350+
{
1351+
struct filter_head *filter_list = container_of(rhp, struct filter_head, rcu);
1352+
struct filter_list *filter_item, *tmp;
1353+
1354+
list_for_each_entry_safe(filter_item, tmp, &filter_list->list, list) {
1355+
__free_filter(filter_item->filter);
1356+
list_del(&filter_item->list);
1357+
kfree(filter_item);
1358+
}
1359+
kfree(filter_list);
1360+
}
1361+
1362+
static void free_filter_list_tasks(struct rcu_head *rhp)
1363+
{
1364+
call_rcu(rhp, free_filter_list);
1365+
}
1366+
1367+
/*
1368+
* The tracepoint_synchronize_unregister() is a double rcu call.
1369+
* It calls synchronize_rcu_tasks_trace() followed by synchronize_rcu().
1370+
* Instead of waiting for it, simply call these via the call_rcu*()
1371+
* variants.
1372+
*/
1373+
static void delay_free_filter(struct filter_head *head)
1374+
{
1375+
call_rcu_tasks_trace(&head->rcu, free_filter_list_tasks);
1376+
}
1377+
1378+
static void try_delay_free_filter(struct event_filter *filter)
1379+
{
1380+
struct filter_head *head;
1381+
struct filter_list *item;
1382+
1383+
head = kmalloc(sizeof(*head), GFP_KERNEL);
1384+
if (!head)
1385+
goto free_now;
1386+
1387+
INIT_LIST_HEAD(&head->list);
1388+
1389+
item = kmalloc(sizeof(*item), GFP_KERNEL);
1390+
if (!item) {
1391+
kfree(head);
1392+
goto free_now;
1393+
}
1394+
1395+
item->filter = filter;
1396+
list_add_tail(&item->list, &head->list);
1397+
delay_free_filter(head);
1398+
return;
1399+
1400+
free_now:
1401+
/* Make sure the filter is not being used */
1402+
tracepoint_synchronize_unregister();
1403+
__free_filter(filter);
1404+
}
1405+
13381406
static inline void __free_subsystem_filter(struct trace_event_file *file)
13391407
{
13401408
__free_filter(file->filter);
13411409
file->filter = NULL;
13421410
}
13431411

1412+
static inline void event_set_filter(struct trace_event_file *file,
1413+
struct event_filter *filter)
1414+
{
1415+
rcu_assign_pointer(file->filter, filter);
1416+
}
1417+
1418+
static inline void event_clear_filter(struct trace_event_file *file)
1419+
{
1420+
RCU_INIT_POINTER(file->filter, NULL);
1421+
}
1422+
13441423
static void filter_free_subsystem_filters(struct trace_subsystem_dir *dir,
1345-
struct trace_array *tr)
1424+
struct trace_array *tr,
1425+
struct event_filter *filter)
13461426
{
13471427
struct trace_event_file *file;
1428+
struct filter_head *head;
1429+
struct filter_list *item;
1430+
1431+
head = kmalloc(sizeof(*head), GFP_KERNEL);
1432+
if (!head)
1433+
goto free_now;
1434+
1435+
INIT_LIST_HEAD(&head->list);
1436+
1437+
item = kmalloc(sizeof(*item), GFP_KERNEL);
1438+
if (!item) {
1439+
kfree(head);
1440+
goto free_now;
1441+
}
1442+
1443+
item->filter = filter;
1444+
list_add_tail(&item->list, &head->list);
13481445

13491446
list_for_each_entry(file, &tr->events, list) {
13501447
if (file->system != dir)
13511448
continue;
1352-
__free_subsystem_filter(file);
1449+
item = kmalloc(sizeof(*item), GFP_KERNEL);
1450+
if (!item)
1451+
goto free_now;
1452+
item->filter = event_filter(file);
1453+
list_add_tail(&item->list, &head->list);
1454+
event_clear_filter(file);
13531455
}
1456+
1457+
delay_free_filter(head);
1458+
return;
1459+
free_now:
1460+
tracepoint_synchronize_unregister();
1461+
1462+
if (head)
1463+
free_filter_list(&head->rcu);
1464+
1465+
list_for_each_entry(file, &tr->events, list) {
1466+
if (file->system != dir || !file->filter)
1467+
continue;
1468+
__free_filter(file->filter);
1469+
}
1470+
__free_filter(filter);
13541471
}
13551472

13561473
int filter_assign_type(const char *type)
@@ -2120,22 +2237,6 @@ static inline void event_set_filtered_flag(struct trace_event_file *file)
21202237
trace_buffered_event_enable();
21212238
}
21222239

2123-
static inline void event_set_filter(struct trace_event_file *file,
2124-
struct event_filter *filter)
2125-
{
2126-
rcu_assign_pointer(file->filter, filter);
2127-
}
2128-
2129-
static inline void event_clear_filter(struct trace_event_file *file)
2130-
{
2131-
RCU_INIT_POINTER(file->filter, NULL);
2132-
}
2133-
2134-
struct filter_list {
2135-
struct list_head list;
2136-
struct event_filter *filter;
2137-
};
2138-
21392240
static int process_system_preds(struct trace_subsystem_dir *dir,
21402241
struct trace_array *tr,
21412242
struct filter_parse_error *pe,
@@ -2144,11 +2245,16 @@ static int process_system_preds(struct trace_subsystem_dir *dir,
21442245
struct trace_event_file *file;
21452246
struct filter_list *filter_item;
21462247
struct event_filter *filter = NULL;
2147-
struct filter_list *tmp;
2148-
LIST_HEAD(filter_list);
2248+
struct filter_head *filter_list;
21492249
bool fail = true;
21502250
int err;
21512251

2252+
filter_list = kmalloc(sizeof(*filter_list), GFP_KERNEL);
2253+
if (!filter_list)
2254+
return -ENOMEM;
2255+
2256+
INIT_LIST_HEAD(&filter_list->list);
2257+
21522258
list_for_each_entry(file, &tr->events, list) {
21532259

21542260
if (file->system != dir)
@@ -2175,7 +2281,7 @@ static int process_system_preds(struct trace_subsystem_dir *dir,
21752281
if (!filter_item)
21762282
goto fail_mem;
21772283

2178-
list_add_tail(&filter_item->list, &filter_list);
2284+
list_add_tail(&filter_item->list, &filter_list->list);
21792285
/*
21802286
* Regardless of if this returned an error, we still
21812287
* replace the filter for the call.
@@ -2195,31 +2301,22 @@ static int process_system_preds(struct trace_subsystem_dir *dir,
21952301
* Do a synchronize_rcu() and to ensure all calls are
21962302
* done with them before we free them.
21972303
*/
2198-
tracepoint_synchronize_unregister();
2199-
list_for_each_entry_safe(filter_item, tmp, &filter_list, list) {
2200-
__free_filter(filter_item->filter);
2201-
list_del(&filter_item->list);
2202-
kfree(filter_item);
2203-
}
2304+
delay_free_filter(filter_list);
22042305
return 0;
22052306
fail:
22062307
/* No call succeeded */
2207-
list_for_each_entry_safe(filter_item, tmp, &filter_list, list) {
2208-
list_del(&filter_item->list);
2209-
kfree(filter_item);
2210-
}
2308+
free_filter_list(&filter_list->rcu);
22112309
parse_error(pe, FILT_ERR_BAD_SUBSYS_FILTER, 0);
22122310
return -EINVAL;
22132311
fail_mem:
22142312
__free_filter(filter);
2313+
22152314
/* If any call succeeded, we still need to sync */
22162315
if (!fail)
2217-
tracepoint_synchronize_unregister();
2218-
list_for_each_entry_safe(filter_item, tmp, &filter_list, list) {
2219-
__free_filter(filter_item->filter);
2220-
list_del(&filter_item->list);
2221-
kfree(filter_item);
2222-
}
2316+
delay_free_filter(filter_list);
2317+
else
2318+
free_filter_list(&filter_list->rcu);
2319+
22232320
return -ENOMEM;
22242321
}
22252322

@@ -2361,9 +2458,7 @@ int apply_event_filter(struct trace_event_file *file, char *filter_string)
23612458

23622459
event_clear_filter(file);
23632460

2364-
/* Make sure the filter is not being used */
2365-
tracepoint_synchronize_unregister();
2366-
__free_filter(filter);
2461+
try_delay_free_filter(filter);
23672462

23682463
return 0;
23692464
}
@@ -2387,11 +2482,8 @@ int apply_event_filter(struct trace_event_file *file, char *filter_string)
23872482

23882483
event_set_filter(file, filter);
23892484

2390-
if (tmp) {
2391-
/* Make sure the call is done with the filter */
2392-
tracepoint_synchronize_unregister();
2393-
__free_filter(tmp);
2394-
}
2485+
if (tmp)
2486+
try_delay_free_filter(tmp);
23952487
}
23962488

23972489
return err;
@@ -2417,9 +2509,7 @@ int apply_subsystem_event_filter(struct trace_subsystem_dir *dir,
24172509
filter = system->filter;
24182510
system->filter = NULL;
24192511
/* Ensure all filters are no longer used */
2420-
tracepoint_synchronize_unregister();
2421-
filter_free_subsystem_filters(dir, tr);
2422-
__free_filter(filter);
2512+
filter_free_subsystem_filters(dir, tr, filter);
24232513
return 0;
24242514
}
24252515

0 commit comments

Comments
 (0)