Skip to content

Commit 7308748

Browse files
davemarchevskyAlexei Starovoitov
authored andcommitted
selftests/bpf: Add benchmark for local_storage get
Add a benchmarks to demonstrate the performance cliff for local_storage get as the number of local_storage maps increases beyond current local_storage implementation's cache size. "sequential get" and "interleaved get" benchmarks are added, both of which do many bpf_task_storage_get calls on sets of task local_storage maps of various counts, while considering a single specific map to be 'important' and counting task_storage_gets to the important map separately in addition to normal 'hits' count of all gets. Goal here is to mimic scenario where a particular program using one map - the important one - is running on a system where many other local_storage maps exist and are accessed often. While "sequential get" benchmark does bpf_task_storage_get for map 0, 1, ..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4 bpf_task_storage_gets for the important map for every 10 map gets. This is meant to highlight performance differences when important map is accessed far more frequently than non-important maps. A "hashmap control" benchmark is also included for easy comparison of standard bpf hashmap lookup vs local_storage get. The benchmark is similar to "sequential get", but creates and uses BPF_MAP_TYPE_HASH instead of local storage. Only one inner map is created - a hashmap meant to hold tid -> data mapping for all tasks. Size of the hashmap is hardcoded to my system's PID_MAX_LIMIT (4,194,304). The number of these keys which are actually fetched as part of the benchmark is configurable. Addition of this benchmark is inspired by conversation with Alexei in a previous patchset's thread [0], which highlighted the need for such a benchmark to motivate and validate improvements to local_storage implementation. My approach in that series focused on improving performance for explicitly-marked 'important' maps and was rejected with feedback to make more generally-applicable improvements while avoiding explicitly marking maps as important. Thus the benchmark reports both general and important-map-focused metrics, so effect of future work on both is clear. Regarding the benchmark results. On a powerful system (Skylake, 20 cores, 256gb ram): Hashmap Control =============== num keys: 10 hashmap (control) sequential get: hits throughput: 20.900 ± 0.334 M ops/s, hits latency: 47.847 ns/op, important_hits throughput: 20.900 ± 0.334 M ops/s num keys: 1000 hashmap (control) sequential get: hits throughput: 13.758 ± 0.219 M ops/s, hits latency: 72.683 ns/op, important_hits throughput: 13.758 ± 0.219 M ops/s num keys: 10000 hashmap (control) sequential get: hits throughput: 6.995 ± 0.034 M ops/s, hits latency: 142.959 ns/op, important_hits throughput: 6.995 ± 0.034 M ops/s num keys: 100000 hashmap (control) sequential get: hits throughput: 4.452 ± 0.371 M ops/s, hits latency: 224.635 ns/op, important_hits throughput: 4.452 ± 0.371 M ops/s num keys: 4194304 hashmap (control) sequential get: hits throughput: 3.043 ± 0.033 M ops/s, hits latency: 328.587 ns/op, important_hits throughput: 3.043 ± 0.033 M ops/s Local Storage ============= num_maps: 1 local_storage cache sequential get: hits throughput: 47.298 ± 0.180 M ops/s, hits latency: 21.142 ns/op, important_hits throughput: 47.298 ± 0.180 M ops/s local_storage cache interleaved get: hits throughput: 55.277 ± 0.888 M ops/s, hits latency: 18.091 ns/op, important_hits throughput: 55.277 ± 0.888 M ops/s num_maps: 10 local_storage cache sequential get: hits throughput: 40.240 ± 0.802 M ops/s, hits latency: 24.851 ns/op, important_hits throughput: 4.024 ± 0.080 M ops/s local_storage cache interleaved get: hits throughput: 48.701 ± 0.722 M ops/s, hits latency: 20.533 ns/op, important_hits throughput: 17.393 ± 0.258 M ops/s num_maps: 16 local_storage cache sequential get: hits throughput: 44.515 ± 0.708 M ops/s, hits latency: 22.464 ns/op, important_hits throughput: 2.782 ± 0.044 M ops/s local_storage cache interleaved get: hits throughput: 49.553 ± 2.260 M ops/s, hits latency: 20.181 ns/op, important_hits throughput: 15.767 ± 0.719 M ops/s num_maps: 17 local_storage cache sequential get: hits throughput: 38.778 ± 0.302 M ops/s, hits latency: 25.788 ns/op, important_hits throughput: 2.284 ± 0.018 M ops/s local_storage cache interleaved get: hits throughput: 43.848 ± 1.023 M ops/s, hits latency: 22.806 ns/op, important_hits throughput: 13.349 ± 0.311 M ops/s num_maps: 24 local_storage cache sequential get: hits throughput: 19.317 ± 0.568 M ops/s, hits latency: 51.769 ns/op, important_hits throughput: 0.806 ± 0.024 M ops/s local_storage cache interleaved get: hits throughput: 24.397 ± 0.272 M ops/s, hits latency: 40.989 ns/op, important_hits throughput: 6.863 ± 0.077 M ops/s num_maps: 32 local_storage cache sequential get: hits throughput: 13.333 ± 0.135 M ops/s, hits latency: 75.000 ns/op, important_hits throughput: 0.417 ± 0.004 M ops/s local_storage cache interleaved get: hits throughput: 16.898 ± 0.383 M ops/s, hits latency: 59.178 ns/op, important_hits throughput: 4.717 ± 0.107 M ops/s num_maps: 100 local_storage cache sequential get: hits throughput: 6.360 ± 0.107 M ops/s, hits latency: 157.233 ns/op, important_hits throughput: 0.064 ± 0.001 M ops/s local_storage cache interleaved get: hits throughput: 7.303 ± 0.362 M ops/s, hits latency: 136.930 ns/op, important_hits throughput: 1.907 ± 0.094 M ops/s num_maps: 1000 local_storage cache sequential get: hits throughput: 0.452 ± 0.010 M ops/s, hits latency: 2214.022 ns/op, important_hits throughput: 0.000 ± 0.000 M ops/s local_storage cache interleaved get: hits throughput: 0.542 ± 0.007 M ops/s, hits latency: 1843.341 ns/op, important_hits throughput: 0.136 ± 0.002 M ops/s Looking at the "sequential get" results, it's clear that as the number of task local_storage maps grows beyond the current cache size (16), there's a significant reduction in hits throughput. Note that current local_storage implementation assigns a cache_idx to maps as they are created. Since "sequential get" is creating maps 0..n in order and then doing bpf_task_storage_get calls in the same order, the benchmark is effectively ensuring that a map will not be in cache when the program tries to access it. For "interleaved get" results, important-map hits throughput is greatly increased as the important map is more likely to be in cache by virtue of being accessed far more frequently. Throughput still reduces as # maps increases, though. To get a sense of the overhead of the benchmark program, I commented out bpf_task_storage_get/bpf_map_lookup_elem in local_storage_bench.c and ran the benchmark on the same host as the 'real' run. Results: Hashmap Control =============== num keys: 10 hashmap (control) sequential get: hits throughput: 54.288 ± 0.655 M ops/s, hits latency: 18.420 ns/op, important_hits throughput: 54.288 ± 0.655 M ops/s num keys: 1000 hashmap (control) sequential get: hits throughput: 52.913 ± 0.519 M ops/s, hits latency: 18.899 ns/op, important_hits throughput: 52.913 ± 0.519 M ops/s num keys: 10000 hashmap (control) sequential get: hits throughput: 53.480 ± 1.235 M ops/s, hits latency: 18.699 ns/op, important_hits throughput: 53.480 ± 1.235 M ops/s num keys: 100000 hashmap (control) sequential get: hits throughput: 54.982 ± 1.902 M ops/s, hits latency: 18.188 ns/op, important_hits throughput: 54.982 ± 1.902 M ops/s num keys: 4194304 hashmap (control) sequential get: hits throughput: 50.858 ± 0.707 M ops/s, hits latency: 19.662 ns/op, important_hits throughput: 50.858 ± 0.707 M ops/s Local Storage ============= num_maps: 1 local_storage cache sequential get: hits throughput: 110.990 ± 4.828 M ops/s, hits latency: 9.010 ns/op, important_hits throughput: 110.990 ± 4.828 M ops/s local_storage cache interleaved get: hits throughput: 161.057 ± 4.090 M ops/s, hits latency: 6.209 ns/op, important_hits throughput: 161.057 ± 4.090 M ops/s num_maps: 10 local_storage cache sequential get: hits throughput: 112.930 ± 1.079 M ops/s, hits latency: 8.855 ns/op, important_hits throughput: 11.293 ± 0.108 M ops/s local_storage cache interleaved get: hits throughput: 115.841 ± 2.088 M ops/s, hits latency: 8.633 ns/op, important_hits throughput: 41.372 ± 0.746 M ops/s num_maps: 16 local_storage cache sequential get: hits throughput: 115.653 ± 0.416 M ops/s, hits latency: 8.647 ns/op, important_hits throughput: 7.228 ± 0.026 M ops/s local_storage cache interleaved get: hits throughput: 138.717 ± 1.649 M ops/s, hits latency: 7.209 ns/op, important_hits throughput: 44.137 ± 0.525 M ops/s num_maps: 17 local_storage cache sequential get: hits throughput: 112.020 ± 1.649 M ops/s, hits latency: 8.927 ns/op, important_hits throughput: 6.598 ± 0.097 M ops/s local_storage cache interleaved get: hits throughput: 128.089 ± 1.960 M ops/s, hits latency: 7.807 ns/op, important_hits throughput: 38.995 ± 0.597 M ops/s num_maps: 24 local_storage cache sequential get: hits throughput: 92.447 ± 5.170 M ops/s, hits latency: 10.817 ns/op, important_hits throughput: 3.855 ± 0.216 M ops/s local_storage cache interleaved get: hits throughput: 128.844 ± 2.808 M ops/s, hits latency: 7.761 ns/op, important_hits throughput: 36.245 ± 0.790 M ops/s num_maps: 32 local_storage cache sequential get: hits throughput: 102.042 ± 1.462 M ops/s, hits latency: 9.800 ns/op, important_hits throughput: 3.194 ± 0.046 M ops/s local_storage cache interleaved get: hits throughput: 126.577 ± 1.818 M ops/s, hits latency: 7.900 ns/op, important_hits throughput: 35.332 ± 0.507 M ops/s num_maps: 100 local_storage cache sequential get: hits throughput: 111.327 ± 1.401 M ops/s, hits latency: 8.983 ns/op, important_hits throughput: 1.113 ± 0.014 M ops/s local_storage cache interleaved get: hits throughput: 131.327 ± 1.339 M ops/s, hits latency: 7.615 ns/op, important_hits throughput: 34.302 ± 0.350 M ops/s num_maps: 1000 local_storage cache sequential get: hits throughput: 101.978 ± 0.563 M ops/s, hits latency: 9.806 ns/op, important_hits throughput: 0.102 ± 0.001 M ops/s local_storage cache interleaved get: hits throughput: 141.084 ± 1.098 M ops/s, hits latency: 7.088 ns/op, important_hits throughput: 35.430 ± 0.276 M ops/s Adjusting for overhead, latency numbers for "hashmap control" and "sequential get" are: hashmap_control_1k: ~53.8ns hashmap_control_10k: ~124.2ns hashmap_control_100k: ~206.5ns sequential_get_1: ~12.1ns sequential_get_10: ~16.0ns sequential_get_16: ~13.8ns sequential_get_17: ~16.8ns sequential_get_24: ~40.9ns sequential_get_32: ~65.2ns sequential_get_100: ~148.2ns sequential_get_1000: ~2204ns Clearly demonstrating a cliff. In the discussion for v1 of this patch, Alexei noted that local_storage was 2.5x faster than a large hashmap when initially implemented [1]. The benchmark results show that local_storage is 5-10x faster: a long-running BPF application putting some pid-specific info into a hashmap for each pid it sees will probably see on the order of 10-100k pids. Bench numbers for hashmaps of this size are ~10x slower than sequential_get_16, but as the number of local_storage maps grows far past local_storage cache size the performance advantage shrinks and eventually reverses. When running the benchmarks it may be necessary to bump 'open files' ulimit for a successful run. [0]: https://lore.kernel.org/all/[email protected] [1]: https://lore.kernel.org/bpf/20220511173305.ftldpn23m4ski3d3@MBP-98dd607d3435.dhcp.thefacebook.com/ Signed-off-by: Dave Marchevsky <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
1 parent 7722517 commit 7308748

File tree

7 files changed

+494
-1
lines changed

7 files changed

+494
-1
lines changed

tools/testing/selftests/bpf/Makefile

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -571,6 +571,7 @@ $(OUTPUT)/bench_bloom_filter_map.o: $(OUTPUT)/bloom_filter_bench.skel.h
571571
$(OUTPUT)/bench_bpf_loop.o: $(OUTPUT)/bpf_loop_bench.skel.h
572572
$(OUTPUT)/bench_strncmp.o: $(OUTPUT)/strncmp_bench.skel.h
573573
$(OUTPUT)/bench_bpf_hashmap_full_update.o: $(OUTPUT)/bpf_hashmap_full_update_bench.skel.h
574+
$(OUTPUT)/bench_local_storage.o: $(OUTPUT)/local_storage_bench.skel.h
574575
$(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ)
575576
$(OUTPUT)/bench: LDLIBS += -lm
576577
$(OUTPUT)/bench: $(OUTPUT)/bench.o \
@@ -583,7 +584,8 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \
583584
$(OUTPUT)/bench_bloom_filter_map.o \
584585
$(OUTPUT)/bench_bpf_loop.o \
585586
$(OUTPUT)/bench_strncmp.o \
586-
$(OUTPUT)/bench_bpf_hashmap_full_update.o
587+
$(OUTPUT)/bench_bpf_hashmap_full_update.o \
588+
$(OUTPUT)/bench_local_storage.o
587589
$(call msg,BINARY,,$@)
588590
$(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@
589591

tools/testing/selftests/bpf/bench.c

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -150,6 +150,53 @@ void ops_report_final(struct bench_res res[], int res_cnt)
150150
printf("latency %8.3lf ns/op\n", 1000.0 / hits_mean * env.producer_cnt);
151151
}
152152

153+
void local_storage_report_progress(int iter, struct bench_res *res,
154+
long delta_ns)
155+
{
156+
double important_hits_per_sec, hits_per_sec;
157+
double delta_sec = delta_ns / 1000000000.0;
158+
159+
hits_per_sec = res->hits / 1000000.0 / delta_sec;
160+
important_hits_per_sec = res->important_hits / 1000000.0 / delta_sec;
161+
162+
printf("Iter %3d (%7.3lfus): ", iter, (delta_ns - 1000000000) / 1000.0);
163+
164+
printf("hits %8.3lfM/s ", hits_per_sec);
165+
printf("important_hits %8.3lfM/s\n", important_hits_per_sec);
166+
}
167+
168+
void local_storage_report_final(struct bench_res res[], int res_cnt)
169+
{
170+
double important_hits_mean = 0.0, important_hits_stddev = 0.0;
171+
double hits_mean = 0.0, hits_stddev = 0.0;
172+
int i;
173+
174+
for (i = 0; i < res_cnt; i++) {
175+
hits_mean += res[i].hits / 1000000.0 / (0.0 + res_cnt);
176+
important_hits_mean += res[i].important_hits / 1000000.0 / (0.0 + res_cnt);
177+
}
178+
179+
if (res_cnt > 1) {
180+
for (i = 0; i < res_cnt; i++) {
181+
hits_stddev += (hits_mean - res[i].hits / 1000000.0) *
182+
(hits_mean - res[i].hits / 1000000.0) /
183+
(res_cnt - 1.0);
184+
important_hits_stddev +=
185+
(important_hits_mean - res[i].important_hits / 1000000.0) *
186+
(important_hits_mean - res[i].important_hits / 1000000.0) /
187+
(res_cnt - 1.0);
188+
}
189+
190+
hits_stddev = sqrt(hits_stddev);
191+
important_hits_stddev = sqrt(important_hits_stddev);
192+
}
193+
printf("Summary: hits throughput %8.3lf \u00B1 %5.3lf M ops/s, ",
194+
hits_mean, hits_stddev);
195+
printf("hits latency %8.3lf ns/op, ", 1000.0 / hits_mean);
196+
printf("important_hits throughput %8.3lf \u00B1 %5.3lf M ops/s\n",
197+
important_hits_mean, important_hits_stddev);
198+
}
199+
153200
const char *argp_program_version = "benchmark";
154201
const char *argp_program_bug_address = "<[email protected]>";
155202
const char argp_program_doc[] =
@@ -188,12 +235,14 @@ static const struct argp_option opts[] = {
188235
extern struct argp bench_ringbufs_argp;
189236
extern struct argp bench_bloom_map_argp;
190237
extern struct argp bench_bpf_loop_argp;
238+
extern struct argp bench_local_storage_argp;
191239
extern struct argp bench_strncmp_argp;
192240

193241
static const struct argp_child bench_parsers[] = {
194242
{ &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 },
195243
{ &bench_bloom_map_argp, 0, "Bloom filter map benchmark", 0 },
196244
{ &bench_bpf_loop_argp, 0, "bpf_loop helper benchmark", 0 },
245+
{ &bench_local_storage_argp, 0, "local_storage benchmark", 0 },
197246
{ &bench_strncmp_argp, 0, "bpf_strncmp helper benchmark", 0 },
198247
{},
199248
};
@@ -397,6 +446,9 @@ extern const struct bench bench_bpf_loop;
397446
extern const struct bench bench_strncmp_no_helper;
398447
extern const struct bench bench_strncmp_helper;
399448
extern const struct bench bench_bpf_hashmap_full_update;
449+
extern const struct bench bench_local_storage_cache_seq_get;
450+
extern const struct bench bench_local_storage_cache_interleaved_get;
451+
extern const struct bench bench_local_storage_cache_hashmap_control;
400452

401453
static const struct bench *benchs[] = {
402454
&bench_count_global,
@@ -432,6 +484,9 @@ static const struct bench *benchs[] = {
432484
&bench_strncmp_no_helper,
433485
&bench_strncmp_helper,
434486
&bench_bpf_hashmap_full_update,
487+
&bench_local_storage_cache_seq_get,
488+
&bench_local_storage_cache_interleaved_get,
489+
&bench_local_storage_cache_hashmap_control,
435490
};
436491

437492
static void setup_benchmark()

tools/testing/selftests/bpf/bench.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ struct bench_res {
3434
long hits;
3535
long drops;
3636
long false_hits;
37+
long important_hits;
3738
};
3839

3940
struct bench {
@@ -61,6 +62,9 @@ void false_hits_report_progress(int iter, struct bench_res *res, long delta_ns);
6162
void false_hits_report_final(struct bench_res res[], int res_cnt);
6263
void ops_report_progress(int iter, struct bench_res *res, long delta_ns);
6364
void ops_report_final(struct bench_res res[], int res_cnt);
65+
void local_storage_report_progress(int iter, struct bench_res *res,
66+
long delta_ns);
67+
void local_storage_report_final(struct bench_res res[], int res_cnt);
6468

6569
static inline __u64 get_time_ns(void)
6670
{

0 commit comments

Comments
 (0)