Add MT benchmarks #204

igchor · 2024-02-03T01:27:11Z

It's not possible to use ubench for MT benchmarks because we want only to measure the workload and not thread creation time, etc. This means that we have to write out our own measurement code and use it within each worker thread. We also have to make sure that all threads start running at the same time (hence use of syncthreads()).

PatKamin · 2024-02-05T07:34:33Z

Wait for the PR #167 with a fix for TSAN CI builds to merge this one.

igchor · 2024-02-05T17:46:25Z

Wait for the PR #167 with a fix for TSAN CI builds to merge this one.

Rebased on current main.

benchmark/multithread.cpp

.github/workflows/benchmarks.yml

benchmark/CMakeLists.txt

benchmark/multithread.cpp

benchmark/multithread.hpp

.github/workflows/benchmarks.yml

benchmark/CMakeLists.txt

README.md

CMakeLists.txt

benchmark/CMakeLists.txt

.github/workflows/benchmarks.yml

README.md

ldorau

The results of the benchmark are sometimes very strange. It seems that the methodology of measurements is wrong, since the standard deviation can be even bigger than the mean. It means the accuracy of measurements is far too small:

scalable_pool mt_alloc_free: mean: 16.13 [ms] std_dev: 21.2146 [ms]

https://github.com/oneapi-src/unified-memory-framework/actions/runs/7805176815/job/21288925509?pr=204

ldorau

When I run this benchmark on Windows WSL or on a bare-metal server it hangs after printing (I waited 55 minutes and it did not finish):

scalable_pool mt_alloc_free: mean: 24.34 [ms] std_dev: 18.1215 [ms] (total alloc failures: 0 out of 5000000)
jemalloc_pool mt_alloc_free: mean: 1112.15 [ms] std_dev: 116.737 [ms] (total alloc failures: 0 out of 5000000)

probably on:

Thread 1 "multithread_ben" received signal SIGINT, Interrupt.
__futex_abstimed_wait_common64 (private=128, cancel=true, abstime=0x0, op=265, expected=11177, futex_word=0x7fffe79ff910) at ./nptl/futex-internal.c:57
57      in ./nptl/futex-internal.c
(gdb) bt
#0  __futex_abstimed_wait_common64 (private=128, cancel=true, abstime=0x0, op=265, expected=11177, futex_word=0x7fffe79ff910) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=128, abstime=0x0, clockid=0, expected=11177, futex_word=0x7fffe79ff910) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7fffe79ff910, expected=11177, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=128) at ./nptl/futex-internal.c:139
#3  0x00007ffff77dd624 in __pthread_clockjoin_ex (threadid=140737079408192, thread_return=0x0, clockid=0, abstime=0x0, block=<optimized out>) at ./nptl/pthread_join_common.c:105
#4  0x00007ffff7b352c7 in std::thread::join() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x0000555555559b1b in umf_bench::parallel_exec<umf_bench::measure<std::chrono::duration<long int, std::ratio<1, 1000> >, mt_alloc_free(poolCreateExtParams)::<lambda(auto:2)> >(size_t, size_t, mt_alloc_free(poolCreateExtParams)::<lambda(auto:2)>&&)::<lambda(size_t)> >(size_t, struct {...} &&) (threads_number=20, f=...) at /home/ldorau/work/unified-memory-framework/benchmark/multithread.hpp:32
#6  0x000055555555935f in umf_bench::measure<std::chrono::duration<long int, std::ratio<1, 1000> >, mt_alloc_free(poolCreateExtParams)::<lambda(auto:2)> >(size_t, size_t, struct {...} &&) (iterations=5, concurrency=20, run_workload=...)
    at /home/ldorau/work/unified-memory-framework/benchmark/multithread.hpp:110
#7  0x00005555555595e7 in mt_alloc_free (params=std::tuple containing = {...}) at /home/ldorau/work/unified-memory-framework/benchmark/multithread.cpp:76
#8  0x0000555555559988 in main () at /home/ldorau/work/unified-memory-framework/benchmark/multithread.cpp:114

benchmark/multithread.cpp

igchor · 2024-02-07T15:48:56Z

The results of the benchmark are sometimes very strange. It seems that the methodology of measurements is wrong, since the standard deviation can be even bigger than the mean. It means the accuracy of measurements is far too small:
scalable_pool mt_alloc_free: mean: 16.13 [ms] std_dev: 21.2146 [ms] 
https://github.com/oneapi-src/unified-memory-framework/actions/runs/7805176815/job/21288925509?pr=204

Can you try it again now? scalable_pool seems to need some time to warmup and the first iteration was taking longer. I've made a change to skip the first one in calculations.

igchor · 2024-02-07T21:32:26Z

When I run this benchmark on Windows WSL or on a bare-metal server it hangs after printing (I waited 55 minutes and it did not finish):

scalable_pool mt_alloc_free: mean: 24.34 [ms] std_dev: 18.1215 [ms] (total alloc failures: 0 out of 5000000)
jemalloc_pool mt_alloc_free: mean: 1112.15 [ms] std_dev: 116.737 [ms] (total alloc failures: 0 out of 5000000)

probably on:

Thread 1 "multithread_ben" received signal SIGINT, Interrupt.
__futex_abstimed_wait_common64 (private=128, cancel=true, abstime=0x0, op=265, expected=11177, futex_word=0x7fffe79ff910) at ./nptl/futex-internal.c:57
57      in ./nptl/futex-internal.c
(gdb) bt
#0  __futex_abstimed_wait_common64 (private=128, cancel=true, abstime=0x0, op=265, expected=11177, futex_word=0x7fffe79ff910) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=128, abstime=0x0, clockid=0, expected=11177, futex_word=0x7fffe79ff910) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7fffe79ff910, expected=11177, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=128) at ./nptl/futex-internal.c:139
#3  0x00007ffff77dd624 in __pthread_clockjoin_ex (threadid=140737079408192, thread_return=0x0, clockid=0, abstime=0x0, block=<optimized out>) at ./nptl/pthread_join_common.c:105
#4  0x00007ffff7b352c7 in std::thread::join() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x0000555555559b1b in umf_bench::parallel_exec<umf_bench::measure<std::chrono::duration<long int, std::ratio<1, 1000> >, mt_alloc_free(poolCreateExtParams)::<lambda(auto:2)> >(size_t, size_t, mt_alloc_free(poolCreateExtParams)::<lambda(auto:2)>&&)::<lambda(size_t)> >(size_t, struct {...} &&) (threads_number=20, f=...) at /home/ldorau/work/unified-memory-framework/benchmark/multithread.hpp:32
#6  0x000055555555935f in umf_bench::measure<std::chrono::duration<long int, std::ratio<1, 1000> >, mt_alloc_free(poolCreateExtParams)::<lambda(auto:2)> >(size_t, size_t, struct {...} &&) (iterations=5, concurrency=20, run_workload=...)
    at /home/ldorau/work/unified-memory-framework/benchmark/multithread.hpp:110
#7  0x00005555555595e7 in mt_alloc_free (params=std::tuple containing = {...}) at /home/ldorau/work/unified-memory-framework/benchmark/multithread.cpp:76
#8  0x0000555555559988 in main () at /home/ldorau/work/unified-memory-framework/benchmark/multithread.cpp:114

This only happens in debug mode because of the debug checks in base_alloc. We should either not run benchmarks in debug or optimize those checks.

benchmark/multithread.cpp

ldorau · 2024-02-08T12:18:51Z

The results of the benchmark are sometimes very strange. It seems that the methodology of measurements is wrong, since the standard deviation can be even bigger than the mean. It means the accuracy of measurements is far too small:
scalable_pool mt_alloc_free: mean: 16.13 [ms] std_dev: 21.2146 [ms] 
https://github.com/oneapi-src/unified-memory-framework/actions/runs/7805176815/job/21288925509?pr=204
Can you try it again now? scalable_pool seems to need some time to warmup and the first iteration was taking longer. I've made a change to skip the first one in calculations.

See the last CI build: https://github.com/oneapi-src/unified-memory-framework/actions/runs/7817812924/job/21327245255?pr=204

scalable_pool mt_alloc_free: mean: 13.3125 [ms] std_dev: 10.5836 [ms]

Still: 13.3 ms +/- 10.5 ms says it can be from 2.8 to 23.8 ms - I'm afraid this is not a good result of a benchmark ...

The ubench has the maximum allowed confidence interval +- 2.5% (the ubench fails if the confidence interval is bigger than 2.5%):

[       OK ] simple.jemalloc_pool_with_os_memory_provider (mean 199.766us, confidence interval +- 2.458446%)
[       OK ] simple.scalable_pool_with_os_memory_provider (mean 94.782us, confidence interval +- 1.989851%)

See the output of the last CI build of ubench: "confidence interval 5.807915% exceeds maximum permitted 2.500000%"
https://github.com/oneapi-src/unified-memory-framework/actions/runs/7817812924/job/21327245255?pr=204

[ RUN      ] simple.scalable_pool_with_os_memory_provider
confidence interval 5.807915% exceeds maximum permitted 2.500000%
[  FAILED  ] simple.scalable_pool_with_os_memory_provider (mean 53.946us, confidence interval +- 5.807915%)

but we have here 10.5836/13.3125 = 79,5 % so almost 80% ...

ldorau

The confidence interval about 80% is not good enough, but we can merge it as is and fix it later.
Please add TODO in the code to fix that later.

bratpiorka · 2024-02-08T13:18:43Z

The confidence interval about 80% is not good enough, but we can merge it as is and fix it later. Please add TODO in the code to fix that later.

it would be better to add a github issue

igchor · 2024-02-08T17:58:34Z

The confidence interval about 80% is not good enough, but we can merge it as is and fix it later. Please add TODO in the code to fix that later.

it would be better to add a github issue

I've increased the number of iterations for scalable pool and now it looks like this:

scalable_pool mt_alloc_free: mean: 437.1 [ms] std_dev: 37.5834 [ms] (total alloc failures: 0 out of 100000000)

I think this is acceptable for a multithreaded benchmark. Even for ubench scalable_pool reports higher than expected (above the 2.5% confidence interval).

Helper functions taken from pmemstream repo.

so that we do not measure warmup time

igchor requested a review from a team as a code owner February 3, 2024 01:27

igchor force-pushed the mt_benchmark branch from 44d7568 to 47715fb Compare February 5, 2024 17:45

igchor force-pushed the mt_benchmark branch 2 times, most recently from 8715b28 to 0bba9e3 Compare February 5, 2024 18:36

ldorau requested changes Feb 6, 2024

View reviewed changes

benchmark/multithread.cpp Show resolved Hide resolved

bratpiorka requested changes Feb 6, 2024

View reviewed changes

lukaszstolarczuk reviewed Feb 6, 2024

View reviewed changes

.github/workflows/benchmarks.yml Show resolved Hide resolved

benchmark/CMakeLists.txt Show resolved Hide resolved

igchor force-pushed the mt_benchmark branch 2 times, most recently from c13fede to a2aaa58 Compare February 6, 2024 19:39

ldorau requested changes Feb 7, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

CMakeLists.txt Show resolved Hide resolved

ldorau requested review from bratpiorka and lukaszstolarczuk February 7, 2024 08:28

ldorau requested changes Feb 7, 2024

View reviewed changes

benchmark/CMakeLists.txt Show resolved Hide resolved

PatKamin requested changes Feb 7, 2024

View reviewed changes

.github/workflows/benchmarks.yml Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

ldorau requested changes Feb 7, 2024

View reviewed changes

PatKamin requested changes Feb 7, 2024

View reviewed changes

benchmark/multithread.cpp Show resolved Hide resolved

igchor force-pushed the mt_benchmark branch from a2aaa58 to dc1f57b Compare February 7, 2024 15:44

igchor force-pushed the mt_benchmark branch from dc1f57b to 935bab5 Compare February 7, 2024 16:15

PatKamin approved these changes Feb 8, 2024

View reviewed changes

bratpiorka approved these changes Feb 8, 2024

View reviewed changes

lukaszstolarczuk reviewed Feb 8, 2024

View reviewed changes

benchmark/multithread.cpp Outdated Show resolved Hide resolved

ldorau approved these changes Feb 8, 2024

View reviewed changes

igchor force-pushed the mt_benchmark branch from 935bab5 to f41cc69 Compare February 8, 2024 16:27

igchor force-pushed the mt_benchmark branch from f41cc69 to 6ae3c07 Compare February 8, 2024 17:06

igchor added 3 commits February 13, 2024 16:25

Add multithreaded benchmark for umf

b8fe0d6

Helper functions taken from pmemstream repo.

Add mt bench run to CI

eb85278

Skip first iteration in mt benchmarks

e577673

so that we do not measure warmup time

igchor force-pushed the mt_benchmark branch from 6ae3c07 to e577673 Compare February 13, 2024 15:25

ldorau merged commit 73706ea into oneapi-src:main Feb 14, 2024

igchor deleted the mt_benchmark branch February 14, 2024 15:31

Add MT benchmarks #204

Add MT benchmarks #204

Uh oh!

Conversation

igchor commented Feb 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PatKamin commented Feb 5, 2024

Uh oh!

igchor commented Feb 5, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ldorau left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ldorau left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

igchor commented Feb 7, 2024

Uh oh!

igchor commented Feb 7, 2024

Uh oh!

Uh oh!

ldorau commented Feb 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ldorau left a comment

Choose a reason for hiding this comment

Uh oh!

bratpiorka commented Feb 8, 2024

Uh oh!

igchor commented Feb 8, 2024

Uh oh!

Uh oh!

igchor commented Feb 3, 2024 •

edited

Loading

ldorau left a comment •

edited

Loading

ldorau left a comment •

edited

Loading

ldorau commented Feb 8, 2024 •

edited

Loading