Skip to content

Add MT benchmarks #204

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 14, 2024
Merged

Add MT benchmarks #204

merged 3 commits into from
Feb 14, 2024

Conversation

igchor
Copy link
Member

@igchor igchor commented Feb 3, 2024

It's not possible to use ubench for MT benchmarks because we want only to measure the workload and not thread creation time, etc. This means that we have to write out our own measurement code and use it within each worker thread. We also have to make sure that all threads start running at the same time (hence use of syncthreads()).

@igchor igchor requested a review from a team as a code owner February 3, 2024 01:27
@PatKamin
Copy link
Contributor

PatKamin commented Feb 5, 2024

Wait for the PR #167 with a fix for TSAN CI builds to merge this one.

@igchor
Copy link
Member Author

igchor commented Feb 5, 2024

Wait for the PR #167 with a fix for TSAN CI builds to merge this one.

Rebased on current main.

@igchor igchor force-pushed the mt_benchmark branch 2 times, most recently from 8715b28 to 0bba9e3 Compare February 5, 2024 18:36
@igchor igchor force-pushed the mt_benchmark branch 2 times, most recently from c13fede to a2aaa58 Compare February 6, 2024 19:39
Copy link
Contributor

@ldorau ldorau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The results of the benchmark are sometimes very strange. It seems that the methodology of measurements is wrong, since the standard deviation can be even bigger than the mean. It means the accuracy of measurements is far too small:

scalable_pool mt_alloc_free: mean: 16.13 [ms] std_dev: 21.2146 [ms] 

https://github.com/oneapi-src/unified-memory-framework/actions/runs/7805176815/job/21288925509?pr=204

Copy link
Contributor

@ldorau ldorau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I run this benchmark on Windows WSL or on a bare-metal server it hangs after printing (I waited 55 minutes and it did not finish):

scalable_pool mt_alloc_free: mean: 24.34 [ms] std_dev: 18.1215 [ms] (total alloc failures: 0 out of 5000000)
jemalloc_pool mt_alloc_free: mean: 1112.15 [ms] std_dev: 116.737 [ms] (total alloc failures: 0 out of 5000000)

probably on:

Thread 1 "multithread_ben" received signal SIGINT, Interrupt.
__futex_abstimed_wait_common64 (private=128, cancel=true, abstime=0x0, op=265, expected=11177, futex_word=0x7fffe79ff910) at ./nptl/futex-internal.c:57
57      in ./nptl/futex-internal.c
(gdb) bt
#0  __futex_abstimed_wait_common64 (private=128, cancel=true, abstime=0x0, op=265, expected=11177, futex_word=0x7fffe79ff910) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=128, abstime=0x0, clockid=0, expected=11177, futex_word=0x7fffe79ff910) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7fffe79ff910, expected=11177, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=128) at ./nptl/futex-internal.c:139
#3  0x00007ffff77dd624 in __pthread_clockjoin_ex (threadid=140737079408192, thread_return=0x0, clockid=0, abstime=0x0, block=<optimized out>) at ./nptl/pthread_join_common.c:105
#4  0x00007ffff7b352c7 in std::thread::join() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x0000555555559b1b in umf_bench::parallel_exec<umf_bench::measure<std::chrono::duration<long int, std::ratio<1, 1000> >, mt_alloc_free(poolCreateExtParams)::<lambda(auto:2)> >(size_t, size_t, mt_alloc_free(poolCreateExtParams)::<lambda(auto:2)>&&)::<lambda(size_t)> >(size_t, struct {...} &&) (threads_number=20, f=...) at /home/ldorau/work/unified-memory-framework/benchmark/multithread.hpp:32
#6  0x000055555555935f in umf_bench::measure<std::chrono::duration<long int, std::ratio<1, 1000> >, mt_alloc_free(poolCreateExtParams)::<lambda(auto:2)> >(size_t, size_t, struct {...} &&) (iterations=5, concurrency=20, run_workload=...)
    at /home/ldorau/work/unified-memory-framework/benchmark/multithread.hpp:110
#7  0x00005555555595e7 in mt_alloc_free (params=std::tuple containing = {...}) at /home/ldorau/work/unified-memory-framework/benchmark/multithread.cpp:76
#8  0x0000555555559988 in main () at /home/ldorau/work/unified-memory-framework/benchmark/multithread.cpp:114

@igchor
Copy link
Member Author

igchor commented Feb 7, 2024

The results of the benchmark are sometimes very strange. It seems that the methodology of measurements is wrong, since the standard deviation can be even bigger than the mean. It means the accuracy of measurements is far too small:

scalable_pool mt_alloc_free: mean: 16.13 [ms] std_dev: 21.2146 [ms] 

https://github.com/oneapi-src/unified-memory-framework/actions/runs/7805176815/job/21288925509?pr=204

Can you try it again now? scalable_pool seems to need some time to warmup and the first iteration was taking longer. I've made a change to skip the first one in calculations.

@igchor
Copy link
Member Author

igchor commented Feb 7, 2024

When I run this benchmark on Windows WSL or on a bare-metal server it hangs after printing (I waited 55 minutes and it did not finish):

scalable_pool mt_alloc_free: mean: 24.34 [ms] std_dev: 18.1215 [ms] (total alloc failures: 0 out of 5000000)
jemalloc_pool mt_alloc_free: mean: 1112.15 [ms] std_dev: 116.737 [ms] (total alloc failures: 0 out of 5000000)

probably on:

Thread 1 "multithread_ben" received signal SIGINT, Interrupt.
__futex_abstimed_wait_common64 (private=128, cancel=true, abstime=0x0, op=265, expected=11177, futex_word=0x7fffe79ff910) at ./nptl/futex-internal.c:57
57      in ./nptl/futex-internal.c
(gdb) bt
#0  __futex_abstimed_wait_common64 (private=128, cancel=true, abstime=0x0, op=265, expected=11177, futex_word=0x7fffe79ff910) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=128, abstime=0x0, clockid=0, expected=11177, futex_word=0x7fffe79ff910) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7fffe79ff910, expected=11177, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=128) at ./nptl/futex-internal.c:139
#3  0x00007ffff77dd624 in __pthread_clockjoin_ex (threadid=140737079408192, thread_return=0x0, clockid=0, abstime=0x0, block=<optimized out>) at ./nptl/pthread_join_common.c:105
#4  0x00007ffff7b352c7 in std::thread::join() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x0000555555559b1b in umf_bench::parallel_exec<umf_bench::measure<std::chrono::duration<long int, std::ratio<1, 1000> >, mt_alloc_free(poolCreateExtParams)::<lambda(auto:2)> >(size_t, size_t, mt_alloc_free(poolCreateExtParams)::<lambda(auto:2)>&&)::<lambda(size_t)> >(size_t, struct {...} &&) (threads_number=20, f=...) at /home/ldorau/work/unified-memory-framework/benchmark/multithread.hpp:32
#6  0x000055555555935f in umf_bench::measure<std::chrono::duration<long int, std::ratio<1, 1000> >, mt_alloc_free(poolCreateExtParams)::<lambda(auto:2)> >(size_t, size_t, struct {...} &&) (iterations=5, concurrency=20, run_workload=...)
    at /home/ldorau/work/unified-memory-framework/benchmark/multithread.hpp:110
#7  0x00005555555595e7 in mt_alloc_free (params=std::tuple containing = {...}) at /home/ldorau/work/unified-memory-framework/benchmark/multithread.cpp:76
#8  0x0000555555559988 in main () at /home/ldorau/work/unified-memory-framework/benchmark/multithread.cpp:114

This only happens in debug mode because of the debug checks in base_alloc. We should either not run benchmarks in debug or optimize those checks.

@ldorau
Copy link
Contributor

ldorau commented Feb 8, 2024

The results of the benchmark are sometimes very strange. It seems that the methodology of measurements is wrong, since the standard deviation can be even bigger than the mean. It means the accuracy of measurements is far too small:

scalable_pool mt_alloc_free: mean: 16.13 [ms] std_dev: 21.2146 [ms] 

https://github.com/oneapi-src/unified-memory-framework/actions/runs/7805176815/job/21288925509?pr=204

Can you try it again now? scalable_pool seems to need some time to warmup and the first iteration was taking longer. I've made a change to skip the first one in calculations.

See the last CI build: https://github.com/oneapi-src/unified-memory-framework/actions/runs/7817812924/job/21327245255?pr=204

scalable_pool mt_alloc_free: mean: 13.3125 [ms] std_dev: 10.5836 [ms]

Still: 13.3 ms +/- 10.5 ms says it can be from 2.8 to 23.8 ms - I'm afraid this is not a good result of a benchmark ...

The ubench has the maximum allowed confidence interval +- 2.5% (the ubench fails if the confidence interval is bigger than 2.5%):

[       OK ] simple.jemalloc_pool_with_os_memory_provider (mean 199.766us, confidence interval +- 2.458446%)
[       OK ] simple.scalable_pool_with_os_memory_provider (mean 94.782us, confidence interval +- 1.989851%)

See the output of the last CI build of ubench: "confidence interval 5.807915% exceeds maximum permitted 2.500000%"
https://github.com/oneapi-src/unified-memory-framework/actions/runs/7817812924/job/21327245255?pr=204

[ RUN      ] simple.scalable_pool_with_os_memory_provider
confidence interval 5.807915% exceeds maximum permitted 2.500000%
[  FAILED  ] simple.scalable_pool_with_os_memory_provider (mean 53.946us, confidence interval +- 5.807915%)

but we have here 10.5836/13.3125 = 79,5 % so almost 80% ...

Copy link
Contributor

@ldorau ldorau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The confidence interval about 80% is not good enough, but we can merge it as is and fix it later.
Please add TODO in the code to fix that later.

@bratpiorka
Copy link
Contributor

The confidence interval about 80% is not good enough, but we can merge it as is and fix it later. Please add TODO in the code to fix that later.

it would be better to add a github issue

@igchor
Copy link
Member Author

igchor commented Feb 8, 2024

The confidence interval about 80% is not good enough, but we can merge it as is and fix it later. Please add TODO in the code to fix that later.

it would be better to add a github issue

I've increased the number of iterations for scalable pool and now it looks like this:

scalable_pool mt_alloc_free: mean: 437.1 [ms] std_dev: 37.5834 [ms] (total alloc failures: 0 out of 100000000)

I think this is acceptable for a multithreaded benchmark. Even for ubench scalable_pool reports higher than expected (above the 2.5% confidence interval).

Helper functions taken from pmemstream repo.
so that we do not measure warmup time
@ldorau ldorau merged commit 73706ea into oneapi-src:main Feb 14, 2024
@igchor igchor deleted the mt_benchmark branch February 14, 2024 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants