-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[libc++] [sort] Improve performance of std::sort #76616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-libcxx Author: Amirreza Ashouri (AMP999) Changes-Let in a wider variety of element types that are eligible to be processed by branchless sort functions, like POD structs and pointers. This change cuases We considered forbidding the branchless version for unintentionally expensive comparators, e.g. Sanity-check that "branchless sort" is used only for trivially copyable types It relies on copying the elements, so copying must be at least possible, and ideally no more expensive than move; i.e., trivial. No functional change. Full diff: https://github.com/llvm/llvm-project/pull/76616.diff 2 Files Affected:
diff --git a/libcxx/include/__algorithm/sort.h b/libcxx/include/__algorithm/sort.h
index 1b878c33c7a16f..9e72d460f06f69 100644
--- a/libcxx/include/__algorithm/sort.h
+++ b/libcxx/include/__algorithm/sort.h
@@ -28,8 +28,8 @@
#include <__iterator/iterator_traits.h>
#include <__type_traits/conditional.h>
#include <__type_traits/disjunction.h>
-#include <__type_traits/is_arithmetic.h>
#include <__type_traits/is_constant_evaluated.h>
+#include <__type_traits/is_trivially_copyable.h>
#include <__utility/move.h>
#include <__utility/pair.h>
#include <climits>
@@ -144,7 +144,7 @@ template <class _Compare, class _Iter, class _Tp = typename iterator_traits<_Ite
using __use_branchless_sort =
integral_constant<bool,
__libcpp_is_contiguous_iterator<_Iter>::value && sizeof(_Tp) <= sizeof(void*) &&
- is_arithmetic<_Tp>::value && __is_simple_comparator<_Compare>::value>;
+ is_trivially_copyable<_Tp>::value && __is_simple_comparator<_Compare>::value>;
namespace __detail {
@@ -158,7 +158,8 @@ template <class _Compare, class _RandomAccessIterator>
inline _LIBCPP_HIDE_FROM_ABI void __cond_swap(_RandomAccessIterator __x, _RandomAccessIterator __y, _Compare __c) {
// Note: this function behaves correctly even with proxy iterators (because it relies on `value_type`).
using value_type = typename iterator_traits<_RandomAccessIterator>::value_type;
- bool __r = __c(*__x, *__y);
+ static_assert(is_trivially_copyable<value_type>::value, "");
+ bool __r = __c(*__x, *__y);
value_type __tmp = __r ? *__x : *__y;
*__y = __r ? *__y : *__x;
*__x = __tmp;
@@ -171,7 +172,8 @@ inline _LIBCPP_HIDE_FROM_ABI void
__partially_sorted_swap(_RandomAccessIterator __x, _RandomAccessIterator __y, _RandomAccessIterator __z, _Compare __c) {
// Note: this function behaves correctly even with proxy iterators (because it relies on `value_type`).
using value_type = typename iterator_traits<_RandomAccessIterator>::value_type;
- bool __r = __c(*__z, *__x);
+ static_assert(is_trivially_copyable<value_type>::value, "");
+ bool __r = __c(*__z, *__x);
value_type __tmp = __r ? *__z : *__x;
*__z = __r ? *__x : *__z;
__r = __c(__tmp, *__y);
diff --git a/libcxx/test/libcxx/algorithms/sort_stability.pass.cpp b/libcxx/test/libcxx/algorithms/sort_stability.pass.cpp
index 712f12c2559359..8856ca019f721d 100644
--- a/libcxx/test/libcxx/algorithms/sort_stability.pass.cpp
+++ b/libcxx/test/libcxx/algorithms/sort_stability.pass.cpp
@@ -68,8 +68,7 @@ void test_same() {
auto snapshot_custom_v = v;
std::sort(v.begin(), v.end());
std::sort(snapshot_v.begin(), snapshot_v.end());
- std::sort(snapshot_custom_v.begin(), snapshot_custom_v.end(),
- [](const EqualType&, const EqualType&) { return false; });
+ std::sort(snapshot_custom_v.begin(), snapshot_custom_v.end(), std::less<EqualType>());
bool all_equal = true;
for (int i = 0; i < kSize; ++i) {
if (v[i].value != snapshot_v[i].value || v[i].value != snapshot_custom_v[i].value) {
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
c337b96
to
5ec8ba9
Compare
45081f9
to
85efcee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your patch. Can you add libc++ benchmarks so we can evaluate the results?
@@ -68,8 +68,7 @@ void test_same() { | |||
auto snapshot_custom_v = v; | |||
std::sort(v.begin(), v.end()); | |||
std::sort(snapshot_v.begin(), snapshot_v.end()); | |||
std::sort(snapshot_custom_v.begin(), snapshot_custom_v.end(), | |||
[](const EqualType&, const EqualType&) { return false; }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove this test instead of just adding the new one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old test case passes [](const EqualType&, const EqualType&) { return false; }
as a comparator, which is not a "simple comparator", and __use_branchless_sort
is written in a way that is true only when the __is_simple_comparator
is true, therefor libc++ chooses to use branchless sort for std::sort(snapshot_v.begin(), snapshot_v.end());
but not for std::sort(snapshot_custom_v.begin(), snapshot_custom_v.end(), [](const EqualType&, const EqualType&) { return false; });
but after the change std::sort(snapshot_custom_v.begin(), snapshot_custom_v.end(), std::less<EqualType>())
uses branchless sort. After this change, both statements will use branchless sort while the main behavior is preserved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that means we start to strongly write our tests based on our current implementation. That's not a good idea in general. Having some duplicated test coverage can help during refactoring where different paths are taken. So I really like adding a test, I'm just less thrilled with the removal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth noting that this is in test/libcxx
, not test/std
, and so there's more lenience for tests that are coupled with the implementation. I do think that we should avoid this practice though.
It's also worth noting that this lambda doesn't meet the requirements of Compare
because it doesn't impose a strict weak order on its elements, so replacing it with something that does is a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return false;
imposes a perfectly correct strict weak ordering comparator
@@ -144,7 +144,7 @@ template <class _Compare, class _Iter, class _Tp = typename iterator_traits<_Ite | |||
using __use_branchless_sort = | |||
integral_constant<bool, | |||
__libcpp_is_contiguous_iterator<_Iter>::value && sizeof(_Tp) <= sizeof(void*) && | |||
is_arithmetic<_Tp>::value && __is_simple_comparator<_Compare>::value>; | |||
is_trivially_copyable<_Tp>::value && __is_simple_comparator<_Compare>::value>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there already tests for sorting trivially_copyable types?
85efcee
to
8fc02d5
Compare
We've not done this before, because even comparators for trivially copyable types can be arbitrarily expensive and the types can also be arbitrarily large. Consider e.g. |
@philnik777 I've seen you've demonstrated benchmarks on your two PRs https://reviews.llvm.org/D156956 and https://reviews.llvm.org/D151274. Could you please help me to produce benchmarks for my PR? |
8fc02d5
to
2804972
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. It'd be great if the benchmarks you've recorded were checked into the project (using the benchmark framework) and if you could also put the results into the commit message.
@@ -68,8 +68,7 @@ void test_same() { | |||
auto snapshot_custom_v = v; | |||
std::sort(v.begin(), v.end()); | |||
std::sort(snapshot_v.begin(), snapshot_v.end()); | |||
std::sort(snapshot_custom_v.begin(), snapshot_custom_v.end(), | |||
[](const EqualType&, const EqualType&) { return false; }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth noting that this is in test/libcxx
, not test/std
, and so there's more lenience for tests that are coupled with the implementation. I do think that we should avoid this practice though.
It's also worth noting that this lambda doesn't meet the requirements of Compare
because it doesn't impose a strict weak order on its elements, so replacing it with something that does is a good idea.
cc0155b
to
ff6670d
Compare
Let in a wider variety of element types that are eligible to be processed by branchless sort functions, like POD structs and pointers. This change causes `sort_stability.pass.cpp` to prefer branchless sort for `EqualType`, but not if the comparator is an arbitrary lambda. So the test is adjusted to always pass a simple comparator. `nth_element_stability.pass.cpp` and `partial_sort_stability.pass.cpp` already use a simple comparator all the time, so they don't need to be adjusted. We considered forbidding the branchless version for unintentionally expensive comparators, e.g. double a[10]; std::sort(a, a+10, std::less<BigDouble>()); Derived *a[10]; std::sort(a, a+10, std::less<VirtualBase*>()); In both cases, it is safe, but more expensive than the original PR intended (since the comparator will be invoked more often in the branchless version). However, it's invoked only 0.1% more often, and benchmarks show that the branchless version is still (much) faster despite the extra invocations. So we leave this alone. Sample benchmarks: - https://godbolt.org/z/x1Waz3vbY - https://godbolt.org/z/5Ph3hfqbc Sanity-check that "branchless sort" is used only for trivially copyable types It relies on copying the elements, so copying must be at least possible, and ideally no more expensive than move; i.e., trivial. No functional change.
ff6670d
to
99209b2
Compare
@AMP999 please don't force push commits, as it messes up the review. Just pushing subsequent commits is fine: we'll squash and merge into one commit at the very end. |
@cjdb Sure, I will continue the review process as you instructed. |
@philnik777 I'd appreciate it if you ELI5 how to produce a benchmark like yours in #76657 with "old" and "new" comparisons. It'd be a great help for my PR. Thank you in advance. |
I use an incredibly advanced technique called "I have no idea how to do it properly, so I just run the benchmark twice (on the old and new code), copy-paste the old and new numbers together and call the columns 'old' and 'new'". It's a bit wordy, but I think it gets the idea across. |
Oh and you get the layout like this when you run google benchmark in a terminal. |
Closing, since there hasn't been any activity for quite a while. If you want to pursue this again feel free to re-open this PR. |
Let in a wider variety of element types that are eligible to be processed by branchless sort functions, like POD structs and pointers.
This change causes
sort_stability.pass.cpp
to prefer branchless sort forEqualType
, but not if the comparator is an arbitrary lambda. So the test is adjusted to always pass a simple comparator.nth_element_stability.pass.cpp
andpartial_sort_stability.pass.cpp
already use a simple comparator all the time, so they don't need to be adjusted.We considered forbidding the branchless version for unintentionally expensive comparators, e.g.
double a[10]; std::sort(a, a+10, std::less());
Derived a[10]; std::sort(a, a+10, std::less<VirtualBase>());
In both cases, it is safe, but more expensive than the original PR intended (since the comparator will be invoked more often in the branchless version). However, it's invoked only 0.1% more often, and benchmarks show that the branchless version is still (much) faster despite the extra invocations. So we leave this alone.
Sample benchmarks:
Sanity-check that "branchless sort" is used only for trivially copyable types It relies on copying the elements, so copying must be at least possible, and ideally no more expensive than move; i.e., trivial.