Skip to content

[libc++] [sort] Improve performance of std::sort #76616

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

AMP999
Copy link
Contributor

@AMP999 AMP999 commented Dec 30, 2023

Let in a wider variety of element types that are eligible to be processed by branchless sort functions, like POD structs and pointers.

This change causes sort_stability.pass.cpp to prefer branchless sort for EqualType, but not if the comparator is an arbitrary lambda. So the test is adjusted to always pass a simple comparator. nth_element_stability.pass.cpp and partial_sort_stability.pass.cpp already use a simple comparator all the time, so they don't need to be adjusted.

We considered forbidding the branchless version for unintentionally expensive comparators, e.g.
double a[10]; std::sort(a, a+10, std::less());
Derived a[10]; std::sort(a, a+10, std::less<VirtualBase>());
In both cases, it is safe, but more expensive than the original PR intended (since the comparator will be invoked more often in the branchless version). However, it's invoked only 0.1% more often, and benchmarks show that the branchless version is still (much) faster despite the extra invocations. So we leave this alone.
Sample benchmarks:

Sanity-check that "branchless sort" is used only for trivially copyable types It relies on copying the elements, so copying must be at least possible, and ideally no more expensive than move; i.e., trivial.

@AMP999 AMP999 requested a review from a team as a code owner December 30, 2023 14:34
@llvmbot llvmbot added the libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. label Dec 30, 2023
@llvmbot
Copy link
Member

llvmbot commented Dec 30, 2023

@llvm/pr-subscribers-libcxx

Author: Amirreza Ashouri (AMP999)

Changes

-Let in a wider variety of element types that are eligible to be processed by branchless sort functions, like POD structs and pointers.

This change cuases sort_stability.pass.cpp prefer branchless sort for EqualType, but not if the comprator is an arbitrary lambda. So the test is adjusted to always pass a simple comprator. nth_element_stability.pass.cpp and partial_sort_stability.pass.cpp already use a simple comparator all the time, so they don't need to be adjusted.

We considered forbidding the branchless version for unintentionally expensive comparators, e.g.
double a[10]; std::sort(a, a+10, std::less<BigDouble>());
Derived a[10]; std::sort(a, a+10, std::less<VirtualBase>());
In both cases it is safe, but more expensive than the original PR intended (since the comparator will be invoked more often in the branchless version). However, it's invoked only 0.1% more often, and benchmarks show that the branchless version is still (much) faster despite the extra invocations. So we leave this alone.
Sample benchmarks:

Sanity-check that "branchless sort" is used only for trivially copyable types It relies on copying the elements, so copying must be at least possible, and ideally no more expensive than move; i.e., trivial. No functional change.


Full diff: https://github.com/llvm/llvm-project/pull/76616.diff

2 Files Affected:

  • (modified) libcxx/include/__algorithm/sort.h (+6-4)
  • (modified) libcxx/test/libcxx/algorithms/sort_stability.pass.cpp (+1-2)
diff --git a/libcxx/include/__algorithm/sort.h b/libcxx/include/__algorithm/sort.h
index 1b878c33c7a16f..9e72d460f06f69 100644
--- a/libcxx/include/__algorithm/sort.h
+++ b/libcxx/include/__algorithm/sort.h
@@ -28,8 +28,8 @@
 #include <__iterator/iterator_traits.h>
 #include <__type_traits/conditional.h>
 #include <__type_traits/disjunction.h>
-#include <__type_traits/is_arithmetic.h>
 #include <__type_traits/is_constant_evaluated.h>
+#include <__type_traits/is_trivially_copyable.h>
 #include <__utility/move.h>
 #include <__utility/pair.h>
 #include <climits>
@@ -144,7 +144,7 @@ template <class _Compare, class _Iter, class _Tp = typename iterator_traits<_Ite
 using __use_branchless_sort =
     integral_constant<bool,
                       __libcpp_is_contiguous_iterator<_Iter>::value && sizeof(_Tp) <= sizeof(void*) &&
-                          is_arithmetic<_Tp>::value && __is_simple_comparator<_Compare>::value>;
+                          is_trivially_copyable<_Tp>::value && __is_simple_comparator<_Compare>::value>;
 
 namespace __detail {
 
@@ -158,7 +158,8 @@ template <class _Compare, class _RandomAccessIterator>
 inline _LIBCPP_HIDE_FROM_ABI void __cond_swap(_RandomAccessIterator __x, _RandomAccessIterator __y, _Compare __c) {
   // Note: this function behaves correctly even with proxy iterators (because it relies on `value_type`).
   using value_type = typename iterator_traits<_RandomAccessIterator>::value_type;
-  bool __r         = __c(*__x, *__y);
+  static_assert(is_trivially_copyable<value_type>::value, "");
+  bool __r = __c(*__x, *__y);
   value_type __tmp = __r ? *__x : *__y;
   *__y             = __r ? *__y : *__x;
   *__x             = __tmp;
@@ -171,7 +172,8 @@ inline _LIBCPP_HIDE_FROM_ABI void
 __partially_sorted_swap(_RandomAccessIterator __x, _RandomAccessIterator __y, _RandomAccessIterator __z, _Compare __c) {
   // Note: this function behaves correctly even with proxy iterators (because it relies on `value_type`).
   using value_type = typename iterator_traits<_RandomAccessIterator>::value_type;
-  bool __r         = __c(*__z, *__x);
+  static_assert(is_trivially_copyable<value_type>::value, "");
+  bool __r = __c(*__z, *__x);
   value_type __tmp = __r ? *__z : *__x;
   *__z             = __r ? *__x : *__z;
   __r              = __c(__tmp, *__y);
diff --git a/libcxx/test/libcxx/algorithms/sort_stability.pass.cpp b/libcxx/test/libcxx/algorithms/sort_stability.pass.cpp
index 712f12c2559359..8856ca019f721d 100644
--- a/libcxx/test/libcxx/algorithms/sort_stability.pass.cpp
+++ b/libcxx/test/libcxx/algorithms/sort_stability.pass.cpp
@@ -68,8 +68,7 @@ void test_same() {
   auto snapshot_custom_v = v;
   std::sort(v.begin(), v.end());
   std::sort(snapshot_v.begin(), snapshot_v.end());
-  std::sort(snapshot_custom_v.begin(), snapshot_custom_v.end(),
-            [](const EqualType&, const EqualType&) { return false; });
+  std::sort(snapshot_custom_v.begin(), snapshot_custom_v.end(), std::less<EqualType>());
   bool all_equal = true;
   for (int i = 0; i < kSize; ++i) {
     if (v[i].value != snapshot_v[i].value || v[i].value != snapshot_custom_v[i].value) {

Copy link

github-actions bot commented Dec 30, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

@AMP999 AMP999 force-pushed the pr4-std-sort-improvement branch from c337b96 to 5ec8ba9 Compare December 30, 2023 14:43
@AMP999 AMP999 changed the title [libc++] [sort] Improve perfomance of std::sort [libc++] [sort] Improve performance of std::sort Dec 30, 2023
@AMP999 AMP999 force-pushed the pr4-std-sort-improvement branch 3 times, most recently from 45081f9 to 85efcee Compare December 30, 2023 14:57
Copy link
Member

@mordante mordante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patch. Can you add libc++ benchmarks so we can evaluate the results?

@@ -68,8 +68,7 @@ void test_same() {
auto snapshot_custom_v = v;
std::sort(v.begin(), v.end());
std::sort(snapshot_v.begin(), snapshot_v.end());
std::sort(snapshot_custom_v.begin(), snapshot_custom_v.end(),
[](const EqualType&, const EqualType&) { return false; });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this test instead of just adding the new one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old test case passes [](const EqualType&, const EqualType&) { return false; } as a comparator, which is not a "simple comparator", and __use_branchless_sort is written in a way that is true only when the __is_simple_comparator is true, therefor libc++ chooses to use branchless sort for std::sort(snapshot_v.begin(), snapshot_v.end()); but not for std::sort(snapshot_custom_v.begin(), snapshot_custom_v.end(), [](const EqualType&, const EqualType&) { return false; }); but after the change std::sort(snapshot_custom_v.begin(), snapshot_custom_v.end(), std::less<EqualType>()) uses branchless sort. After this change, both statements will use branchless sort while the main behavior is preserved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that means we start to strongly write our tests based on our current implementation. That's not a good idea in general. Having some duplicated test coverage can help during refactoring where different paths are taken. So I really like adding a test, I'm just less thrilled with the removal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting that this is in test/libcxx, not test/std, and so there's more lenience for tests that are coupled with the implementation. I do think that we should avoid this practice though.

It's also worth noting that this lambda doesn't meet the requirements of Compare because it doesn't impose a strict weak order on its elements, so replacing it with something that does is a good idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return false; imposes a perfectly correct strict weak ordering comparator

@@ -144,7 +144,7 @@ template <class _Compare, class _Iter, class _Tp = typename iterator_traits<_Ite
using __use_branchless_sort =
integral_constant<bool,
__libcpp_is_contiguous_iterator<_Iter>::value && sizeof(_Tp) <= sizeof(void*) &&
is_arithmetic<_Tp>::value && __is_simple_comparator<_Compare>::value>;
is_trivially_copyable<_Tp>::value && __is_simple_comparator<_Compare>::value>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there already tests for sorting trivially_copyable types?

@AMP999 AMP999 force-pushed the pr4-std-sort-improvement branch from 85efcee to 8fc02d5 Compare December 30, 2023 15:36
@philnik777
Copy link
Contributor

We've not done this before, because even comparators for trivially copyable types can be arbitrarily expensive and the types can also be arbitrarily large. Consider e.g. string_view. It's trivially copyable, but the comparator definitely isn't simple, which may result in very bad code.

@AMP999
Copy link
Contributor Author

AMP999 commented Dec 30, 2023

@philnik777 I've seen you've demonstrated benchmarks on your two PRs https://reviews.llvm.org/D156956 and https://reviews.llvm.org/D151274. Could you please help me to produce benchmarks for my PR?

@AMP999 AMP999 force-pushed the pr4-std-sort-improvement branch from 8fc02d5 to 2804972 Compare January 1, 2024 08:57
Copy link
Contributor

@cjdb cjdb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. It'd be great if the benchmarks you've recorded were checked into the project (using the benchmark framework) and if you could also put the results into the commit message.

@@ -68,8 +68,7 @@ void test_same() {
auto snapshot_custom_v = v;
std::sort(v.begin(), v.end());
std::sort(snapshot_v.begin(), snapshot_v.end());
std::sort(snapshot_custom_v.begin(), snapshot_custom_v.end(),
[](const EqualType&, const EqualType&) { return false; });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting that this is in test/libcxx, not test/std, and so there's more lenience for tests that are coupled with the implementation. I do think that we should avoid this practice though.

It's also worth noting that this lambda doesn't meet the requirements of Compare because it doesn't impose a strict weak order on its elements, so replacing it with something that does is a good idea.

@AMP999 AMP999 force-pushed the pr4-std-sort-improvement branch 2 times, most recently from cc0155b to ff6670d Compare January 3, 2024 19:18
@AMP999 AMP999 requested a review from cjdb January 3, 2024 21:11
Let in a wider variety of element types that are eligible to be processed by branchless sort functions, like POD structs and pointers.

This change causes `sort_stability.pass.cpp` to prefer branchless sort for `EqualType`, but not if the comparator is an arbitrary lambda.
So the test is adjusted to always pass a simple comparator. `nth_element_stability.pass.cpp` and `partial_sort_stability.pass.cpp` already
use a simple comparator all the time, so they don't need to be adjusted.

We considered forbidding the branchless version for unintentionally
expensive comparators, e.g.
    double a[10]; std::sort(a, a+10, std::less<BigDouble>());
    Derived *a[10]; std::sort(a, a+10, std::less<VirtualBase*>());
In both cases, it is safe, but more expensive than the original PR
intended (since the comparator will be invoked more often in the branchless
version). However, it's invoked only 0.1% more often, and benchmarks show
that the branchless version is still (much) faster despite the extra
invocations. So we leave this alone.
Sample benchmarks:
- https://godbolt.org/z/x1Waz3vbY
- https://godbolt.org/z/5Ph3hfqbc

Sanity-check that "branchless sort" is used only for trivially copyable types
It relies on copying the elements, so copying must be at least possible,
and ideally no more expensive than move; i.e., trivial. No functional change.
@AMP999 AMP999 force-pushed the pr4-std-sort-improvement branch from ff6670d to 99209b2 Compare January 5, 2024 13:09
@cjdb
Copy link
Contributor

cjdb commented Jan 5, 2024

@AMP999 please don't force push commits, as it messes up the review. Just pushing subsequent commits is fine: we'll squash and merge into one commit at the very end.

@AMP999
Copy link
Contributor Author

AMP999 commented Jan 5, 2024

@cjdb Sure, I will continue the review process as you instructed.

@AMP999
Copy link
Contributor Author

AMP999 commented Jan 11, 2024

@philnik777 I'd appreciate it if you ELI5 how to produce a benchmark like yours in #76657 with "old" and "new" comparisons. It'd be a great help for my PR. Thank you in advance.

@philnik777
Copy link
Contributor

@philnik777 I'd appreciate it if you ELI5 how to produce a benchmark like yours in #76657 with "old" and "new" comparisons. It'd be a great help for my PR. Thank you in advance.

I use an incredibly advanced technique called "I have no idea how to do it properly, so I just run the benchmark twice (on the old and new code), copy-paste the old and new numbers together and call the columns 'old' and 'new'". It's a bit wordy, but I think it gets the idea across.

@philnik777
Copy link
Contributor

Oh and you get the layout like this when you run google benchmark in a terminal.

@philnik777
Copy link
Contributor

Closing, since there hasn't been any activity for quite a while. If you want to pursue this again feel free to re-open this PR.

@philnik777 philnik777 closed this Sep 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants