-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[libc++] Implement std::gcd using the binary version #77747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-libcxx Author: None (serge-sans-paille) ChangesThe binary version is four times faster than current implementation in my setup, and generally considered a better implementation. Code inspired by https://en.algorithmica.org/hpc/algorithms/gcd/ which itself is inspired by https://lemire.me/blog/2013/12/26/fastest-way-to-compute-the-greatest-common-divisor/ Fix #77648 Full diff: https://github.com/llvm/llvm-project/pull/77747.diff 4 Files Affected:
diff --git a/libcxx/include/__bit/countr.h b/libcxx/include/__bit/countr.h
index 0cc679f87a99d9..e47b8245be3cea 100644
--- a/libcxx/include/__bit/countr.h
+++ b/libcxx/include/__bit/countr.h
@@ -35,10 +35,8 @@ _LIBCPP_NODISCARD inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR int __libcpp_ct
return __builtin_ctzll(__x);
}
-#if _LIBCPP_STD_VER >= 20
-
template <__libcpp_unsigned_integer _Tp>
-_LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI constexpr int countr_zero(_Tp __t) noexcept {
+_LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI constexpr int __countr_zero(_Tp __t) noexcept {
if (__t == 0)
return numeric_limits<_Tp>::digits;
@@ -59,6 +57,13 @@ _LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI constexpr int countr_zero(_Tp __t) n
}
}
+#if _LIBCPP_STD_VER >= 20
+
+template <__libcpp_unsigned_integer _Tp>
+_LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI constexpr int countr_zero(_Tp __t) noexcept {
+ return __countr_zero(__t);
+}
+
template <__libcpp_unsigned_integer _Tp>
_LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI constexpr int countr_one(_Tp __t) noexcept {
return __t != numeric_limits<_Tp>::max() ? std::countr_zero(static_cast<_Tp>(~__t)) : numeric_limits<_Tp>::digits;
diff --git a/libcxx/include/__numeric/gcd_lcm.h b/libcxx/include/__numeric/gcd_lcm.h
index 3e9c244f25c285..55199547fa4db2 100644
--- a/libcxx/include/__numeric/gcd_lcm.h
+++ b/libcxx/include/__numeric/gcd_lcm.h
@@ -10,7 +10,9 @@
#ifndef _LIBCPP___NUMERIC_GCD_LCM_H
#define _LIBCPP___NUMERIC_GCD_LCM_H
+#include <__algorithm/min.h>
#include <__assert>
+#include <__bit/countr.h>
#include <__config>
#include <__type_traits/common_type.h>
#include <__type_traits/is_integral.h>
@@ -50,9 +52,25 @@ struct __ct_abs<_Result, _Source, false> {
};
template <class _Tp>
-_LIBCPP_CONSTEXPR _LIBCPP_HIDDEN _Tp __gcd(_Tp __m, _Tp __n) {
- static_assert((!is_signed<_Tp>::value), "");
- return __n == 0 ? __m : std::__gcd<_Tp>(__n, __m % __n);
+_LIBCPP_CONSTEXPR _LIBCPP_HIDDEN _Tp __gcd(_Tp __a, _Tp __b) {
+ static_assert((!is_signed<_Tp>::value), "");
+ if (__a == 0)
+ return __b;
+ if (__b == 0)
+ return __a;
+
+ int __az = std::__countr_zero(__a);
+ int __bz = std::__countr_zero(__b);
+ int __shift = std::min(__az, __bz);
+ __b >>= __bz;
+ while (__a != 0) {
+ __a >>= __az;
+ _Tp __absdiff = __a > __b ? __a - __b : __b - __a;
+ __b = std::min(__a, __b);
+ __a = __absdiff;
+ __az = std::__countr_zero(__absdiff);
+ }
+ return __b << __shift;
}
template <class _Tp, class _Up>
diff --git a/libcxx/test/libcxx/transitive_includes/cxx26.csv b/libcxx/test/libcxx/transitive_includes/cxx26.csv
index 0d8c3fa21b2f37..e3d325169a3fa8 100644
--- a/libcxx/test/libcxx/transitive_includes/cxx26.csv
+++ b/libcxx/test/libcxx/transitive_includes/cxx26.csv
@@ -173,6 +173,29 @@ experimental/simd limits
experimental/type_traits initializer_list
experimental/type_traits type_traits
experimental/utility utility
+experimental/vector experimental/memory_resource
+experimental/vector vector
+ext/hash_map algorithm
+ext/hash_map cmath
+ext/hash_map cstddef
+ext/hash_map cstdint
+ext/hash_map cstring
+ext/hash_map functional
+ext/hash_map initializer_list
+ext/hash_map limits
+ext/hash_map new
+ext/hash_map stdexcept
+ext/hash_map string
+ext/hash_set algorithm
+ext/hash_set cmath
+ext/hash_set cstddef
+ext/hash_set cstdint
+ext/hash_set cstring
+ext/hash_set functional
+ext/hash_set initializer_list
+ext/hash_set limits
+ext/hash_set new
+ext/hash_set string
filesystem compare
filesystem cstddef
filesystem cstdint
diff --git a/libcxx/test/std/numerics/numeric.ops/numeric.ops.gcd/gcd.pass.cpp b/libcxx/test/std/numerics/numeric.ops/numeric.ops.gcd/gcd.pass.cpp
index 831c226f9c8ea1..0435575bd8759a 100644
--- a/libcxx/test/std/numerics/numeric.ops/numeric.ops.gcd/gcd.pass.cpp
+++ b/libcxx/test/std/numerics/numeric.ops/numeric.ops.gcd/gcd.pass.cpp
@@ -17,6 +17,7 @@
#include <cassert>
#include <climits>
#include <cstdint>
+#include <random>
#include <type_traits>
#include "test_macros.h"
@@ -48,6 +49,26 @@ constexpr bool test0(int in1, int in2, int out)
return true;
}
+template <typename T>
+T basic_gcd(T m, T n) {
+ return n == 0 ? m : basic_gcd<T>(n, m % n);
+}
+
+template <typename Input>
+void do_fuzzy_tests() {
+
+ std::random_device rd;
+ std::mt19937 gen(rd());
+ std::uniform_int_distribution<Input> distrib;
+
+ constexpr int nb_rounds = 10000;
+ for(int i = 0; i < nb_rounds; ++i) {
+ Input n = distrib(gen);
+ Input m = distrib(gen);
+ assert(std::gcd(n, m) == basic_gcd(n, m));
+ }
+}
+
template <typename Input1, typename Input2 = Input1>
constexpr bool do_test(int = 0)
@@ -143,5 +164,10 @@ int main(int argc, char**)
assert(res == 2);
}
+ do_fuzzy_tests<std::uint8_t>();
+ do_fuzzy_tests<std::uint16_t>();
+ do_fuzzy_tests<std::uint32_t>();
+ do_fuzzy_tests<std::uint64_t>();
+
return 0;
}
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
5d7517c
to
335d34b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me % testing.
To be sure that we don't change std::gcd
behavior, let's test (compare with naive gcd) also:
- minimal and maximal values for different types.
- all pairs of numbers around 0.
for(T i = -9; i < 10; ++i)
for(T j = -9; j < 10; ++j)
I don't see a benchmark requested on Phabricator.
@philnik777 do changes in transitive_includes
look good to you?
libcxx/test/std/numerics/numeric.ops/numeric.ops.gcd/gcd.pass.cpp
Outdated
Show resolved
Hide resolved
bc0e98d
to
9cc8154
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gcd
implementation and gcd
test look good to me with addressed comments.
Please, add a benchmark in libcxx/benchmarks/
.
libcxx/test/std/numerics/numeric.ops/numeric.ops.gcd/gcd.pass.cpp
Outdated
Show resolved
Hide resolved
Also, while for random values performance gain is consistent, it's worth to point out that it's possible to create benchmarks in which naive GCD is clearly faster. https://quick-bench.com/q/1y8T00mwK3t9EL__IqTKLgmTHwI We may want to consider running naive GCD if one value is very small. |
Until now, I was not really thinking how this algorithm works, but I finally looked at it closer and I don't think we should upstream it. The idea is based on:
Based on that knowledge, it's possible to create an example where the new GCD is hundred times slower. https://quick-bench.com/q/4_kHmlvAN7nwcDaMqk3_6BSGaRM (Edit: notice that there is a problem with that benchmark, look at comments below) Moreover, we still can build a test like that, even if we exclude very small values. This algorithm works well for random values, but unfortunately not for all values. https://quick-bench.com/q/vUh_PAKAip6GLvnH3FvgnmzdK74 Potential performance hit is so big that I don't think we can upstream it. |
@AdvenamTacet You always have to be careful when writing benchmarks. If you look at the generated code, you see that Clang was able to constant-fold the native version and you only benchmark moving
|
@philnik777 thx for looking at it and fixing my benchmark! It's still big performance hit, but you are right that it's probably unlikely to see it in the wild often.
My fast tests show that for random numbers or interval of numbers, it's unlikely to have worse performance with this new implementation when
On every big interval I tested, implementation from this PR is faster, I did not try to check how often it's slower and how often it's faster. I can try to look at it soon. On average, it may have a positive effect, if we are ok with big slowdown in some rare cases. |
The algorithmic complexity of the binary GCD is the number of bits in the largest of the two integers. For the benchmark in question, it should go down as...
So it is difficult to imagine that that there could be gigantic differences. |
Note that my statement is not meant to contradict @AdvenamTacet but to point out that you can actually bound the difference from first principles. Algorithmically speaking, the difference cannot be excessively large, especially because each step in the binary GCD is cheaper. So it is definitively true that the naive algorithm can be sometimes faster, but there is denial-of-service attack lurking. :-) |
That change looks good to me and resolves my performance concerns. Overall, LGTM with my previous in-code comments addressed. |
baed224
to
59bbb53
Compare
@AdvenamTacet does this version look good to you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no more performance concerns and LGTM % two hanging comments.
Please add to test values numbers around limits instead of only limits:
std::numeric_limits<Input>::min(),
std::numeric_limits<Input>::min() + 1,
std::numeric_limits<Input>::min() + 2,
std::numeric_limits<Input>::max(),
std::numeric_limits<Input>::max() - 1,
std::numeric_limits<Input>::max() - 2,
could you also comment on that change:
- return -static_cast<_Result>(__t);
+ return static_cast<_Result>(-static_cast<std::make_unsigned_t<_Result>>(__t));
Why that change? When does it change the result?
26c142f
to
a8a1179
Compare
The binary version is four times faster than current implementation in my setup, and generally considered a better implementation. Code inspired by https://en.algorithmica.org/hpc/algorithms/gcd/ which itself is inspired by https://lemire.me/blog/2013/12/26/fastest-way-to-compute-the-greatest-common-divisor/ Hybrid approach and benchmarks inspired by `ylchapuy <https://github.com/ylchapuy>`. Fix llvm#77648
a8a1179
to
8ea797c
Compare
done
I wasn't aware that gcd is UB when one of its argument absolute value cannot be represented in the result type. I dismissed this change. |
@AdvenamTacet I'm going to merge this tomorrow, if that's fine with you? |
I think it's ready and I support that change. I would be happy, if someone else gives their positive opinion about the change in transitive includes, because I don't know rules of modifying them, but [except that] I see nothing to change and I support the PR. Don't consider me blocking the merge. |
We should test this algorithm with ubsan |
We do. Every test is run with UBSan in the CI. |
The binary version is four times faster than current implementation in my setup, and generally considered a better implementation.
Code inspired by https://en.algorithmica.org/hpc/algorithms/gcd/ which itself is inspired by https://lemire.me/blog/2013/12/26/fastest-way-to-compute-the-greatest-common-divisor/
Fix #77648