-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[libc] Add support for string/memory_utils functions for AArch64 without HW FP/SIMD #137592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
@llvm/pr-subscribers-libc Author: William (saturn691) ChangesAdd conditional compilation to add support for AArch64 without vector registers and/or hardware FPUs by using the generic implementation. Context: A few functions were hard-coded to use vector registers/hardware FPUs. This meant that libc would not compile on architectures that did not support these features. This fix falls back on the generic implementation if a feature is not supported. Full diff: https://github.com/llvm/llvm-project/pull/137592.diff 8 Files Affected:
diff --git a/libc/src/__support/FPUtil/FEnvImpl.h b/libc/src/__support/FPUtil/FEnvImpl.h
index 1c5a1108ff9e0..4c8f34a435bdf 100644
--- a/libc/src/__support/FPUtil/FEnvImpl.h
+++ b/libc/src/__support/FPUtil/FEnvImpl.h
@@ -17,7 +17,7 @@
#include "src/__support/macros/properties/architectures.h"
#include "src/errno/libc_errno.h"
-#if defined(LIBC_TARGET_ARCH_IS_AARCH64)
+#if defined(LIBC_TARGET_ARCH_IS_AARCH64) && defined(__ARM_FP)
#if defined(__APPLE__)
#include "aarch64/fenv_darwin_impl.h"
#else
diff --git a/libc/src/__support/FPUtil/nearest_integer.h b/libc/src/__support/FPUtil/nearest_integer.h
index 5d0deddd751f5..768f13414bd95 100644
--- a/libc/src/__support/FPUtil/nearest_integer.h
+++ b/libc/src/__support/FPUtil/nearest_integer.h
@@ -16,7 +16,7 @@
#if (defined(LIBC_TARGET_ARCH_IS_X86_64) && defined(LIBC_TARGET_CPU_HAS_SSE4_2))
#include "x86_64/nearest_integer.h"
-#elif defined(LIBC_TARGET_ARCH_IS_AARCH64)
+#elif (defined(LIBC_TARGET_ARCH_IS_AARCH64) && defined(__ARM_FP))
#include "aarch64/nearest_integer.h"
#elif defined(LIBC_TARGET_ARCH_IS_GPU)
diff --git a/libc/src/__support/FPUtil/sqrt.h b/libc/src/__support/FPUtil/sqrt.h
index 89da44ff2970f..1d377ab9a4e2d 100644
--- a/libc/src/__support/FPUtil/sqrt.h
+++ b/libc/src/__support/FPUtil/sqrt.h
@@ -42,7 +42,7 @@ template <> LIBC_INLINE double sqrt<double>(double x) {
// Use inline assembly when __builtin_elementwise_sqrt is not available.
#if defined(LIBC_TARGET_CPU_HAS_SSE2)
#include "x86_64/sqrt.h"
-#elif defined(LIBC_TARGET_ARCH_IS_AARCH64)
+#elif defined(LIBC_TARGET_ARCH_IS_AARCH64) && defined(__ARM_FP)
#include "aarch64/sqrt.h"
#elif defined(LIBC_TARGET_ARCH_IS_ARM)
#include "arm/sqrt.h"
diff --git a/libc/src/string/memory_utils/aarch64/inline_bcmp.h b/libc/src/string/memory_utils/aarch64/inline_bcmp.h
index e41ac202dbaac..2a64ceee10a6d 100644
--- a/libc/src/string/memory_utils/aarch64/inline_bcmp.h
+++ b/libc/src/string/memory_utils/aarch64/inline_bcmp.h
@@ -19,13 +19,36 @@
namespace LIBC_NAMESPACE_DECL {
+#if defined(__ARM_NEON)
+[[maybe_unused]] LIBC_INLINE BcmpReturnType
+inline_bcmp_aarch64_neon(CPtr p1, CPtr p2, size_t count) {
+ if (count <= 32) {
+ return aarch64::Bcmp<16>::head_tail(p1, p2, count);
+ }
+
+ if (count <= 64) {
+ return aarch64::Bcmp<32>::head_tail(p1, p2, count);
+ }
+
+ if (LIBC_UNLIKELY(count > 256)) {
+ if (auto value = aarch64::Bcmp<32>::block(p1, p2))
+ return value;
+ align_to_next_boundary<16, Arg::P1>(p1, p2, count);
+ }
+
+ return aarch64::Bcmp<32>::loop_and_tail(p1, p2, count);
+}
+#else
+[[maybe_unused]] LIBC_INLINE BcmpReturnType
+inline_bcmp_aarch64_no_neon(CPtr p1, CPtr p2, size_t count) {
+ return generic::Bcmp<uint64_t>::loop_and_tail_align_above(256, p1, p2, count);
+}
+#endif // __ARM_NEON
+
[[maybe_unused]] LIBC_INLINE BcmpReturnType inline_bcmp_aarch64(CPtr p1,
CPtr p2,
size_t count) {
- if (LIBC_LIKELY(count <= 32)) {
- if (LIBC_UNLIKELY(count >= 16)) {
- return aarch64::Bcmp<16>::head_tail(p1, p2, count);
- }
+ if (LIBC_LIKELY(count <= 16)) {
switch (count) {
case 0:
return BcmpReturnType::zero();
@@ -54,16 +77,11 @@ namespace LIBC_NAMESPACE_DECL {
}
}
- if (count <= 64)
- return aarch64::Bcmp<32>::head_tail(p1, p2, count);
-
- // Aligned loop if > 256, otherwise normal loop
- if (LIBC_UNLIKELY(count > 256)) {
- if (auto value = aarch64::Bcmp<32>::block(p1, p2))
- return value;
- align_to_next_boundary<16, Arg::P1>(p1, p2, count);
- }
- return aarch64::Bcmp<32>::loop_and_tail(p1, p2, count);
+#if defined(__ARM_NEON)
+ return inline_bcmp_aarch64_neon(p1, p2, count);
+#else
+ return inline_bcmp_aarch64_no_neon(p1, p2, count);
+#endif
}
} // namespace LIBC_NAMESPACE_DECL
diff --git a/libc/src/string/memory_utils/aarch64/inline_memcmp.h b/libc/src/string/memory_utils/aarch64/inline_memcmp.h
index 35ca077dab526..f017cc950d2e4 100644
--- a/libc/src/string/memory_utils/aarch64/inline_memcmp.h
+++ b/libc/src/string/memory_utils/aarch64/inline_memcmp.h
@@ -16,16 +16,7 @@
namespace LIBC_NAMESPACE_DECL {
-[[maybe_unused]] LIBC_INLINE MemcmpReturnType
-inline_memcmp_generic_gt16(CPtr p1, CPtr p2, size_t count) {
- if (LIBC_UNLIKELY(count >= 384)) {
- if (auto value = generic::Memcmp<uint8x16_t>::block(p1, p2))
- return value;
- align_to_next_boundary<16, Arg::P1>(p1, p2, count);
- }
- return generic::Memcmp<uint8x16_t>::loop_and_tail(p1, p2, count);
-}
-
+#if defined(__ARM_NEON)
[[maybe_unused]] LIBC_INLINE MemcmpReturnType
inline_memcmp_aarch64_neon_gt16(CPtr p1, CPtr p2, size_t count) {
if (LIBC_UNLIKELY(count >= 128)) { // [128, ∞]
@@ -46,6 +37,13 @@ inline_memcmp_aarch64_neon_gt16(CPtr p1, CPtr p2, size_t count) {
return generic::Memcmp<uint8x16_t>::loop_and_tail(p1 + 32, p2 + 32,
count - 32);
}
+#else
+[[maybe_unused]] LIBC_INLINE MemcmpReturnType
+inline_memcmp_generic_gt16(CPtr p1, CPtr p2, size_t count) {
+ return generic::Memcmp<uint64_t>::loop_and_tail_align_above(384, p1, p2,
+ count);
+}
+#endif
LIBC_INLINE MemcmpReturnType inline_memcmp_aarch64(CPtr p1, CPtr p2,
size_t count) {
@@ -61,10 +59,12 @@ LIBC_INLINE MemcmpReturnType inline_memcmp_aarch64(CPtr p1, CPtr p2,
return generic::Memcmp<uint32_t>::head_tail(p1, p2, count);
if (count <= 16)
return generic::Memcmp<uint64_t>::head_tail(p1, p2, count);
- if constexpr (aarch64::kNeon)
- return inline_memcmp_aarch64_neon_gt16(p1, p2, count);
- else
- return inline_memcmp_generic_gt16(p1, p2, count);
+
+#if defined(__ARM_NEON)
+ return inline_memcmp_aarch64_neon_gt16(p1, p2, count);
+#else
+ return inline_memcmp_generic_gt16(p1, p2, count);
+#endif
}
} // namespace LIBC_NAMESPACE_DECL
diff --git a/libc/src/string/memory_utils/aarch64/inline_memmove.h b/libc/src/string/memory_utils/aarch64/inline_memmove.h
index 2b238031af49d..d8d276966fd27 100644
--- a/libc/src/string/memory_utils/aarch64/inline_memmove.h
+++ b/libc/src/string/memory_utils/aarch64/inline_memmove.h
@@ -8,8 +8,7 @@
#ifndef LIBC_SRC_STRING_MEMORY_UTILS_AARCH64_INLINE_MEMMOVE_H
#define LIBC_SRC_STRING_MEMORY_UTILS_AARCH64_INLINE_MEMMOVE_H
-#include "src/__support/macros/attributes.h" // LIBC_INLINE
-#include "src/string/memory_utils/op_aarch64.h" // aarch64::kNeon
+#include "src/__support/macros/attributes.h" // LIBC_INLINE
#include "src/string/memory_utils/op_builtin.h"
#include "src/string/memory_utils/op_generic.h"
#include "src/string/memory_utils/utils.h"
@@ -19,7 +18,6 @@
namespace LIBC_NAMESPACE_DECL {
LIBC_INLINE void inline_memmove_aarch64(Ptr dst, CPtr src, size_t count) {
- static_assert(aarch64::kNeon, "aarch64 supports vector types");
using uint128_t = generic_v128;
using uint256_t = generic_v256;
using uint512_t = generic_v512;
diff --git a/libc/src/string/memory_utils/aarch64/inline_memset.h b/libc/src/string/memory_utils/aarch64/inline_memset.h
index efcbfd0705983..71b686d92670b 100644
--- a/libc/src/string/memory_utils/aarch64/inline_memset.h
+++ b/libc/src/string/memory_utils/aarch64/inline_memset.h
@@ -18,12 +18,34 @@
namespace LIBC_NAMESPACE_DECL {
+using uint128_t = generic_v128;
+using uint256_t = generic_v256;
+using uint512_t = generic_v512;
+
+#if defined(__ARM_NEON)
+[[maybe_unused]] LIBC_INLINE static void
+inline_memset_aarch64_neon(Ptr dst, uint8_t value, size_t count) {
+ if (count >= 448 && value == 0 && aarch64::neon::hasZva()) {
+ generic::Memset<uint512_t>::block(dst, 0);
+ align_to_next_boundary<64>(dst, count);
+ return aarch64::neon::BzeroCacheLine::loop_and_tail(dst, 0, count);
+ }
+
+ generic::Memset<uint128_t>::block(dst, value);
+ align_to_next_boundary<16>(dst, count);
+ return generic::Memset<uint512_t>::loop_and_tail(dst, value, count);
+}
+#else
+[[maybe_unused]] LIBC_INLINE static void
+inline_memset_aarch64_no_neon(Ptr dst, uint8_t value, size_t count) {
+ generic::Memset<uint128_t>::block(dst, value);
+ align_to_next_boundary<16>(dst, count);
+ return generic::Memset<uint512_t>::loop_and_tail(dst, value, count);
+}
+#endif // __ARM_NEON
+
[[maybe_unused]] LIBC_INLINE static void
inline_memset_aarch64(Ptr dst, uint8_t value, size_t count) {
- static_assert(aarch64::kNeon, "aarch64 supports vector types");
- using uint128_t = generic_v128;
- using uint256_t = generic_v256;
- using uint512_t = generic_v512;
if (count == 0)
return;
if (count <= 3) {
@@ -46,15 +68,12 @@ inline_memset_aarch64(Ptr dst, uint8_t value, size_t count) {
generic::Memset<uint256_t>::tail(dst, value, count);
return;
}
- if (count >= 448 && value == 0 && aarch64::neon::hasZva()) {
- generic::Memset<uint512_t>::block(dst, 0);
- align_to_next_boundary<64>(dst, count);
- return aarch64::neon::BzeroCacheLine::loop_and_tail(dst, 0, count);
- } else {
- generic::Memset<uint128_t>::block(dst, value);
- align_to_next_boundary<16>(dst, count);
- return generic::Memset<uint512_t>::loop_and_tail(dst, value, count);
- }
+
+#if defined(__ARM_NEON)
+ return inline_memset_aarch64_neon(dst, value, count);
+#else
+ return inline_memset_aarch64_no_neon(dst, value, count);
+#endif
}
} // namespace LIBC_NAMESPACE_DECL
diff --git a/libc/src/string/memory_utils/op_aarch64.h b/libc/src/string/memory_utils/op_aarch64.h
index 868c64474c0b4..e552601fbb708 100644
--- a/libc/src/string/memory_utils/op_aarch64.h
+++ b/libc/src/string/memory_utils/op_aarch64.h
@@ -25,7 +25,6 @@
#ifdef __ARM_NEON
#include <arm_neon.h>
-#endif //__ARM_NEON
namespace LIBC_NAMESPACE_DECL {
namespace aarch64 {
@@ -176,6 +175,8 @@ template <size_t Size> struct Bcmp {
} // namespace aarch64
} // namespace LIBC_NAMESPACE_DECL
+#endif //__ARM_NEON
+
namespace LIBC_NAMESPACE_DECL {
namespace generic {
@@ -225,6 +226,8 @@ LIBC_INLINE MemcmpReturnType cmp<uint64_t>(CPtr p1, CPtr p2, size_t offset) {
return MemcmpReturnType::zero();
}
+#if defined(__ARM_NEON)
+
///////////////////////////////////////////////////////////////////////////////
// Specializations for uint8x16_t
template <> struct is_vector<uint8x16_t> : cpp::true_type {};
@@ -269,6 +272,9 @@ LIBC_INLINE MemcmpReturnType cmp<uint8x16x2_t>(CPtr p1, CPtr p2,
}
return MemcmpReturnType::zero();
}
+
+#endif // __ARM_NEON
+
} // namespace generic
} // namespace LIBC_NAMESPACE_DECL
|
Regarding the memory functions ( |
Would something like this be preferable? // Guard both these functions, or let the compiler optimise it out?
// Seems like you want it guarded but then this is quite a small change- it is already guarded.
#ifdef __ARM_NEON
inline_bcmp_aarch64_with_fp(...) { ... }
#else
inline_bcmp_aarch64_no_fp(...) { ... }
#endif
inline_bcmp_aarch64_dispatch(...) {
#ifdef __ARM_NEON
return inline_bcmp_aarch64_with_fp(...);
#else
return inline_bcmp_aarch64_no_fp(...);
#endif
}
inline_bcmp_aarch64(...) {
// Shared logic here.
return inline_bcmp_aarch64_dispatch(...);
} |
I meant something along those lines #ifdef __ARM_NEON
[[maybe_unused]] bool inline_bcmp_aarch64_with_fp(...) { ... }
#endif
[[maybe_unused]] bool inline_bcmp_aarch64_no_fp(...) { ... }
[[gnu::flatten]] bool inline_bcmp_aarch64_dispatch(...) {
#ifdef __ARM_NEON
return inline_bcmp_aarch64_with_fp(...);
#else
return inline_bcmp_aarch64_no_fp(...);
#endif
} Then in libc/src/string/memory_utils/inline_bcmp.h we need to change Sorry for what looks like nitpicking but I'm currently working on memory function implementations for armv6/v7 and this approach seems to scale. Also, now that I think it through, I'm a bit puzzled. aarch64 is supposed to be armv8/v9 and AFAIU armv8 has to have Advanced SIMD (Neon). What is the exact CPU you're targeting? Isn't it more of an issue about 32/64 bit support rather than neon/non-neon? |
You're correct that in standard implementations AArch64 is required to have Neon, but there are valid implementations that don't have NEON. For example:
I see what you're suggesting, I'll get to work on the patch. |
f447844
to
2e5f7d2
Compare
|
||
[[gnu::flatten]] LIBC_INLINE BcmpReturnType | ||
inline_bcmp_aarch64_dispatch(CPtr p1, CPtr p2, size_t count) { | ||
if (LIBC_LIKELY(count <= 16)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you completely duplicate the code in inline_bcmp_aarch64_no_fp
and inline_bcmp_aarch64_with_fp
so we can make sure the code is exactly the same as before for the latter. Currently the if (LIBC_UNLIKELY(count >= 16))
is gone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want the shared code in a shared (possibly inline?) function so that it looks like this:
inline_bcmp_aarch64_with_fp(...) {
if (LIBC_LIKELY(count <= 16)) {
return inline_bcmp_aarch64_shared(...)
}
// Specialised logic goes here
return x;
}
Also what do you mean by if (LIBC_UNLIKELY(count >= 16))
is gone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want the shared code in a shared (possibly inline?)
I want full code duplication. I'm usually quite against this but in this particular case duplication makes sense. It allows working on an implementation without having to think about the impact for the other ones.
Also what do you mean by
if (LIBC_UNLIKELY(count >= 16))
is gone?
Oops my bad, I meant, before we had:
if (LIBC_LIKELY(count <= 32)) {
if (LIBC_UNLIKELY(count >= 16)) {
return aarch64::Bcmp<16>::head_tail(p1, p2, count);
}
switch ....
now we have:
if (LIBC_LIKELY(count <= 16)) {
switch ...
I'd rather have an exact match with the previous implementation for the with_fp
version if possible. Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see, I'll push a change soon.
LIBC_INLINE MemcmpReturnType inline_memcmp_aarch64(CPtr p1, CPtr p2, | ||
size_t count) { | ||
[[gnu::flatten]] LIBC_INLINE MemcmpReturnType | ||
inline_memcmp_aarch64_dispatch(CPtr p1, CPtr p2, size_t count) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, the _dispatch
functions should only contain the dispatch logic, each implementation should be able to evolve on their own.
} | ||
#endif | ||
|
||
[[gnu::flatten]] [[maybe_unused]] LIBC_INLINE static void |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
…out HW FP/SIMD Add conditional compilation to add support for AArch64 without vector registers and/or hardware FPUs by using the generic implementation
2e5f7d2
to
9028175
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx for the PR !
Hi Guillaume, I don't have commit access to LLVM yet, you can commit this for me. Thanks! |
@saturn691 Congratulations on having your first Pull Request (PR) merged into the LLVM Project! Your changes will be combined with recent changes from other authors, then tested by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR. Please check whether problems have been caused by your change specifically, as the builds can include changes from many authors. It is not uncommon for your change to be included in a build that fails due to someone else's changes, or infrastructure issues. How to do this, and the rest of the post-merge process, is covered in detail here. If your change does cause a problem, it may be reverted, or you can revert it yourself. This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again. If you don't get any reports, no action is required from you. Your changes are working as expected, well done! |
…out HW FP/SIMD (llvm#137592) Add conditional compilation to add support for AArch64 without vector registers and/or hardware FPUs by using the generic implementation. **Context:** A few functions were hard-coded to use vector registers/hardware FPUs. This meant that libc would not compile on architectures that did not support these features. This fix falls back on the generic implementation if a feature is not supported.
…out HW FP/SIMD (llvm#137592) Add conditional compilation to add support for AArch64 without vector registers and/or hardware FPUs by using the generic implementation. **Context:** A few functions were hard-coded to use vector registers/hardware FPUs. This meant that libc would not compile on architectures that did not support these features. This fix falls back on the generic implementation if a feature is not supported.
…out HW FP/SIMD (llvm#137592) Add conditional compilation to add support for AArch64 without vector registers and/or hardware FPUs by using the generic implementation. **Context:** A few functions were hard-coded to use vector registers/hardware FPUs. This meant that libc would not compile on architectures that did not support these features. This fix falls back on the generic implementation if a feature is not supported.
…out HW FP/SIMD (llvm#137592) Add conditional compilation to add support for AArch64 without vector registers and/or hardware FPUs by using the generic implementation. **Context:** A few functions were hard-coded to use vector registers/hardware FPUs. This meant that libc would not compile on architectures that did not support these features. This fix falls back on the generic implementation if a feature is not supported.
From llvm/llvm-project#137592, these variants for LLVM-libc can be reenabled.
Add conditional compilation to add support for AArch64 without vector registers and/or hardware FPUs by using the generic implementation.
Context:
A few functions were hard-coded to use vector registers/hardware FPUs. This meant that libc would not compile on architectures that did not support these features. This fix falls back on the generic implementation if a feature is not supported.