Skip to content

Commit 903cc71

Browse files
committed
[libc] mem* framework v3
This version is more composable and also simpler at the expense of being more explicit and more verbose. This patch provides rationale for the framework, implementation and unit tests but the functions themselves are still using the previous version. The change in implementation will come in a follow up patch. Differential Revision: https://reviews.llvm.org/D136292
1 parent e25ed05 commit 903cc71

File tree

10 files changed

+1572
-3
lines changed

10 files changed

+1572
-3
lines changed

libc/src/string/memory_utils/CMakeLists.txt

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,17 @@
22
add_header_library(
33
memory_utils
44
HDRS
5-
utils.h
6-
elements.h
75
bcmp_implementations.h
86
bzero_implementations.h
7+
elements.h
98
memcmp_implementations.h
109
memcpy_implementations.h
1110
memset_implementations.h
11+
op_aarch64.h
12+
op_builtin.h
13+
op_generic.h
14+
op_x86.h
15+
utils.h
1216
DEPS
1317
libc.src.__support.CPP.bit
1418
)
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
# The mem* framework
2+
3+
The framework handles the following mem* functions:
4+
- `memcpy`
5+
- `memmove`
6+
- `memset`
7+
- `bzero`
8+
- `bcmp`
9+
- `memcmp`
10+
11+
## Building blocks
12+
13+
These functions can be built out of a set of lower-level operations:
14+
- **`block`** : operates on a block of `SIZE` bytes.
15+
- **`tail`** : operates on the last `SIZE` bytes of the buffer (e.g., `[dst + count - SIZE, dst + count]`)
16+
- **`head_tail`** : operates on the first and last `SIZE` bytes. This is the same as calling `block` and `tail`.
17+
- **`loop_and_tail`** : calls `block` in a loop to consume as much as possible of the `count` bytes and handle the remaining bytes with a `tail` operation.
18+
19+
As an illustration, let's take the example of a trivial `memset` implementation:
20+
21+
```C++
22+
extern "C" void memset(const char* dst, int value, size_t count) {
23+
if (count == 0) return;
24+
if (count == 1) return Memset<1>::block(dst, value);
25+
if (count == 2) return Memset<2>::block(dst, value);
26+
if (count == 3) return Memset<3>::block(dst, value);
27+
if (count <= 8) return Memset<4>::head_tail(dst, value, count); // Note that 0 to 4 bytes are written twice.
28+
if (count <= 16) return Memset<8>::head_tail(dst, value, count); // Same here.
29+
return Memset<16>::loop_and_tail(dst, value, count);
30+
}
31+
```
32+
33+
Now let's have a look into the `Memset` structure:
34+
35+
```C++
36+
template <size_t Size>
37+
struct Memset {
38+
static constexpr size_t SIZE = Size;
39+
40+
static inline void block(Ptr dst, uint8_t value) {
41+
// Implement me
42+
}
43+
44+
static inline void tail(Ptr dst, uint8_t value, size_t count) {
45+
block(dst + count - SIZE, value);
46+
}
47+
48+
static inline void head_tail(Ptr dst, uint8_t value, size_t count) {
49+
block(dst, value);
50+
tail(dst, value, count);
51+
}
52+
53+
static inline void loop_and_tail(Ptr dst, uint8_t value, size_t count) {
54+
size_t offset = 0;
55+
do {
56+
block(dst + offset, value);
57+
offset += SIZE;
58+
} while (offset < count - SIZE);
59+
tail(dst, value, count);
60+
}
61+
};
62+
```
63+
64+
As you can see, the `tail`, `head_tail` and `loop_and_tail` are higher order functions that build on each others. Only `block` really needs to be implemented.
65+
In earlier designs we were implementing these higher order functions with templated functions but it appears that it is more readable to have the implementation explicitly stated.
66+
**This design is useful because it provides customization points**. For instance, for `bcmp` on `aarch64` we can provide a better implementation of `head_tail` using vector reduction intrinsics.
67+
68+
## Scoped specializations
69+
70+
We can have several specializations of the `Memset` structure. Depending on the target requirements we can use one or several scopes for the same implementation.
71+
72+
In the following example we use the `generic` implementation for the small sizes but use the `x86` implementation for the loop.
73+
```C++
74+
extern "C" void memset(const char* dst, int value, size_t count) {
75+
if (count == 0) return;
76+
if (count == 1) return generic::Memset<1>::block(dst, value);
77+
if (count == 2) return generic::Memset<2>::block(dst, value);
78+
if (count == 3) return generic::Memset<3>::block(dst, value);
79+
if (count <= 8) return generic::Memset<4>::head_tail(dst, value, count);
80+
if (count <= 16) return generic::Memset<8>::head_tail(dst, value, count);
81+
return x86::Memset<16>::loop_and_tail(dst, value, count);
82+
}
83+
```
84+
85+
### The `builtin` scope
86+
87+
Ultimately we would like the compiler to provide the code for the `block` function. For this we rely on dedicated builtins available in Clang (e.g., [`__builtin_memset_inline`](https://clang.llvm.org/docs/LanguageExtensions.html#guaranteed-inlined-memset))
88+
89+
### The `generic` scope
90+
91+
In this scope we define pure C++ implementations using native integral types and clang vector extensions.
92+
93+
### The arch specific scopes
94+
95+
Then comes implementations that are using specific architectures or microarchitectures features (e.g., `rep;movsb` for `x86` or `dc zva` for `aarch64`).
96+
97+
The purpose here is to rely on builtins as much as possible and fallback to `asm volatile` as a last resort.
Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
//===-- aarch64 implementation of memory function building blocks ---------===//
2+
//
3+
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
4+
// See https://llvm.org/LICENSE.txt for license information.
5+
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
6+
//
7+
//===----------------------------------------------------------------------===//
8+
//
9+
// This file provides aarch64 specific building blocks to compose memory
10+
// functions.
11+
//
12+
//===----------------------------------------------------------------------===//
13+
#ifndef LLVM_LIBC_SRC_STRING_MEMORY_UTILS_OP_AARCH64_H
14+
#define LLVM_LIBC_SRC_STRING_MEMORY_UTILS_OP_AARCH64_H
15+
16+
#include "src/__support/architectures.h"
17+
18+
#if defined(LLVM_LIBC_ARCH_AARCH64)
19+
20+
#include "src/__support/common.h"
21+
#include "src/string/memory_utils/op_generic.h"
22+
23+
#ifdef __ARM_NEON
24+
#include <arm_neon.h>
25+
#endif //__ARM_NEON
26+
27+
namespace __llvm_libc::aarch64 {
28+
29+
static inline constexpr bool kNeon = LLVM_LIBC_IS_DEFINED(__ARM_NEON);
30+
31+
namespace neon {
32+
33+
template <size_t Size> struct BzeroCacheLine {
34+
static constexpr size_t SIZE = Size;
35+
36+
static inline void block(Ptr dst, uint8_t) {
37+
static_assert(Size == 64);
38+
#if __SIZEOF_POINTER__ == 4
39+
asm("dc zva, %w[dst]" : : [dst] "r"(dst) : "memory");
40+
#else
41+
asm("dc zva, %[dst]" : : [dst] "r"(dst) : "memory");
42+
#endif
43+
}
44+
45+
static inline void loop_and_tail(Ptr dst, uint8_t value, size_t count) {
46+
static_assert(Size > 1, "a loop of size 1 does not need tail");
47+
size_t offset = 0;
48+
do {
49+
block(dst + offset, value);
50+
offset += SIZE;
51+
} while (offset < count - SIZE);
52+
// Unaligned store, we can't use 'dc zva' here.
53+
static constexpr size_t kMaxSize = kNeon ? 16 : 8;
54+
generic::Memset<Size, kMaxSize>::tail(dst, value, count);
55+
}
56+
};
57+
58+
inline static bool hasZva() {
59+
uint64_t zva_val;
60+
asm("mrs %[zva_val], dczid_el0" : [zva_val] "=r"(zva_val));
61+
// DC ZVA is permitted if DZP, bit [4] is zero.
62+
// BS, bits [3:0] is log2 of the block count in words.
63+
// So the next line checks whether the instruction is permitted and block
64+
// count is 16 words (i.e. 64 bytes).
65+
return (zva_val & 0b11111) == 0b00100;
66+
}
67+
68+
} // namespace neon
69+
70+
///////////////////////////////////////////////////////////////////////////////
71+
// Bcmp
72+
template <size_t Size> struct Bcmp {
73+
static constexpr size_t SIZE = Size;
74+
static constexpr size_t BlockSize = 32;
75+
76+
static const unsigned char *as_u8(CPtr ptr) {
77+
return reinterpret_cast<const unsigned char *>(ptr);
78+
}
79+
80+
static inline BcmpReturnType block(CPtr p1, CPtr p2) {
81+
if constexpr (Size == BlockSize) {
82+
auto _p1 = as_u8(p1);
83+
auto _p2 = as_u8(p2);
84+
uint8x16_t a = vld1q_u8(_p1);
85+
uint8x16_t b = vld1q_u8(_p1 + 16);
86+
uint8x16_t n = vld1q_u8(_p2);
87+
uint8x16_t o = vld1q_u8(_p2 + 16);
88+
uint8x16_t an = veorq_u8(a, n);
89+
uint8x16_t bo = veorq_u8(b, o);
90+
// anbo = (a ^ n) | (b ^ o). At least one byte is nonzero if there is
91+
// a difference between the two buffers. We reduce this value down to 4
92+
// bytes in two steps. First, calculate the saturated move value when
93+
// going from 2x64b to 2x32b. Second, compute the max of the 2x32b to get
94+
// a single 32 bit nonzero value if a mismatch occurred.
95+
uint8x16_t anbo = vorrq_u8(an, bo);
96+
uint32x2_t anbo_reduced = vqmovn_u64(anbo);
97+
return vmaxv_u32(anbo_reduced);
98+
} else if constexpr ((Size % BlockSize) == 0) {
99+
for (size_t offset = 0; offset < Size; offset += BlockSize)
100+
if (auto value = Bcmp<BlockSize>::block(p1 + offset, p2 + offset))
101+
return value;
102+
} else {
103+
deferred_static_assert("SIZE not implemented");
104+
}
105+
return BcmpReturnType::ZERO();
106+
}
107+
108+
static inline BcmpReturnType tail(CPtr p1, CPtr p2, size_t count) {
109+
return block(p1 + count - SIZE, p2 + count - SIZE);
110+
}
111+
112+
static inline BcmpReturnType head_tail(CPtr p1, CPtr p2, size_t count) {
113+
if constexpr (Size <= 8) {
114+
return generic::Bcmp<Size>::head_tail(p1, p2, count);
115+
} else if constexpr (Size == 16) {
116+
auto _p1 = as_u8(p1);
117+
auto _p2 = as_u8(p2);
118+
uint8x16_t a = vld1q_u8(_p1);
119+
uint8x16_t b = vld1q_u8(_p1 + count - 16);
120+
uint8x16_t n = vld1q_u8(_p2);
121+
uint8x16_t o = vld1q_u8(_p2 + count - 16);
122+
uint8x16_t an = veorq_s8(a, n);
123+
uint8x16_t bo = veorq_s8(b, o);
124+
// anbo = (a ^ n) | (b ^ o)
125+
uint8x16_t anbo = vorrq_s8(an, bo);
126+
uint32x2_t anbo_reduced = vqmovn_u64(anbo);
127+
return vmaxv_u32(anbo_reduced);
128+
} else if constexpr (Size == 32) {
129+
auto _p1 = as_u8(p1);
130+
auto _p2 = as_u8(p2);
131+
uint8x16_t a = vld1q_u8(_p1);
132+
uint8x16_t b = vld1q_u8(_p1 + 16);
133+
uint8x16_t c = vld1q_u8(_p1 + count - 16);
134+
uint8x16_t d = vld1q_u8(_p1 + count - 32);
135+
uint8x16_t n = vld1q_u8(_p2);
136+
uint8x16_t o = vld1q_u8(_p2 + 16);
137+
uint8x16_t p = vld1q_u8(_p2 + count - 16);
138+
uint8x16_t q = vld1q_u8(_p2 + count - 32);
139+
uint8x16_t an = veorq_s8(a, n);
140+
uint8x16_t bo = veorq_s8(b, o);
141+
uint8x16_t cp = veorq_s8(c, p);
142+
uint8x16_t dq = veorq_s8(d, q);
143+
uint8x16_t anbo = vorrq_s8(an, bo);
144+
uint8x16_t cpdq = vorrq_s8(cp, dq);
145+
// abnocpdq = ((a ^ n) | (b ^ o)) | ((c ^ p) | (d ^ q)). Reduce this to
146+
// a nonzero 32 bit value if a mismatch occurred.
147+
uint64x2_t abnocpdq = vreinterpretq_u64_u8(anbo | cpdq);
148+
uint32x2_t abnocpdq_reduced = vqmovn_u64(abnocpdq);
149+
return vmaxv_u32(abnocpdq_reduced);
150+
} else {
151+
deferred_static_assert("SIZE not implemented");
152+
}
153+
return BcmpReturnType::ZERO();
154+
}
155+
156+
static inline BcmpReturnType loop_and_tail(CPtr p1, CPtr p2, size_t count) {
157+
static_assert(Size > 1, "a loop of size 1 does not need tail");
158+
size_t offset = 0;
159+
do {
160+
if (auto value = block(p1 + offset, p2 + offset))
161+
return value;
162+
offset += SIZE;
163+
} while (offset < count - SIZE);
164+
return tail(p1, p2, count);
165+
}
166+
};
167+
168+
} // namespace __llvm_libc::aarch64
169+
170+
#endif // LLVM_LIBC_ARCH_AARCH64
171+
172+
#endif // LLVM_LIBC_SRC_STRING_MEMORY_UTILS_OP_AARCH64_H

0 commit comments

Comments
 (0)