Skip to content

Commit 86e99e1

Browse files
[libc] [search] improve hsearch robustness (#73896)
Following up the discussion at #73469 (comment) by @nickdesaulniers. According to FreeBSD implementation (https://android.googlesource.com/platform/bionic/+/refs/heads/main/libc/upstream-freebsd/lib/libc/stdlib/hcreate.c), `hsearch` is able to handle the cases where the global table is not properly initialized. To do this, FreeBSD actually allows hash table to be dynamically resized. If the global table is uninitialized at the first call, the table will be initialized with a minimal size; hence subsequent insertion will be reasonable as the table grows automatically. This patch mimic such behaviors. More precisely, this patch introduces: 1. a full table iterator that scans each element in the table, 2. a resize routine that is automatically triggered whenever the load factor is reached where it iterates the old table and insert the entries into a new one, 3. more tests that cover the growth of the table.
1 parent 67f9b5a commit 86e99e1

File tree

14 files changed

+320
-104
lines changed

14 files changed

+320
-104
lines changed

libc/docs/dev/undefined_behavior.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,3 +62,10 @@ Often the standard will imply an intended behavior through what it states is und
6262
Ignoring Bug-For-Bug Compatibility
6363
----------------------------------
6464
Any long running implementations will have bugs and deviations from the standard. Hyrum's Law states that “all observable behaviors of your system will be depended on by somebody” which includes these bugs. An example of a long-standing bug is glibc's scanf float parsing behavior. The behavior is specifically defined in the standard, but it isn't adhered to by all libc implementations. There is a longstanding bug in glibc where it incorrectly parses the string 100er and this caused the C standard to add that specific example to the definition for scanf. The intended behavior is for scanf, when parsing a float, to parse the longest possibly valid prefix and then accept it if and only if that complete parsed value is a float. In the case of 100er the longest possibly valid prefix is 100e but the float parsed from that string is only 100. Since there is no number after the e it shouldn't be included in the float, so scanf should return a parsing error. For LLVM's libc it was decided to follow the standard, even though glibc's version is slightly simpler to implement and this edge case is rare. Following the standard must be the first priority, since that's the goal of the library.
65+
66+
Design Decisions
67+
================
68+
69+
Resizable Tables for hsearch
70+
----------------------------
71+
The POSIX.1 standard does not delineate the behavior consequent to invoking hsearch or hdestroy without prior initialization of the hash table via hcreate. Furthermore, the standard does not specify the outcomes of successive invocations of hsearch absent intervening hdestroy calls. Libraries such as MUSL and Glibc do not apply checks to these scenarios, potentially leading to memory corruption or leakage. Conversely, FreeBSD's libc and Bionic automatically initialize the hash table to a minimal size if it is found uninitialized, and proceeding to destroy the table only if initialization has occurred. This approach also avoids redundant table allocation if an initialized hash table is already present. Given that the hash table starts with a minimal size, resizing becomes necessary to accommodate additional user insertions. LLVM's libc mirrors the approach of FreeBSD's libc and Bionic, owing to its enhanced robustness and user-friendliness. Notably, such resizing behavior itself aligns with POSIX.1 standards, which explicitly permit implementations to modify the capacity of the hash table.

libc/src/__support/HashTable/CMakeLists.txt

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,6 @@ add_header_library(
2828
libc.include.llvm-libc-types.ENTRY
2929
libc.src.__support.CPP.bit
3030
libc.src.__support.CPP.new
31-
libc.src.__support.CPP.type_traits
3231
libc.src.__support.hash
3332
libc.src.__support.macros.attributes
3433
libc.src.__support.macros.optimization

libc/src/__support/HashTable/bitmask.h

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,9 +31,7 @@ namespace internal {
3131
// | Available | 0b1xxx'xxxx |
3232
// | Occupied | 0b0xxx'xxxx |
3333
// =============================
34-
template <typename T, T WORD_MASK, size_t WORD_STRIDE> struct BitMaskAdaptor {
35-
// A masked constant whose bits are all set.
36-
LIBC_INLINE_VAR constexpr static T MASK = WORD_MASK;
34+
template <typename T, size_t WORD_STRIDE> struct BitMaskAdaptor {
3735
// A stride in the bitmask may use multiple bits.
3836
LIBC_INLINE_VAR constexpr static size_t STRIDE = WORD_STRIDE;
3937

libc/src/__support/HashTable/generic/bitmask_impl.inc

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -34,14 +34,14 @@ LIBC_INLINE constexpr bitmask_t repeat_byte(bitmask_t byte) {
3434
return byte;
3535
}
3636

37-
using BitMask = BitMaskAdaptor<bitmask_t, repeat_byte(0x80), 0x8ull>;
37+
using BitMask = BitMaskAdaptor<bitmask_t, 0x8ull>;
3838
using IteratableBitMask = IteratableBitMaskAdaptor<BitMask>;
3939

4040
struct Group {
4141
bitmask_t data;
4242

4343
// Load a group of control words from an arbitary address.
44-
LIBC_INLINE static Group load(const void *__restrict addr) {
44+
LIBC_INLINE static Group load(const void *addr) {
4545
union {
4646
bitmask_t value;
4747
char bytes[sizeof(bitmask_t)];
@@ -51,6 +51,11 @@ struct Group {
5151
return {data.value};
5252
}
5353

54+
// Load a group of control words from an aligned address.
55+
LIBC_INLINE static Group load_aligned(const void *addr) {
56+
return *static_cast<const Group *>(addr);
57+
}
58+
5459
// Find out the lanes equal to the given byte and return the bitmask
5560
// with corresponding bits set.
5661
LIBC_INLINE IteratableBitMask match_byte(uint8_t byte) const {
@@ -106,6 +111,10 @@ struct Group {
106111
LIBC_INLINE BitMask mask_available() const {
107112
return {LIBC_NAMESPACE::Endian::to_little_endian(data) & repeat_byte(0x80)};
108113
}
114+
115+
LIBC_INLINE IteratableBitMask occupied() const {
116+
return {static_cast<bitmask_t>(mask_available().word ^ repeat_byte(0x80))};
117+
}
109118
};
110119
} // namespace internal
111120
} // namespace LIBC_NAMESPACE

libc/src/__support/HashTable/sse2/bitmask_impl.inc

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,17 +12,22 @@ namespace internal {
1212
// With SSE2, every bitmask is iteratable as
1313
// we use single bit to encode the data.
1414

15-
using BitMask = BitMaskAdaptor<uint16_t, 0xffffu, 0x1u>;
15+
using BitMask = BitMaskAdaptor<uint16_t, 0x1u>;
1616
using IteratableBitMask = IteratableBitMaskAdaptor<BitMask>;
1717

1818
struct Group {
1919
__m128i data;
2020

2121
// Load a group of control words from an arbitary address.
22-
LIBC_INLINE static Group load(const void *__restrict addr) {
22+
LIBC_INLINE static Group load(const void *addr) {
2323
return {_mm_loadu_si128(static_cast<const __m128i *>(addr))};
2424
}
2525

26+
// Load a group of control words from an aligned address.
27+
LIBC_INLINE static Group load_aligned(const void *addr) {
28+
return {_mm_load_si128(static_cast<const __m128i *>(addr))};
29+
}
30+
2631
// Find out the lanes equal to the given byte and return the bitmask
2732
// with corresponding bits set.
2833
LIBC_INLINE IteratableBitMask match_byte(uint8_t byte) const {
@@ -35,6 +40,10 @@ struct Group {
3540
auto bitmask = static_cast<uint16_t>(_mm_movemask_epi8(data));
3641
return {bitmask};
3742
}
43+
44+
LIBC_INLINE IteratableBitMask occupied() const {
45+
return {static_cast<uint16_t>(~mask_available().word)};
46+
}
3847
};
3948
} // namespace internal
4049
} // namespace LIBC_NAMESPACE

0 commit comments

Comments
 (0)