Skip to content

[libc] [search] improve hsearch robustness #73896

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Dec 5, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions libc/docs/dev/undefined_behavior.rst
Original file line number Diff line number Diff line change
Expand Up @@ -62,3 +62,10 @@ Often the standard will imply an intended behavior through what it states is und
Ignoring Bug-For-Bug Compatibility
----------------------------------
Any long running implementations will have bugs and deviations from the standard. Hyrum's Law states that “all observable behaviors of your system will be depended on by somebody” which includes these bugs. An example of a long-standing bug is glibc's scanf float parsing behavior. The behavior is specifically defined in the standard, but it isn't adhered to by all libc implementations. There is a longstanding bug in glibc where it incorrectly parses the string 100er and this caused the C standard to add that specific example to the definition for scanf. The intended behavior is for scanf, when parsing a float, to parse the longest possibly valid prefix and then accept it if and only if that complete parsed value is a float. In the case of 100er the longest possibly valid prefix is 100e but the float parsed from that string is only 100. Since there is no number after the e it shouldn't be included in the float, so scanf should return a parsing error. For LLVM's libc it was decided to follow the standard, even though glibc's version is slightly simpler to implement and this edge case is rare. Following the standard must be the first priority, since that's the goal of the library.

Design Decisions
================

Resizable Tables for hsearch
----------------------------
The POSIX.1 standard does not delineate the behavior consequent to invoking hsearch or hdestroy without prior initialization of the hash table via hcreate. Furthermore, the standard does not specify the outcomes of successive invocations of hsearch absent intervening hdestroy calls. Libraries such as MUSL and Glibc do not apply checks to these scenarios, potentially leading to memory corruption or leakage. Conversely, FreeBSD's libc and Bionic automatically initialize the hash table to a minimal size if it is found uninitialized, and proceeding to destroy the table only if initialization has occurred. This approach also avoids redundant table allocation if an initialized hash table is already present. Given that the hash table starts with a minimal size, resizing becomes necessary to accommodate additional user insertions. LLVM's libc mirrors the approach of FreeBSD's libc and Bionic, owing to its enhanced robustness and user-friendliness. Notably, such resizing behavior itself aligns with POSIX.1 standards, which explicitly permit implementations to modify the capacity of the hash table.
1 change: 0 additions & 1 deletion libc/src/__support/HashTable/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,6 @@ add_header_library(
libc.include.llvm-libc-types.ENTRY
libc.src.__support.CPP.bit
libc.src.__support.CPP.new
libc.src.__support.CPP.type_traits
libc.src.__support.hash
libc.src.__support.macros.attributes
libc.src.__support.macros.optimization
Expand Down
4 changes: 1 addition & 3 deletions libc/src/__support/HashTable/bitmask.h
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,7 @@ namespace internal {
// | Available | 0b1xxx'xxxx |
// | Occupied | 0b0xxx'xxxx |
// =============================
template <typename T, T WORD_MASK, size_t WORD_STRIDE> struct BitMaskAdaptor {
// A masked constant whose bits are all set.
LIBC_INLINE_VAR constexpr static T MASK = WORD_MASK;
template <typename T, size_t WORD_STRIDE> struct BitMaskAdaptor {
// A stride in the bitmask may use multiple bits.
LIBC_INLINE_VAR constexpr static size_t STRIDE = WORD_STRIDE;

Expand Down
13 changes: 11 additions & 2 deletions libc/src/__support/HashTable/generic/bitmask_impl.inc
Original file line number Diff line number Diff line change
Expand Up @@ -34,14 +34,14 @@ LIBC_INLINE constexpr bitmask_t repeat_byte(bitmask_t byte) {
return byte;
}

using BitMask = BitMaskAdaptor<bitmask_t, repeat_byte(0x80), 0x8ull>;
using BitMask = BitMaskAdaptor<bitmask_t, 0x8ull>;
using IteratableBitMask = IteratableBitMaskAdaptor<BitMask>;

struct Group {
bitmask_t data;

// Load a group of control words from an arbitary address.
LIBC_INLINE static Group load(const void *__restrict addr) {
LIBC_INLINE static Group load(const void *addr) {
union {
bitmask_t value;
char bytes[sizeof(bitmask_t)];
Expand All @@ -51,6 +51,11 @@ struct Group {
return {data.value};
}

// Load a group of control words from an aligned address.
LIBC_INLINE static Group load_aligned(const void *addr) {
return *static_cast<const Group *>(addr);
}

// Find out the lanes equal to the given byte and return the bitmask
// with corresponding bits set.
LIBC_INLINE IteratableBitMask match_byte(uint8_t byte) const {
Expand Down Expand Up @@ -106,6 +111,10 @@ struct Group {
LIBC_INLINE BitMask mask_available() const {
return {LIBC_NAMESPACE::Endian::to_little_endian(data) & repeat_byte(0x80)};
}

LIBC_INLINE IteratableBitMask occupied() const {
return {static_cast<bitmask_t>(mask_available().word ^ repeat_byte(0x80))};
}
};
} // namespace internal
} // namespace LIBC_NAMESPACE
13 changes: 11 additions & 2 deletions libc/src/__support/HashTable/sse2/bitmask_impl.inc
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,22 @@ namespace internal {
// With SSE2, every bitmask is iteratable as
// we use single bit to encode the data.

using BitMask = BitMaskAdaptor<uint16_t, 0xffffu, 0x1u>;
using BitMask = BitMaskAdaptor<uint16_t, 0x1u>;
using IteratableBitMask = IteratableBitMaskAdaptor<BitMask>;

struct Group {
__m128i data;

// Load a group of control words from an arbitary address.
LIBC_INLINE static Group load(const void *__restrict addr) {
LIBC_INLINE static Group load(const void *addr) {
return {_mm_loadu_si128(static_cast<const __m128i *>(addr))};
}

// Load a group of control words from an aligned address.
LIBC_INLINE static Group load_aligned(const void *addr) {
return {_mm_load_si128(static_cast<const __m128i *>(addr))};
}

// Find out the lanes equal to the given byte and return the bitmask
// with corresponding bits set.
LIBC_INLINE IteratableBitMask match_byte(uint8_t byte) const {
Expand All @@ -35,6 +40,10 @@ struct Group {
auto bitmask = static_cast<uint16_t>(_mm_movemask_epi8(data));
return {bitmask};
}

LIBC_INLINE IteratableBitMask occupied() const {
return {static_cast<uint16_t>(~mask_available().word)};
}
};
} // namespace internal
} // namespace LIBC_NAMESPACE
Loading