Skip to content

CDRIVER-5641: BSON Binary Vector Subtype Support #1868

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
54 commits merged into from
Mar 6, 2025

Conversation

ghost
Copy link

@ghost ghost commented Feb 15, 2025

  • Resolves CDRIVER-5641 by documenting and implementing new BSON Binary Vector APIs
  • Synced bson_binary_vector and bson_corpus spec tests from specifications commit 585b797c110b6709f81def6200b946b94c8d9c55
  • tests: Full support for the existing version of the spec test suite.
  • tests: TODOs mark spec test improvements that depend on DRIVERS-3095 and DRIVERS-3097.
  • tests: Additional API usage and fuzzing.
  • New BSON Binary APIs: bson_append_binary_uninit, bson_iter_binary_equal, bson_iter_binary_subtype, bson_iter_overwrite_binary
  • Extend existing big-endian byte swapping API for 32-bit float
  • Compiler support for restricted pointer aliasing using BSON_RESTRICT.

The concept: no new owned data types are defined. Instead, we encourage access to vector data that's already allocated inside a bson_t. When you append a vector to a bson_t, the result is a view. Views are effectively pointers decorated with type and length information, conferring a validity guarantee about the memory they reference.

The view data types are included in the ABI, and several performance-critical accessors are implemented as inline functions.

For review I'd recommend starting with the new binary_vector.rst documentation page and following the links from there. The documentation page for each view type includes a short example. These examples are also included in the test suite, in an adapted form.

It's tempting to automatically generate the repetitive portions of this patch, but so far I have opted instead to avoid the extra complexity.

Commits are separated into new files, synced spec files, and modified files.

Verified with:

Micah Scott added 5 commits February 14, 2025 15:25
From specifications repository commit 585b797c110b6709f81def6200b946b94c8d9c55
From specifications repository commit 585b797c110b6709f81def6200b946b94c8d9c55
@ghost ghost requested review from eramongodb and vector-of-bool February 15, 2025 06:51
mdbmes and others added 16 commits February 25, 2025 12:45
* replace s/packed_bits/packed_bit/
* in filenames also
* and fixup documentation headings and indents
Micah Scott added 13 commits February 26, 2025 21:29
The earlier fix for a bug in allocating bson_t near the max size revealed a flawed test that was only succeeding because of the allocation bug.
This fixes the test to identify both sides of the boundary: we fail to allocate a bson_t that's one byte too large, and on 64-bit platforms we allocate a max-sized bson_t.
The main reason for this change is to avoid an unhelpful implementation of -Warray-bounds in gcc 11
which notices that the memcpy in these tests can be out-of-range, but doesn't notice that the out-of-range
memcpy will never be executed. The easiest fix was to dynamically allocate the value buffers, to prevent
gcc from associating range info with these pointers.

This replaces some arbitrary nonzero test values with zeroes, out of convenience.
The new name doesn't misleadingly imply that the result is always a power of two.
@ghost
Copy link
Author

ghost commented Feb 27, 2025

This PR has commits to fix allocating max sized bson_t, to facilitate binary vector tests. I copied them out into #1891 for separate review.

@ghost ghost requested a review from eramongodb February 28, 2025 01:44
Copy link
Contributor

@eramongodb eramongodb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor suggestions remaining; otherwise, LGTM.

int8_t *expected_elements = bson_malloc (MAX_TESTED_VECTOR_LENGTH * sizeof *expected_elements);
int8_t *actual_elements = bson_malloc (MAX_TESTED_VECTOR_LENGTH * sizeof *actual_elements);
for (int fuzz_iter = 0; fuzz_iter < FUZZ_TEST_ITERS; fuzz_iter++) {
int r = rand ();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rand functions definitely leave a lot to be desired w.r.t. performance. Their addition in #898 was very much a "correctness first, performance later" decision. The suggestion to use them here was only for rand()-avoidance and reuse of existing API to easily achieve size_t type consistency without requiring explicit casts (i.e. in module expressions). imo given the execution of these tests are still under a second in total, I do not think the performance tradeoff is especially significant given the overall execution time of the test suite. If you still prefer the explicit rand(), I am fine with that decision.

@ghost
Copy link
Author

ghost commented Feb 28, 2025

Not sure why the Github UI is not giving me an option to reply directly to this above..

The rand functions definitely leave a lot to be desired w.r.t. performance. Their addition in #898 was very much a "correctness first, performance later" decision. The suggestion to use them here was only for rand()-avoidance and reuse of existing API to easily achieve size_t type consistency without requiring explicit casts (i.e. in module expressions). imo given the execution of these tests are still under a second in total, I do not think the performance tradeoff is especially significant given the overall execution time of the test suite. If you still prefer the explicit rand(), I am fine with that decision.

Switching to the wrapper just for consistency here would really not be an improvement. It's not just a waste of CPU time, it's a waste of the (extremely limited!) state space of the 32-bit PRNG on windows. Each call to the wrapper consumes 5 rand() calls, so the result is a 64-bit wide number with a period of only (2^32)/5.

Copy link
Contributor

@vector-of-bool vector-of-bool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern is the large amount of code duplication between vector types and the const/non-const variants. I can't think of a pretty solution that doesn't involve macro trickery, though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I feel similarly. It's been tempting to macro-ize it more, but I feel like I'd be setting a trap for folks who are trying to debug this code later. The lowest hanging fruit for de-duplication at this point might be some kind of template engine for the documentation pages?

@ghost ghost merged commit 1126b6e into mongodb:master Mar 6, 2025
40 of 42 checks passed
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants