Skip to content

New internal bounded string append APIs #1816

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
39 commits merged into from
Jan 13, 2025
Merged

Conversation

ghost
Copy link

@ghost ghost commented Dec 30, 2024

To efficiently support JSON serialization with bounded length, this is a new internal string API that can express bounded UTF-8-preserving append operations directly.

This patch on its own should have limited effect, but it's a prerequisite for work on CDRIVER-3775 and CDRIVER-4814.

The main benefit to this new API is that a single buffer can be shared by all nesting levels of a complex document. Previously, all nesting levels of a BSON document would allocate their own string append buffer, and truncation happened as a separate step on the way out of these nested levels.

Several other temporary string buffers were eliminated, including the temporary buffer that was needed for every printf. Now mcommon_string_append_printf writes directly to the destination, growing it as needed.

This work wasn't focused on performance, but it has an overall positive effect. The new mcommon_string implementations can be optimized more effectively at compile-time, and the updated string escape implementation treats fewer bytes as 'special'.

Other than the internal API change itself, the effects should be:

  • bson_as_json_with_opts now always truncates output at a UTF-8 boundary (this behavior was unspecified before)
  • bson_as_json serialization is overall faster, about 2x
  • worst case performance of bson_utf8_escape_for_json is better by about 2x
  • best case performance of bson_utf8_escape_for_json with plain ASCII data is worse by about 3x, due to an extra bson_utf8_validate pass that was needed to maintain API compatibility. The new internal equivalent for this API doesn't have the extra overhead.

The new internal APIs are documented in header comments. I've added test cases to cover UTF-8 truncation behavior, but most of the new code's test coverage is provided by existing tests.

@ghost ghost requested review from kevinAlbs, eramongodb and joshbsiegel December 30, 2024 20:36
@kevinAlbs kevinAlbs removed the request for review from joshbsiegel December 30, 2024 20:39
Micah Scott added 11 commits December 30, 2024 13:32
If TXT strings extended past the end of the TXT record, this had no check to prevent a read past the buffer.
I don't think the libc resolver does its own validation on these records at all, so we do need to validate here.
This isn't actually needed unless int is 64-bit, not common, but it was bugging me.
Copy link
Contributor

@eramongodb eramongodb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial feedback. Still in review.

@eramongodb eramongodb self-requested a review December 31, 2024 17:13
micah and others added 3 commits December 31, 2024 15:18
Co-authored-by: Ezra Chung <[email protected]>
Many of these are redundant, but some at least should be optimized out.
This will improve some of our error messages at the expense of code size.
Copy link
Collaborator

@kevinAlbs kevinAlbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remarkably, I only have minor comments. LGTM. Great work.


// Compatibility wrapper; deprecated private API.
bson_string_t *
_bson_string_alloc (const size_t size)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest removing _bson_string_alloc and _bson_string_append_ex and associated tests. They were private API before, and appear no longer used.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the wrappers here are only needed for the tests. I was trying to preserve existing test coverage during the refactor. If you think it'd be cleaner to remove them now, I'll include that change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think remove. I see no reason to preserve test coverage. Test code using _bson_string_alloc and _bson_string_append_ex appear only to be test the functions themselves.

/* Parameters and outcome for a bounded append operation on a mcommon_string_t. Individual type-specific append
* functions can consume this struct to communicate bounds info. "max_len_exceeded" can be tested any time an
* algorithmic exit is convenient; the actual appended content will be limited by max_len. Truncation is guaranteed not
* to split a valid UTF-8 byte sequence.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect the behavior of truncation in bson_as_json_with_opts was previously unspecified. Request add an entry to src/libbson/NEWS to note the possible output change:

libbson 1.30.0 (Unreleased)
===========================

Fixes:

  * Truncated output of `bson_as_json_with_opts` is changed to no longer split valid UTF-8 sequences.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. Yes, my understanding is that the behavior was unspecified before.

@ghost
Copy link
Author

ghost commented Jan 10, 2025

Ok, I'll need a minute to write a polished patch for this but I have a proposal for a new naming to make the ownership of strings and string append operations clearer.

I considered renaming the string_append to be an "appender", but my main frustration with this is that it implies the existence of methods like "string_appender_append" and "string_appender_printf", whereas I'd rather treat the "append" as the conceptual object that various appending methods act on. The mcommon_string_append() method itself feels conceptually like operator() on a string_append object. I think this makes the naming of most actual append operations clearer, if the confusion about lifetimes can be cleared up.

I'm avoiding the word "init" in describing how we come up with a mcommon_string_append_t. Something like "begin" or "do" could work, describing the expected placement of the operation, but I'd prefer to be more straightforward about what's happening. So, I'm considering that instead of a mcommon_string_append_t constructor, this could be written as a mcommon_string method that "sets" an append to itself. So mcommon_string_set_append(string, new_append), mcommon_string_set_append_with_limit(string, new_append, max_len). With string as the first param. And this means when we actually allocate a string, it can be written as a special type of string_new, like mcommon_string_new_as_append(), mcommon_string_new_as_fixed_capacity_append().

@ghost ghost requested a review from eramongodb January 11, 2025 04:24
Copy link
Contributor

@eramongodb eramongodb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subject-object reversal of string and append is a much appreciated improvement. Thank you very much! One last suggestion remaining; otherwise, LGTM.

…ior, rename 'atomic' append to 'all-or-none'
@ghost ghost merged commit f2c1bb7 into mongodb:master Jan 13, 2025
45 checks passed
ghost pushed a commit that referenced this pull request Jan 22, 2025
#1826)

This PR is a follow-up for #1795, #1821, and #1816. The new internal string and json APIs are used to implement missing functionality and tests for the structured logging subsystem.

* Adds public APIs for the max_document_length within a mongoc_structured_log_opts_t.
* Addresses the remainder of CDRIVER-4485: JSON documents within structured log messages are truncated correctly according to the rules set out in the Logging specification.
* Addresses the remainder of CDRIVER-4486 by implementing prose tests.
* Addresses CDRIVER-4814. Command or payload data beyond the truncation limit no longer degrades logging performance. (New serializer for command-with-payload)
* Unified tests use the suggested max document length of 10000 from the spec
* Replace atomic counter with atomic flag for environment error guards
* Tests for environment defaults now skip if relevant variables are set externally
ghost pushed a commit that referenced this pull request Jan 23, 2025
This is a follow-up to #1816 and mongodb/mongo-cxx-driver#1322, documenting minor changes to JSON output subsequent to change f2c1bb7
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants