Skip to content

DOCSP-43242: Improve UTF-8 validation documentation to clarify validation occurs on decoded data only #908

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Oct 14, 2024
13 changes: 7 additions & 6 deletions source/fundamentals/bson/utf8-validation.txt
Original file line number Diff line number Diff line change
Expand Up @@ -25,15 +25,16 @@ processing overhead since it needs to check the data.
If you *disable* validation, your application avoids the validation processing
overhead, but cannot guarantee consistent presentation of invalid UTF-8 data.

The driver enables UTF-8 validation by default. It checks documents for any
characters that are not encoded in a valid UTF-8 format when it transfers data
between your application and MongoDB.
By default, the driver enables UTF-8 validation on data from MongoDB.
It checks incoming documents for any characters that are not encoded in a
valid UTF-8 format when it parses data sent from MongoDB to your application.

.. note::

The current version of the {+driver-short+} automatically substitutes
invalid UTF-8 characters with alternate valid UTF-8 ones before
validation when you send data to MongoDB. Therefore, the validation
The current version of the {+driver-short+} automatically substitutes invalid
`lone surrogates <https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String#utf-16_characters_unicode_code_points_and_grapheme_clusters>`__
with the `replacement character <https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/toWellFormed>`__
before validation when you send data to MongoDB. Therefore, the validation
only throws an error when the setting is enabled and the driver
receives invalid UTF-8 document data from MongoDB.

Expand Down
Loading