Skip to content

CDRIVER-4119 Replace ICU SASLprep with internal implementation #1317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Jul 13, 2023

Conversation

joshbsiegel
Copy link
Contributor

@joshbsiegel joshbsiegel commented Jun 21, 2023

Summary

This PR will replace the ICU library that was used for SASLprep with an internal implementation of SASLprep. Once merged, a separate PR will be made to remove the ENABLE_ICU option and related code.

Background

SASLprep defines the steps required to prepare usernames and passwords for comparison and authentication. This is necessary as the vast amount of UTF-8 characters that a user can input for a password presents the need for some standardization. For the C driver, we use it to prepare passwords when authenticating with SCRAM-SHA-256. SASLprep is simply a profile built upon stringprep that explains specifically what parts of the stringprep document to abide by. From a high level, here is what is required by SASLprep:

  1. Mapping: There are some characters that are similar to a space character and should be mapped to that. There are also characters that should be mapped to nothing (deletion).
  2. Normalization: SASLprep requires Unicode normalization form KC. There are many changes that this normalization will make. For example, a character such as "Ⅳ" (one character) will be modified to "IV" (two characters).
  3. Prohibited characters: Some characters are fully prohibited.
  4. Bidirectional characters: While most characters are left to right, there are some characters that are right to left. These characters are allowed but must be checked for some requirements (see code comments).

Previously SASLprep was done with the ICU library and the goal of this ticket is to get rid of that dependency.

utf8proc

The normalization described above is the most involved step and we opted to use a library to handle it. utf8proc is bundled as a part of this ticket and handles this.

Preliminary Work

All of the references to characters in the stringprep and SASLprep RFC documents (such as what characters are mapped to what, prohibited characters, etc.) are displayed using their Unicode codepoints. This is not the same as a character's byte encoding in UTF-8 (see here). Because of this, all of the lookup tables that are required use Unicode codepoints, which is different than the const char *in_utf8 parameter in the existing _mongoc_sasl_prep_impl function. As such, it was necessary to convert the UTF-8 characters to their Unicode codepoints before doing any other work.

UTF-8 characters can be 1-4 bytes, so that is why _mongoc_utf8_string_length and _mongoc_utf8_char_length were added instead of using strlen (which will return the number of bytes in the string). The length of each UTF8 character can be determined by the first byte. _mongoc_utf8_to_unicode handles the conversion.

Lookup Tables

The mapping, prohibited characters, and bidirectional steps all use lookup tables to easily find whether a codepoint belongs in a particular category. The tables work by using ranges, where the 2n element is the lower range and the 2n + 1 element is the upper range (both inclusive). _mongoc_is_code_in_table will search through the table passed through and return whether the codepoint exists in the table. There might be a potential improvement here as I believe a binary search could be implemented for a faster lookup.

Misc

All of the tables and much of the instructions on how to perform SASLprep actually come from the stringprep RFC document. The SASLprep document just specifies what parts of stringprep to follow (which is most of it).

What's New

  • _mongoc_sasl_prep_impl was almost completely scrapped for the SASLprep steps listed above
  • _mongoc_utf8_char_length
  • _mongoc_utf8_string_length
  • _mongoc_utf8_is_valid
  • _mongoc_char_between_chars
  • _mongoc_is_code_in_table
  • _mongoc_utf8_to_unicode
  • _mongoc_unicode_to_utf8
  • _mongoc_unicode_codepoint_length
  • Tests added to test_mongoc_scram_sasl_prep
  • test_mongoc_utf8_char_length
  • test_mongoc_utf8_string_length
  • test_mongoc_utf8_to_unicode

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were a few functions and tables that I needed to add to this file (and mongoc-scram.cpp) and I believe they belong elsewhere to improve organization but I wasn't sure where to put them as I wanted to follow the codebase structure/protocol. Please let me know what suggestions you have on where to place these additions and it should be an easy change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest moving the declarations only needed by mongoc-scram.c into mongoc-scram.c (e.g. _mongoc_utf8_is_valid). Declaring in the header may suggest they are intended to be used in other files.

Suggest adding a comment to symbols that are only included in the header for testing purposes (e.g. _mongoc_utf8_char_length ): is exposed for testing.

@joshbsiegel joshbsiegel requested a review from kevinAlbs June 21, 2023 18:01
Copy link
Collaborator

@kevinAlbs kevinAlbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest splitting the bundling of utf8proc into a separate PR. That may help keep the scope of the review smaller and to get passing on CI.

I have an open question about the bundling of utf8proc and I want input from others on the team. I can post this open question on the separate PR:

  • If utf8proc is bundled, is an option needed to use a system install of utf8proc?
    • Would Debian and other package managers require having an option to use a system install of utf8proc if available?
    • IMO utf8proc is a small library. I prefer not having another option between using a system install or the bundled if it is not needed.

COPYONLY
)
set (UTF8PROC_INCLUDE_DIRS "")
# if (ENABLE_UTF8PROC MATCHES "SYSTEM|AUTO")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented out section.

@joshbsiegel joshbsiegel changed the title CDRIVER-4119 Replace ICU SASLPrep with internal implementation CDRIVER-4119 Replace ICU SASLprep with internal implementation Jun 27, 2023
int
_mongoc_unicode_to_utf8 (uint32_t c, uint8_t *out);

/* the tables below all fome from RFC 3454. They are all range tables, with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* the tables below all fome from RFC 3454. They are all range tables, with
/* the tables below all come from RFC 3454. They are all range tables, with

@@ -123,6 +123,348 @@ _mongoc_sasl_prep_required (const char *str);
char *
_mongoc_sasl_prep (const char *in_utf8, int in_utf8_len, bson_error_t *err);

/* returns how many bytes a UTF-8 character is. */
size_t
_mongoc_utf8_char_length (const uint8_t *c);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggestion to use/cast-to uint8_t was intended to be within the implementation, not in the interface, of the utf8 functions. Keeping the interface char* will avoid requiring users to explicitly cast to uint8_t as currently being done in the updated test cases.

Suggest reverting the function declarations to use char* while still defining their operations in terms of uint8_t in their implementation.

const size_t utf8_char_length = _mongoc_utf8_char_length (c);

if (!_mongoc_utf8_is_valid (c, utf8_char_length))
return -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we really must return a special value to indicate invalid lengths (instead of requiring/asserting that the given UTF-8 string is valid and length computation will always succeed), may need to use ssize_t for the return type of this function to handle negative values.


/* converts a Unicode code point to UTF-8 character. Returns how many bytes the
* character converted is. */
int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either size_t (if we forbid invalid UTF-8 strings as input) or ssize_t (if we must return -1 on invalid input).

/* converts a Unicode code point to UTF-8 character. Returns how many bytes the
* character converted is. */
int
_mongoc_unicode_to_utf8 (uint32_t c, uint8_t *out);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_mongoc_unicode_to_utf8 (uint32_t c, uint8_t *out);
_mongoc_utf8_code_point_to_str (uint32_t c, uint8_t *out);

Suggest renaming to more accurate description + consistency with other UTF-8 functions.

@@ -1035,12 +1036,12 @@ _mongoc_sasl_prep_impl (const char *name,
int in_utf8_len,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter is now unused: should _mongoc_utf8_string_length account for the possibility of in_utf8_len < strlen(in_utf8)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little unsure on what you mean by this. In all cases of _mongoc_sasl_prep_impl being called, in_utf8_len is passed through as strlen(in_utf8), meaning that in_utf8_len == strlen(in_utf8).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I suggest removing the redundant/unused parameter from _mongoc_sasl_prep and _mongoc_sasl_prep_impl to address the unused parameter warning(s).

* is that the 2*n element is the lower bound and the 2*n + 1 is the upper bound
* (both inclusive). */
bool
_mongoc_is_code_point_in_table (uint32_t code,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_mongoc_is_code_point_in_table (uint32_t code,
_mongoc_utf8_code_point_is_in_table (uint32_t code,

Suggest renaming for consistency with other UTF-8 functions.


/* returns how many bytes a unicode codepoint is. */
size_t
_mongoc_unicode_codepoint_length (uint32_t c);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_mongoc_unicode_codepoint_length (uint32_t c);
_mongoc_utf8_code_point_length (uint32_t c);

Suggest renaming for consistency with other UTF-8 functions.


uint8_t *loc = utf8_pre_norm;
for (size_t i = 0; i < num_chars; ++i) {
utf8_char_length = _mongoc_unicode_to_utf8 (utf8_codepoints[i], loc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing validity check.

Comment on lines 1118 to 1119
utf8_pre_norm_len +=
_mongoc_unicode_codepoint_length (utf8_codepoints[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing validity check.

@joshbsiegel joshbsiegel requested a review from eramongodb July 11, 2023 19:52
Copy link
Contributor

@eramongodb eramongodb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestions; otherwise, LGTM.


const uint8_t *c = (uint8_t *) s;

size_t str_length = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
size_t str_length = 0;
ssize_t str_length = 0;

Consistent signedness with return value.

static void
test_mongoc_utf8_string_length (void)
{
ASSERT_CMPINT (_mongoc_utf8_string_length (",ase"), ==, 4u);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ASSERT_CMPINT (_mongoc_utf8_string_length (",ase"), ==, 4u);
ASSERT_CMPSIZE_T (_mongoc_utf8_string_length (",ase"), ==, 4u);

Use ASSERT_CMPSIZE_T to compare size_t.

static void
test_mongoc_utf8_to_unicode (void)
{
ASSERT_CMPINT (_mongoc_utf8_get_first_code_point (",", 1), ==, 0x002C);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ASSERT_CMPINT (_mongoc_utf8_get_first_code_point (",", 1), ==, 0x002C);
ASSERT_CMPUINT32 (_mongoc_utf8_get_first_code_point (",", 1), ==, 0x002C);

Use ASSERT_CMPUINT32 to compare uint32_t.

size_t
_mongoc_utf8_char_length (const char *c);

/* returns how many characters are in a UTF-8 string. Returns -1 on error. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* returns how many characters are in a UTF-8 string. Returns -1 on error. */
/* Returns the byte length of the UTF-8 string. Returns -1 if `s` is not a valid
* UTF-8 string. */

_mongoc_sasl_prep (const char *in_utf8, int in_utf8_len, bson_error_t *err);
_mongoc_sasl_prep (const char *in_utf8, bson_error_t *err);

/* returns how many bytes a UTF-8 character is. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* returns how many bytes a UTF-8 character is. */
/* Returns the byte length of the first UTF-8 code point in `s`. */

ssize_t
_mongoc_utf8_string_length (const char *s);

/* returns whether a UTF-8 character is valid or not. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* returns whether a UTF-8 character is valid or not. */
/* Returns true if the first UTF-8 code point in `s` is valid. */

const uint32_t *table,
size_t size);

/* returns the first Unicode codepoint of a UTF-8 character. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* returns the first Unicode codepoint of a UTF-8 character. */
/* Returns the first Unicode code point in `c`. Returns 0 if length is 0.
* `c` must be a valid UTF-8 string. */

Comment on lines 156 to 157
/* returns how many bytes a unicode codepoint is. Returns -1 on invalid code
* point. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* returns how many bytes a unicode codepoint is. Returns -1 on invalid code
* point. */
/* Returns the byte length of the UTF-8 code point. Returns -1 if `c` is not a
* valid UTF-8 code point. */

@joshbsiegel joshbsiegel requested a review from kevinAlbs July 11, 2023 20:36
@joshbsiegel joshbsiegel removed the request for review from vector-of-bool July 13, 2023 18:29
Copy link
Collaborator

@kevinAlbs kevinAlbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. LGTM with minor comments addressed.

Comment on lines 170 to 173
/* the tables below all come from RFC 3454. They are all range tables, with
* value 2*n being the lower range, and value 2*n + 1 being the upper range.
* Both ranges are exclusive.
*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* the tables below all come from RFC 3454. They are all range tables, with
* value 2*n being the lower range, and value 2*n + 1 being the upper range.
* Both ranges are exclusive.
*/
/* the tables below all come from RFC 3454. They are all range tables, with
* value 2*n being the lower bound, and value 2*n + 1 being the upper bound.
* Both bounds are inclusive.
*/

Suggest using bound to refer to the range bounds. Comment above _mongoc_utf8_code_point_is_in_table and implementation suggests the bounds are inclusive.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest moving the declarations only needed by mongoc-scram.c into mongoc-scram.c (e.g. _mongoc_utf8_is_valid). Declaring in the header may suggest they are intended to be used in other files.

Suggest adding a comment to symbols that are only included in the header for testing purposes (e.g. _mongoc_utf8_char_length ): is exposed for testing.

@@ -121,7 +121,354 @@ _mongoc_sasl_prep_required (const char *str);
* null on error and sets err.
* `name` should be "username" or "password". */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `name` should be "username" or "password". */


/* returns the byte length of the first UTF-8 code point in `s`. */
size_t
_mongoc_utf8_char_length (const char *c);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_mongoc_utf8_char_length (const char *c);
_mongoc_utf8_char_length (const char *s);


/* returns true if the first UTF-8 code point in `s` is valid. */
bool
_mongoc_utf8_is_valid (const char *c, size_t length);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_mongoc_utf8_is_valid (const char *c, size_t length);
_mongoc_utf8_first_code_point_is_valid (const char *s, size_t length);

Suggest renaming to emphasize only the first code point is considered, not the full string.

if (error_code) {
/* convert to unicode. */
utf8_codepoints = bson_malloc (sizeof (uint32_t) *
(num_chars + 1)); /* add one for null byte. */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(num_chars + 1)); /* add one for null byte. */
(num_chars + 1)); /* add one for trailing 0 value. */

Last value in utf8_codepoints is a 0 uint32_t, not one byte.

#ifdef MONGOC_ENABLE_ICU
return _mongoc_sasl_prep_impl ("password", in_utf8, in_utf8_len, err);
return _mongoc_sasl_prep_impl ("password", in_utf8, err);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return _mongoc_sasl_prep_impl ("password", in_utf8, err);
if (_mongoc_sasl_prep_required (in_utf8)) {
return _mongoc_sasl_prep_impl ("password", in_utf8, err);
}
return bson_strdup (in_utf8);

Suggest only running _mongoc_sasl_prep_impl if a password includes non-ASCII characters. I expect the majority of passwords are ASCII only characters and do not need to undergo SASLPrep.

num_chars = curr;


// b. Normalize - normalize the result of step 1 using Unicode
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// b. Normalize - normalize the result of step 1 using Unicode
// b. Normalize - normalize the result of step `a` using Unicode

Comment on lines 1191 to 1192
if (contains_RandALCar)
break;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (contains_RandALCar)
break;

Suggest removing unnecessary early break for consistency with assignment of contains_RandALCar = true;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants