-
Notifications
You must be signed in to change notification settings - Fork 455
CDRIVER-4119 Replace ICU SASLprep with internal implementation #1317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were a few functions and tables that I needed to add to this file (and mongoc-scram.cpp
) and I believe they belong elsewhere to improve organization but I wasn't sure where to put them as I wanted to follow the codebase structure/protocol. Please let me know what suggestions you have on where to place these additions and it should be an easy change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest moving the declarations only needed by mongoc-scram.c into mongoc-scram.c (e.g. _mongoc_utf8_is_valid
). Declaring in the header may suggest they are intended to be used in other files.
Suggest adding a comment to symbols that are only included in the header for testing purposes (e.g. _mongoc_utf8_char_length
): is exposed for testing
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest splitting the bundling of utf8proc
into a separate PR. That may help keep the scope of the review smaller and to get passing on CI.
I have an open question about the bundling of utf8proc
and I want input from others on the team. I can post this open question on the separate PR:
- If
utf8proc
is bundled, is an option needed to use a system install ofutf8proc
?- Would Debian and other package managers require having an option to use a system install of
utf8proc
if available? - IMO
utf8proc
is a small library. I prefer not having another option between using a system install or the bundled if it is not needed.
- Would Debian and other package managers require having an option to use a system install of
src/libmongoc/CMakeLists.txt
Outdated
COPYONLY | ||
) | ||
set (UTF8PROC_INCLUDE_DIRS "") | ||
# if (ENABLE_UTF8PROC MATCHES "SYSTEM|AUTO") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented out section.
int | ||
_mongoc_unicode_to_utf8 (uint32_t c, uint8_t *out); | ||
|
||
/* the tables below all fome from RFC 3454. They are all range tables, with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/* the tables below all fome from RFC 3454. They are all range tables, with | |
/* the tables below all come from RFC 3454. They are all range tables, with |
@@ -123,6 +123,348 @@ _mongoc_sasl_prep_required (const char *str); | |||
char * | |||
_mongoc_sasl_prep (const char *in_utf8, int in_utf8_len, bson_error_t *err); | |||
|
|||
/* returns how many bytes a UTF-8 character is. */ | |||
size_t | |||
_mongoc_utf8_char_length (const uint8_t *c); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The suggestion to use/cast-to uint8_t
was intended to be within the implementation, not in the interface, of the utf8
functions. Keeping the interface char*
will avoid requiring users to explicitly cast to uint8_t
as currently being done in the updated test cases.
Suggest reverting the function declarations to use char*
while still defining their operations in terms of uint8_t
in their implementation.
const size_t utf8_char_length = _mongoc_utf8_char_length (c); | ||
|
||
if (!_mongoc_utf8_is_valid (c, utf8_char_length)) | ||
return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we really must return a special value to indicate invalid lengths (instead of requiring/asserting that the given UTF-8 string is valid and length computation will always succeed), may need to use ssize_t
for the return type of this function to handle negative values.
|
||
/* converts a Unicode code point to UTF-8 character. Returns how many bytes the | ||
* character converted is. */ | ||
int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either size_t
(if we forbid invalid UTF-8 strings as input) or ssize_t
(if we must return -1
on invalid input).
/* converts a Unicode code point to UTF-8 character. Returns how many bytes the | ||
* character converted is. */ | ||
int | ||
_mongoc_unicode_to_utf8 (uint32_t c, uint8_t *out); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_mongoc_unicode_to_utf8 (uint32_t c, uint8_t *out); | |
_mongoc_utf8_code_point_to_str (uint32_t c, uint8_t *out); |
Suggest renaming to more accurate description + consistency with other UTF-8 functions.
@@ -1035,12 +1036,12 @@ _mongoc_sasl_prep_impl (const char *name, | |||
int in_utf8_len, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parameter is now unused: should _mongoc_utf8_string_length
account for the possibility of in_utf8_len < strlen(in_utf8)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little unsure on what you mean by this. In all cases of _mongoc_sasl_prep_impl
being called, in_utf8_len
is passed through as strlen(in_utf8)
, meaning that in_utf8_len == strlen(in_utf8)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I suggest removing the redundant/unused parameter from _mongoc_sasl_prep
and _mongoc_sasl_prep_impl
to address the unused parameter warning(s).
* is that the 2*n element is the lower bound and the 2*n + 1 is the upper bound | ||
* (both inclusive). */ | ||
bool | ||
_mongoc_is_code_point_in_table (uint32_t code, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_mongoc_is_code_point_in_table (uint32_t code, | |
_mongoc_utf8_code_point_is_in_table (uint32_t code, |
Suggest renaming for consistency with other UTF-8 functions.
|
||
/* returns how many bytes a unicode codepoint is. */ | ||
size_t | ||
_mongoc_unicode_codepoint_length (uint32_t c); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_mongoc_unicode_codepoint_length (uint32_t c); | |
_mongoc_utf8_code_point_length (uint32_t c); |
Suggest renaming for consistency with other UTF-8 functions.
|
||
uint8_t *loc = utf8_pre_norm; | ||
for (size_t i = 0; i < num_chars; ++i) { | ||
utf8_char_length = _mongoc_unicode_to_utf8 (utf8_codepoints[i], loc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing validity check.
utf8_pre_norm_len += | ||
_mongoc_unicode_codepoint_length (utf8_codepoints[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing validity check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestions; otherwise, LGTM.
|
||
const uint8_t *c = (uint8_t *) s; | ||
|
||
size_t str_length = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size_t str_length = 0; | |
ssize_t str_length = 0; |
Consistent signedness with return value.
static void | ||
test_mongoc_utf8_string_length (void) | ||
{ | ||
ASSERT_CMPINT (_mongoc_utf8_string_length (",ase"), ==, 4u); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ASSERT_CMPINT (_mongoc_utf8_string_length (",ase"), ==, 4u); | |
ASSERT_CMPSIZE_T (_mongoc_utf8_string_length (",ase"), ==, 4u); |
Use ASSERT_CMPSIZE_T
to compare size_t
.
static void | ||
test_mongoc_utf8_to_unicode (void) | ||
{ | ||
ASSERT_CMPINT (_mongoc_utf8_get_first_code_point (",", 1), ==, 0x002C); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ASSERT_CMPINT (_mongoc_utf8_get_first_code_point (",", 1), ==, 0x002C); | |
ASSERT_CMPUINT32 (_mongoc_utf8_get_first_code_point (",", 1), ==, 0x002C); |
Use ASSERT_CMPUINT32
to compare uint32_t
.
size_t | ||
_mongoc_utf8_char_length (const char *c); | ||
|
||
/* returns how many characters are in a UTF-8 string. Returns -1 on error. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/* returns how many characters are in a UTF-8 string. Returns -1 on error. */ | |
/* Returns the byte length of the UTF-8 string. Returns -1 if `s` is not a valid | |
* UTF-8 string. */ |
_mongoc_sasl_prep (const char *in_utf8, int in_utf8_len, bson_error_t *err); | ||
_mongoc_sasl_prep (const char *in_utf8, bson_error_t *err); | ||
|
||
/* returns how many bytes a UTF-8 character is. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/* returns how many bytes a UTF-8 character is. */ | |
/* Returns the byte length of the first UTF-8 code point in `s`. */ |
ssize_t | ||
_mongoc_utf8_string_length (const char *s); | ||
|
||
/* returns whether a UTF-8 character is valid or not. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/* returns whether a UTF-8 character is valid or not. */ | |
/* Returns true if the first UTF-8 code point in `s` is valid. */ |
const uint32_t *table, | ||
size_t size); | ||
|
||
/* returns the first Unicode codepoint of a UTF-8 character. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/* returns the first Unicode codepoint of a UTF-8 character. */ | |
/* Returns the first Unicode code point in `c`. Returns 0 if length is 0. | |
* `c` must be a valid UTF-8 string. */ |
/* returns how many bytes a unicode codepoint is. Returns -1 on invalid code | ||
* point. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/* returns how many bytes a unicode codepoint is. Returns -1 on invalid code | |
* point. */ | |
/* Returns the byte length of the UTF-8 code point. Returns -1 if `c` is not a | |
* valid UTF-8 code point. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work. LGTM with minor comments addressed.
/* the tables below all come from RFC 3454. They are all range tables, with | ||
* value 2*n being the lower range, and value 2*n + 1 being the upper range. | ||
* Both ranges are exclusive. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/* the tables below all come from RFC 3454. They are all range tables, with | |
* value 2*n being the lower range, and value 2*n + 1 being the upper range. | |
* Both ranges are exclusive. | |
*/ | |
/* the tables below all come from RFC 3454. They are all range tables, with | |
* value 2*n being the lower bound, and value 2*n + 1 being the upper bound. | |
* Both bounds are inclusive. | |
*/ |
Suggest using bound
to refer to the range bounds. Comment above _mongoc_utf8_code_point_is_in_table
and implementation suggests the bounds are inclusive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest moving the declarations only needed by mongoc-scram.c into mongoc-scram.c (e.g. _mongoc_utf8_is_valid
). Declaring in the header may suggest they are intended to be used in other files.
Suggest adding a comment to symbols that are only included in the header for testing purposes (e.g. _mongoc_utf8_char_length
): is exposed for testing
.
@@ -121,7 +121,354 @@ _mongoc_sasl_prep_required (const char *str); | |||
* null on error and sets err. | |||
* `name` should be "username" or "password". */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* `name` should be "username" or "password". */ |
|
||
/* returns the byte length of the first UTF-8 code point in `s`. */ | ||
size_t | ||
_mongoc_utf8_char_length (const char *c); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_mongoc_utf8_char_length (const char *c); | |
_mongoc_utf8_char_length (const char *s); |
|
||
/* returns true if the first UTF-8 code point in `s` is valid. */ | ||
bool | ||
_mongoc_utf8_is_valid (const char *c, size_t length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_mongoc_utf8_is_valid (const char *c, size_t length); | |
_mongoc_utf8_first_code_point_is_valid (const char *s, size_t length); |
Suggest renaming to emphasize only the first code point is considered, not the full string.
if (error_code) { | ||
/* convert to unicode. */ | ||
utf8_codepoints = bson_malloc (sizeof (uint32_t) * | ||
(num_chars + 1)); /* add one for null byte. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(num_chars + 1)); /* add one for null byte. */ | |
(num_chars + 1)); /* add one for trailing 0 value. */ |
Last value in utf8_codepoints
is a 0 uint32_t, not one byte.
#ifdef MONGOC_ENABLE_ICU | ||
return _mongoc_sasl_prep_impl ("password", in_utf8, in_utf8_len, err); | ||
return _mongoc_sasl_prep_impl ("password", in_utf8, err); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return _mongoc_sasl_prep_impl ("password", in_utf8, err); | |
if (_mongoc_sasl_prep_required (in_utf8)) { | |
return _mongoc_sasl_prep_impl ("password", in_utf8, err); | |
} | |
return bson_strdup (in_utf8); |
Suggest only running _mongoc_sasl_prep_impl
if a password includes non-ASCII characters. I expect the majority of passwords are ASCII only characters and do not need to undergo SASLPrep.
num_chars = curr; | ||
|
||
|
||
// b. Normalize - normalize the result of step 1 using Unicode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// b. Normalize - normalize the result of step 1 using Unicode | |
// b. Normalize - normalize the result of step `a` using Unicode |
if (contains_RandALCar) | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (contains_RandALCar) | |
break; |
Suggest removing unnecessary early break for consistency with assignment of contains_RandALCar = true;
Summary
This PR will replace the ICU library that was used for SASLprep with an internal implementation of SASLprep. Once merged, a separate PR will be made to remove the
ENABLE_ICU
option and related code.Background
SASLprep defines the steps required to prepare usernames and passwords for comparison and authentication. This is necessary as the vast amount of UTF-8 characters that a user can input for a password presents the need for some standardization. For the C driver, we use it to prepare passwords when authenticating with SCRAM-SHA-256. SASLprep is simply a profile built upon stringprep that explains specifically what parts of the stringprep document to abide by. From a high level, here is what is required by SASLprep:
Previously SASLprep was done with the ICU library and the goal of this ticket is to get rid of that dependency.
utf8proc
The normalization described above is the most involved step and we opted to use a library to handle it. utf8proc is bundled as a part of this ticket and handles this.
Preliminary Work
All of the references to characters in the stringprep and SASLprep RFC documents (such as what characters are mapped to what, prohibited characters, etc.) are displayed using their Unicode codepoints. This is not the same as a character's byte encoding in UTF-8 (see here). Because of this, all of the lookup tables that are required use Unicode codepoints, which is different than the
const char *in_utf8
parameter in the existing_mongoc_sasl_prep_impl
function. As such, it was necessary to convert the UTF-8 characters to their Unicode codepoints before doing any other work.UTF-8 characters can be 1-4 bytes, so that is why
_mongoc_utf8_string_length
and_mongoc_utf8_char_length
were added instead of usingstrlen
(which will return the number of bytes in the string). The length of each UTF8 character can be determined by the first byte._mongoc_utf8_to_unicode
handles the conversion.Lookup Tables
The mapping, prohibited characters, and bidirectional steps all use lookup tables to easily find whether a codepoint belongs in a particular category. The tables work by using ranges, where the
2n
element is the lower range and the2n + 1
element is the upper range (both inclusive)._mongoc_is_code_in_table
will search through the table passed through and return whether the codepoint exists in the table. There might be a potential improvement here as I believe a binary search could be implemented for a faster lookup.Misc
All of the tables and much of the instructions on how to perform SASLprep actually come from the stringprep RFC document. The SASLprep document just specifies what parts of stringprep to follow (which is most of it).
What's New
_mongoc_sasl_prep_impl
was almost completely scrapped for the SASLprep steps listed above_mongoc_utf8_char_length
_mongoc_utf8_string_length
_mongoc_utf8_is_valid
_mongoc_char_between_chars
_mongoc_is_code_in_table
_mongoc_utf8_to_unicode
_mongoc_unicode_to_utf8
_mongoc_unicode_codepoint_length
test_mongoc_scram_sasl_prep
test_mongoc_utf8_char_length
test_mongoc_utf8_string_length
test_mongoc_utf8_to_unicode