-
Notifications
You must be signed in to change notification settings - Fork 13.5k
Improved code generation for char
s UTF-8/UTF-16 encoding methods
#16498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I have no objection to the second commit, but it should include a comment explaining why it's using unsafe indexing. Overall, I think returning |
@@ -224,7 +227,10 @@ pub fn len_utf8_bytes(c: char) -> uint { | |||
_ if code < MAX_TWO_B => 2u, | |||
_ if code < MAX_THREE_B => 3u, | |||
_ if code < MAX_FOUR_B => 4u, | |||
_ => fail!("invalid character!"), | |||
_ => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think this would be reasonable as:
debug_assert!(code < MAX_FOUR_B, "invalid character!");
match () {
_ if code < MAX_ONE_B => 1,
_ if code < MAX_TWO_B => 2,
_ if code < MAX_THREE_B => 3,
_ => 4
}
since it's undefined behaviour for a char
to be out of range. (Or even dropping the debug_assert
entirely.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right that makes more sense :)
Without concrete data showing that The other changes all look good to me though, thanks! |
Okay, I removed the unsafe indexing. |
// The BMP falls through (assuming non-surrogate, as it should) | ||
assert!(ch <= 0xD7FF_u32 || ch >= 0xE000_u32); | ||
debug_assert!(ch <= 0xD7FF_u32 || ch >= 0xE000_u32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's undefined behavior for a char
to be a surrogate character. I think we should actually just delete the debug_assert!()
altogether (and remove the parenthetical from the comment).
Note that encode_utf8()
doesn't bother with this test, even though a surrogate character encoding is considered invalid utf-8.
You could change the |
- Both can now be inlined and constant folded away - Both can no longer cause failure - Both now return an `Option` instead Removed debug `assert!()`s over the valid ranges of a `char` - It affected optimizations due to unwinding - Char handling is now sound enought that they became uneccessary
@sfackler: That's what I did before in this PR... |
But why keep them around at all? It's literally undefined behavior in Rust for a char to be invalid. No point in asserting that. |
The first commit improves code generation through a few changes: - The `#[inline]` attributes allow llvm to constant fold the encoding step away in certain situations. For example, code like this changes from a call to `encode_utf8` in a inner loop to the pushing of a byte constant: ```rust let mut s = String::new(); for _ in range(0u, 21) { s.push_char('a'); } ``` - Both methods changed their semantic from causing run time failure if the target buffer is not large enough to returning `None` instead. This makes llvm no longer emit code for causing failure for these methods. - A few debug `assert!()` calls got removed because they affected code generation due to unwinding, and where basically unnecessary with today's sound handling of `char` as a Unicode scalar value. ~~The second commit is optional. It changes the methods from regular indexing with the `dst[i]` syntax to unsafe indexing with `dst.unsafe_mut_ref(i)`. This does not change code generation directly - in both cases llvm is smart enough to see that there can never be an out-of-bounds access. But it makes it emit a `nounwind` attribute for the function. However, I'm not sure whether that is a real improvement, so if there is any objection to this I'll remove the commit.~~ This changes how the methods behave on a too small buffer, so this is a [breaking-change]
The first commit improves code generation through a few changes:
The
#[inline]
attributes allow llvm to constant fold the encoding step away in certain situations. For example, code like this changes from a call toencode_utf8
in a inner loop to the pushing of a byte constant:Both methods changed their semantic from causing run time failure if the target buffer is not large enough to returning
None
instead. This makes llvm no longer emit code for causing failure for these methods.A few debug
assert!()
calls got removed because they affected code generation due to unwinding, and where basically unnecessary with today's sound handling ofchar
as a Unicode scalar value.The second commit is optional. It changes the methods from regular indexing with thedst[i]
syntax to unsafe indexing withdst.unsafe_mut_ref(i)
. This does not change code generation directly - in both cases llvm is smart enough to see that there can never be an out-of-bounds access. But it makes it emit anounwind
attribute for the function.However, I'm not sure whether that is a real improvement, so if there is any objection to this I'll remove the commit.
This changes how the methods behave on a too small buffer, so this is a
[breaking-change]