Improved code generation for `char`s UTF-8/UTF-16 encoding methods #16498

Kimundi · 2014-08-14T17:24:59Z

The first commit improves code generation through a few changes:

The #[inline] attributes allow llvm to constant fold the encoding step away in certain situations. For example, code like this changes from a call to encode_utf8 in a inner loop to the pushing of a byte constant:
```
let mut s = String::new();
for _ in range(0u, 21) {
    s.push_char('a');
}
```
Both methods changed their semantic from causing run time failure if the target buffer is not large enough to returning None instead. This makes llvm no longer emit code for causing failure for these methods.
A few debug assert!() calls got removed because they affected code generation due to unwinding, and where basically unnecessary with today's sound handling of char as a Unicode scalar value.

The second commit is optional. It changes the methods from regular indexing with the dst[i] syntax to unsafe indexing with dst.unsafe_mut_ref(i). This does not change code generation directly - in both cases llvm is smart enough to see that there can never be an out-of-bounds access. But it makes it emit a nounwind attribute for the function.
However, I'm not sure whether that is a real improvement, so if there is any objection to this I'll remove the commit.

This changes how the methods behave on a too small buffer, so this is a

[breaking-change]

lilyball · 2014-08-14T17:44:32Z

I have no objection to the second commit, but it should include a comment explaining why it's using unsafe indexing.

Overall, I think returning 0 is better than failing. The more explicit failure we can get rid of, the better.

huonw · 2014-08-15T10:59:31Z

src/libcore/char.rs

@@ -224,7 +227,10 @@ pub fn len_utf8_bytes(c: char) -> uint {
        _ if code < MAX_TWO_B   => 2u,
        _ if code < MAX_THREE_B => 3u,
        _ if code < MAX_FOUR_B  => 4u,
-        _                       => fail!("invalid character!"),
+        _ => {


I actually think this would be reasonable as:

debug_assert!(code < MAX_FOUR_B, "invalid character!"); match () { _ if code < MAX_ONE_B => 1, _ if code < MAX_TWO_B => 2, _ if code < MAX_THREE_B => 3, _ => 4 }

since it's undefined behaviour for a char to be out of range. (Or even dropping the debug_assert entirely.)

Oh, right that makes more sense :)

alexcrichton · 2014-08-15T21:54:05Z

However, I'm not sure whether that is a real improvement, so if there is any objection to this I'll remove the commit.

Without concrete data showing that unsafe improves performance one way or another, this should probably be left off for now.

The other changes all look good to me though, thanks!

Kimundi · 2014-08-15T22:40:34Z

Okay, I removed the unsafe indexing.

lilyball · 2014-08-16T02:01:44Z

src/libcore/char.rs

            // The BMP falls through (assuming non-surrogate, as it should)
-            assert!(ch <= 0xD7FF_u32 || ch >= 0xE000_u32);
+            debug_assert!(ch <= 0xD7FF_u32 || ch >= 0xE000_u32);


It's undefined behavior for a char to be a surrogate character. I think we should actually just delete the debug_assert!() altogether (and remove the parenthetical from the comment).

Note that encode_utf8() doesn't bother with this test, even though a surrogate character encoding is considered invalid utf-8.

sfackler · 2014-08-16T18:31:47Z

You could change the assert!s to debug_assert!s to keep them around without being forced to pay for the hit everywhere.

- Both can now be inlined and constant folded away - Both can no longer cause failure - Both now return an `Option` instead Removed debug `assert!()`s over the valid ranges of a `char` - It affected optimizations due to unwinding - Char handling is now sound enought that they became uneccessary

Kimundi · 2014-08-16T19:16:52Z

@sfackler: That's what I did before in this PR...

lilyball · 2014-08-16T19:35:26Z

But why keep them around at all? It's literally undefined behavior in Rust for a char to be invalid. No point in asserting that.

The first commit improves code generation through a few changes: - The `#[inline]` attributes allow llvm to constant fold the encoding step away in certain situations. For example, code like this changes from a call to `encode_utf8` in a inner loop to the pushing of a byte constant: ```rust let mut s = String::new(); for _ in range(0u, 21) { s.push_char('a'); } ``` - Both methods changed their semantic from causing run time failure if the target buffer is not large enough to returning `None` instead. This makes llvm no longer emit code for causing failure for these methods. - A few debug `assert!()` calls got removed because they affected code generation due to unwinding, and where basically unnecessary with today's sound handling of `char` as a Unicode scalar value. ~~The second commit is optional. It changes the methods from regular indexing with the `dst[i]` syntax to unsafe indexing with `dst.unsafe_mut_ref(i)`. This does not change code generation directly - in both cases llvm is smart enough to see that there can never be an out-of-bounds access. But it makes it emit a `nounwind` attribute for the function. However, I'm not sure whether that is a real improvement, so if there is any objection to this I'll remove the commit.~~ This changes how the methods behave on a too small buffer, so this is a [breaking-change]

huonw reviewed Aug 15, 2014
View reviewed changes

lilyball reviewed Aug 16, 2014
View reviewed changes

bors closed this Aug 17, 2014

bors merged commit 13079c1 into rust-lang:master Aug 17, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improved code generation for `char`s UTF-8/UTF-16 encoding methods #16498

Improved code generation for `char`s UTF-8/UTF-16 encoding methods #16498

Kimundi commented Aug 14, 2014

Uh oh!

lilyball commented Aug 14, 2014

Uh oh!

huonw Aug 15, 2014

Uh oh!

Kimundi Aug 15, 2014

Uh oh!

alexcrichton commented Aug 15, 2014

Uh oh!

Kimundi commented Aug 15, 2014

Uh oh!

lilyball Aug 16, 2014

Uh oh!

sfackler commented Aug 16, 2014

Uh oh!

Kimundi commented Aug 16, 2014

Uh oh!

lilyball commented Aug 16, 2014

Uh oh!

Uh oh!

Improved code generation for chars UTF-8/UTF-16 encoding methods #16498

Improved code generation for chars UTF-8/UTF-16 encoding methods #16498

Conversation

Kimundi commented Aug 14, 2014

Uh oh!

lilyball commented Aug 14, 2014

Uh oh!

huonw Aug 15, 2014

Choose a reason for hiding this comment

Uh oh!

Kimundi Aug 15, 2014

Choose a reason for hiding this comment

Uh oh!

alexcrichton commented Aug 15, 2014

Uh oh!

Kimundi commented Aug 15, 2014

Uh oh!

lilyball Aug 16, 2014

Choose a reason for hiding this comment

Uh oh!

sfackler commented Aug 16, 2014

Uh oh!

Kimundi commented Aug 16, 2014

Uh oh!

lilyball commented Aug 16, 2014

Uh oh!

Uh oh!

Improved code generation for `char`s UTF-8/UTF-16 encoding methods #16498

Improved code generation for `char`s UTF-8/UTF-16 encoding methods #16498