Skip to content

Improved code generation for chars UTF-8/UTF-16 encoding methods #16498

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 17, 2014

Conversation

Kimundi
Copy link
Member

@Kimundi Kimundi commented Aug 14, 2014

The first commit improves code generation through a few changes:

  • The #[inline] attributes allow llvm to constant fold the encoding step away in certain situations. For example, code like this changes from a call to encode_utf8 in a inner loop to the pushing of a byte constant:

    let mut s = String::new();
    for _ in range(0u, 21) {
        s.push_char('a');
    }
  • Both methods changed their semantic from causing run time failure if the target buffer is not large enough to returning None instead. This makes llvm no longer emit code for causing failure for these methods.

  • A few debug assert!() calls got removed because they affected code generation due to unwinding, and where basically unnecessary with today's sound handling of char as a Unicode scalar value.

The second commit is optional. It changes the methods from regular indexing with the dst[i] syntax to unsafe indexing with dst.unsafe_mut_ref(i). This does not change code generation directly - in both cases llvm is smart enough to see that there can never be an out-of-bounds access. But it makes it emit a nounwind attribute for the function.
However, I'm not sure whether that is a real improvement, so if there is any objection to this I'll remove the commit.

This changes how the methods behave on a too small buffer, so this is a

[breaking-change]

@lilyball
Copy link
Contributor

I have no objection to the second commit, but it should include a comment explaining why it's using unsafe indexing.

Overall, I think returning 0 is better than failing. The more explicit failure we can get rid of, the better.

@@ -224,7 +227,10 @@ pub fn len_utf8_bytes(c: char) -> uint {
_ if code < MAX_TWO_B => 2u,
_ if code < MAX_THREE_B => 3u,
_ if code < MAX_FOUR_B => 4u,
_ => fail!("invalid character!"),
_ => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think this would be reasonable as:

debug_assert!(code < MAX_FOUR_B, "invalid character!");
match () {
    _ if code < MAX_ONE_B => 1,
    _ if code < MAX_TWO_B => 2,
    _ if code < MAX_THREE_B => 3,
    _ => 4
}

since it's undefined behaviour for a char to be out of range. (Or even dropping the debug_assert entirely.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right that makes more sense :)

@alexcrichton
Copy link
Member

However, I'm not sure whether that is a real improvement, so if there is any objection to this I'll remove the commit.

Without concrete data showing that unsafe improves performance one way or another, this should probably be left off for now.

The other changes all look good to me though, thanks!

@Kimundi
Copy link
Member Author

Kimundi commented Aug 15, 2014

Okay, I removed the unsafe indexing.

// The BMP falls through (assuming non-surrogate, as it should)
assert!(ch <= 0xD7FF_u32 || ch >= 0xE000_u32);
debug_assert!(ch <= 0xD7FF_u32 || ch >= 0xE000_u32);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's undefined behavior for a char to be a surrogate character. I think we should actually just delete the debug_assert!() altogether (and remove the parenthetical from the comment).

Note that encode_utf8() doesn't bother with this test, even though a surrogate character encoding is considered invalid utf-8.

@sfackler
Copy link
Member

You could change the assert!s to debug_assert!s to keep them around without being forced to pay for the hit everywhere.

- Both can now be inlined and constant folded away
- Both can no longer cause failure
- Both now return an `Option` instead

Removed debug `assert!()`s over the valid ranges of a `char`
- It affected optimizations due to unwinding
- Char handling is now sound enought that they became uneccessary
@Kimundi
Copy link
Member Author

Kimundi commented Aug 16, 2014

@sfackler: That's what I did before in this PR...

@lilyball
Copy link
Contributor

But why keep them around at all? It's literally undefined behavior in Rust for a char to be invalid. No point in asserting that.

bors added a commit that referenced this pull request Aug 17, 2014
The first commit improves code generation through a few changes:
- The `#[inline]` attributes allow llvm to constant fold the encoding step away in certain situations. For example, code like this changes from a call to `encode_utf8` in a inner loop to the pushing of a byte constant:

 ```rust
let mut s = String::new();
for _ in range(0u, 21) {
        s.push_char('a');
}
```
- Both methods changed their semantic from causing run time failure if the target buffer is not large enough to returning `None` instead. This makes llvm no longer emit code for causing failure for these methods.
- A few debug `assert!()` calls got removed because they affected code generation due to unwinding, and where basically unnecessary with today's sound handling of `char` as a Unicode scalar value.

~~The second commit is optional. It changes the methods from regular indexing with the `dst[i]` syntax to unsafe indexing with `dst.unsafe_mut_ref(i)`. This does not change code generation directly - in both cases llvm is smart enough to see that there can never be an out-of-bounds access. But it makes it emit a `nounwind` attribute for the function. 
However, I'm not sure whether that is a real improvement, so if there is any objection to this I'll remove the commit.~~

This changes how the methods behave on a too small buffer, so this is a 

[breaking-change]
@bors bors closed this Aug 17, 2014
@bors bors merged commit 13079c1 into rust-lang:master Aug 17, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants