Skip to content

[stdlib] Rewrite UTF8 decoding #1525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Mar 22, 2016

Conversation

PatrickPijnappel
Copy link
Contributor

What's in this pull request?

Expanding upon #1477, this is a thorough rewrite of UTF8.decode() for optimization, code simplification and clarity.

  • Merged maximal subpart determination and conversion to code point with _isValidUTF8() (renamed to _decodeOne()), significantly reducing total amount of code and eliminating unnecessary branching.
  • Very significant ASCII decoding speed-up by skipping the buffer.
  • Overall performance improvement, branch elimination and code simplification.
  • Added several clarification comments throughout.
  • Expanded validation testing to count sequence lengths and maximal subparts.
  • Validated against original implementation for all inputs up to 4 code units (i.e. all 2^32 possibilities).

Before merging this pull request to apple/swift repository:

  • Test pull request on Swift continuous integration.

Triggering Swift CI

The swift-ci is triggered by writing a comment on this PR addressed to the GitHub user @swift-ci. Different tests will run depending on the specific comment that you use. The currently available comments are:

Smoke Testing

Platform Comment
All supported platforms @swift-ci Please smoke test
OS X platform @swift-ci Please smoke test OS X platform
Linux platform @swift-ci Please smoke test Linux platform

Validation Testing

Platform Comment
All supported platforms @swift-ci Please test
OS X platform @swift-ci Please test OS X platform
Linux platform @swift-ci Please test Linux platform

Note: Only members of the Apple organization can trigger swift-ci.

@dabrahams
Copy link
Contributor

@PatrickPijnappel Hi; can you please merge master into your branch and resolve conflicts? Thanks.

/// - parameter next: A generator of code units to be decoded.
/// - Parameter next: A generator of code units to be decoded. Repeated
/// calls to this method on the same instance should always reuse the same
/// generator. Failing to do so will result in undefined behavior.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please indent following lines of a bullet by two spaces.
What do you mean by “reuse” here? Would “pass” be a better word?

I don’t think you mean undefined behavior; that actually includes memory and type-safety violations. I think you mean the results are unspecified. I think you also want to say that the generator or copies thereof shouldn’t be used for anything else between calls to this method. This seems like a complicated-enough restriction to warrant an example of correct usage.

@PatrickPijnappel
Copy link
Contributor Author

@dabrahams Resolved conflicts and fixed documentation.

These methods would be much more self-explanatory if we just had:

static func decode<S where S.Generator.Element == CodeUnit>(sequence: S) -> SomeSequence<UnicodeScalar>
static func encode<S where S.Generator.Element == UnicodeScalar>(sequence: S) -> SomeSequence<CodeUnit>

Is there a specific reason for the current design to be preferable?

@dabrahams
Copy link
Contributor

@swift-ci Please test

@PatrickPijnappel
Copy link
Contributor Author

@dabrahams Please test!

@gottesmm
Copy link
Contributor

@swift-ci Please test

return UTF8._isValidUTF8(UInt32(truncatingBitPattern: _buffer)) || _isAtEnd
let buffer = UInt32(truncatingBitPattern: _buffer)
let (codePoint, _) = UTF8._decodeOne(buffer)
return codePoint != nil || _isAtEnd
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PatrickPijnappel
Just out of curiosity, why return !UTF8.isContinuation(UTF8.CodeUnit(truncatingBitPattern: _buffer)) is insufficient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rintaro That would change the behavior slightly w.r.t. the original version, as not all non-continuation code units are valid starting code units (e.g. 0xff). Though the difference is somewhat inconsequential as _buffer should be well-formed, and in general it's not clear from the documentation what the handling of mal-formed sequences should be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying :)

@PatrickPijnappel
Copy link
Contributor Author

@gottesmm I can't reproduce these test failures on my machine (using build-script -RcT) and they don't seem to be related to the PR. Any suggestions?

@dabrahams
Copy link
Contributor

@swift-ci Please test

@dabrahams
Copy link
Contributor

on Mon Mar 14 2016, Patrick Pijnappel <notifications-AT-i.8713187.xyz> wrote:

@gottesmm I can't reproduce these test failures on my machine (using
build-script -RcT) and they don't seem to be related to the PR. Any
suggestions?

I'm re-testing. Upstream seems to be healthy now, but maybe this test
ran at a time that it was broken.

-Dave

@PatrickPijnappel
Copy link
Contributor Author

@dabrahams Ok, now OS X fails but Linux passes. Still seems unrelated to the changes though.

@dabrahams
Copy link
Contributor

@@ -63,13 +64,17 @@ public protocol UnicodeCodec {
/// Because of buffering, it is impossible to find the corresponding position
/// in the iterator for a given returned `UnicodeScalar` or an error.
///
/// - parameter next: An iterator over the code units to be decoded.
/// - Parameter next: A generator of code units to be decoded. Repeated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change "generator" to "iterator"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I also found a few stray cases of generator in the rest of the stdlib, see #1705.

@@ -829,7 +724,7 @@ extension UTF16 {
_precondition(width(x) == 2)
return UTF16.CodeUnit(
(x.value - 0x1_0000) & (((1 as UInt32) << 10) - 1)
) + 0xDC00
) + 0xDC00
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please unindent 2 spaces

@dabrahams
Copy link
Contributor

We're really close to being able to accept this one!

@PatrickPijnappel
Copy link
Contributor Author

@dabrahams Cool :).

@gribozavr
Copy link
Contributor

@swift-ci Please test

@gribozavr gribozavr merged commit 168e7e4 into swiftlang:master Mar 22, 2016
@PatrickPijnappel PatrickPijnappel deleted the utf8-rewrite branch March 22, 2016 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants