-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[stdlib] Rewrite UTF8 decoding #1525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Based on the API Design Guidelines.
_atEnd was confusing because it did not mean 'at the end of the sequence', because of buffering.
@PatrickPijnappel Hi; can you please merge master into your branch and resolve conflicts? Thanks. |
/// - parameter next: A generator of code units to be decoded. | ||
/// - Parameter next: A generator of code units to be decoded. Repeated | ||
/// calls to this method on the same instance should always reuse the same | ||
/// generator. Failing to do so will result in undefined behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please indent following lines of a bullet by two spaces.
What do you mean by “reuse” here? Would “pass” be a better word?
I don’t think you mean undefined behavior; that actually includes memory and type-safety violations. I think you mean the results are unspecified. I think you also want to say that the generator or copies thereof shouldn’t be used for anything else between calls to this method. This seems like a complicated-enough restriction to warrant an example of correct usage.
@dabrahams Resolved conflicts and fixed documentation. These methods would be much more self-explanatory if we just had: static func decode<S where S.Generator.Element == CodeUnit>(sequence: S) -> SomeSequence<UnicodeScalar>
static func encode<S where S.Generator.Element == UnicodeScalar>(sequence: S) -> SomeSequence<CodeUnit> Is there a specific reason for the current design to be preferable? |
@swift-ci Please test |
@dabrahams Please test! |
@swift-ci Please test |
return UTF8._isValidUTF8(UInt32(truncatingBitPattern: _buffer)) || _isAtEnd | ||
let buffer = UInt32(truncatingBitPattern: _buffer) | ||
let (codePoint, _) = UTF8._decodeOne(buffer) | ||
return codePoint != nil || _isAtEnd | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PatrickPijnappel
Just out of curiosity, why return !UTF8.isContinuation(UTF8.CodeUnit(truncatingBitPattern: _buffer))
is insufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rintaro That would change the behavior slightly w.r.t. the original version, as not all non-continuation code units are valid starting code units (e.g. 0xff
). Though the difference is somewhat inconsequential as _buffer
should be well-formed, and in general it's not clear from the documentation what the handling of mal-formed sequences should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying :)
@gottesmm I can't reproduce these test failures on my machine (using |
@swift-ci Please test |
on Mon Mar 14 2016, Patrick Pijnappel <notifications-AT-i.8713187.xyz> wrote:
I'm re-testing. Upstream seems to be healthy now, but maybe this test -Dave |
@dabrahams Ok, now OS X fails but Linux passes. Still seems unrelated to the changes though. |
This looks like it was a failure in upstream: https://ci.swift.org/job/oss-swift_tools-RA_stdlib-DA_test-simulator/790/consoleFull#-1806112821fca400bf-2f4a-462e-b517-e058d770b2d7 |
@@ -63,13 +64,17 @@ public protocol UnicodeCodec { | |||
/// Because of buffering, it is impossible to find the corresponding position | |||
/// in the iterator for a given returned `UnicodeScalar` or an error. | |||
/// | |||
/// - parameter next: An iterator over the code units to be decoded. | |||
/// - Parameter next: A generator of code units to be decoded. Repeated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change "generator" to "iterator"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I also found a few stray cases of generator in the rest of the stdlib, see #1705.
@@ -829,7 +724,7 @@ extension UTF16 { | |||
_precondition(width(x) == 2) | |||
return UTF16.CodeUnit( | |||
(x.value - 0x1_0000) & (((1 as UInt32) << 10) - 1) | |||
) + 0xDC00 | |||
) + 0xDC00 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please unindent 2 spaces
We're really close to being able to accept this one! |
7707f66
to
771a815
Compare
@dabrahams Cool :). |
@swift-ci Please test |
What's in this pull request?
Expanding upon #1477, this is a thorough rewrite of UTF8.decode() for optimization, code simplification and clarity.
_isValidUTF8()
(renamed to_decodeOne()
), significantly reducing total amount of code and eliminating unnecessary branching.Before merging this pull request to apple/swift repository:
Triggering Swift CI
The swift-ci is triggered by writing a comment on this PR addressed to the GitHub user @swift-ci. Different tests will run depending on the specific comment that you use. The currently available comments are:
Smoke Testing
Validation Testing
Note: Only members of the Apple organization can trigger swift-ci.