Skip to content

[stdlib] Rewrite UTF8._isValidUTF8() #1477

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Feb 29, 2016

Conversation

PatrickPijnappel
Copy link
Contributor

What's in this pull request?

A rewrite of UTF8._isValidUTF8(), which further improves performance (mainly by removing branches) and reduces code size.

  • Non-ASCII: 35-45% speed-up
  • Invalid sequences: ~15% speed-up
  • ASCII: identical performance

Tested against original implementation for all input values (0...0xffffffff), results are identical.


Before merging this pull request to apple/swift repository:

  • Test pull request on Swift continuous integration.

Triggering Swift CI

The swift-ci is triggered by writing a comment on this PR addressed to the github user @swift-ci. Different tests will run depending on the specific comment that you use. The currently available comments are:

Smoke Testing

Platform Comment
All supported platforms @swift-ci Please smoke test
OS X platform @swift-ci Please smoke test OS X platform
Linux platform @swift-ci Please smoke test Linux platform

Validation Testing

Platform Comment
All supported platforms @swift-ci Please test
OS X platform @swift-ci Please test OS X platform
Linux platform @swift-ci Please test Linux platform

Note: Only members of the Apple organization can trigger swift-ci.

@gribozavr
Copy link
Contributor

@swift-ci Please test

@gribozavr
Copy link
Contributor

@PatrickPijnappel There's a build failure, would you mind taking a look?

@PatrickPijnappel
Copy link
Contributor Author

@gribozavr My bad! Will take a look and resolve.

// Require 10xx xxxx 110x xxxx.
if buffer & 0xc0e0 != 0x80c0 { return false }
// Disallow xxxx xxxx xxx0 000x (<= 7 bits case).
if buffer & 0x001e == 0x0000 { return false }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand this case. I think you meant to test against 0x1f00 instead of 0x001e.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bytes come in reverse order. Never mind.

@gribozavr
Copy link
Contributor

@PatrickPijnappel This is brilliant! Please fix the build issue, and I'll run the benchmarks.

@practicalswift
Copy link
Contributor

@PatrickPijnappel Great stuff! 👍

This is as a replacement for usages of UTF8._numTrailingBytes().
Note that the sanityCheck was redundant at both call sites.
The checks are technically different (previous check only rejected malformed initial code units, not all malformed sequences). Which is more correct is debatable, but since _buffer is only filled by transcoding from UTF-16 it should always be well-formed anyway and the difference is not very relevant.
Replaces the tests for the removed _numTrailingBytes()
@PatrickPijnappel
Copy link
Contributor Author

@gribozavr OK fixed the issues and added some validation tests as well!

@gribozavr
Copy link
Contributor

@swift-ci Please test

}
}
return true
public static func _isValidUTF8(buffer: UInt32) -> Bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use public // @testable (like we do in other places), for documentation purposes, and to make it easy to fix up the code when @testable works for the standard library.

@PatrickPijnappel
Copy link
Contributor Author

Added @testable and fixed the failing test, it now passes on my machine.

@gribozavr
Copy link
Contributor

@swift-ci Please test

@gribozavr
Copy link
Contributor

Running benchmarks.

@gribozavr
Copy link
Contributor

I'm seeing >10% improvements for ErrorHandling, NSError, NSStringConversion, and SevenBoom.

@PatrickPijnappel If you have a targeted microbenchmark, feel free to contribute it to a new file under benchmarks.

gribozavr added a commit that referenced this pull request Feb 29, 2016
[stdlib] Rewrite UTF8._isValidUTF8()
@gribozavr gribozavr merged commit 56785e8 into swiftlang:master Feb 29, 2016
@PatrickPijnappel PatrickPijnappel deleted the patch-3 branch March 1, 2016 04:25
@PatrickPijnappel PatrickPijnappel restored the patch-3 branch March 1, 2016 04:25
@PatrickPijnappel PatrickPijnappel deleted the patch-3 branch March 1, 2016 04:25
@PatrickPijnappel
Copy link
Contributor Author

@gribozavr Added a UTF-8 benchmark (#1493). I'm in the process of simplifying/optimizing the other parts of UTF-8 decoding so it'll be useful to have a benchmark!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants