Skip to content

[SE-0221] Character properties #15880

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

milseman
Copy link
Member

@milseman
Copy link
Member Author

@swift-ci please smoke test compiler performance

@swift-ci
Copy link
Contributor

Build comment file:

Compilation-performance test failed

@milseman
Copy link
Member Author

I just measured locally, this is ~1% to ~1.5% size increase of stdlib for both character properties and scalar properties together.

@milseman milseman force-pushed the character_properties branch 3 times, most recently from 2fe42a1 to 67c905b Compare April 26, 2018 01:37
@milseman milseman force-pushed the character_properties branch 2 times, most recently from 1cf928b to 72099ff Compare July 26, 2018 20:48
@milseman
Copy link
Member Author

@swift-ci please test source compatibility

@benrimmington
Copy link
Contributor

benrimmington commented Jul 29, 2018

Outdated comment

Some documentation (isNewline, isNumber, wholeNumberValue, isHexDigit) uses a mixture of /// and // comments.

Should the Unicode.GeneralCategory extension be moved to the UnicodeScalarProperties.swift file?

Should isNewline also recognize U+000B: LINE TABULATION (VT) and U+000C: FORM FEED (FF) characters? Is there an ICU property, instead of hard-coding these?

Should isEmoji be available for Apple platforms only? But if the emoji properties were introduced in ICU 57, they won't be available for all supported target platforms (e.g. macOS 10.9 and iOS 7).

Should the _crlf and _lf be stored (rather than computed) properties?

A couple more of the boolean properties could be based on Unicode.Scalar APIs.

 public var isASCII: Bool {
-  return asciiValue != nil
+  return (_isSingleScalar && _firstScalar.isASCII) || (self == ._crlf)
 }
 
 public var isHexDigit: Bool {
-  return hexDigitValue != nil
+  return (_isSingleScalar && _firstScalar.properties.isHexDigit)
 }

@benrimmington
Copy link
Contributor

benrimmington commented Jul 30, 2018

Outdated comment

For the following APIs, it might be possible to avoid temporary string creation, at least for single-scalar characters.

-public var isUppercase: Bool { return _isUppercased && isCased }
+public var isUppercase: Bool {
+  return (_isSingleScalar && _firstScalar.properties.isUppercase)
+      || (_isUppercased && isCased)
+}
 
-public var isLowercase: Bool { return _isLowercased && isCased }
+public var isLowercase: Bool {
+  return (_isSingleScalar && _firstScalar.properties.isLowercase)
+      || (_isLowercased && isCased)
+}
 
 public var isCased: Bool {
-  return !_isUppercased || !_isLowercased
+  return (_isSingleScalar && _firstScalar.properties.isCased)
+      || (!_isUppercased || !_isLowercased)
 }

Would it be possible to validate isEmoji with the contents of emoji-test.txt, if the validation-test can require the latest macOS?

@milseman
Copy link
Member Author

Some documentation (isNewline, isNumber, wholeNumberValue, isHexDigit) uses a mixture of /// and // comments.

Triple slash identifies the comment is for processing as formal documentation. As far as the individual blank line being double or triple, I'm indifferent. @natecook1000 would know more.

Should the Unicode.GeneralCategory extension be moved to the UnicodeScalarProperties.swift file?

This is just a (functional) prototype implementation, from before those properties landed. Putting it here minimized merge conflicts, but yes, it could go there as well. I was going to ask @allevato if he was interested in the higher-level categorization.

Should isNewline also recognize U+000B: LINE TABULATION (VT) and U+000C: FORM FEED (FF) characters? Is there an ICU property, instead of hard-coding these?

There is not, at least for general purpose in Unicode. Much of the specification and behavior surround handling them as they emerge with compatibility with other conventions.

To quote some sections: "Newlines are represented on different platforms by carriage return (CR), line feed (LF), CRLF, or next line (NEL)" ... "The Unicode Standard defines two unambiguous separator characters: U+2029 PARA- GRAPH SEPARATOR (PS) and U+2028 LINE SEPARATOR (LS). In Unicode text, the PS and LS characters should be used wherever the desired function is unambiguous." ... "Note that even if an implementer knows which characters represent NLF on a particular platform, CR, LF, CRLF, and NEL should be treated the same on input and in interpreta- tion. Only on output is it necessary to distinguish between them.".

Now, FF might be worth adding: "R4: A readline function should stop at NLF, LS, FF, or PS. In the typical implemen- tation, it does not include the NLF, LS, PS, or FF that caused it to stop."

For VT, I don't really see anything except compatibility with Microsoft Word internal document conventions. Do you see a use for including it?

Should isEmoji be available for Apple platforms only? But if the emoji properties were introduced in ICU 57, they won't be available for all supported target platforms (e.g. macOS 10.9 and iOS 7).

At the very least, it may need some availability restrictions before merging.

Should the _crlf and _lf be stored (rather than computed) properties?

Err, why?

A couple more of the boolean properties could be based on Unicode.Scalar APIs.

I don't understand, what is the advantage of writing it this way?

For the following APIs, it might be possible to avoid temporary string creation, at least for single-scalar characters.

Yes, we can write more single-scalar fast-paths. BTW, with unicode-rich small string support in the future, these will all be stored directly.

Would it be possible to validate isEmoji with the contents of emoji-test.txt, if the validation-test can require the latest macOS?

isEmoji is careful not to perform emoji validation, but rather check for wellformedness. These are two different concerns, the former is recommended for any facility that is producing emoji rather than querying them.

@benrimmington
Copy link
Contributor

benrimmington commented Jul 31, 2018

Outdated comment

Triple slash identifies the comment is for processing as formal documentation. As far as the individual blank line being double or triple, I'm indifferent.

My concern was if the // makes the documentation parser ignore any portion of the comments.

For VT, I don't really see anything except compatibility with Microsoft Word internal document conventions. Do you see a use for including it?

UAX #14 includes it in the BK class, but "implementations are not required to support the VT character."

It doesn't seem like an improvement, but you might be able to use __swift_stdlib_u_getIntPropertyValue(_firstScalar.value, __swift_stdlib_UCHAR_LINE_BREAK) and check for U_LB_MANDATORY_BREAK, U_LB_CARRIAGE_RETURN, U_LB_LINE_FEED, U_LB_NEXT_LINE, etc.

Should the _crlf and _lf be stored (rather than computed) properties?

Err, why?

There used to be a performance issue with character literals, but it has probably been fixed by #6850 and/or #9225.

A couple more of the boolean properties could be based on Unicode.Scalar APIs.

I don't understand, what is the advantage of writing it this way?

For example, if _firstScalar.properties.isHexDigit is faster than your hexDigitValue property.

@milseman
Copy link
Member Author

milseman commented Oct 11, 2018

For example, if _firstScalar.properties.isHexDigit is faster than your hexDigitValue property.

The given implementation is 6 ~11% faster, even without making hexDigitValue @inlinable.

@milseman milseman force-pushed the character_properties branch from 72099ff to aa33b23 Compare October 11, 2018 03:52
@milseman
Copy link
Member Author

@swift-ci please test

@milseman
Copy link
Member Author

@swift-ci please test compiler performance

@swift-ci
Copy link
Contributor

Build failed
Swift Test OS X Platform
Git Sha - 72099ff11143ccaace3819a48bbe9385702267a8

@swift-ci
Copy link
Contributor

Build failed
Swift Test Linux Platform
Git Sha - 72099ff11143ccaace3819a48bbe9385702267a8

@swift-ci
Copy link
Contributor

Build comment file:


@benrimmington
Copy link
Contributor

@milseman Can you hide my outdated comments, or shall I delete them?

(I should have used proper review comments instead).

@milseman
Copy link
Member Author

@benrimmington your comments are valuable, so I'd hate to see them go away. If you must, you can put under a disclosure triangle, but I don't see any harm in leaving them here.

Provide convenience properties on Character.
@milseman milseman force-pushed the character_properties branch from aa33b23 to e2dc4fb Compare October 15, 2018 23:32
@milseman milseman changed the title [DO NOT MERGE] Character properties Character properties Oct 15, 2018
@milseman milseman changed the title Character properties [SE-0221] Character properties Oct 15, 2018
@milseman
Copy link
Member Author

@swift-ci please test

@swift-ci
Copy link
Contributor

Build failed
Swift Test Linux Platform
Git Sha - aa33b232176ff0321979c4dd632e1fb0cf366b9d

@swift-ci
Copy link
Contributor

Build failed
Swift Test OS X Platform
Git Sha - aa33b232176ff0321979c4dd632e1fb0cf366b9d

@milseman
Copy link
Member Author

@airspeedswift ready whenever you are

static internal var _crlf: Character { return "\r\n" }

@inlinable
static internal var _lf: Character { return "\n" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _lf property isn't used in this pull request.

The _crlf property is only used once.

var CharacterPropertiesTests = TestSuite("StringTests")

CharacterPropertiesTests.test("ASCII queries") {
for cu in (0 as UInt32)...(0x7F as UInt32) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should "\r\n" and some non-ASCII characters also be tested?

expectTrue(Character("Π").isUppercase)
expectTrue(Character("π").isLowercase)

expectEqual("SS", Character("ß").uppercased())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the Capital ẞ (U+1E9E: LATIN CAPITAL LETTER SHARP S) supported by ICU yet?

expectEqual("ß", Character("ẞ").lowercased())

The system font (SF Pro) displays it as an "SS" ligature.

Copy link
Member

@airspeedswift airspeedswift left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably need benchmarks too (though they can land separately). Particularly of the numeric conversion parts.

}
@inlinable
internal var _isSingleScalar: Bool {
return self.unicodeScalars.count == 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize lengthy multi-scalar characters are a rare case, but this is O(n) when it could be done in constant time.

@milseman milseman closed this Oct 21, 2018
@milseman milseman deleted the character_properties branch October 21, 2018 05:16
@milseman milseman restored the character_properties branch October 21, 2018 05:16
@milseman
Copy link
Member Author

Err, not sure why I closed this PR, but it can't be re-opened now that there was an update. Tracking in #20520

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants