[SE-0221] Character properties #15880

milseman · 2018-04-11T19:36:09Z

https://forums.swift.org/t/pitch-character-and-string-properties/11620/68

milseman · 2018-04-11T19:39:07Z

@swift-ci please smoke test compiler performance

swift-ci · 2018-04-11T20:06:21Z

Build comment file:

Compilation-performance test failed

milseman · 2018-04-11T20:24:46Z

I just measured locally, this is ~1% to ~1.5% size increase of stdlib for both character properties and scalar properties together.

milseman · 2018-07-26T20:48:25Z

@swift-ci please test source compatibility

benrimmington · 2018-07-29T17:29:32Z

Outdated comment

Some documentation (isNewline, isNumber, wholeNumberValue, isHexDigit) uses a mixture of /// and // comments.

Should the Unicode.GeneralCategory extension be moved to the UnicodeScalarProperties.swift file?

Should isNewline also recognize U+000B: LINE TABULATION (VT) and U+000C: FORM FEED (FF) characters? Is there an ICU property, instead of hard-coding these?

Should isEmoji be available for Apple platforms only? But if the emoji properties were introduced in ICU 57, they won't be available for all supported target platforms (e.g. macOS 10.9 and iOS 7).

Should the _crlf and _lf be stored (rather than computed) properties?

A couple more of the boolean properties could be based on Unicode.Scalar APIs.

 public var isASCII: Bool {
-  return asciiValue != nil
+  return (_isSingleScalar && _firstScalar.isASCII) || (self == ._crlf)
 }
 
 public var isHexDigit: Bool {
-  return hexDigitValue != nil
+  return (_isSingleScalar && _firstScalar.properties.isHexDigit)
 }

benrimmington · 2018-07-30T07:42:45Z

Outdated comment

For the following APIs, it might be possible to avoid temporary string creation, at least for single-scalar characters.

-public var isUppercase: Bool { return _isUppercased && isCased }
+public var isUppercase: Bool {
+  return (_isSingleScalar && _firstScalar.properties.isUppercase)
+      || (_isUppercased && isCased)
+}
 
-public var isLowercase: Bool { return _isLowercased && isCased }
+public var isLowercase: Bool {
+  return (_isSingleScalar && _firstScalar.properties.isLowercase)
+      || (_isLowercased && isCased)
+}
 
 public var isCased: Bool {
-  return !_isUppercased || !_isLowercased
+  return (_isSingleScalar && _firstScalar.properties.isCased)
+      || (!_isUppercased || !_isLowercased)
 }

Would it be possible to validate isEmoji with the contents of emoji-test.txt, if the validation-test can require the latest macOS?

milseman · 2018-07-30T22:55:23Z

Some documentation (isNewline, isNumber, wholeNumberValue, isHexDigit) uses a mixture of /// and // comments.

Triple slash identifies the comment is for processing as formal documentation. As far as the individual blank line being double or triple, I'm indifferent. @natecook1000 would know more.

Should the Unicode.GeneralCategory extension be moved to the UnicodeScalarProperties.swift file?

This is just a (functional) prototype implementation, from before those properties landed. Putting it here minimized merge conflicts, but yes, it could go there as well. I was going to ask @allevato if he was interested in the higher-level categorization.

Should isNewline also recognize U+000B: LINE TABULATION (VT) and U+000C: FORM FEED (FF) characters? Is there an ICU property, instead of hard-coding these?

There is not, at least for general purpose in Unicode. Much of the specification and behavior surround handling them as they emerge with compatibility with other conventions.

To quote some sections: "Newlines are represented on different platforms by carriage return (CR), line feed (LF), CRLF, or next line (NEL)" ... "The Unicode Standard defines two unambiguous separator characters: U+2029 PARA- GRAPH SEPARATOR (PS) and U+2028 LINE SEPARATOR (LS). In Unicode text, the PS and LS characters should be used wherever the desired function is unambiguous." ... "Note that even if an implementer knows which characters represent NLF on a particular platform, CR, LF, CRLF, and NEL should be treated the same on input and in interpreta- tion. Only on output is it necessary to distinguish between them.".

Now, FF might be worth adding: "R4: A readline function should stop at NLF, LS, FF, or PS. In the typical implemen- tation, it does not include the NLF, LS, PS, or FF that caused it to stop."

For VT, I don't really see anything except compatibility with Microsoft Word internal document conventions. Do you see a use for including it?

Should isEmoji be available for Apple platforms only? But if the emoji properties were introduced in ICU 57, they won't be available for all supported target platforms (e.g. macOS 10.9 and iOS 7).

At the very least, it may need some availability restrictions before merging.

Should the _crlf and _lf be stored (rather than computed) properties?

Err, why?

A couple more of the boolean properties could be based on Unicode.Scalar APIs.

I don't understand, what is the advantage of writing it this way?

For the following APIs, it might be possible to avoid temporary string creation, at least for single-scalar characters.

Yes, we can write more single-scalar fast-paths. BTW, with unicode-rich small string support in the future, these will all be stored directly.

Would it be possible to validate isEmoji with the contents of emoji-test.txt, if the validation-test can require the latest macOS?

isEmoji is careful not to perform emoji validation, but rather check for wellformedness. These are two different concerns, the former is recommended for any facility that is producing emoji rather than querying them.

benrimmington · 2018-07-31T07:09:10Z

Outdated comment

Triple slash identifies the comment is for processing as formal documentation. As far as the individual blank line being double or triple, I'm indifferent.

My concern was if the // makes the documentation parser ignore any portion of the comments.

For VT, I don't really see anything except compatibility with Microsoft Word internal document conventions. Do you see a use for including it?

UAX #14 includes it in the BK class, but "implementations are not required to support the VT character."

It doesn't seem like an improvement, but you might be able to use __swift_stdlib_u_getIntPropertyValue(_firstScalar.value, __swift_stdlib_UCHAR_LINE_BREAK) and check for U_LB_MANDATORY_BREAK, U_LB_CARRIAGE_RETURN, U_LB_LINE_FEED, U_LB_NEXT_LINE, etc.

Should the _crlf and _lf be stored (rather than computed) properties?

Err, why?

There used to be a performance issue with character literals, but it has probably been fixed by #6850 and/or #9225.

A couple more of the boolean properties could be based on Unicode.Scalar APIs.

I don't understand, what is the advantage of writing it this way?

For example, if _firstScalar.properties.isHexDigit is faster than your hexDigitValue property.

milseman · 2018-10-11T00:41:06Z

For example, if _firstScalar.properties.isHexDigit is faster than your hexDigitValue property.

The given implementation is 6 ~11% faster, even without making hexDigitValue @inlinable.

milseman · 2018-10-11T03:53:00Z

@swift-ci please test

milseman · 2018-10-11T03:53:11Z

@swift-ci please test compiler performance

swift-ci · 2018-10-11T03:53:29Z

Build failed
Swift Test OS X Platform
Git Sha - 72099ff11143ccaace3819a48bbe9385702267a8

swift-ci · 2018-10-11T03:54:42Z

Build failed
Swift Test Linux Platform
Git Sha - 72099ff11143ccaace3819a48bbe9385702267a8

swift-ci · 2018-10-11T03:55:09Z

Build comment file:

benrimmington · 2018-10-11T09:09:09Z

@milseman Can you hide my outdated comments, or shall I delete them?

(I should have used proper review comments instead).

milseman · 2018-10-11T17:30:17Z

@benrimmington your comments are valuable, so I'd hate to see them go away. If you must, you can put under a disclosure triangle, but I don't see any harm in leaving them here.

Provide convenience properties on Character.

milseman · 2018-10-15T23:33:33Z

@swift-ci please test

swift-ci · 2018-10-15T23:35:40Z

Build failed
Swift Test Linux Platform
Git Sha - aa33b232176ff0321979c4dd632e1fb0cf366b9d

swift-ci · 2018-10-15T23:37:22Z

Build failed
Swift Test OS X Platform
Git Sha - aa33b232176ff0321979c4dd632e1fb0cf366b9d

milseman · 2018-10-16T02:56:23Z

@airspeedswift ready whenever you are

benrimmington · 2018-10-17T14:01:32Z

stdlib/public/core/CharacterProperties.swift

+  static internal var _crlf: Character { return "\r\n" }
+
+  @inlinable
+  static internal var _lf: Character { return "\n" }


The _lf property isn't used in this pull request.

The _crlf property is only used once.

benrimmington · 2018-10-17T14:02:54Z

test/stdlib/CharacterProperties.swift

+var CharacterPropertiesTests = TestSuite("StringTests")
+
+CharacterPropertiesTests.test("ASCII queries") {
+  for cu in (0 as UInt32)...(0x7F as UInt32) {


Should "\r\n" and some non-ASCII characters also be tested?

benrimmington · 2018-10-17T14:04:38Z

test/stdlib/CharacterProperties.swift

+  expectTrue(Character("Π").isUppercase)
+  expectTrue(Character("π").isLowercase)
+
+  expectEqual("SS", Character("ß").uppercased())


Is the Capital ẞ (U+1E9E: LATIN CAPITAL LETTER SHARP S) supported by ICU yet?

expectEqual("ß", Character("ẞ").lowercased())

The system font (SF Pro) displays it as an "SS" ligature.

airspeedswift

I think we probably need benchmarks too (though they can land separately). Particularly of the numeric conversion parts.

airspeedswift · 2018-10-16T14:37:16Z

stdlib/public/core/CharacterProperties.swift

+  }
+  @inlinable
+  internal var _isSingleScalar: Bool {
+    return self.unicodeScalars.count == 1


I realize lengthy multi-scalar characters are a rare case, but this is O(n) when it could be done in constant time.

milseman · 2018-11-12T21:17:48Z

Err, not sure why I closed this PR, but it can't be re-opened now that there was an update. Tracking in #20520

milseman force-pushed the character_properties branch 3 times, most recently from 2fe42a1 to 67c905b Compare April 26, 2018 01:37

milseman force-pushed the character_properties branch 2 times, most recently from 1cf928b to 72099ff Compare July 26, 2018 20:48

milseman force-pushed the character_properties branch from 72099ff to aa33b23 Compare October 11, 2018 03:52

milseman requested a review from airspeedswift October 11, 2018 17:29

Implement SE-0221: Character Properties

e2dc4fb

Provide convenience properties on Character.

milseman force-pushed the character_properties branch from aa33b23 to e2dc4fb Compare October 15, 2018 23:32

milseman changed the title ~~[DO NOT MERGE] Character properties~~ Character properties Oct 15, 2018

milseman changed the title ~~Character properties~~ [SE-0221] Character properties Oct 15, 2018

benrimmington reviewed Oct 17, 2018

View reviewed changes

airspeedswift approved these changes Oct 17, 2018

View reviewed changes

milseman closed this Oct 21, 2018

milseman deleted the character_properties branch October 21, 2018 05:16

milseman restored the character_properties branch October 21, 2018 05:16

milseman mentioned this pull request Nov 12, 2018

Implement SE-0221: Character Properties #20520

Merged

[SE-0221] Character properties #15880

[SE-0221] Character properties #15880

Uh oh!

Conversation

milseman commented Apr 11, 2018

Uh oh!

milseman commented Apr 11, 2018

Uh oh!

swift-ci commented Apr 11, 2018

Build comment file:

Compilation-performance test failed

Uh oh!

milseman commented Apr 11, 2018

Uh oh!

milseman commented Jul 26, 2018

Uh oh!

benrimmington commented Jul 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benrimmington commented Jul 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

milseman commented Jul 30, 2018

Uh oh!

benrimmington commented Jul 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

milseman commented Oct 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

milseman commented Oct 11, 2018

Uh oh!

milseman commented Oct 11, 2018

Uh oh!

swift-ci commented Oct 11, 2018

Uh oh!

swift-ci commented Oct 11, 2018

Uh oh!

swift-ci commented Oct 11, 2018

Build comment file:

Uh oh!

benrimmington commented Oct 11, 2018

Uh oh!

milseman commented Oct 11, 2018

Uh oh!

milseman commented Oct 15, 2018

Uh oh!

swift-ci commented Oct 15, 2018

Uh oh!

swift-ci commented Oct 15, 2018

Uh oh!

milseman commented Oct 16, 2018

Uh oh!

benrimmington Oct 17, 2018

Choose a reason for hiding this comment

Uh oh!

benrimmington Oct 17, 2018

Choose a reason for hiding this comment

Uh oh!

benrimmington Oct 17, 2018

Choose a reason for hiding this comment

Uh oh!

airspeedswift left a comment

Choose a reason for hiding this comment

Uh oh!

airspeedswift Oct 16, 2018

Choose a reason for hiding this comment

Uh oh!

milseman commented Nov 12, 2018

Uh oh!

Uh oh!

benrimmington commented Jul 29, 2018 •

edited

Loading

benrimmington commented Jul 30, 2018 •

edited

Loading

benrimmington commented Jul 31, 2018 •

edited

Loading

milseman commented Oct 11, 2018 •

edited

Loading