-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[SE-0221] Character properties #15880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@swift-ci please smoke test compiler performance |
Build comment file:Compilation-performance test failed |
I just measured locally, this is ~1% to ~1.5% size increase of stdlib for both character properties and scalar properties together. |
2fe42a1
to
67c905b
Compare
1cf928b
to
72099ff
Compare
@swift-ci please test source compatibility |
Outdated commentSome documentation ( Should the Should Should Should the A couple more of the boolean properties could be based on public var isASCII: Bool {
- return asciiValue != nil
+ return (_isSingleScalar && _firstScalar.isASCII) || (self == ._crlf)
}
public var isHexDigit: Bool {
- return hexDigitValue != nil
+ return (_isSingleScalar && _firstScalar.properties.isHexDigit)
} |
Outdated commentFor the following APIs, it might be possible to avoid temporary string creation, at least for single-scalar characters. -public var isUppercase: Bool { return _isUppercased && isCased }
+public var isUppercase: Bool {
+ return (_isSingleScalar && _firstScalar.properties.isUppercase)
+ || (_isUppercased && isCased)
+}
-public var isLowercase: Bool { return _isLowercased && isCased }
+public var isLowercase: Bool {
+ return (_isSingleScalar && _firstScalar.properties.isLowercase)
+ || (_isLowercased && isCased)
+}
public var isCased: Bool {
- return !_isUppercased || !_isLowercased
+ return (_isSingleScalar && _firstScalar.properties.isCased)
+ || (!_isUppercased || !_isLowercased)
} Would it be possible to validate |
Triple slash identifies the comment is for processing as formal documentation. As far as the individual blank line being double or triple, I'm indifferent. @natecook1000 would know more.
This is just a (functional) prototype implementation, from before those properties landed. Putting it here minimized merge conflicts, but yes, it could go there as well. I was going to ask @allevato if he was interested in the higher-level categorization.
There is not, at least for general purpose in Unicode. Much of the specification and behavior surround handling them as they emerge with compatibility with other conventions. To quote some sections: "Newlines are represented on different platforms by carriage return (CR), line feed (LF), CRLF, or next line (NEL)" ... "The Unicode Standard defines two unambiguous separator characters: U+2029 PARA- GRAPH SEPARATOR (PS) and U+2028 LINE SEPARATOR (LS). In Unicode text, the PS and LS characters should be used wherever the desired function is unambiguous." ... "Note that even if an implementer knows which characters represent NLF on a particular platform, CR, LF, CRLF, and NEL should be treated the same on input and in interpreta- tion. Only on output is it necessary to distinguish between them.". Now, FF might be worth adding: "R4: A readline function should stop at NLF, LS, FF, or PS. In the typical implemen- tation, it does not include the NLF, LS, PS, or FF that caused it to stop." For VT, I don't really see anything except compatibility with Microsoft Word internal document conventions. Do you see a use for including it?
At the very least, it may need some availability restrictions before merging.
Err, why?
I don't understand, what is the advantage of writing it this way?
Yes, we can write more single-scalar fast-paths. BTW, with unicode-rich small string support in the future, these will all be stored directly.
isEmoji is careful not to perform emoji validation, but rather check for wellformedness. These are two different concerns, the former is recommended for any facility that is producing emoji rather than querying them. |
Outdated comment
My concern was if the
UAX #14 includes it in the BK class, but "implementations are not required to support the VT character." It doesn't seem like an improvement, but you might be able to use
There used to be a performance issue with character literals, but it has probably been fixed by #6850 and/or #9225.
For example, if |
The given implementation is |
72099ff
to
aa33b23
Compare
@swift-ci please test |
@swift-ci please test compiler performance |
Build failed |
Build failed |
Build comment file: |
@milseman Can you hide my outdated comments, or shall I delete them? (I should have used proper review comments instead). |
@benrimmington your comments are valuable, so I'd hate to see them go away. If you must, you can put under a disclosure triangle, but I don't see any harm in leaving them here. |
Provide convenience properties on Character.
aa33b23
to
e2dc4fb
Compare
@swift-ci please test |
Build failed |
Build failed |
@airspeedswift ready whenever you are |
static internal var _crlf: Character { return "\r\n" } | ||
|
||
@inlinable | ||
static internal var _lf: Character { return "\n" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _lf
property isn't used in this pull request.
The _crlf
property is only used once.
var CharacterPropertiesTests = TestSuite("StringTests") | ||
|
||
CharacterPropertiesTests.test("ASCII queries") { | ||
for cu in (0 as UInt32)...(0x7F as UInt32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should "\r\n"
and some non-ASCII characters also be tested?
expectTrue(Character("Π").isUppercase) | ||
expectTrue(Character("π").isLowercase) | ||
|
||
expectEqual("SS", Character("ß").uppercased()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the Capital ẞ (U+1E9E: LATIN CAPITAL LETTER SHARP S) supported by ICU yet?
expectEqual("ß", Character("ẞ").lowercased())
The system font (SF Pro) displays it as an "SS" ligature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we probably need benchmarks too (though they can land separately). Particularly of the numeric conversion parts.
} | ||
@inlinable | ||
internal var _isSingleScalar: Bool { | ||
return self.unicodeScalars.count == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize lengthy multi-scalar characters are a rare case, but this is O(n) when it could be done in constant time.
Err, not sure why I closed this PR, but it can't be re-opened now that there was an update. Tracking in #20520 |
https://forums.swift.org/t/pitch-character-and-string-properties/11620/68