-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[SE-0211] Add Unicode properties to Unicode.Scalar #15593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
….Scalar.Properties
The UCD also defines a http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/WordBreakProperty.txt |
This enumeration is exposed via the What's a bit tricky is that the So we have a couple options:
WDYT? Given how closely we've adhered to the standard elsewhere, I think it's hard to rationalize doing anything other than 2. |
Oh yuck. This enum seems more like an ICU internal detail than a Unicode definition, so I'd say omit it. |
Oops, I have to correct myself: It looks like I'd have to do some more digging to see why they're not an exact match. |
@swift-ci please test compiler performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for a prototype, some small adjustments
let prop = __swift_stdlib_UCHAR_FULL_COMPOSITION_EXCLUSION | ||
return __swift_stdlib_u_hasBinaryProperty(value, prop) != 0 | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
) -> String? { | ||
let initialCapacity = 256 | ||
|
||
var storage = _SwiftStringStorage<UTF8.CodeUnit>.create( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use String._fromUTF8CodeUnitSequence/_fromWellFormedUTF8CodeUnitSequence/_fromASCI/
, which will attempt to form a small string if possible. Try to avoid creating a storage class.
You could make a small string and try forming into that, falling back to an Array+_fromASCII
otherwise. See e.g. _SmallString._withMutableExcessCapacityBytes
. Micro-optimizing further is almost certainly not worth it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The good news is that this code doesn't compile anymore, so it's a great time to fix it!
/Users/buildnode/jenkins/workspace-private/swift-PR-compiler-performance-macOS/master/new/swift/stdlib/public/core/UnicodeScalarProperties.swift:1091:12: error: argument labels '(_storage:)' do not match any available overloads
16:35:05 return String(_storage: storage)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a fix for case conversion (that uses a stack buffer). I'll try to do _scalarName too and share that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've fixed _scalarName
and pushed it—thanks for the pointers to the new APIs.
I hadn't gotten to case conversion yet since those need to be written into UTF-16 buffers and I didn't see a UTF-16 analogue of _SmallUTF8String
, so I figured there would be some extra work that I haven't been able to dive into yet. If you already have a fix, I'll just wait for that!
/// This property is a `String` because some mappings may transform a single | ||
/// scalar into multiple scalars. For example, the character "İ" (U+0130 | ||
/// LATIN CAPITAL LETTER I WITH DOT ABOVE) becomes two scalars (U+0069 LATIN | ||
/// SMALL LETTER I, U+0307 COMBINING DOT ABOVE) when converted to lowercase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it's not a Character
because it can change a single scalar into multiple graphemes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
public var lowercaseMapping: String { | ||
let initialCapacity = 1 | ||
|
||
var storage = _SwiftStringStorage<UTF16.CodeUnit>.create( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely want to avoid creating storage for these.
|
||
internal var _value: __swift_stdlib_UChar32 | ||
internal var _utf16: (UTF16.CodeUnit, UTF16.CodeUnit) | ||
internal var _utf16Length: Int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Encoding a scalar into UTF-16 is pretty trivial and almost certainly not worth storing an extra word or two to avoid re-computing it. It's also not worth doing every time a user writes myScalar.properties.isWhitespace
. I suggest computing this lazily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Since you said above that you're working on a fix for the case mappings, I'll hold off on fixing this until that's ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think you could cherry-pick this: milseman@0fdd3d7 ?
I tried to open a PR against this PR, but I couldn't find your repo in the list...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any tests? I want to make sure I didn't break anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! These new case mappings look like they did the trick (I've pushed an update with your changes, and also removed the UTF-16 caching).
I don't have any proper unit tests, I've just been spot-testing with the REPL. I'll get those together next.
Build comment file:Compilation-performance test failed |
@swift-ci please test compiler performance |
(I measured about a 1% size increase to optimized stdlib binary size) |
var err = __swift_stdlib_U_ZERO_ERROR | ||
let correctSizeRaw = smallString._withMutableExcessCapacityBytes { ptr in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that this would only allow ICU to store 14 ASCII characters rather than the full 15. See my alternative from the hash I linked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Updated with your version.
@swift-ci please smoke test compiler performance |
Build comment file:Compilation-performance test failed |
I'll work on some tests next and get those pushed. |
I've updated some tests for the properties with more complex logic (like case mappings and age); I plan to cover the remainder shortly. Any idea what the compiler performance failures are about, since there are no logs here? I still need to verify that this builds on Linux, so I wonder if that has any relation. |
@swift-ci please test |
Build failed |
Build failed |
That is... weird if it's on 64-bit, but makes sense on 32-bit. This would need to be For case-conversion, I think it's better to just form a String and call a method on that. SmallString will support UTF-8 in the future, so that would include all scalars in the small form anyways. For the name of a scalar, I think it's better to use a fixed size array and call String._fromASCII on that. I can try a patch to this effect later. |
This property is too specific in that it forces a particular normalization; let's not expose it this way, but instead in the future with a full normalization API.
- numericValue returns nil instead of .nan for non-numerics - Remove small-string optimizations from _scalarName that failed on 32-bit archs - Put case mappings back into U.S.Properties - Added more sanity tests
@milseman I pushed some updates now that SE-0211 has been accepted. PTAL and let me know if there's anything I've missed. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@swift-ci please test |
Did the tests actually run? I couldn't dig up any results on Jenkins a few hours after you kicked it off. |
@swift-ci please test |
Build failed |
Urgh, that test failure is gross.
The cheap way to fix it is to remove those tests, but that just hides the underlying issue. I can think of a couple options, both kind of awful:
What are your thoughts? |
It's really unfortunate that these APIs aren't available on the oldest Ubuntu LTS. Let me check up on what we can do here. For now, I would conditionally test based either on Darwin or a recent Linux. edit: You could also make these few specific APIs available on Darwin only for now. We can add Linux support when we figure out a solution and/or Ubuntu gets version-bumped. |
Ubuntu 16.04 doesn't have a recent enough ICU to support these; we need a better long-term solution, such as bundling ICU with the toolchain.
Sounds good—I've updated the PR to only define the four emoji properties on Darwin OSes. |
@swift-ci please test |
Build failed |
Build failed |
Tests look good! Anything left to do before we merge? |
@allevato Great job! |
/// The general category of a scalar is its "first-order, most usual | ||
/// categorization". It does not attempt to cover multiple uses of some | ||
/// scalars, such as the use of letters to represent Roman numerals. | ||
public enum GeneralCategory { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋
I thought the discussion agreed on making that enum RawRepresentable (which would make it useful for building regular expressions for example), was this idea just lost in the thread or discarded for any reason?
https://forums.swift.org/t/se-0211-add-unicode-properties-to-unicode-scalar/12121/6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the proposal was accepted by the core team with the only revision being to the numericValue
property, I didn't make any changes from the original proposal other than the one that was requested (and to remove a couple ICU-only properties that accidentally got into the list).
@milseman What do you think? Did the core team have any feelings about that specifically, or was it just an idea that just got lost in the discussion thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw I don't think it would be that useful to be able to do in the direction of init(rawValue: String)
using "Lu"
for example, but would definitely be useful to go in the other direction (.uppercaseLetter
=> "Lu"
or "\\p{Lu}"
). So only implementing var description: String
for that purpose (instead of making it RawRepresentable where RawValue: StringProtocol
) could also be enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's many ways of describing it. There's the short name, the long name, how it might appear in various regex conventions, etc. I don't have any strong opinions here, other than description
and debugDescription
probably should choose user-readable names over a particular regex convention.
Long term, Swift built in regexes and/or regex packages of various flavors could provide their Custom[PCRE|POSIX|...]RegexConvertible
protocols. The would probably produces types that themselves could be CustomStringConvertible
, producing e.g. "\\p{Lu}"
or "<:Lu>"
, etc.
This PR is a work-in-progress for the
Unicode.Scalar
additions being proposed in this thread. I'm opening the PR now to link to it in the proposal text.I still need to test these changes on Linux and add some unit tests (and likely some benchmarks) for it. Please feel free to comment on these changes—I'm very interested in that feedback—but let's hold off on running the CI bots to preserve resources until I've completed those tasks.