[SE-0211] Add Unicode properties to Unicode.Scalar #15593

allevato · 2018-03-29T05:26:40Z

This PR is a work-in-progress for the Unicode.Scalar additions being proposed in this thread. I'm opening the PR now to link to it in the proposal text.

I still need to test these changes on Linux and add some unit tests (and likely some benchmarks) for it. Please feel free to comment on these changes—I'm very interested in that feedback—but let's hold off on running the CI bots to preserve resources until I've completed those tasks.

…rties

….Scalar.Properties

milseman · 2018-03-29T22:43:37Z

The UCD also defines a isNewline property. Do you know if ICU exposes this?

http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/WordBreakProperty.txt

allevato · 2018-03-30T03:17:49Z

@milseman:

The UCD also defines a isNewline property. Do you know if ICU exposes this?

This enumeration is exposed via the U_CHAR_WORD_BREAK property. I don't believe this is a "isNewline" property as such (i.e., not a binary property), but just one case in the enumeration.

What's a bit tricky is that the Newline word break type excludes CR and LF, because those are represented as distinct word break types (whereas, conceptually, the word break algorithm in UAX #29 treats them equivalently).

So we have a couple options:

We could expose a computed isNewline property, but we'd have to decide whether we also want to include CR and LF or not; either way, we'd be diverging slightly from the standard.
We can just expose a Unicode.WordBreak enum through a wordBreak property and let users decide how to handle it. That's closest to standard compliance.

WDYT? Given how closely we've adhered to the standard elsewhere, I think it's hard to rationalize doing anything other than 2.

milseman · 2018-03-30T14:24:26Z

Oh yuck. This enum seems more like an ICU internal detail than a Unicode definition, so I'd say omit it.

allevato · 2018-03-30T14:32:58Z

Oops, I have to correct myself: UCHAR_WORD_BREAK returns UWordBreakValues—they had to use a different name because they had already used UWordBreak for their break iterator constants.

It looks like UWordBreakValues corresponds almost to the WordBreakProperty.txt Unicode property definitions defined in Table 3 of UAX #29. The ICU enum is a subset—for example, MidLetter doesn't have a corresponding ICU constant.

I'd have to do some more digging to see why they're not an exact match.

milseman · 2018-03-30T20:55:08Z

@swift-ci please test compiler performance

milseman

LGTM for a prototype, some small adjustments

milseman · 2018-03-30T20:57:11Z

stdlib/public/core/StringNormalization.swift

-    let prop = __swift_stdlib_UCHAR_FULL_COMPOSITION_EXCLUSION
-    return __swift_stdlib_u_hasBinaryProperty(value, prop) != 0
-  }
-}


milseman · 2018-03-30T21:03:57Z

stdlib/public/core/UnicodeScalarProperties.swift

+  ) -> String? {
+    let initialCapacity = 256
+
+    var storage = _SwiftStringStorage<UTF8.CodeUnit>.create(


Use String._fromUTF8CodeUnitSequence/_fromWellFormedUTF8CodeUnitSequence/_fromASCI/, which will attempt to form a small string if possible. Try to avoid creating a storage class.

You could make a small string and try forming into that, falling back to an Array+_fromASCII otherwise. See e.g. _SmallString._withMutableExcessCapacityBytes. Micro-optimizing further is almost certainly not worth it here.

The good news is that this code doesn't compile anymore, so it's a great time to fix it!

/Users/buildnode/jenkins/workspace-private/swift-PR-compiler-performance-macOS/master/new/swift/stdlib/public/core/UnicodeScalarProperties.swift:1091:12: error: argument labels '(_storage:)' do not match any available overloads 16:35:05 return String(_storage: storage)

I have a fix for case conversion (that uses a stack buffer). I'll try to do _scalarName too and share that

I've fixed _scalarName and pushed it—thanks for the pointers to the new APIs.

I hadn't gotten to case conversion yet since those need to be written into UTF-16 buffers and I didn't see a UTF-16 analogue of _SmallUTF8String, so I figured there would be some extra work that I haven't been able to dive into yet. If you already have a fix, I'll just wait for that!

milseman · 2018-03-30T21:16:35Z

stdlib/public/core/UnicodeScalarProperties.swift

+  /// This property is a `String` because some mappings may transform a single
+  /// scalar into multiple scalars. For example, the character "İ" (U+0130
+  /// LATIN CAPITAL LETTER I WITH DOT ABOVE) becomes two scalars (U+0069 LATIN
+  /// SMALL LETTER I, U+0307 COMBINING DOT ABOVE) when converted to lowercase.


And it's not a Character because it can change a single scalar into multiple graphemes.

milseman · 2018-03-30T21:17:21Z

stdlib/public/core/UnicodeScalarProperties.swift

+  public var lowercaseMapping: String {
+    let initialCapacity = 1
+
+    var storage = _SwiftStringStorage<UTF16.CodeUnit>.create(


We definitely want to avoid creating storage for these.

milseman · 2018-03-30T21:22:57Z

stdlib/public/core/UnicodeScalarProperties.swift

+
+    internal var _value: __swift_stdlib_UChar32
+    internal var _utf16: (UTF16.CodeUnit, UTF16.CodeUnit)
+    internal var _utf16Length: Int


Encoding a scalar into UTF-16 is pretty trivial and almost certainly not worth storing an extra word or two to avoid re-computing it. It's also not worth doing every time a user writes myScalar.properties.isWhitespace. I suggest computing this lazily.

Sounds good. Since you said above that you're working on a fix for the case mappings, I'll hold off on fixing this until that's ready.

Do you think you could cherry-pick this: milseman@0fdd3d7 ?

I tried to open a PR against this PR, but I couldn't find your repo in the list...

Do you have any tests? I want to make sure I didn't break anything

Thanks! These new case mappings look like they did the trick (I've pushed an update with your changes, and also removed the UTF-16 caching).

I don't have any proper unit tests, I've just been spot-testing with the REPL. I'll get those together next.

swift-ci · 2018-03-30T21:35:11Z

Build comment file:

Compilation-performance test failed

milseman · 2018-04-03T14:09:01Z

@swift-ci please test compiler performance

milseman · 2018-04-03T14:09:16Z

(I measured about a 1% size increase to optimized stdlib binary size)

milseman · 2018-04-03T14:10:40Z

stdlib/public/core/UnicodeScalarProperties.swift

    var err = __swift_stdlib_U_ZERO_ERROR
+    let correctSizeRaw = smallString._withMutableExcessCapacityBytes { ptr in


I believe that this would only allow ICU to store 14 ASCII characters rather than the full 15. See my alternative from the hash I linked

👍 Updated with your version.

milseman · 2018-04-03T16:44:29Z

@swift-ci please smoke test compiler performance

swift-ci · 2018-04-03T16:54:31Z

Build comment file:

Compilation-performance test failed

allevato · 2018-04-04T07:28:52Z

I'll work on some tests next and get those pushed.

allevato · 2018-04-06T13:32:36Z

I've updated some tests for the properties with more complex logic (like case mappings and age); I plan to cover the remainder shortly.

Any idea what the compiler performance failures are about, since there are no logs here?

I still need to verify that this builds on Linux, so I wonder if that has any relation.

milseman · 2018-04-06T13:35:28Z

@swift-ci please test

swift-ci · 2018-04-06T13:36:00Z

Build failed
Swift Test OS X Platform
Git Sha - 5a50f27

swift-ci · 2018-04-06T13:39:57Z

Build failed
Swift Test Linux Platform
Git Sha - 5a50f27

milseman · 2018-04-06T13:43:07Z

https://ci.swift.org/view/Pull%20Request/job/swift-PR-compiler-performance-smoke-test-macOS/214/consoleFull#-10877886173122a513-f36a-4c87-8ed7-cbc36a1ec144

/Users/buildnode/jenkins/workspace-private/swift-PR-compiler-performance-smoke-test-macOS/master/new/swift/stdlib/public/core/UnicodeScalarProperties.swift:1089:17: error: value of type '_SmallUTF8String' has no member '_withAllUnsafeMutableBytes'
15:50:41     let count = smol._withAllUnsafeMutableBytes { bufPtr -> Int in
15:50:41                 ^~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~

That is... weird if it's on 64-bit, but makes sense on 32-bit. This would need to be #if guarded for 32-bit.

For case-conversion, I think it's better to just form a String and call a method on that. SmallString will support UTF-8 in the future, so that would include all scalars in the small form anyways.

For the name of a scalar, I think it's better to use a fixed size array and call String._fromASCII on that. I can try a patch to this effect later.

This property is too specific in that it forces a particular normalization; let's not expose it this way, but instead in the future with a full normalization API.

- numericValue returns nil instead of .nan for non-numerics - Remove small-string optimizations from _scalarName that failed on 32-bit archs - Put case mappings back into U.S.Properties - Added more sanity tests

allevato · 2018-07-06T04:14:35Z

@milseman I pushed some updates now that SE-0211 has been accepted. numericValue returns nil instead of .nan for non-numerics, I removed the small-string optimizations from _scalarName that were causing 32-bit issues up above, I moved the case mappings back into Unicode.Scalar.Properties with names derived from the full Unicode property name (i.e., var Unicode.Scalar.Properties.lowercaseMapping instead of func Unicode.Scalar.lowercased()).

PTAL and let me know if there's anything I've missed. Thanks!

milseman

LGTM

milseman · 2018-07-06T21:29:45Z

@swift-ci please test

allevato · 2018-07-09T18:36:06Z

Did the tests actually run? I couldn't dig up any results on Jenkins a few hours after you kicked it off.

milseman · 2018-07-09T18:45:44Z

@swift-ci please test

swift-ci · 2018-07-09T20:03:18Z

Build failed
Swift Test Linux Platform
Git Sha - d0e93ac

allevato · 2018-07-09T20:14:44Z

Urgh, that test failure is gross.

UCHAR_EMOJI and the other three properties we expose were added in ICU 57, but as far as I can tell, Ubuntu 16.04 comes with ICU 55. So the return value for those properties can't be relied on there.

The cheap way to fix it is to remove those tests, but that just hides the underlying issue. I can think of a couple options, both kind of awful:

Provide a way for users to either explicitly check the ICU version that stdlib was built against, but then users have to also be able to discover if a property is supported by that version. That's a lot of boilerplate code.
Add some conditional compilation directives to UnicodeScalarProperties.swift that only defines properties that actually exist based on which version of ICU is being included. That effectively introduces dialects of the standard library based on what OS you're compiling on, which would be horrible.

What are your thoughts?

milseman · 2018-07-09T21:04:23Z

It's really unfortunate that these APIs aren't available on the oldest Ubuntu LTS. Let me check up on what we can do here. For now, I would conditionally test based either on Darwin or a recent Linux.

edit: You could also make these few specific APIs available on Darwin only for now. We can add Linux support when we figure out a solution and/or Ubuntu gets version-bumped.

Ubuntu 16.04 doesn't have a recent enough ICU to support these; we need a better long-term solution, such as bundling ICU with the toolchain.

allevato · 2018-07-10T01:50:44Z

Sounds good—I've updated the PR to only define the four emoji properties on Darwin OSes.

milseman · 2018-07-10T22:42:14Z

@swift-ci please test

swift-ci · 2018-07-10T22:43:49Z

Build failed
Swift Test Linux Platform
Git Sha - d0e93ac

swift-ci · 2018-07-10T22:43:59Z

Build failed
Swift Test OS X Platform
Git Sha - d0e93ac

allevato · 2018-07-11T15:38:15Z

Tests look good! Anything left to do before we merge?

gottesmm · 2018-07-11T21:05:54Z

@allevato Great job!

AliSoftware · 2018-07-12T00:11:20Z

stdlib/public/core/UnicodeScalarProperties.swift

+  /// The general category of a scalar is its "first-order, most usual
+  /// categorization". It does not attempt to cover multiple uses of some
+  /// scalars, such as the use of letters to represent Roman numerals.
+  public enum GeneralCategory {


👋
I thought the discussion agreed on making that enum RawRepresentable (which would make it useful for building regular expressions for example), was this idea just lost in the thread or discarded for any reason?

https://forums.swift.org/t/se-0211-add-unicode-properties-to-unicode-scalar/12121/6

Since the proposal was accepted by the core team with the only revision being to the numericValue property, I didn't make any changes from the original proposal other than the one that was requested (and to remove a couple ICU-only properties that accidentally got into the list).

@milseman What do you think? Did the core team have any feelings about that specifically, or was it just an idea that just got lost in the discussion thread?

Btw I don't think it would be that useful to be able to do in the direction of init(rawValue: String) using "Lu" for example, but would definitely be useful to go in the other direction (.uppercaseLetter => "Lu" or "\\p{Lu}"). So only implementing var description: String for that purpose (instead of making it RawRepresentable where RawValue: StringProtocol) could also be enough

There's many ways of describing it. There's the short name, the long name, how it might appear in various regex conventions, etc. I don't have any strong opinions here, other than description and debugDescription probably should choose user-readable names over a particular regex convention.

Long term, Swift built in regexes and/or regex packages of various flavors could provide their Custom[PCRE|POSIX|...]RegexConvertible protocols. The would probably produces types that themselves could be CustomStringConvertible, producing e.g. "\\p{Lu}" or "<:Lu>", etc.

allevato added 9 commits March 19, 2018 13:54

[stdlib] Add binary properties to Unicode.Scalar

6726645

[stdlib] Add "age" to Unicode.Scalar.Properties

d6ee54f

[stdlib] Add "generalCategory" to Unicode.Scalar.Properties

af798fa

[stdlib] Add "name", "nameAlias" to Unicode.Scalar.Properties

9858d4e

[stdlib] Add "{lower,title,upper}caseMapping" to Unicode.Scalar.Prope…

e7fa499

…rties

[stdlib] Add "canonicalCombiningClass" to Unicode.Scalar.Properties

e7078a4

[stdlib] Add "numeric{Type,Value}" to Unicode.Scalar.Properties

354f2ad

[stdlib] Add "isDefined", "hasNormalizationBoundaryBefore" to Unicode…

3a2ad05

….Scalar.Properties

[stdlib] Migrate normalization usage to public properties

5a50f27

allevato mentioned this pull request Mar 29, 2018

Proposal: Add Unicode properties to Unicode.Scalar swiftlang/swift-evolution#820

Merged

milseman reviewed Mar 30, 2018

View reviewed changes

allevato added 5 commits March 31, 2018 09:54

Merge branch 'master' into unicode-properties

fb9f7ec

[stdlib] Fix _scalarName to use small string if possible

5807bb9

[stdlib] Update documentation for case mappings

4a17940

[stdlib] Update case mappings to use small strings if possible

5a7d7d3

[stdlib] Compute scalar's UTF-16 on demand instead of caching

47deadc

milseman reviewed Apr 3, 2018

View reviewed changes

allevato added 2 commits April 3, 2018 19:20

[stdlib] Rewrite _scalarName to fully use a small string

56d04be

[stdlib] Lift case mappings directly into Unicode.Scalar

f06af77

allevato added 3 commits April 22, 2018 12:01

[stdlib] Revert hasNormalizationBoundaryBefore

54f4c77

This property is too specific in that it forces a particular normalization; let's not expose it this way, but instead in the future with a full normalization API.

Merge branch 'master' into unicode-properties

8eef50f

Various fixes to Unicode.Scalar.Properties.

d0e93ac

- numericValue returns nil instead of .nan for non-numerics - Remove small-string optimizations from _scalarName that failed on 32-bit archs - Put case mappings back into U.S.Properties - Added more sanity tests

allevato changed the title ~~[WIP] Add Unicode properties to Unicode.Scalar~~ [SE-0211] Add Unicode properties to Unicode.Scalar Jul 6, 2018

milseman approved these changes Jul 6, 2018

View reviewed changes

Make emoji properties Darwin only.

b454e8d

Ubuntu 16.04 doesn't have a recent enough ICU to support these; we need a better long-term solution, such as bundling ICU with the toolchain.

milseman merged commit 3045067 into swiftlang:master Jul 11, 2018

allevato deleted the unicode-properties branch July 11, 2018 20:30

BasThomas mentioned this pull request Jul 11, 2018

[113] Issue #113 - July 12, 2018 SwiftWeekly/swiftweekly.github.io#399

Closed

AliSoftware reviewed Jul 12, 2018

View reviewed changes

allevato mentioned this pull request Jul 17, 2018

[stdlib] NFC: Unicode.Scalar.Properties documentation fixes #17923

Merged

allevato mentioned this pull request Jul 10, 2018

[SR-6076] [String] var count: String.CharacterView.IndexDistance { get } returns a wrong value on Linux when "Regional Indicator Symbols" are contained. #48631

Closed

		var err = __swift_stdlib_U_ZERO_ERROR
		let correctSizeRaw = smallString._withMutableExcessCapacityBytes { ptr in

[SE-0211] Add Unicode properties to Unicode.Scalar #15593

[SE-0211] Add Unicode properties to Unicode.Scalar #15593

Uh oh!

Conversation

allevato commented Mar 29, 2018

Uh oh!

milseman commented Mar 29, 2018

Uh oh!

allevato commented Mar 30, 2018

Uh oh!

milseman commented Mar 30, 2018

Uh oh!

allevato commented Mar 30, 2018

Uh oh!

milseman commented Mar 30, 2018

Uh oh!

milseman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

milseman Mar 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

allevato Apr 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

swift-ci commented Mar 30, 2018

Build comment file:

Compilation-performance test failed

Uh oh!

milseman commented Apr 3, 2018

Uh oh!

milseman commented Apr 3, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

milseman commented Apr 3, 2018

Uh oh!

swift-ci commented Apr 3, 2018

Build comment file:

Compilation-performance test failed

Uh oh!

allevato commented Apr 4, 2018

Uh oh!

allevato commented Apr 6, 2018

Uh oh!

milseman commented Apr 6, 2018

Uh oh!

swift-ci commented Apr 6, 2018

Uh oh!

swift-ci commented Apr 6, 2018

Uh oh!

milseman commented Apr 6, 2018

Uh oh!

allevato commented Jul 6, 2018

milseman Mar 30, 2018 •

edited

Loading

allevato Apr 2, 2018 •

edited

Loading

milseman commented Jul 9, 2018 •

edited

Loading

AliSoftware Jul 12, 2018 •

edited

Loading

AliSoftware Jul 12, 2018 •

edited

Loading