Skip to content

[SE-0211] Add Unicode properties to Unicode.Scalar #15593

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Jul 11, 2018

Conversation

allevato
Copy link
Member

This PR is a work-in-progress for the Unicode.Scalar additions being proposed in this thread. I'm opening the PR now to link to it in the proposal text.

I still need to test these changes on Linux and add some unit tests (and likely some benchmarks) for it. Please feel free to comment on these changes—I'm very interested in that feedback—but let's hold off on running the CI bots to preserve resources until I've completed those tasks.

@milseman
Copy link
Member

The UCD also defines a isNewline property. Do you know if ICU exposes this?

http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/WordBreakProperty.txt

@allevato
Copy link
Member Author

@milseman:

The UCD also defines a isNewline property. Do you know if ICU exposes this?

This enumeration is exposed via the U_CHAR_WORD_BREAK property. I don't believe this is a "isNewline" property as such (i.e., not a binary property), but just one case in the enumeration.

What's a bit tricky is that the Newline word break type excludes CR and LF, because those are represented as distinct word break types (whereas, conceptually, the word break algorithm in UAX #29 treats them equivalently).

So we have a couple options:

  1. We could expose a computed isNewline property, but we'd have to decide whether we also want to include CR and LF or not; either way, we'd be diverging slightly from the standard.
  2. We can just expose a Unicode.WordBreak enum through a wordBreak property and let users decide how to handle it. That's closest to standard compliance.

WDYT? Given how closely we've adhered to the standard elsewhere, I think it's hard to rationalize doing anything other than 2.

@milseman
Copy link
Member

Oh yuck. This enum seems more like an ICU internal detail than a Unicode definition, so I'd say omit it.

@allevato
Copy link
Member Author

Oops, I have to correct myself: UCHAR_WORD_BREAK returns UWordBreakValues—they had to use a different name because they had already used UWordBreak for their break iterator constants.

It looks like UWordBreakValues corresponds almost to the WordBreakProperty.txt Unicode property definitions defined in Table 3 of UAX #29. The ICU enum is a subset—for example, MidLetter doesn't have a corresponding ICU constant.

I'd have to do some more digging to see why they're not an exact match.

@milseman
Copy link
Member

@swift-ci please test compiler performance

Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for a prototype, some small adjustments

let prop = __swift_stdlib_UCHAR_FULL_COMPOSITION_EXCLUSION
return __swift_stdlib_u_hasBinaryProperty(value, prop) != 0
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

) -> String? {
let initialCapacity = 256

var storage = _SwiftStringStorage<UTF8.CodeUnit>.create(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use String._fromUTF8CodeUnitSequence/_fromWellFormedUTF8CodeUnitSequence/_fromASCI/, which will attempt to form a small string if possible. Try to avoid creating a storage class.

You could make a small string and try forming into that, falling back to an Array+_fromASCII otherwise. See e.g. _SmallString._withMutableExcessCapacityBytes. Micro-optimizing further is almost certainly not worth it here.

Copy link
Member

@milseman milseman Mar 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The good news is that this code doesn't compile anymore, so it's a great time to fix it!

/Users/buildnode/jenkins/workspace-private/swift-PR-compiler-performance-macOS/master/new/swift/stdlib/public/core/UnicodeScalarProperties.swift:1091:12: error: argument labels '(_storage:)' do not match any available overloads
16:35:05     return String(_storage: storage)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a fix for case conversion (that uses a stack buffer). I'll try to do _scalarName too and share that

Copy link
Member Author

@allevato allevato Apr 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've fixed _scalarName and pushed it—thanks for the pointers to the new APIs.

I hadn't gotten to case conversion yet since those need to be written into UTF-16 buffers and I didn't see a UTF-16 analogue of _SmallUTF8String, so I figured there would be some extra work that I haven't been able to dive into yet. If you already have a fix, I'll just wait for that!

/// This property is a `String` because some mappings may transform a single
/// scalar into multiple scalars. For example, the character "İ" (U+0130
/// LATIN CAPITAL LETTER I WITH DOT ABOVE) becomes two scalars (U+0069 LATIN
/// SMALL LETTER I, U+0307 COMBINING DOT ABOVE) when converted to lowercase.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it's not a Character because it can change a single scalar into multiple graphemes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

public var lowercaseMapping: String {
let initialCapacity = 1

var storage = _SwiftStringStorage<UTF16.CodeUnit>.create(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely want to avoid creating storage for these.


internal var _value: __swift_stdlib_UChar32
internal var _utf16: (UTF16.CodeUnit, UTF16.CodeUnit)
internal var _utf16Length: Int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encoding a scalar into UTF-16 is pretty trivial and almost certainly not worth storing an extra word or two to avoid re-computing it. It's also not worth doing every time a user writes myScalar.properties.isWhitespace. I suggest computing this lazily.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Since you said above that you're working on a fix for the case mappings, I'll hold off on fixing this until that's ready.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think you could cherry-pick this: milseman@0fdd3d7 ?

I tried to open a PR against this PR, but I couldn't find your repo in the list...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any tests? I want to make sure I didn't break anything

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! These new case mappings look like they did the trick (I've pushed an update with your changes, and also removed the UTF-16 caching).

I don't have any proper unit tests, I've just been spot-testing with the REPL. I'll get those together next.

@swift-ci
Copy link
Contributor

Build comment file:

Compilation-performance test failed

@milseman
Copy link
Member

milseman commented Apr 3, 2018

@swift-ci please test compiler performance

@milseman
Copy link
Member

milseman commented Apr 3, 2018

(I measured about a 1% size increase to optimized stdlib binary size)

var err = __swift_stdlib_U_ZERO_ERROR
let correctSizeRaw = smallString._withMutableExcessCapacityBytes { ptr in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this would only allow ICU to store 14 ASCII characters rather than the full 15. See my alternative from the hash I linked

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Updated with your version.

@milseman
Copy link
Member

milseman commented Apr 3, 2018

@swift-ci please smoke test compiler performance

@swift-ci
Copy link
Contributor

swift-ci commented Apr 3, 2018

Build comment file:

Compilation-performance test failed

@allevato
Copy link
Member Author

allevato commented Apr 4, 2018

I'll work on some tests next and get those pushed.

@allevato
Copy link
Member Author

allevato commented Apr 6, 2018

I've updated some tests for the properties with more complex logic (like case mappings and age); I plan to cover the remainder shortly.

Any idea what the compiler performance failures are about, since there are no logs here?

I still need to verify that this builds on Linux, so I wonder if that has any relation.

@milseman
Copy link
Member

milseman commented Apr 6, 2018

@swift-ci please test

@swift-ci
Copy link
Contributor

swift-ci commented Apr 6, 2018

Build failed
Swift Test OS X Platform
Git Sha - 5a50f27

@swift-ci
Copy link
Contributor

swift-ci commented Apr 6, 2018

Build failed
Swift Test Linux Platform
Git Sha - 5a50f27

@milseman
Copy link
Member

milseman commented Apr 6, 2018

https://ci.swift.org/view/Pull%20Request/job/swift-PR-compiler-performance-smoke-test-macOS/214/consoleFull#-10877886173122a513-f36a-4c87-8ed7-cbc36a1ec144

/Users/buildnode/jenkins/workspace-private/swift-PR-compiler-performance-smoke-test-macOS/master/new/swift/stdlib/public/core/UnicodeScalarProperties.swift:1089:17: error: value of type '_SmallUTF8String' has no member '_withAllUnsafeMutableBytes'
15:50:41     let count = smol._withAllUnsafeMutableBytes { bufPtr -> Int in
15:50:41                 ^~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~

That is... weird if it's on 64-bit, but makes sense on 32-bit. This would need to be #if guarded for 32-bit.

For case-conversion, I think it's better to just form a String and call a method on that. SmallString will support UTF-8 in the future, so that would include all scalars in the small form anyways.

For the name of a scalar, I think it's better to use a fixed size array and call String._fromASCII on that. I can try a patch to this effect later.

This property is too specific in that it forces a particular normalization; let's not expose it this way, but instead in the future with a full normalization API.
- numericValue returns nil instead of .nan for non-numerics
- Remove small-string optimizations from _scalarName that failed on 32-bit archs
- Put case mappings back into U.S.Properties
- Added more sanity tests
@allevato allevato changed the title [WIP] Add Unicode properties to Unicode.Scalar [SE-0211] Add Unicode properties to Unicode.Scalar Jul 6, 2018
@allevato
Copy link
Member Author

allevato commented Jul 6, 2018

@milseman I pushed some updates now that SE-0211 has been accepted. numericValue returns nil instead of .nan for non-numerics, I removed the small-string optimizations from _scalarName that were causing 32-bit issues up above, I moved the case mappings back into Unicode.Scalar.Properties with names derived from the full Unicode property name (i.e., var Unicode.Scalar.Properties.lowercaseMapping instead of func Unicode.Scalar.lowercased()).

PTAL and let me know if there's anything I've missed. Thanks!

Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@milseman
Copy link
Member

milseman commented Jul 6, 2018

@swift-ci please test

@allevato
Copy link
Member Author

allevato commented Jul 9, 2018

Did the tests actually run? I couldn't dig up any results on Jenkins a few hours after you kicked it off.

@milseman
Copy link
Member

milseman commented Jul 9, 2018

@swift-ci please test

@swift-ci
Copy link
Contributor

swift-ci commented Jul 9, 2018

Build failed
Swift Test Linux Platform
Git Sha - d0e93ac

@allevato
Copy link
Member Author

allevato commented Jul 9, 2018

Urgh, that test failure is gross.

UCHAR_EMOJI and the other three properties we expose were added in ICU 57, but as far as I can tell, Ubuntu 16.04 comes with ICU 55. So the return value for those properties can't be relied on there.

The cheap way to fix it is to remove those tests, but that just hides the underlying issue. I can think of a couple options, both kind of awful:

  1. Provide a way for users to either explicitly check the ICU version that stdlib was built against, but then users have to also be able to discover if a property is supported by that version. That's a lot of boilerplate code.

  2. Add some conditional compilation directives to UnicodeScalarProperties.swift that only defines properties that actually exist based on which version of ICU is being included. That effectively introduces dialects of the standard library based on what OS you're compiling on, which would be horrible.

What are your thoughts?

@milseman
Copy link
Member

milseman commented Jul 9, 2018

It's really unfortunate that these APIs aren't available on the oldest Ubuntu LTS. Let me check up on what we can do here. For now, I would conditionally test based either on Darwin or a recent Linux.

edit: You could also make these few specific APIs available on Darwin only for now. We can add Linux support when we figure out a solution and/or Ubuntu gets version-bumped.

Ubuntu 16.04 doesn't have a recent enough ICU to support these; we need a better long-term solution, such as bundling ICU with the toolchain.
@allevato
Copy link
Member Author

Sounds good—I've updated the PR to only define the four emoji properties on Darwin OSes.

@milseman
Copy link
Member

@swift-ci please test

@swift-ci
Copy link
Contributor

Build failed
Swift Test Linux Platform
Git Sha - d0e93ac

@swift-ci
Copy link
Contributor

Build failed
Swift Test OS X Platform
Git Sha - d0e93ac

@allevato
Copy link
Member Author

Tests look good! Anything left to do before we merge?

@milseman milseman merged commit 3045067 into swiftlang:master Jul 11, 2018
@allevato allevato deleted the unicode-properties branch July 11, 2018 20:30
@gottesmm
Copy link
Contributor

@allevato Great job!

/// The general category of a scalar is its "first-order, most usual
/// categorization". It does not attempt to cover multiple uses of some
/// scalars, such as the use of letters to represent Roman numerals.
public enum GeneralCategory {
Copy link
Contributor

@AliSoftware AliSoftware Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋
I thought the discussion agreed on making that enum RawRepresentable (which would make it useful for building regular expressions for example), was this idea just lost in the thread or discarded for any reason?

https://forums.swift.org/t/se-0211-add-unicode-properties-to-unicode-scalar/12121/6

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the proposal was accepted by the core team with the only revision being to the numericValue property, I didn't make any changes from the original proposal other than the one that was requested (and to remove a couple ICU-only properties that accidentally got into the list).

@milseman What do you think? Did the core team have any feelings about that specifically, or was it just an idea that just got lost in the discussion thread?

Copy link
Contributor

@AliSoftware AliSoftware Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw I don't think it would be that useful to be able to do in the direction of init(rawValue: String) using "Lu" for example, but would definitely be useful to go in the other direction (.uppercaseLetter => "Lu" or "\\p{Lu}"). So only implementing var description: String for that purpose (instead of making it RawRepresentable where RawValue: StringProtocol) could also be enough

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's many ways of describing it. There's the short name, the long name, how it might appear in various regex conventions, etc. I don't have any strong opinions here, other than description and debugDescription probably should choose user-readable names over a particular regex convention.

Long term, Swift built in regexes and/or regex packages of various flavors could provide their Custom[PCRE|POSIX|...]RegexConvertible protocols. The would probably produces types that themselves could be CustomStringConvertible, producing e.g. "\\p{Lu}" or "<:Lu>", etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants