[stdlib] Implement String.WordView #42414

Azoy · 2022-04-16T23:25:04Z

No description provided.

Azoy · 2022-05-09T18:36:04Z

@swift-ci please smoke test

lorentey

This looks good! It would be useful if we could make this public instead of landing it as SPI.

Important:

We need a test that applies every index from every string view to the new view APIs, as in StringIndex/Fully exhaustive index interchange. (Which will also need to be extended to test word view indices on all the other views.) This is crucial -- the index validation code will not work correctly unless we exercise every case in the tests.
_WordView must not use Slice as its SubSequence -- it needs a custom slice type. As in the character case, we must limit word breaking to remain within the slice. (And because we'll want/need the freedom to tweak implementations as we like.)
We'll also need a Substring.words property. (Returning the word view's SubSequence.)
It might make sense for the word view to be RangeReplaceable. (Although using Substring as the Element type may make that less useful than it could be.)
Related: Should this just use String as its element type?

lorentey · 2022-05-10T01:52:04Z

stdlib/public/core/StringIndexValidation.swift

+      return i
+    }
+
+    return roundDownToNearestCharacter(


I think this should read

Suggested change

return roundDownToNearestCharacter(

return roundDownToNearestWord(

lorentey · 2022-05-10T01:57:59Z

stdlib/public/core/StringWordBreaking.swift

+  @inline(never)
+  @available(SwiftStdlib 5.7, *)
+  internal func _slowRoundDownToNearestWord(_ i: String.Index) -> String.Index {
+    let words = String._WordView(self)


This is a small layering violation of the sort that tends to result in subtle infinite recursion problems as the stdlib evolves. (What tends to happen is that someone adds a path to a word view method that somehow gets here, then proceeds to call back to the same entry point.)

It might be worth moving most/all of the word view's _uncheckedIndex(before/after:) implementations down to StringGuts instead. (Alternatively, we could try moving these up to the word view.)

lorentey · 2022-05-10T02:18:50Z

stdlib/public/core/StringWordView.swift

+extension String {
+  @_spi(_Unicode)
+  @available(SwiftStdlib 5.7, *)
+  public struct _WordView {


Hm, it may be worth considering putting this through Swift Evolution and making it fully public in this release. I know we have too many proposals in flight already, but it feels bad to use up an ABI stable bit in String.Index for a feature that isn't public. (E.g., we don't need that bit unless there is a risk someone will actually feed invalid indices to the word view. Since we are in full control over SPI use sites, we could also choose to rather require them to explicitly call a special conversion/rounding method instead.)

Do you want to work with Alejandro on this? We had a few minor improvements to String as well we wanted to land but it's pushing it already.

E.g. a String.init(utf8: Collection<UInt8>) would be a huge discoverability win and it could also convert to NFC. If you want exact-scalar-sequence preservation, that's what the decoding argument label is for.

If we can't get it through SE, is there any reason this core functionality can't be a few SPI functions rather than a whole type? (Not sure if that makes a difference)

Yep, we can also just have methods for advancing an index to the next/previous word boundary

lorentey · 2022-05-10T02:27:12Z

stdlib/public/core/StringWordView.swift

+  @available(SwiftStdlib 5.7, *)
+  public subscript(position: Index) -> Element {
+    let position = _guts.validateWordIndex(position)
+    let indexAfter = _uncheckedIndex(after: position)


It would be useful to know if caching the word size in the index (like we do in the character views) would have a performance benefit here, and if so, how much it matters.

Usually more important to implement Iterator. But this index could store a range for the word, if that makes sense. (Haven't thought through it).

lorentey · 2022-05-10T02:29:18Z

stdlib/public/core/StringWordView.swift

+extension String {
+  @_spi(_Unicode)
+  @available(SwiftStdlib 5.7, *)
+  public func _isOnWordBoundary(_ i: String.Index) -> Bool {


What's the intended use case for this? (We won't be calling it in a loop for every index in some string, right?)

Regex's \b zero-width assertion maps directly to this.

OK, so is \b going to call this on every index then?

I'm asking because rounding down operates by doing an index(before:) + index(after:) dance, which (when done repeatedly) is going to be significantly slower than memoizing where the nearest word boundaries are.

OK, so is \b going to call this on every index then?

I mean, it's going to do what the regex algorithm requests that it does. By itself, no it will only call it once at the current position resulting in success or failure. If it's the first candidate in an alternation like /(?:\b|.)*/ then it will get called until it succeeds, which is what the algorithm is specifying.

My point is that this entry point laboriously calculates the nearest boundary only to throw it away immediately after comparing it to the passed-in index.

Replacing this with an entry point that rounds an index down (or up) to the nearest word boundary would allow us to memoize the results in the regex library, and therefore avoid needless repetitions of all this work for successive indices within the same word. (We would be far better off if we could amortize the cost of _guts.roundDownToNearestWord(i) across multiple invocations, rather than repeating it.)

Of course, if this is only a temporary implementation, and we'll be able to replace it with an O(1) variant later, then this would still make a plausible entry point.

milseman · 2022-05-10T14:05:30Z

_WordView must not use Slice as its SubSequence -- it needs a custom slice type. As in the character case, we must limit word breaking to remain within the slice. (And because we'll want/need the freedom to tweak implementations as we like.)

Why?

It might make sense for the word view to be RangeReplaceable. (Although using Substring as the Element type may make that less useful than it could be.)

Is the idea that it would preserve the existing separators? That's an interesting idea because trying to re-join them with a space (Haskell's unwords) isn't content-preserving.

milseman · 2022-05-10T14:21:02Z

stdlib/public/core/StringIndex.swift

+
+    If set, the index is known to be on a Unicode word boundary.
+    (Introduced in Swift 5.7)
+


How important is this bit? It feels a little off to be trying to add this to every index, especially since the word view can have its own index type.

Azoy · 2022-06-05T22:16:42Z

@swift-ci please test

Azoy · 2022-06-06T00:30:00Z

@swift-ci please test

add bidirectional conformance Fix tests

Azoy · 2022-06-07T17:15:53Z

@swift-ci please test

Azoy · 2022-06-07T21:32:03Z

@swift-ci please test macOS

Azoy · 2022-06-08T14:29:22Z

@swift-ci please test macOS

stdlib/private/StdlibUnicodeUnittest/WordBreaking.swift

stdlib/public/core/StringIndex.swift

stdlib/public/core/StringIndexValidation.swift

stdlib/public/core/StringWordBreaking.swift

stdlib/public/core/StringWordView.swift

milseman · 2022-06-09T17:49:30Z

stdlib/public/core/StringWordView.swift

+  // Should this be:
+  //    var words: WordView
+  // or perhaps
+  //    func words(_ level: ...) -> some BidirectionalCollection<Substring>


Using an opaque result type for this would make it impossible to integrate this new collection type into the existing String design in any meaningful sense.

E.g., it would mean that we'd be giving up on ever being able to add methods to convert between String.Index and the custom Index type of this collection (assuming that we actually want a custom index type for this thing, which I continue to be skeptical about). To me it seems like a reasonable expectation that the stdlib would provide a way to quickly find the index of a word that contains a particular character or scalar index.

The stdlib ought to be a cohesive library with well-integrated parts, not just a disjoint collection of independent components.

Using an opaque result type for this would make it impossible to integrate this new collection type into the existing String design in any meaningful sense.

Exactly

Exactly

What do you mean? Could you elaborate please?

lorentey · 2022-06-09T20:12:01Z

_WordView must not use Slice as its SubSequence -- it needs a custom slice type. As in the character case, we must limit word breaking to remain within the slice. (And because we'll want/need the freedom to tweak implementations as we like.)
Why?

Slice does not support cases where its startIndex and/or endIndex aren't reachable indices in the base collection, and it makes undocumented & unfounded assumptions about RangeReplaceable mutations preserving indices preceding the mutated range. The latter problem hopefully isn't relevant for WordView, but the former seems like a thing.

In any case, having a custom SubSequence seems like a good idea in general, for the flexibility it gives us about customizing each individual operation.

stdlib/public/core/StringIndex.swift

stdlib/public/core/StringWordView.swift

stdlib/public/core/StringWordBreaking.swift

lorentey · 2022-06-09T20:24:40Z

stdlib/public/core/StringWordBreaking.swift

+
+        guard $0 > 0 else {
+          return nil
+        }


I fully expect we'll need to have separate entry point for slices, where we compare against the slice's start index.

stdlib/public/core/StringWordView.swift

lorentey · 2022-06-09T20:33:41Z

stdlib/public/core/StringWordView.swift

+extension String {
+  @_spi(_Unicode)
+  @available(SwiftStdlib 5.7, *)
+  public func _isOnWordBoundary(_ i: String.Index) -> Bool {


OK, so is \b going to call this on every index then?

I'm asking because rounding down operates by doing an index(before:) + index(after:) dance, which (when done repeatedly) is going to be significantly slower than memoizing where the nearest word boundaries are.

lorentey · 2022-06-09T20:42:37Z

stdlib/public/core/StringWordView.swift

+  // Should this be:
+  //    var words: WordView
+  // or perhaps
+  //    func words(_ level: ...) -> some BidirectionalCollection<Substring>


Using an opaque result type for this would make it impossible to integrate this new collection type into the existing String design in any meaningful sense.

E.g., it would mean that we'd be giving up on ever being able to add methods to convert between String.Index and the custom Index type of this collection (assuming that we actually want a custom index type for this thing, which I continue to be skeptical about). To me it seems like a reasonable expectation that the stdlib would provide a way to quickly find the index of a word that contains a particular character or scalar index.

The stdlib ought to be a cohesive library with well-integrated parts, not just a disjoint collection of independent components.

Azoy · 2022-06-15T17:04:45Z

@swift-ci please test

Azoy · 2022-06-17T16:49:46Z

@swift-ci please test

lorentey

AIUI, this PR only introduces the String._isOnWordBoundary and String._words() entry points, neither of which need us to define a _WordView in the stdlib code base.

So let's remove _WordView.

As noted above, I have serious doubts that _isOnWordBoundary is the right interface for finding word boundaries in a string -- it does a significant amount work that we would be much better off amortizing over successive invocations.

Instead of _words() and _isOnWordBoundary, why not e.g. expose entry points to go from an arbitrary string index to the previous/next word boundary within the string?

Azoy · 2022-06-21T17:43:36Z

@swift-ci please test

lorentey

Looks good! I noted an index validation error -- let's not make that particular mistake again.

stdlib/public/core/UnicodeSPI.swift

stdlib/public/core/StringWordBreaking.swift

lorentey · 2022-06-21T20:21:00Z

stdlib/public/core/StringWordBreaking.swift

+
+    if offset == 0 || offset == count {
+      return i
+    }


The roundDownToNearestWord/_slowRoundDownToNearestWord split is left over from the version where we had performance flags for word boundaries in String.Index. Now that we don't have them, this scheme doesn't seem particularly useful, as we'll almost always fall into the "slow" case.

(Feel free to leave it in place if for some reason you think this is worth keeping, though.)

Azoy · 2022-06-21T23:25:00Z

@swift-ci please test

stdlib/public/core/UnicodeSPI.swift

aaa

Azoy · 2022-06-22T04:15:01Z

@swift-ci please test

* Implement String.WordView * Add isWordAligned bit * Hide WordView for now (also separate Index type) add bidirectional conformance Fix tests * Address comments from Karoy and Michael * Remove word view, use index methods * Address Karoy's comments aaa

Azoy requested review from milseman and lorentey April 16, 2022 23:25

Azoy force-pushed the string-word-view branch from 2748380 to 0ca2eae Compare May 9, 2022 18:35

lorentey approved these changes May 10, 2022

View reviewed changes

milseman reviewed May 10, 2022

View reviewed changes

Azoy added 2 commits June 5, 2022 13:33

Implement String.WordView

3b4b475

Add isWordAligned bit

ec900f9

Azoy force-pushed the string-word-view branch from 0ca2eae to c25513e Compare June 5, 2022 21:28

Azoy force-pushed the string-word-view branch from c25513e to f96eec0 Compare June 6, 2022 00:29

Hide WordView for now (also separate Index type)

b9c94df

add bidirectional conformance Fix tests

Azoy force-pushed the string-word-view branch from f96eec0 to b9c94df Compare June 7, 2022 17:15

milseman reviewed Jun 9, 2022

View reviewed changes

lorentey reviewed Jun 9, 2022

View reviewed changes

Address comments from Karoy and Michael

45f1aec

Azoy marked this pull request as ready for review June 15, 2022 17:00

milseman approved these changes Jun 16, 2022

View reviewed changes

lorentey requested changes Jun 17, 2022

View reviewed changes

Remove word view, use index methods

b555c43

milseman approved these changes Jun 21, 2022

View reviewed changes

lorentey approved these changes Jun 21, 2022

View reviewed changes

lorentey approved these changes Jun 22, 2022

View reviewed changes

stdlib/public/core/UnicodeSPI.swift Outdated Show resolved Hide resolved

Address Karoy's comments

32d8a63

aaa

Azoy force-pushed the string-word-view branch from 676ff8f to 32d8a63 Compare June 22, 2022 04:14

Azoy merged commit 95da55b into swiftlang:main Jun 22, 2022

Azoy deleted the string-word-view branch June 22, 2022 16:10

Azoy mentioned this pull request Jun 29, 2022

[5.7] [stdlib] Implement String.WordView #59793

Merged

	return roundDownToNearestCharacter(
	return roundDownToNearestWord(


		If set, the index is known to be on a Unicode word boundary.
		(Introduced in Swift 5.7)

[stdlib] Implement String.WordView #42414

[stdlib] Implement String.WordView #42414

Uh oh!

Conversation

Azoy commented Apr 16, 2022

Uh oh!

Azoy commented May 9, 2022

Uh oh!

lorentey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

milseman May 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

milseman commented May 10, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Azoy commented Jun 5, 2022

Uh oh!

Azoy commented Jun 6, 2022

Uh oh!

Azoy commented Jun 7, 2022

Uh oh!

Azoy commented Jun 7, 2022

Uh oh!

Azoy commented Jun 8, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lorentey commented Jun 9, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

milseman May 10, 2022 •

edited

Loading

lorentey Jun 21, 2022 •

edited

Loading