Skip to content

[stdlib] Implement String.WordView #42414

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 22, 2022
Merged

[stdlib] Implement String.WordView #42414

merged 6 commits into from
Jun 22, 2022

Conversation

Azoy
Copy link
Contributor

@Azoy Azoy commented Apr 16, 2022

No description provided.

@Azoy Azoy requested review from milseman and lorentey April 16, 2022 23:25
@Azoy Azoy force-pushed the string-word-view branch from 2748380 to 0ca2eae Compare May 9, 2022 18:35
@Azoy
Copy link
Contributor Author

Azoy commented May 9, 2022

@swift-ci please smoke test

Copy link
Member

@lorentey lorentey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! It would be useful if we could make this public instead of landing it as SPI.

Important:

  1. We need a test that applies every index from every string view to the new view APIs, as in StringIndex/Fully exhaustive index interchange. (Which will also need to be extended to test word view indices on all the other views.) This is crucial -- the index validation code will not work correctly unless we exercise every case in the tests.
  2. _WordView must not use Slice as its SubSequence -- it needs a custom slice type. As in the character case, we must limit word breaking to remain within the slice. (And because we'll want/need the freedom to tweak implementations as we like.)
  3. We'll also need a Substring.words property. (Returning the word view's SubSequence.)
  4. It might make sense for the word view to be RangeReplaceable. (Although using Substring as the Element type may make that less useful than it could be.)
  5. Related: Should this just use String as its element type?

return i
}

return roundDownToNearestCharacter(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should read

Suggested change
return roundDownToNearestCharacter(
return roundDownToNearestWord(

@inline(never)
@available(SwiftStdlib 5.7, *)
internal func _slowRoundDownToNearestWord(_ i: String.Index) -> String.Index {
let words = String._WordView(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a small layering violation of the sort that tends to result in subtle infinite recursion problems as the stdlib evolves. (What tends to happen is that someone adds a path to a word view method that somehow gets here, then proceeds to call back to the same entry point.)

It might be worth moving most/all of the word view's _uncheckedIndex(before/after:) implementations down to StringGuts instead. (Alternatively, we could try moving these up to the word view.)

extension String {
@_spi(_Unicode)
@available(SwiftStdlib 5.7, *)
public struct _WordView {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, it may be worth considering putting this through Swift Evolution and making it fully public in this release. I know we have too many proposals in flight already, but it feels bad to use up an ABI stable bit in String.Index for a feature that isn't public. (E.g., we don't need that bit unless there is a risk someone will actually feed invalid indices to the word view. Since we are in full control over SPI use sites, we could also choose to rather require them to explicitly call a special conversion/rounding method instead.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to work with Alejandro on this? We had a few minor improvements to String as well we wanted to land but it's pushing it already.

E.g. a String.init(utf8: Collection<UInt8>) would be a huge discoverability win and it could also convert to NFC. If you want exact-scalar-sequence preservation, that's what the decoding argument label is for.

Copy link
Member

@milseman milseman May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can't get it through SE, is there any reason this core functionality can't be a few SPI functions rather than a whole type? (Not sure if that makes a difference)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, we can also just have methods for advancing an index to the next/previous word boundary

@available(SwiftStdlib 5.7, *)
public subscript(position: Index) -> Element {
let position = _guts.validateWordIndex(position)
let indexAfter = _uncheckedIndex(after: position)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to know if caching the word size in the index (like we do in the character views) would have a performance benefit here, and if so, how much it matters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually more important to implement Iterator. But this index could store a range for the word, if that makes sense. (Haven't thought through it).

extension String {
@_spi(_Unicode)
@available(SwiftStdlib 5.7, *)
public func _isOnWordBoundary(_ i: String.Index) -> Bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the intended use case for this? (We won't be calling it in a loop for every index in some string, right?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regex's \b zero-width assertion maps directly to this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so is \b going to call this on every index then?

I'm asking because rounding down operates by doing an index(before:) + index(after:) dance, which (when done repeatedly) is going to be significantly slower than memoizing where the nearest word boundaries are.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so is \b going to call this on every index then?

I mean, it's going to do what the regex algorithm requests that it does. By itself, no it will only call it once at the current position resulting in success or failure. If it's the first candidate in an alternation like /(?:\b|.)*/ then it will get called until it succeeds, which is what the algorithm is specifying.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that this entry point laboriously calculates the nearest boundary only to throw it away immediately after comparing it to the passed-in index.

Replacing this with an entry point that rounds an index down (or up) to the nearest word boundary would allow us to memoize the results in the regex library, and therefore avoid needless repetitions of all this work for successive indices within the same word. (We would be far better off if we could amortize the cost of _guts.roundDownToNearestWord(i) across multiple invocations, rather than repeating it.)

Of course, if this is only a temporary implementation, and we'll be able to replace it with an O(1) variant later, then this would still make a plausible entry point.

@milseman
Copy link
Member

_WordView must not use Slice as its SubSequence -- it needs a custom slice type. As in the character case, we must limit word breaking to remain within the slice. (And because we'll want/need the freedom to tweak implementations as we like.)

Why?

It might make sense for the word view to be RangeReplaceable. (Although using Substring as the Element type may make that less useful than it could be.)

Is the idea that it would preserve the existing separators? That's an interesting idea because trying to re-join them with a space (Haskell's unwords) isn't content-preserving.


If set, the index is known to be on a Unicode word boundary.
(Introduced in Swift 5.7)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How important is this bit? It feels a little off to be trying to add this to every index, especially since the word view can have its own index type.

@Azoy Azoy force-pushed the string-word-view branch from 0ca2eae to c25513e Compare June 5, 2022 21:28
@Azoy
Copy link
Contributor Author

Azoy commented Jun 5, 2022

@swift-ci please test

@Azoy Azoy force-pushed the string-word-view branch from c25513e to f96eec0 Compare June 6, 2022 00:29
@Azoy
Copy link
Contributor Author

Azoy commented Jun 6, 2022

@swift-ci please test

add bidirectional conformance

Fix tests
@Azoy Azoy force-pushed the string-word-view branch from f96eec0 to b9c94df Compare June 7, 2022 17:15
@Azoy
Copy link
Contributor Author

Azoy commented Jun 7, 2022

@swift-ci please test

@Azoy
Copy link
Contributor Author

Azoy commented Jun 7, 2022

@swift-ci please test macOS

1 similar comment
@Azoy
Copy link
Contributor Author

Azoy commented Jun 8, 2022

@swift-ci please test macOS

// Should this be:
// var words: WordView
// or perhaps
// func words(_ level: ...) -> some BidirectionalCollection<Substring>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

☝️

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an opaque result type for this would make it impossible to integrate this new collection type into the existing String design in any meaningful sense.

E.g., it would mean that we'd be giving up on ever being able to add methods to convert between String.Index and the custom Index type of this collection (assuming that we actually want a custom index type for this thing, which I continue to be skeptical about). To me it seems like a reasonable expectation that the stdlib would provide a way to quickly find the index of a word that contains a particular character or scalar index.

The stdlib ought to be a cohesive library with well-integrated parts, not just a disjoint collection of independent components.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an opaque result type for this would make it impossible to integrate this new collection type into the existing String design in any meaningful sense.

Exactly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly

What do you mean? Could you elaborate please?

@lorentey
Copy link
Member

lorentey commented Jun 9, 2022

_WordView must not use Slice as its SubSequence -- it needs a custom slice type. As in the character case, we must limit word breaking to remain within the slice. (And because we'll want/need the freedom to tweak implementations as we like.)
Why?

Slice does not support cases where its startIndex and/or endIndex aren't reachable indices in the base collection, and it makes undocumented & unfounded assumptions about RangeReplaceable mutations preserving indices preceding the mutated range. The latter problem hopefully isn't relevant for WordView, but the former seems like a thing.

In any case, having a custom SubSequence seems like a good idea in general, for the flexibility it gives us about customizing each individual operation.


guard $0 > 0 else {
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully expect we'll need to have separate entry point for slices, where we compare against the slice's start index.

extension String {
@_spi(_Unicode)
@available(SwiftStdlib 5.7, *)
public func _isOnWordBoundary(_ i: String.Index) -> Bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so is \b going to call this on every index then?

I'm asking because rounding down operates by doing an index(before:) + index(after:) dance, which (when done repeatedly) is going to be significantly slower than memoizing where the nearest word boundaries are.

// Should this be:
// var words: WordView
// or perhaps
// func words(_ level: ...) -> some BidirectionalCollection<Substring>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an opaque result type for this would make it impossible to integrate this new collection type into the existing String design in any meaningful sense.

E.g., it would mean that we'd be giving up on ever being able to add methods to convert between String.Index and the custom Index type of this collection (assuming that we actually want a custom index type for this thing, which I continue to be skeptical about). To me it seems like a reasonable expectation that the stdlib would provide a way to quickly find the index of a word that contains a particular character or scalar index.

The stdlib ought to be a cohesive library with well-integrated parts, not just a disjoint collection of independent components.

@Azoy Azoy marked this pull request as ready for review June 15, 2022 17:00
@Azoy
Copy link
Contributor Author

Azoy commented Jun 15, 2022

@swift-ci please test

@Azoy
Copy link
Contributor Author

Azoy commented Jun 17, 2022

@swift-ci please test

Copy link
Member

@lorentey lorentey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AIUI, this PR only introduces the String._isOnWordBoundary and String._words() entry points, neither of which need us to define a _WordView in the stdlib code base.

So let's remove _WordView.

As noted above, I have serious doubts that _isOnWordBoundary is the right interface for finding word boundaries in a string -- it does a significant amount work that we would be much better off amortizing over successive invocations.

Instead of _words() and _isOnWordBoundary, why not e.g. expose entry points to go from an arbitrary string index to the previous/next word boundary within the string?

@Azoy
Copy link
Contributor Author

Azoy commented Jun 21, 2022

@swift-ci please test

Copy link
Member

@lorentey lorentey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I noted an index validation error -- let's not make that particular mistake again.


if offset == 0 || offset == count {
return i
}
Copy link
Member

@lorentey lorentey Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The roundDownToNearestWord/_slowRoundDownToNearestWord split is left over from the version where we had performance flags for word boundaries in String.Index. Now that we don't have them, this scheme doesn't seem particularly useful, as we'll almost always fall into the "slow" case.

(Feel free to leave it in place if for some reason you think this is worth keeping, though.)

@Azoy
Copy link
Contributor Author

Azoy commented Jun 21, 2022

@swift-ci please test

@Azoy Azoy force-pushed the string-word-view branch from 676ff8f to 32d8a63 Compare June 22, 2022 04:14
@Azoy
Copy link
Contributor Author

Azoy commented Jun 22, 2022

@swift-ci please test

@Azoy Azoy merged commit 95da55b into swiftlang:main Jun 22, 2022
@Azoy Azoy deleted the string-word-view branch June 22, 2022 16:10
Azoy added a commit to Azoy/swift that referenced this pull request Jun 29, 2022
* Implement String.WordView

* Add isWordAligned bit

* Hide WordView for now (also separate Index type)

add bidirectional conformance

Fix tests

* Address comments from Karoy and Michael

* Remove word view, use index methods

* Address Karoy's comments

aaa
Azoy added a commit that referenced this pull request Jun 30, 2022
* Implement String.WordView

* Add isWordAligned bit

* Hide WordView for now (also separate Index type)

add bidirectional conformance

Fix tests

* Address comments from Karoy and Michael

* Remove word view, use index methods

* Address Karoy's comments

aaa
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants