-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[stdlib] Implement String.WordView #42414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@swift-ci please smoke test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! It would be useful if we could make this public instead of landing it as SPI.
Important:
- We need a test that applies every index from every string view to the new view APIs, as in
StringIndex/Fully exhaustive index interchange
. (Which will also need to be extended to test word view indices on all the other views.) This is crucial -- the index validation code will not work correctly unless we exercise every case in the tests. _WordView
must not useSlice
as itsSubSequence
-- it needs a custom slice type. As in the character case, we must limit word breaking to remain within the slice. (And because we'll want/need the freedom to tweak implementations as we like.)- We'll also need a
Substring.words
property. (Returning the word view's SubSequence.) - It might make sense for the word view to be
RangeReplaceable
. (Although using Substring as the Element type may make that less useful than it could be.) - Related: Should this just use
String
as its element type?
return i | ||
} | ||
|
||
return roundDownToNearestCharacter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should read
return roundDownToNearestCharacter( | |
return roundDownToNearestWord( |
@inline(never) | ||
@available(SwiftStdlib 5.7, *) | ||
internal func _slowRoundDownToNearestWord(_ i: String.Index) -> String.Index { | ||
let words = String._WordView(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a small layering violation of the sort that tends to result in subtle infinite recursion problems as the stdlib evolves. (What tends to happen is that someone adds a path to a word view method that somehow gets here, then proceeds to call back to the same entry point.)
It might be worth moving most/all of the word view's _uncheckedIndex(before/after:)
implementations down to StringGuts instead. (Alternatively, we could try moving these up to the word view.)
extension String { | ||
@_spi(_Unicode) | ||
@available(SwiftStdlib 5.7, *) | ||
public struct _WordView { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, it may be worth considering putting this through Swift Evolution and making it fully public in this release. I know we have too many proposals in flight already, but it feels bad to use up an ABI stable bit in String.Index
for a feature that isn't public. (E.g., we don't need that bit unless there is a risk someone will actually feed invalid indices to the word view. Since we are in full control over SPI use sites, we could also choose to rather require them to explicitly call a special conversion/rounding method instead.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to work with Alejandro on this? We had a few minor improvements to String as well we wanted to land but it's pushing it already.
E.g. a String.init(utf8: Collection<UInt8>)
would be a huge discoverability win and it could also convert to NFC. If you want exact-scalar-sequence preservation, that's what the decoding
argument label is for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can't get it through SE, is there any reason this core functionality can't be a few SPI functions rather than a whole type? (Not sure if that makes a difference)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, we can also just have methods for advancing an index to the next/previous word boundary
@available(SwiftStdlib 5.7, *) | ||
public subscript(position: Index) -> Element { | ||
let position = _guts.validateWordIndex(position) | ||
let indexAfter = _uncheckedIndex(after: position) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be useful to know if caching the word size in the index (like we do in the character views) would have a performance benefit here, and if so, how much it matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually more important to implement Iterator
. But this index could store a range for the word, if that makes sense. (Haven't thought through it).
extension String { | ||
@_spi(_Unicode) | ||
@available(SwiftStdlib 5.7, *) | ||
public func _isOnWordBoundary(_ i: String.Index) -> Bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the intended use case for this? (We won't be calling it in a loop for every index in some string, right?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regex's \b
zero-width assertion maps directly to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so is \b
going to call this on every index then?
I'm asking because rounding down operates by doing an index(before:) + index(after:) dance, which (when done repeatedly) is going to be significantly slower than memoizing where the nearest word boundaries are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so is \b going to call this on every index then?
I mean, it's going to do what the regex algorithm requests that it does. By itself, no it will only call it once at the current position resulting in success or failure. If it's the first candidate in an alternation like /(?:\b|.)*/
then it will get called until it succeeds, which is what the algorithm is specifying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is that this entry point laboriously calculates the nearest boundary only to throw it away immediately after comparing it to the passed-in index.
Replacing this with an entry point that rounds an index down (or up) to the nearest word boundary would allow us to memoize the results in the regex library, and therefore avoid needless repetitions of all this work for successive indices within the same word. (We would be far better off if we could amortize the cost of _guts.roundDownToNearestWord(i)
across multiple invocations, rather than repeating it.)
Of course, if this is only a temporary implementation, and we'll be able to replace it with an O(1) variant later, then this would still make a plausible entry point.
Why?
Is the idea that it would preserve the existing separators? That's an interesting idea because trying to re-join them with a space (Haskell's |
stdlib/public/core/StringIndex.swift
Outdated
|
||
If set, the index is known to be on a Unicode word boundary. | ||
(Introduced in Swift 5.7) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How important is this bit? It feels a little off to be trying to add this to every index, especially since the word view can have its own index type.
@swift-ci please test |
@swift-ci please test |
add bidirectional conformance Fix tests
@swift-ci please test |
@swift-ci please test macOS |
1 similar comment
@swift-ci please test macOS |
// Should this be: | ||
// var words: WordView | ||
// or perhaps | ||
// func words(_ level: ...) -> some BidirectionalCollection<Substring> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
☝️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using an opaque result type for this would make it impossible to integrate this new collection type into the existing String design in any meaningful sense.
E.g., it would mean that we'd be giving up on ever being able to add methods to convert between String.Index
and the custom Index type of this collection (assuming that we actually want a custom index type for this thing, which I continue to be skeptical about). To me it seems like a reasonable expectation that the stdlib would provide a way to quickly find the index of a word that contains a particular character or scalar index.
The stdlib ought to be a cohesive library with well-integrated parts, not just a disjoint collection of independent components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using an opaque result type for this would make it impossible to integrate this new collection type into the existing String design in any meaningful sense.
Exactly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly
What do you mean? Could you elaborate please?
In any case, having a custom SubSequence seems like a good idea in general, for the flexibility it gives us about customizing each individual operation. |
|
||
guard $0 > 0 else { | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fully expect we'll need to have separate entry point for slices, where we compare against the slice's start index.
extension String { | ||
@_spi(_Unicode) | ||
@available(SwiftStdlib 5.7, *) | ||
public func _isOnWordBoundary(_ i: String.Index) -> Bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so is \b
going to call this on every index then?
I'm asking because rounding down operates by doing an index(before:) + index(after:) dance, which (when done repeatedly) is going to be significantly slower than memoizing where the nearest word boundaries are.
// Should this be: | ||
// var words: WordView | ||
// or perhaps | ||
// func words(_ level: ...) -> some BidirectionalCollection<Substring> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using an opaque result type for this would make it impossible to integrate this new collection type into the existing String design in any meaningful sense.
E.g., it would mean that we'd be giving up on ever being able to add methods to convert between String.Index
and the custom Index type of this collection (assuming that we actually want a custom index type for this thing, which I continue to be skeptical about). To me it seems like a reasonable expectation that the stdlib would provide a way to quickly find the index of a word that contains a particular character or scalar index.
The stdlib ought to be a cohesive library with well-integrated parts, not just a disjoint collection of independent components.
@swift-ci please test |
@swift-ci please test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AIUI, this PR only introduces the String._isOnWordBoundary
and String._words()
entry points, neither of which need us to define a _WordView
in the stdlib code base.
So let's remove _WordView
.
As noted above, I have serious doubts that _isOnWordBoundary
is the right interface for finding word boundaries in a string -- it does a significant amount work that we would be much better off amortizing over successive invocations.
Instead of _words()
and _isOnWordBoundary
, why not e.g. expose entry points to go from an arbitrary string index to the previous/next word boundary within the string?
@swift-ci please test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I noted an index validation error -- let's not make that particular mistake again.
|
||
if offset == 0 || offset == count { | ||
return i | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The roundDownToNearestWord
/_slowRoundDownToNearestWord
split is left over from the version where we had performance flags for word boundaries in String.Index
. Now that we don't have them, this scheme doesn't seem particularly useful, as we'll almost always fall into the "slow" case.
(Feel free to leave it in place if for some reason you think this is worth keeping, though.)
@swift-ci please test |
@swift-ci please test |
* Implement String.WordView * Add isWordAligned bit * Hide WordView for now (also separate Index type) add bidirectional conformance Fix tests * Address comments from Karoy and Michael * Remove word view, use index methods * Address Karoy's comments aaa
* Implement String.WordView * Add isWordAligned bit * Hide WordView for now (also separate Index type) add bidirectional conformance Fix tests * Address comments from Karoy and Michael * Remove word view, use index methods * Address Karoy's comments aaa
No description provided.