Skip to content

AttributedString APIs for operations over noncontiguous ranges #1145

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jmschonfeld
Copy link
Contributor

This PR implements the API changes proposed in SF-0014 by adding a new DiscontiguousAttributedSubstring type (the result of slicing an AttributedString with a RangeSet) which vends out slices of each underlying view, an upgraded Runs view which now handles discontiguous segments, and performs operations over the discontiguous subranges.

@jmschonfeld
Copy link
Contributor Author

@swift-ci please test

@@ -351,7 +471,8 @@ extension AttributedString.Runs {
internal func _slicedRunBoundary(
after i: AttributedString.Index,
attributeNames: [String],
constraints: [AttributeRunBoundaries]
constraints: [AttributeRunBoundaries],
endOfCurrent: Bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not quite sure what endOfCurrent means semantically. Would it be possible to show a simple example of how it affects the result of this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, so this function is attempting to find the boundary of a run following the given index (which can be either at the start of a run or anywhere within a run). endOfCurrent determines whether the index returned is the end of the current run or the start of the next run. For contiguous runs, the end of the current is always equal to the start of the next since runs are back-to-back when contiguous. However for discontiguous runs, the end of the current may be before the start of the next.

For example, if I have the string "AAABBB" where the A's have one set of attributes and the B's have another (i.e. 2 attribute runs). For a contiguous runs view of the full string, calling _slicedRunBoundary(after: startIndex) will always return the index of the first B regardless of what endOfCurrent is since the first B is the end of the A run and the start of the B run.

But now, let's imagine I have a discontiguous slice that only includes these subranges: "[AA]AB[BB]" (so the discontiguous slice looks like "AABB"). In this case, _slicedRunBoundary(after: startIndex) will return the index of the third A for endOfCurrent: true and the index of the second B for endOfCurrent: false because the first run ends at the third A but the second run doesn't start until the second B.

The reason we need the ability to calculate both is because we need both to support enumeration of the runs view (which enumerates runs by providing the start and end index of every run) but also to allow subscripting the runs view (which provides the range of a singular run containing an arbitrary index without looking to the next run). Previously we only had the ability to find the start of the next run and we used it for both that and finding the end of the current run, but now that those indices can differ we need a way to determine both.

@available(FoundationPreview 6.2, *)
extension DiscontiguousAttributedSubstring : Equatable {
public static func == (lhs: Self, rhs: Self) -> Bool {
AttributedString.Guts._characterwiseIsEqual(lhs.runs, to: rhs.runs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this equivalent to looping over the ranges, and compare their characters one by one, like this?

for range in  _indices.ranges {
    AttributedString.Guts.characterwiseIsEqual(lhs.guts, in: range, to: rhs.guts, in: range)
}

Asking because in the hash implementation below, you're doing it range by range. That looks like a different way to think about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's how _characterwiseIsEqual is implemented - it loops over each run's range and performs a character-wise equality check on the text content and an equality check on the attribute containers

}
}

public subscript(bounds: some RangeExpression<AttributedString.Index>) -> DiscontiguousAttributedSubstring {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a new API or a protocol requirement? It doesn't look like a protocol requirement for AttributedStringAttributeMutation. Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a protocol requirement - it's just new API for this type. It's an almost-requirement of AttributedStringProtocol to which DiscontiguousAttributedSubstring partially conforms, but is slightly different because this returns a DiscontiguousAttributedSubstring instead of just an AttributedSubstring (which is required because a singular range of an already-discontiguous substring may produce another discontiguous substring)

@jmschonfeld
Copy link
Contributor Author

@swift-ci please test

@jmschonfeld
Copy link
Contributor Author

@swift-ci please test

@jmschonfeld jmschonfeld requested a review from itingliu February 11, 2025 20:54
func testSubCharacterSlicing() {
// This test validates that DiscontiguousAttributedSubstring.characters (i.e. DiscontiguousSlice<AttributedString.CharacterView>) behaves the same as DiscontiguousSlice<String> w.r.t. slicing on sub-character boundaries
// Note: Country flag emojis are excluded from the text below because AttributedString.CharacterView is not self-slicing and Slice<AttributedString.CharacterView> does not properly handle sub-character slicing within country flag emojis
let complexString = "ABCDáëĩoō№Ωאבג🎺🌈WXYZ"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This string appears to have no multi-scalar Characters -- neither (1) in its input string, nor (2) in the slices it generates. There are no "sub-character boundaries", so the test is not validating them -- what is the point of this test, then? Ideally we should verify behavior in both of these cases.

Regarding point (1): I expect that the flag emoji problem applies to all multi-scalar characters in the input string: a\u{301}, , 🤞🏾, etc. Given that there will be no easy way to avoid slicing causing crashes, this greatly limits the usefulness of attributed string slices, although hopefully it doesn't entirely eliminate it. (Note that UI-level text editing operations do not use the same grapheme breaking algorithms that Swift implements, so users are sometimes able to position cursors and select text on what Swift considers to be sub-character positions.) Solving this is probably outside the scope of this change; it may indeed be best to treat it as a temporary limitation.

However, point (2) above is independent of this. The issue with CharacterView incorrectly using Slice for its SubSequence does not (immediately) affect slicing that produces multi-scalar characters within the result, though -- so this test can carry some weight by exercising that at least. For example, take this input string:

let str = "a\n\u{301}"

str has three single-scalar Characters, a, \n, and \u{301} (the newline does not combine with the combining accent following it). Cutting the a and \u{301} off into a discontiguous slice produces a result that technically should (according to Unicode) consist of a single two-scalar Character, á. We expect that both String and AttributedString would handle this slicing without traps/crashes, although we expect them to both count the ASCII a and the combining accent as distinct Characters. Can we use this test to verify that the two types produce consistent behavior in this second case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, you're right on both counts. Indeed any slicing within a grapheme cluster causes the same crash as I experienced with the country flag emojis. For point (1) I agree that's definitely something we should look into soon as a followup, my hope is that a temporary limitation is ok as this limitation has existed without much uproar for noncontiguous slicing since AttributedString's inception. However we should definitely put a plan in place to resolve it so it doesn't continue to linger.

For point (2) that's a great point, and I've added a unit test in place of this one that asserts the behavior of String is the same as AttributedString.CharacterView in this regard to make sure that the behavior is the same (and indeed you are correct that both count the ASCII a and the combining accent as distinct Characters.

@jmschonfeld jmschonfeld requested a review from lorentey February 13, 2025 18:36
@jmschonfeld
Copy link
Contributor Author

@swift-ci please test

Copy link
Member

@lorentey lorentey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jmschonfeld
Copy link
Contributor Author

@swift-ci please test

@jmschonfeld
Copy link
Contributor Author

@swift-ci please test

@jmschonfeld jmschonfeld merged commit 3b31f75 into swiftlang:main Feb 17, 2025
2 of 3 checks passed
@jmschonfeld jmschonfeld deleted the attrstr/noncontiguous-operations branch February 17, 2025 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants