-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[String] Scalar-alignment bug fixes. #23834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@swift-ci please test |
@swift-ci please benchmark |
Build failed |
Performance: -O
Code size: -O
Performance: -Osize
Code size: -Osize
Performance: -Onone
How to read the dataThe tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the Noise: Sometimes the performance results (not code size!) contain false Hardware Overview
|
@swift-ci please benchmark |
Performance: -O
Code size: -O
Performance: -Osize
Code size: -Osize
Performance: -Onone
Code size: -swiftlibsHow to read the dataThe tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the Noise: Sometimes the performance results (not code size!) contain false Hardware Overview
|
Whoops, I had a stray debugging print statement that's triggering on that benchmark. |
@swift-ci please benchmark |
(still need to fix up a couple tests that should've been failing before, but weren't) |
Performance: -O
Code size: -O
Performance: -Osize
Code size: -Osize
Performance: -Onone
Code size: -swiftlibsHow to read the dataThe tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the Noise: Sometimes the performance results (not code size!) contain false Hardware Overview
|
@swift-ci please test |
@swift-ci please benchmark |
Build failed |
Build failed |
Build failed before running benchmark. |
@swift-ci please test |
Build failed |
ICU will return different results if we call with an offset into a code unit buffer vs if we slice the buffer first and provide an offset of zero. Slicing more closely models the semantics of SE-0180, so use that. Test case coming in subsequent commit enforcing index scalar-alignment.
Ah, I see something got dropped from a recent rebase. |
Build failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Admittedly this is a difficult one to review well -- the intricacies of utf-8 ↔︎ utf-16 transcoding vs indices make my head hurt, and I don't think I can fully appreciate the backward/forward deployment implications.
I added some review comments/questions/nitpicks.
stdlib/public/core/StringIndex.swift
Outdated
@@ -188,6 +197,15 @@ extension String.Index { | |||
transcodedOffset: self.transcodedOffset &- 1) | |||
} | |||
|
|||
@_alwaysEmitIntoClient // Swift 5.1 | |||
@inline(__always) | |||
internal var aligned: String.Index { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: The current name doesn't really reflect what this does, and this should probably be a function.
Perhaps scalarAligned()
/isScalarAligned
would work a bit better? Admittedly that still has the issue that this function doesn't actually align anything, it just sets a bit.
ensureScalarAligned()
/ isKnownScalarAligned
?
certifiedScalarAligned()
/ isKnownScalarAligned
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand. Why a function, and why is
if it's not a question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming change (if any) should apply to the predicate property (currently isAligned
) too; the is*
parts were intended as suggestions for that.
aligned
as a property feels weird to me. Its effect feels closer to Float.rounded()
than Float.magnitude
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how this is any different than the transcoded or encoded var
s. This is an internal name, where we have terminology with specific implications in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm merely telling you the specific implications that come to my mind when I read index.aligned
do not match what the property does. You're of course free to take it or leave it, it's a nitpick.
More importantly, these names should all be underscored -- String.Index is a public type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, is that a new policy? Could you update the programmer's manual then? We're pretty screwed right now, but we can at least make all non-public additions to public types be underscored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we have always put at least one underscore in the name of non-public types and members.
I guess this is why we need to document these things...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: #25810
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we have always put at least one underscore in the name of non-public types and members.
Unfortunately, there are many many places in the stdlib where we have internal types or declarations without a leading underscore, predating either your or my involvement in the project.
Going forwards, sure, we'll try to do that. For now, we can fixup non-ABI internal names, and we can move ABI names to the deprecated file. I'll underscore these names from this PR, and fixing up other names can come elsewhere. We need to do an audit; do you want to open a SR for that?
I guess this is why we need to document these things...
Sure, and let's nest the rule in the programmer's manual. I'll reply to that PR on how best to integrate this.
return position == 0 ? ( | ||
endIndex == 1 ? UTF16.CodeUnit(value.value) : UTF16.leadSurrogate(value) | ||
) : UTF16.trailSurrogate(value) | ||
_internalInvariant((0..<self.count).contains(position)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There really should be a precondition checking for out of bounds indices here; the debug check is not strong enough to catch accidental misuse, but it's strong enough to interfere with code elsewhere in the stdlib that assumes it's okay to subscript this view with any random integer. (See my comment above.)
I can currently do this:
let sadness: UnicodeScalar = "\u{1F614}"
sadness.utf16[42] // ⟹ 5682
sadness.utf16[-9000] // ⟹ 5682
I think we should trap instead, and fix stdlib code not to rely on this misbehavior. (Doing it in a way that's compatible with code compiled on 5.0 would be challenge, I expect.) :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change just made the control flow logic easier to understand. I can also not touch this declaration in this PR. We can trap in a future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a functional change. With the new _internalInvariant
, this no longer allows subscripting with arbitrary indices for non-BMP runes in debug stdlibs, while the original code did allow that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's adding an assert, because it would be a bug if the stdlib does so. I don't think it's pragmatic to consider all assertions as being "functional changes" to the standard library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT, the stdlib does indeed call this with unvalidated transcoded offsets:
https://github.com/apple/swift/pull/23834/files#diff-610243ed72b2ff2c91619cdc75fea5b1R254
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is that _internalInvariant
is not the correct way to check for coding errors outside of the stdlib.
If you intend to trap if String.UTF16View.subscript
is given an index with some nonsensical transcodedOffset, then we should add a precondition checking for that, either here or there.
If you intend to continue allowing such indices there, then this check needs to be removed.
I don't think it is desirable for user code to generate _internalInvariant
failures in Debug stdlibs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is that _internalInvariant is not the correct way to check for coding errors outside of the stdlib.
I agree. I added this assertion to detect unexpected inputs coming from the stdlib itself. I could remove this assertion from this PR, but it would degrade the quality of our testing.
If you intend to trap if String.UTF16View.subscript is given an index with some nonsensical transcodedOffset, then we should add a precondition checking for that, either here or there.
For this PR, I do not intend to change the semantics or behavior of this public API from user code. That does sound like a good thing to add, though, and so we'd have two methods here:
public subscript
would check and trap, and subscriptImpl
(or whatever it's named) would do the _internalInvariant
. The stdlib's code would call the latter.
Would you like to open a SR for that change?
I don't think it is desirable for user code to generate _internalInvariant failures in Debug stdlibs.
I do not understand this position, could you clarify? This is true of any assertions, are you arguing the stdlib shouldn't have assertions? Why is user code running with a assertions-enabled stdlib relevant to this discussion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this PR, I do not intend to change the semantics or behavior of this public API from user code.
I'm trying to point out that you did! The addition of that _internalInvariant
is a strong value judgement on the validity of the expression ("😞" as UnicodeScalar).utf16[9000]
.
My position is that it's not okay to let this expression run "successfully" in release builds of the stdlib, but to make it hit an assert in debug builds. This is not a valid use of _internalInvariant
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My position is that it's not okay to let this expression run "successfully" in release builds of the stdlib, but to make it hit an assert in debug builds. This is not a valid use of _internalInvariant.
That's all uses of _internalInvariant
. If we had a compile-time proof, we wouldn't need _internalInvariant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a debug build of the stdlib, a user error without any bugs in the stdlib itself should not trigger an _internalInvariant
.
I think what @lorentey is saying is that _internalInvariant
should not be used to diagnose errors on the part of the caller of a public API--even when the intention is to use this for internal diagnostics. The diagnostics cannot distinguish between calls by the stdlib itself and calls by an end user, and errors on the part of the latter cannot by themselves be in violation of an internal invariant.
let offset = __swift_stdlib_ubrk_following( | ||
iterator, Int32(truncatingIfNeeded: i)) | ||
// ICU will gives us a different result if we feed in the whole buffer, so | ||
// slice it appropriately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😱 I remember hearing about this, but this still gives me goosebumps. Does it merely jump back to the nearest scalar boundary, or does it also look at previous scalars? Should we have a dedicated test for the difference in behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We care about the behavior of String, for which we have tests. Why would we care if a detail of an ICU API that we no longer use changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value of that test would be to call out the specific difference in behaviour between 5.0 and 5.1. This has an observable effect on how indexing works, doesn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have that test in this PR, and it would fail on 5.0. We don't usually add test cases trying to mimic prior bugs by subverting the stdlib's interfaces.
} | ||
return scalar.utf16[i.transcodedOffset] | ||
startingAt: _guts.scalarAlign(idx)._encodedOffset) | ||
return scalar.utf16[idx.transcodedOffset] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both the original code and the new code assume that it's safe to subscript the scalar's UTF-16 view with anything we find in transcodedOffset
. This is no longer the case, since we now have _internalInvariant
implementing a range check there. (And I think it should actually be a precondition -- see my separate review comment.)
Is it okay to trap if there is junk in transcodedOffset
?
It's rather unfortunate that indices don't hold information on which view they are coming from. Is it too late to add that? If we knew the provenance of an index we would be able to only look at transcodedOffset when the index was actually generated by an UTF-16 view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would that information give us? We could also add that in the future if we want.
Regardless, we want a dedicated bit. It's more efficient and smaller code size than OR-ing up multiple bits (let along multiple branches), and this check happens everywhere.
Transcoded indices have scalar-aligned offsets, which we already detect (and set when we move on to the next scalar). We might want to clarify the semantics of the aligned bit to mean that there could still be a transcoded offset, i.e. it means the encoded offset was aligned. Then transcoded views would also always set the bit, rather than only setting it when we advance to the next scalar. It would also clean up a branch in transcoded code unit views where we check if transcoded is zero or alignment bit is set. On the flip side, anything that needs a aligned and non-transcoded index (e.g. distance
) would have to check the alignment bit and transcoded offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I added more comments and a table in the new update, such as https://github.com/apple/swift/pull/23834/files#diff-8ec9c169d4bc3c678838b4067481979eR211
@swift-ci please test |
Build failed |
Build failed |
Fixes a general category (pun intended) of scalar-alignment bugs surrounding exchanging non-scalar-aligned indices between views and for slicing. SE-0180 unifies the Index type of String and all its views and allows non-scalar-aligned indices to be used across views. In order to guarantee behavior, we often have to check and perform scalar alignment. To speed up these checks, we allocate a bit denoting known-to-be-aligned, so that the alignment check can skip the load. The below shows what views need to check for alignment before they can operate, and whether the indices they produce are aligned. ┌───────────────╥────────────────────┬──────────────────────────┐ │ View ║ Requires Alignment │ Produces Aligned Indices │ ╞═══════════════╬════════════════════╪══════════════════════════╡ │ Native UTF8 ║ no │ no │ ├───────────────╫────────────────────┼──────────────────────────┤ │ Native UTF16 ║ yes │ no │ ╞═══════════════╬════════════════════╪══════════════════════════╡ │ Foreign UTF8 ║ yes │ no │ ├───────────────╫────────────────────┼──────────────────────────┤ │ Foreign UTF16 ║ no │ no │ ╞═══════════════╬════════════════════╪══════════════════════════╡ │ UnicodeScalar ║ yes │ yes │ ├───────────────╫────────────────────┼──────────────────────────┤ │ Character ║ yes │ yes │ └───────────────╨────────────────────┴──────────────────────────┘ The "requires alignment" applies to any operation taking a String.Index that's not defined entirely in terms of other operations taking a String.Index. These include: * index(after:) * index(before:) * subscript * distance(from:to:) (since `to` is compared against directly) * UTF16View._nativeGetOffset(for:)
@swift-ci please test |
Build failed |
Build failed |
CI failed with:
Seems unrelated. |
@swift-ci please test |
Build failed |
Build failed |
[String] Scalar-alignment bug fixes.
Fixes a general category (pun intended) of scalar-alignment bugs
surrounding exchanging non-scalar-aligned indices between views and
for slicing.
SE-0180 unifies the Index type of String and all its views and allows
non-scalar-aligned indices to be used across views. In order to
guarantee behavior, we often have to check and perform scalar
alignment. To speed up these checks, we allocate a bit denoting
known-to-be-aligned, so that the alignment check can skip the
load. The below shows what views need to check for alignment before
they can operate, and whether the indices they produce are aligned.
The "requires alignment" applies to any operation taking a
String.Index that's not defined entirely in terms of other operations
taking a String.Index. These include:
to
is compared against directly)Bugs: https://bugs.swift.org/browse/SR-9802, https://bugs.swift.org/browse/SR-10124, https://bugs.swift.org/browse/SR-10451, https://bugs.swift.org/browse/SR-10452, https://bugs.swift.org/browse/SR-10453.