-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[stdlib] Remove _HasContiguousBytes #22028
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@swift-ci please benchmark |
@swift-ci please test |
I'm in favor of this change. @airspeedswift this removes some things from the ABI which are no longer needed, are we taking removals? |
Build comment file:Performance: -O
Code size: -O
Performance: -Osize
Code size: -Osize
Performance: -Onone
How to read the dataThe tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the Noise: Sometimes the performance results (not code size!) contain false Hardware Overview
|
One important thing I just realised is that On the one hand, the UTF8 validation should probably work with a raw buffer internally to just avoid the binding entirely. On the other hand, you do see custom types with contiguous storage, and the point of the evolution proposal was largely that there should be a consistent interface for them to benefit from these kind of fast-paths. Would be awesome if we could have both. |
What about if we add |
To address @karwa 's question, but also for @milseman @airspeedswift... In general, if you decide to expose a customization point at the level of Pointers, and you need to handle nontrivial types, then you should to define two variants, one for strictly typed, potentially nontrivial elements ( String's customization points (or internal APIs) can just use raw pointers because it doesn't need to handle nontrivial element types (assuming these APIs are below the level of checking the type of CodeUnit). Part of the assumption behind this design is that extensible APIs will not expose Pointers at all. Instead they should expose something like: We can easily have a Sorry, to coopt this PR to make broad design points. I just want to make sure this is clear so others can make the short term decision on this or the other PRs. If it's really important, then we can start a dev forum thread. |
Are you saying that we shouldn't use |
@milseman If the API is meant to be used with arbitrary byte buffers, then that is what I'm saying. If you only expect the String API to be used with buffers that are owned by a |
None of the stdlib APIs or SPIs are supposed to work with arbitrary byte buffers that may have originated or already be bound to a different type. The API in this PR has a generic constraint that the Collection is typed to contain elements of the encoding's code unit type. The SPI in the other PR is aimed at creating a String from a typed contiguous collection of UInt8s. These only accept valid UTF-8 contents, so it's nonsense to use with arbitrary bytes that might be e.g. the memory addresses of I don't know how best to square Data's circle. |
We're on the wrong PR now, but I didn't want to complicate the other one. I think The problem is that you cannot do all of the following at once:
You can give up on concrete types and do something like Also note that if we did somehow allow any byte buffer to be exposed as I think you're arguing that we can only create |
Just to make sure it's clear, unless this makes it into Swift 5, it can't be done, since these symbols are referenced in inlinable functions. |
@jrose-apple I was hoping to get this in to Swift 5. I think the current String ABI promotes undefined behaviour by using typed pointers for internal functions (like I'm working on a version which goes the other way (albeit with much more ABI churn) - making String's internal functions use URBPs, which I think matches what @atrick said. I know time is critical so I should have it tomorrow. |
I'm saying the stdlib has no need/desire to allow the user to circumvent the type system here, other than perhaps what's needed for C interop such as
I don't think I'm arguing that. I don't understand Data's design regarding ownership of the memory and vending typed pointers into it. If someone owns a bunch of bytes in memory, say fresh off the network, and hasn't vended any typed pointers into that (or at least none are currently live), then I see no reason why they shouldn't be able to produce a typed UInt8 UBP to give to String for initialization. What am I missing? |
That's fine; the problem is that if they do that, they won't get the String fast-path because URBP doesn't provide a typed pointer. NIO's ByteBuffer does this, for example, although I think they could also use a typed pointer. It's just unfortunate if one way of writing it has very different performance characteristics to the other. The way I see it is that UTF8 is a binary encoding, and by itself does not necessarily require memory of any particular type - decoding a contiguous raw buffer directly should totally be okay. There may also be a place to add an initialiser which decodes UTF8/16/32 from a raw buffer directly, but this one just takes a generic (and possibly non-contiguous) Collection of things which happen to already match the encoding's CodeUnit. |
They currently do with BTW, if the user has a memory region typed to be In talking with @atrick, we're not going to rewrite the utility methods to be on If and when we add a public For the purposes of this PR, it's always nice to delete code, but I'm debating whether this is worth the ABI churn this late into the release. If we're going to do this, we'll have to break it into 2 steps. The first PR removes the uses by using We'll later add an overload taking some |
Long term: Conditional conformance on a hypothetical "is trivial" type constraint should solve the problem of disallowing nontrivial types. e.g. you can initialize String from |
0217c10
to
8541626
Compare
8541626
to
a850878
Compare
@milseman OK, I've redone this to make it easier to cherry-pick each part. The first part removes the UB by creating Strings from raw bytes (keeping Could you please summon some tests and benchmarks? |
@atrick IIUC, a So it would only allow you to use raw bytes to pass the static check - i.e. if you wanted to decode a URBP or Data as UTF-16 encoded bytes. |
That's right. URBP is dangerous when used in APIs, because it's possible to create one from a buffer of non-trivial types. (It's documented as illegal to perform non-trivial typed operations on that buffer, but the type system can't enforce that). Any kind of higher-level byte buffer, like Data, should ideally have constraints on their initializers to prevent initialization from non-trivial element type. As a concrete type, ...it's also perfectly memory safe to create a |
Sorry to say that this can no longer be done as-is. This breaks ABI. Going to close this out. |
_HasContiguousBytes
is from the early UTF8 String work and is only used by a single fast-path on a single initialiser which effectively constrains the collection to containUInt8
s anyway. This means thatString
andSubstring
s conformances are redundant. Since it isn't adding anything significant, it would be nice to remove this protocol from ABI.The only thing which SE-237 doesn't include is
_providesContiguousBytesNoCopy
, which is only customised by bridged Arrays. If that's worth keeping, it's probably worth moving up toSequence
(and if time, probably worth making public as an extension to SE-237).@milseman