Skip to content

[stdlib] Remove _HasContiguousBytes #22028

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

karwa
Copy link
Contributor

@karwa karwa commented Jan 21, 2019

_HasContiguousBytes is from the early UTF8 String work and is only used by a single fast-path on a single initialiser which effectively constrains the collection to contain UInt8s anyway. This means that String and Substrings conformances are redundant. Since it isn't adding anything significant, it would be nice to remove this protocol from ABI.

The only thing which SE-237 doesn't include is _providesContiguousBytesNoCopy, which is only customised by bridged Arrays. If that's worth keeping, it's probably worth moving up to Sequence (and if time, probably worth making public as an extension to SE-237).

@milseman

@milseman
Copy link
Member

@swift-ci please benchmark

@milseman
Copy link
Member

@swift-ci please test

@milseman
Copy link
Member

I'm in favor of this change. @airspeedswift this removes some things from the ABI which are no longer needed, are we taking removals?

@swift-ci
Copy link
Contributor

Build comment file:

Performance: -O

TEST OLD NEW DELTA RATIO
Improvement
DataToStringEmpty 2650 2000 -24.5% 1.32x (?)
DataToStringSmall 4700 4050 -13.8% 1.16x (?)
StringComparison_slowerPrenormal 1780 1620 -9.0% 1.10x (?)
CountAlgoString 1845 1705 -7.6% 1.08x (?)
DataToStringMedium 11600 10750 -7.3% 1.08x (?)

Code size: -O

TEST OLD NEW DELTA RATIO
Improvement
UTF8Decode.o 12558 12314 -1.9% 1.02x
DataBenchmarks.o 51066 50506 -1.1% 1.01x

Performance: -Osize

TEST OLD NEW DELTA RATIO
Regression
DataCreateSmall 2000 3000 +50.0% 0.67x
DataSubscriptSmall 25 31 +24.0% 0.81x
DataCountSmall 25 28 +12.0% 0.89x
DataCountMedium 31 34 +9.7% 0.91x (?)
FlattenListLoop 4054 4425 +9.2% 0.92x (?)
Array2D 6912 7520 +8.8% 0.92x
Improvement
DataToStringEmpty 2650 2050 -22.6% 1.29x
DataToStringSmall 4700 4050 -13.8% 1.16x (?)
CharacterLiteralsLarge 111 100 -9.9% 1.11x
DataToStringMedium 11600 10600 -8.6% 1.09x (?)
EqualSubstringSubstringGenericEquatable 43 40 -7.0% 1.07x (?)
DictionaryBridgeToObjC_Access 991 923 -6.9% 1.07x (?)
CharacterLiteralsSmall 345 322 -6.7% 1.07x

Code size: -Osize

TEST OLD NEW DELTA RATIO
Improvement
DataBenchmarks.o 38466 37842 -1.6% 1.02x
UTF8Decode.o 10878 10737 -1.3% 1.01x

Performance: -Onone

TEST OLD NEW DELTA RATIO
Regression
ReversedDictionary2 29871 35739 +19.6% 0.84x (?)
DictionaryOfAnyHashableStrings_insert 8346 9157 +9.7% 0.91x (?)
Improvement
DataAppendDataLargeToLarge 51200 38600 -24.6% 1.33x (?)
DataToStringEmpty 2800 2250 -19.6% 1.24x
UTF8Decode_InitDecoding_ascii 706 600 -15.0% 1.18x
UTF8Decode_InitDecoding 238 205 -13.9% 1.16x
DataToStringSmall 4850 4250 -12.4% 1.14x (?)
ArrayOfPOD 853 775 -9.1% 1.10x (?)
StrComplexWalk 7350 6680 -9.1% 1.10x
How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB
--------------

@karwa
Copy link
Contributor Author

karwa commented Jan 22, 2019

One important thing I just realised is that UnsafeRawBufferPointer doesn't implement withContiguousStorageIfAvailable. I guess because it doesn't agree with doing the assumingMemoryBound thing that we were doing here before, which is fair. /CC @atrick

On the one hand, the UTF8 validation should probably work with a raw buffer internally to just avoid the binding entirely. On the other hand, you do see custom types with contiguous storage, and the point of the evolution proposal was largely that there should be a consistent interface for them to benefit from these kind of fast-paths.

Would be awesome if we could have both.

@karwa
Copy link
Contributor Author

karwa commented Jan 22, 2019

What about if we add _withUnsafeBytesIfAvailable to Sequence, with a default implementation that provides a raw-buffer over the typed buffer. The idea is that while some sequences can guarantee typed contiguous access, others can only guarantee raw access (like URBP and Data). UTF8 decoding can safely be done at the raw level, but something like copying to a typed Array would want a typed pointer.

@atrick
Copy link
Contributor

atrick commented Jan 22, 2019

To address @karwa 's question, but also for @milseman @airspeedswift...

In general, if you decide to expose a customization point at the level of Pointers, and you need to handle nontrivial types, then you should to define two variants, one for strictly typed, potentially nontrivial elements (UnsafeBufferPointer<T>), and another for trivial elements that will be viewed as UInt8 (UnsafeRawBufferPointer). It is a very deliberate language design choice to disallow generic Pointers that may or may not be strictly typed because those two things have different semantics!

String's customization points (or internal APIs) can just use raw pointers because it doesn't need to handle nontrivial element types (assuming these APIs are below the level of checking the type of CodeUnit).

Part of the assumption behind this design is that extensible APIs will not expose Pointers at all. Instead they should expose something like: ContiguousStorage<T> where T == Element.

We can easily have a ReinterpretedBufferPointer<T> : ContiguousStorage<T> that can be initialized from UnsafeRawBufferPointer where T must be trivial (the trivial constraint can be enforced when the language catches up).

Sorry, to coopt this PR to make broad design points. I just want to make sure this is clear so others can make the short term decision on this or the other PRs. If it's really important, then we can start a dev forum thread.

@milseman
Copy link
Member

Are you saying that we shouldn't use UnsafeBufferPointer<UInt8> as a concrete type in the standard library, but only UnsafeRawBufferPointer?

@atrick
Copy link
Contributor

atrick commented Jan 23, 2019

@milseman If the API is meant to be used with arbitrary byte buffers, then that is what I'm saying. If you only expect the String API to be used with buffers that are owned by a String, then UnsafeBufferPointer<UInt8> makes sense and is hypothetically more performant.

@milseman
Copy link
Member

None of the stdlib APIs or SPIs are supposed to work with arbitrary byte buffers that may have originated or already be bound to a different type. The API in this PR has a generic constraint that the Collection is typed to contain elements of the encoding's code unit type. The SPI in the other PR is aimed at creating a String from a typed contiguous collection of UInt8s. These only accept valid UTF-8 contents, so it's nonsense to use with arbitrary bytes that might be e.g. the memory addresses of AnyObjects.

I don't know how best to square Data's circle.

@atrick
Copy link
Contributor

atrick commented Jan 23, 2019

We're on the wrong PR now, but I didn't want to complicate the other one.

I think Data is correctly serving its role as a type erased bag of bytes. Statically enforcing the CodeUnit's type at the point of creating String from Data makes no sense.

The problem is that you cannot do all of the following at once:

  • allow the user to cirumvent the type system
  • effectively circumvent the compiler magic by using concrete types
  • statically enforce the CodeUnit type

You can give up on concrete types and do something like ContiguouslyStored<UInt8> or you can give up on statically checked CodeUnits and use UnsafeRawBufferPointer.

Also note that if we did somehow allow any byte buffer to be exposed as UnsafeBufferPointer<UInt8> then there would be no way to enforce the type of CodeUnit in general.

I think you're arguing that we can only create String from a Data that was itself created from a String-like object controlled by the stdlib, and you don't want users to be able to create a String from a byte buffer that they just happen to know is UTF-8. I'm sure we need such an API eventually but maybe this isn't the right time and place.

@jrose-apple
Copy link
Contributor

Just to make sure it's clear, unless this makes it into Swift 5, it can't be done, since these symbols are referenced in inlinable functions.

@karwa
Copy link
Contributor Author

karwa commented Jan 23, 2019

@jrose-apple I was hoping to get this in to Swift 5. I think the current String ABI promotes undefined behaviour by using typed pointers for internal functions (like String._fromUTF8Repairing and ._allASCII), even though important memory containers like URBP, Data and NIO's ByteBuffer use raw buffers. This PR removes the UB only providing the fast-path for typed pointers.

I'm working on a version which goes the other way (albeit with much more ABI churn) - making String's internal functions use URBPs, which I think matches what @atrick said. I know time is critical so I should have it tomorrow.

@milseman
Copy link
Member

allow the user to cirumvent the type system

I'm saying the stdlib has no need/desire to allow the user to circumvent the type system here, other than perhaps what's needed for C interop such as UInt8/Int8 or if they got a poorly-typed C pointer. Trying to treat a collection of UInt32s as though they were UTF-8 code units in host-endianness is highly dubious and the user should have to contort themselves through URBP as an intermediary.

I think you're arguing that we can only create String from a Data that was itself created from a String-like object controlled by the stdlib, and you don't want users to be able to create a String from a byte buffer that they just happen to know is UTF-8. I'm sure we need such an API eventually but maybe this isn't the right time and place.

I don't think I'm arguing that. I don't understand Data's design regarding ownership of the memory and vending typed pointers into it.

If someone owns a bunch of bytes in memory, say fresh off the network, and hasn't vended any typed pointers into that (or at least none are currently live), then I see no reason why they shouldn't be able to produce a typed UInt8 UBP to give to String for initialization. What am I missing?

@karwa
Copy link
Contributor Author

karwa commented Jan 24, 2019

Trying to treat a collection of UInt32s as though they were UTF-8 code units in host-endianness is highly dubious and the user should have to contort themselves through URBP as an intermediary.

That's fine; the problem is that if they do that, they won't get the String fast-path because URBP doesn't provide a typed pointer. NIO's ByteBuffer does this, for example, although I think they could also use a typed pointer. It's just unfortunate if one way of writing it has very different performance characteristics to the other.

The way I see it is that UTF8 is a binary encoding, and by itself does not necessarily require memory of any particular type - decoding a contiguous raw buffer directly should totally be okay. There may also be a place to add an initialiser which decodes UTF8/16/32 from a raw buffer directly, but this one just takes a generic (and possibly non-contiguous) Collection of things which happen to already match the encoding's CodeUnit.

@milseman
Copy link
Member

That's fine; the problem is that if they do that, they won't get the String fast-path because URBP doesn't provide a typed pointer.

They currently do with _HasContiguousBytes. If you want to remove _HasContiguousBytes in this PR and preserve this, you can add in explicit type checks.

BTW, if the user has a memory region typed to be AnyObject, then I actually do want a theoretical UBSan for Swift to trigger a failure if they try to treat any subregion of that as containing UTF-8. The only case where I don't want the failure is when it was allocated with a higher-stride trivial type (e.g. Word) out of convenience or alignment concerns, and a sub-region of that is UTF-8. But, honoring one and not the other might require an extension to the model.


In talking with @atrick, we're not going to rewrite the utility methods to be on URBP for Swift 5. The exiting interfaces on UBP<UInt8> are much less bug prone for stdlib-internal usage, even if they are overly restrictive.

If and when we add a public ContiguousRawStorage protocol or Sequence customization point or whatever, which vends URBP, we can rewrite the utilities to be over URBP. We would keep the old UBP<UInt8>entry points, which just call the new URBP methods. Since the URBP utilities are error-prone, we'd probably want to restrict them with as tight of access-control as we can while the UPB<UInt8> ones remain internal.


For the purposes of this PR, it's always nice to delete code, but I'm debating whether this is worth the ABI churn this late into the release.

If we're going to do this, we'll have to break it into 2 steps. The first PR removes the uses by using withContiguousStorageIfAvailable alongside some hard-coded checks specifically for UnsafeRawBufferPointer and maybe even Slice<UnsafeRawBufferPointer>, depending on what SwiftNIO has or needs. In a second PR, we can drop the symbols, and it's up to the release managers how/when/if they take the second one.

We'll later add an overload taking some ContiguousRawStorage or whatever when that happens.

@atrick
Copy link
Contributor

atrick commented Jan 24, 2019

Long term: Conditional conformance on a hypothetical "is trivial" type constraint should solve the problem of disallowing nontrivial types. e.g. you can initialize String from ContiguouslyStored<UInt8>; both UBP<UInt8> and a hypothetical ReinterpretedPointer<UInt8> conform. And of course you wouldn't be able to reinterpret a pointer to AnyObject to a pointer to UInt8. The problem is that ultimately some utilities will exist below the resilience boundary and those need a concrete type. To be fully general, that concrete type would need to be either URBP or ReinterpretedPointer<UInt8>. I'm not concerned about the performance of that impact precisely because the string utilities are self-contained loops--they don't execute arbitrary user code within those loops.

@karwa karwa force-pushed the goodbyte_sweetheart branch from 0217c10 to 8541626 Compare January 31, 2019 16:41
@karwa karwa force-pushed the goodbyte_sweetheart branch from 8541626 to a850878 Compare January 31, 2019 16:46
@karwa
Copy link
Contributor Author

karwa commented Jan 31, 2019

@milseman OK, I've redone this to make it easier to cherry-pick each part. The first part removes the UB by creating Strings from raw bytes (keeping _HasContiguousBytes to access those bytes. Very small ABI impact, but I think the implementation of _allASCII also looks much nicer). Then a bit of cleanup and finally replacing HCB.

Could you please summon some tests and benchmarks?

@karwa
Copy link
Contributor Author

karwa commented Jan 31, 2019

@atrick IIUC, a Reinterpreted(Buffer)Pointer<T> would only differ from a URBP because its Collection conformance would load/store a different type, not UInt8s. I guess it wouldn't support withContiguousStorageIfAvailable, giving a bound UBP, because that's what withMemoryRebound is for.

So it would only allow you to use raw bytes to pass the static check - i.e. if you wanted to decode a URBP or Data as UTF-16 encoded bytes.

@atrick
Copy link
Contributor

atrick commented Jan 31, 2019

That's right. withContiguousStorageIfAvailable works with any Element type, preserving the semantics of copying/destroying that type. A fully customizable Sequence would also need something like a withContiguousBytesIfAvailable to handle byte buffers or reinterpreted pointers.

URBP is dangerous when used in APIs, because it's possible to create one from a buffer of non-trivial types. (It's documented as illegal to perform non-trivial typed operations on that buffer, but the type system can't enforce that). Any kind of higher-level byte buffer, like Data, should ideally have constraints on their initializers to prevent initialization from non-trivial element type.

As a concrete type, Reinterpreted(Buffer)Pointer<UInt8> is hypothetically safer than URBP because you could create one directly from UBP<T> while enforcing that T is a trivial type. It also helps communicate directly in the API that the client expects valid UInt8 elements, not just arbitrary bytes.

...it's also perfectly memory safe to create a Reinterpreted(Buffer)Pointer<UInt8> from URBP as long we know the URBP came from a trusted source, like Data and can't contain nontrivial types.

@CodaFi
Copy link
Contributor

CodaFi commented Nov 18, 2019

Sorry to say that this can no longer be done as-is. This breaks ABI.

Going to close this out.

@CodaFi CodaFi closed this Nov 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants