[stdlib] Remove _HasContiguousBytes #22028

karwa · 2019-01-21T21:50:25Z

_HasContiguousBytes is from the early UTF8 String work and is only used by a single fast-path on a single initialiser which effectively constrains the collection to contain UInt8s anyway. This means that String and Substrings conformances are redundant. Since it isn't adding anything significant, it would be nice to remove this protocol from ABI.

The only thing which SE-237 doesn't include is _providesContiguousBytesNoCopy, which is only customised by bridged Arrays. If that's worth keeping, it's probably worth moving up to Sequence (and if time, probably worth making public as an extension to SE-237).

@milseman

milseman · 2019-01-22T17:48:36Z

@swift-ci please benchmark

milseman · 2019-01-22T17:48:42Z

@swift-ci please test

milseman · 2019-01-22T17:50:09Z

I'm in favor of this change. @airspeedswift this removes some things from the ABI which are no longer needed, are we taking removals?

swift-ci · 2019-01-22T18:33:42Z

Build comment file:

Performance: -O

TEST	OLD	NEW	DELTA	RATIO
Improvement
DataToStringEmpty	2650	2000	-24.5%	1.32x (?)
DataToStringSmall	4700	4050	-13.8%	1.16x (?)
StringComparison_slowerPrenormal	1780	1620	-9.0%	1.10x (?)
CountAlgoString	1845	1705	-7.6%	1.08x (?)
DataToStringMedium	11600	10750	-7.3%	1.08x (?)

Code size: -O

TEST	OLD	NEW	DELTA	RATIO
Improvement
UTF8Decode.o	12558	12314	-1.9%	1.02x
DataBenchmarks.o	51066	50506	-1.1%	1.01x

Performance: -Osize

TEST	OLD	NEW	DELTA	RATIO
Regression
DataCreateSmall	2000	3000	+50.0%	0.67x
DataSubscriptSmall	25	31	+24.0%	0.81x
DataCountSmall	25	28	+12.0%	0.89x
DataCountMedium	31	34	+9.7%	0.91x (?)
FlattenListLoop	4054	4425	+9.2%	0.92x (?)
Array2D	6912	7520	+8.8%	0.92x
Improvement
DataToStringEmpty	2650	2050	-22.6%	1.29x
DataToStringSmall	4700	4050	-13.8%	1.16x (?)
CharacterLiteralsLarge	111	100	-9.9%	1.11x
DataToStringMedium	11600	10600	-8.6%	1.09x (?)
EqualSubstringSubstringGenericEquatable	43	40	-7.0%	1.07x (?)
DictionaryBridgeToObjC_Access	991	923	-6.9%	1.07x (?)
CharacterLiteralsSmall	345	322	-6.7%	1.07x

Code size: -Osize

TEST	OLD	NEW	DELTA	RATIO
Improvement
DataBenchmarks.o	38466	37842	-1.6%	1.02x
UTF8Decode.o	10878	10737	-1.3%	1.01x

Performance: -Onone

TEST	OLD	NEW	DELTA	RATIO
Regression
ReversedDictionary2	29871	35739	+19.6%	0.84x (?)
DictionaryOfAnyHashableStrings_insert	8346	9157	+9.7%	0.91x (?)
Improvement
DataAppendDataLargeToLarge	51200	38600	-24.6%	1.33x (?)
DataToStringEmpty	2800	2250	-19.6%	1.24x
UTF8Decode_InitDecoding_ascii	706	600	-15.0%	1.18x
UTF8Decode_InitDecoding	238	205	-13.9%	1.16x
DataToStringSmall	4850	4250	-12.4%	1.14x (?)
ArrayOfPOD	853	775	-9.1%	1.10x (?)
StrComplexWalk	7350	6680	-9.1%	1.10x

How to read the data

The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview

  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB

--------------

karwa · 2019-01-22T19:23:50Z

One important thing I just realised is that UnsafeRawBufferPointer doesn't implement withContiguousStorageIfAvailable. I guess because it doesn't agree with doing the assumingMemoryBound thing that we were doing here before, which is fair. /CC @atrick

On the one hand, the UTF8 validation should probably work with a raw buffer internally to just avoid the binding entirely. On the other hand, you do see custom types with contiguous storage, and the point of the evolution proposal was largely that there should be a consistent interface for them to benefit from these kind of fast-paths.

Would be awesome if we could have both.

karwa · 2019-01-22T19:57:17Z

What about if we add _withUnsafeBytesIfAvailable to Sequence, with a default implementation that provides a raw-buffer over the typed buffer. The idea is that while some sequences can guarantee typed contiguous access, others can only guarantee raw access (like URBP and Data). UTF8 decoding can safely be done at the raw level, but something like copying to a typed Array would want a typed pointer.

atrick · 2019-01-22T23:51:43Z

To address @karwa 's question, but also for @milseman @airspeedswift...

In general, if you decide to expose a customization point at the level of Pointers, and you need to handle nontrivial types, then you should to define two variants, one for strictly typed, potentially nontrivial elements (UnsafeBufferPointer<T>), and another for trivial elements that will be viewed as UInt8 (UnsafeRawBufferPointer). It is a very deliberate language design choice to disallow generic Pointers that may or may not be strictly typed because those two things have different semantics!

String's customization points (or internal APIs) can just use raw pointers because it doesn't need to handle nontrivial element types (assuming these APIs are below the level of checking the type of CodeUnit).

Part of the assumption behind this design is that extensible APIs will not expose Pointers at all. Instead they should expose something like: ContiguousStorage<T> where T == Element.

We can easily have a ReinterpretedBufferPointer<T> : ContiguousStorage<T> that can be initialized from UnsafeRawBufferPointer where T must be trivial (the trivial constraint can be enforced when the language catches up).

Sorry, to coopt this PR to make broad design points. I just want to make sure this is clear so others can make the short term decision on this or the other PRs. If it's really important, then we can start a dev forum thread.

milseman · 2019-01-22T23:58:10Z

Are you saying that we shouldn't use UnsafeBufferPointer<UInt8> as a concrete type in the standard library, but only UnsafeRawBufferPointer?

atrick · 2019-01-23T00:13:25Z

@milseman If the API is meant to be used with arbitrary byte buffers, then that is what I'm saying. If you only expect the String API to be used with buffers that are owned by a String, then UnsafeBufferPointer<UInt8> makes sense and is hypothetically more performant.

milseman · 2019-01-23T00:25:27Z

None of the stdlib APIs or SPIs are supposed to work with arbitrary byte buffers that may have originated or already be bound to a different type. The API in this PR has a generic constraint that the Collection is typed to contain elements of the encoding's code unit type. The SPI in the other PR is aimed at creating a String from a typed contiguous collection of UInt8s. These only accept valid UTF-8 contents, so it's nonsense to use with arbitrary bytes that might be e.g. the memory addresses of AnyObjects.

I don't know how best to square Data's circle.

atrick · 2019-01-23T00:54:18Z

We're on the wrong PR now, but I didn't want to complicate the other one.

I think Data is correctly serving its role as a type erased bag of bytes. Statically enforcing the CodeUnit's type at the point of creating String from Data makes no sense.

The problem is that you cannot do all of the following at once:

allow the user to cirumvent the type system
effectively circumvent the compiler magic by using concrete types
statically enforce the CodeUnit type

You can give up on concrete types and do something like ContiguouslyStored<UInt8> or you can give up on statically checked CodeUnits and use UnsafeRawBufferPointer.

Also note that if we did somehow allow any byte buffer to be exposed as UnsafeBufferPointer<UInt8> then there would be no way to enforce the type of CodeUnit in general.

I think you're arguing that we can only create String from a Data that was itself created from a String-like object controlled by the stdlib, and you don't want users to be able to create a String from a byte buffer that they just happen to know is UTF-8. I'm sure we need such an API eventually but maybe this isn't the right time and place.

jrose-apple · 2019-01-23T01:53:12Z

Just to make sure it's clear, unless this makes it into Swift 5, it can't be done, since these symbols are referenced in inlinable functions.

karwa · 2019-01-23T07:12:53Z

@jrose-apple I was hoping to get this in to Swift 5. I think the current String ABI promotes undefined behaviour by using typed pointers for internal functions (like String._fromUTF8Repairing and ._allASCII), even though important memory containers like URBP, Data and NIO's ByteBuffer use raw buffers. This PR removes the UB only providing the fast-path for typed pointers.

I'm working on a version which goes the other way (albeit with much more ABI churn) - making String's internal functions use URBPs, which I think matches what @atrick said. I know time is critical so I should have it tomorrow.

milseman · 2019-01-23T23:07:47Z

allow the user to cirumvent the type system

I'm saying the stdlib has no need/desire to allow the user to circumvent the type system here, other than perhaps what's needed for C interop such as UInt8/Int8 or if they got a poorly-typed C pointer. Trying to treat a collection of UInt32s as though they were UTF-8 code units in host-endianness is highly dubious and the user should have to contort themselves through URBP as an intermediary.

I think you're arguing that we can only create String from a Data that was itself created from a String-like object controlled by the stdlib, and you don't want users to be able to create a String from a byte buffer that they just happen to know is UTF-8. I'm sure we need such an API eventually but maybe this isn't the right time and place.

I don't think I'm arguing that. I don't understand Data's design regarding ownership of the memory and vending typed pointers into it.

If someone owns a bunch of bytes in memory, say fresh off the network, and hasn't vended any typed pointers into that (or at least none are currently live), then I see no reason why they shouldn't be able to produce a typed UInt8 UBP to give to String for initialization. What am I missing?

karwa · 2019-01-24T00:34:30Z

Trying to treat a collection of UInt32s as though they were UTF-8 code units in host-endianness is highly dubious and the user should have to contort themselves through URBP as an intermediary.

That's fine; the problem is that if they do that, they won't get the String fast-path because URBP doesn't provide a typed pointer. NIO's ByteBuffer does this, for example, although I think they could also use a typed pointer. It's just unfortunate if one way of writing it has very different performance characteristics to the other.

The way I see it is that UTF8 is a binary encoding, and by itself does not necessarily require memory of any particular type - decoding a contiguous raw buffer directly should totally be okay. There may also be a place to add an initialiser which decodes UTF8/16/32 from a raw buffer directly, but this one just takes a generic (and possibly non-contiguous) Collection of things which happen to already match the encoding's CodeUnit.

milseman · 2019-01-24T01:29:54Z

That's fine; the problem is that if they do that, they won't get the String fast-path because URBP doesn't provide a typed pointer.

They currently do with _HasContiguousBytes. If you want to remove _HasContiguousBytes in this PR and preserve this, you can add in explicit type checks.

BTW, if the user has a memory region typed to be AnyObject, then I actually do want a theoretical UBSan for Swift to trigger a failure if they try to treat any subregion of that as containing UTF-8. The only case where I don't want the failure is when it was allocated with a higher-stride trivial type (e.g. Word) out of convenience or alignment concerns, and a sub-region of that is UTF-8. But, honoring one and not the other might require an extension to the model.

In talking with @atrick, we're not going to rewrite the utility methods to be on URBP for Swift 5. The exiting interfaces on UBP<UInt8> are much less bug prone for stdlib-internal usage, even if they are overly restrictive.

If and when we add a public ContiguousRawStorage protocol or Sequence customization point or whatever, which vends URBP, we can rewrite the utilities to be over URBP. We would keep the old UBP<UInt8>entry points, which just call the new URBP methods. Since the URBP utilities are error-prone, we'd probably want to restrict them with as tight of access-control as we can while the UPB<UInt8> ones remain internal.

For the purposes of this PR, it's always nice to delete code, but I'm debating whether this is worth the ABI churn this late into the release.

If we're going to do this, we'll have to break it into 2 steps. The first PR removes the uses by using withContiguousStorageIfAvailable alongside some hard-coded checks specifically for UnsafeRawBufferPointer and maybe even Slice<UnsafeRawBufferPointer>, depending on what SwiftNIO has or needs. In a second PR, we can drop the symbols, and it's up to the release managers how/when/if they take the second one.

We'll later add an overload taking some ContiguousRawStorage or whatever when that happens.

atrick · 2019-01-24T16:54:57Z

Long term: Conditional conformance on a hypothetical "is trivial" type constraint should solve the problem of disallowing nontrivial types. e.g. you can initialize String from ContiguouslyStored<UInt8>; both UBP<UInt8> and a hypothetical ReinterpretedPointer<UInt8> conform. And of course you wouldn't be able to reinterpret a pointer to AnyObject to a pointer to UInt8. The problem is that ultimately some utilities will exist below the resilience boundary and those need a concrete type. To be fully general, that concrete type would need to be either URBP or ReinterpretedPointer<UInt8>. I'm not concerned about the performance of that impact precisely because the string utilities are self-contained loops--they don't execute arbitrary user code within those loops.

conformance from String/Substring.

karwa · 2019-01-31T16:51:51Z

@milseman OK, I've redone this to make it easier to cherry-pick each part. The first part removes the UB by creating Strings from raw bytes (keeping _HasContiguousBytes to access those bytes. Very small ABI impact, but I think the implementation of _allASCII also looks much nicer). Then a bit of cleanup and finally replacing HCB.

Could you please summon some tests and benchmarks?

karwa · 2019-01-31T18:10:19Z

@atrick IIUC, a Reinterpreted(Buffer)Pointer<T> would only differ from a URBP because its Collection conformance would load/store a different type, not UInt8s. I guess it wouldn't support withContiguousStorageIfAvailable, giving a bound UBP, because that's what withMemoryRebound is for.

So it would only allow you to use raw bytes to pass the static check - i.e. if you wanted to decode a URBP or Data as UTF-16 encoded bytes.

atrick · 2019-01-31T18:43:17Z

That's right. withContiguousStorageIfAvailable works with any Element type, preserving the semantics of copying/destroying that type. A fully customizable Sequence would also need something like a withContiguousBytesIfAvailable to handle byte buffers or reinterpreted pointers.

URBP is dangerous when used in APIs, because it's possible to create one from a buffer of non-trivial types. (It's documented as illegal to perform non-trivial typed operations on that buffer, but the type system can't enforce that). Any kind of higher-level byte buffer, like Data, should ideally have constraints on their initializers to prevent initialization from non-trivial element type.

As a concrete type, Reinterpreted(Buffer)Pointer<UInt8> is hypothetically safer than URBP because you could create one directly from UBP<T> while enforcing that T is a trivial type. It also helps communicate directly in the API that the client expects valid UInt8 elements, not just arbitrary bytes.

...it's also perfectly memory safe to create a Reinterpreted(Buffer)Pointer<UInt8> from URBP as long we know the URBP came from a trusted source, like Data and can't contain nontrivial types.

CodaFi · 2019-11-18T22:07:03Z

Sorry to say that this can no longer be done as-is. This breaks ABI.

Going to close this out.

karwa mentioned this pull request Jan 22, 2019

[String] Add UTF-8 fast-paths for Foundation initializers #21959

Merged

karwa force-pushed the goodbyte_sweetheart branch from 0217c10 to 8541626 Compare January 31, 2019 16:41

karwa added 4 commits January 31, 2019 17:45

Allow creation of validated UTF8 strings from raw memory.

bcb874c

[ABI] Clean up StringCreate, remove unused _HasContiguousBytes

19075e0

conformance from String/Substring.

Add Sequence._withRawContiguousStorageIfAvailable

f2c126a

[ABI] Remove _HasContiguousBytes

a850878

karwa force-pushed the goodbyte_sweetheart branch from 8541626 to a850878 Compare January 31, 2019 16:46

karwa mentioned this pull request Feb 14, 2019

[String] Speed up UTF-8 initialization from non-stdlib types #22560

Merged

CodaFi closed this Nov 18, 2019

karwa mentioned this pull request Apr 9, 2020

[string] String(decoding:as:) fast path for withContiguousStorageIfAvailable #30729

Merged

milseman mannequin mentioned this pull request Jan 22, 2019

[SR-9700] String(decoding:as:UTF8) fastpath doesn't work for Slice<UBP<UInt8> | URBP> #52140

Open

[stdlib] Remove _HasContiguousBytes #22028

[stdlib] Remove _HasContiguousBytes #22028

Uh oh!

Conversation

karwa commented Jan 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

milseman commented Jan 22, 2019

Uh oh!

milseman commented Jan 22, 2019

Uh oh!

milseman commented Jan 22, 2019

Uh oh!

swift-ci commented Jan 22, 2019

Build comment file:

Performance: -O

Code size: -O

Performance: -Osize

Code size: -Osize

Performance: -Onone

Uh oh!

karwa commented Jan 22, 2019

Uh oh!

karwa commented Jan 22, 2019

Uh oh!

atrick commented Jan 22, 2019

Uh oh!

milseman commented Jan 22, 2019

Uh oh!

atrick commented Jan 23, 2019

Uh oh!

milseman commented Jan 23, 2019

Uh oh!

atrick commented Jan 23, 2019

Uh oh!

jrose-apple commented Jan 23, 2019

Uh oh!

karwa commented Jan 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

milseman commented Jan 23, 2019

Uh oh!

karwa commented Jan 24, 2019

Uh oh!

milseman commented Jan 24, 2019

Uh oh!

atrick commented Jan 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karwa commented Jan 31, 2019

Uh oh!

karwa commented Jan 31, 2019

Uh oh!

atrick commented Jan 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CodaFi commented Nov 18, 2019

Uh oh!

Uh oh!

karwa commented Jan 21, 2019 •

edited

Loading

karwa commented Jan 23, 2019 •

edited

Loading

atrick commented Jan 24, 2019 •

edited

Loading

atrick commented Jan 31, 2019 •

edited

Loading