-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Remove as much checked math as possible from buffer pointers. #37424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@swift-ci please test |
Build failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I agree with the premise. UnsafeBufferPointer indices are only checked in debug builds, so this change makes the implementation more consistent. And these are are inlinable critical code paths where we don't want extra traps.
UnsafeRawBufferPointer may have the same issue, at least in the iterator.
Build failed |
@swift-ci please test |
UnsafeRawBufferPointer definitely also needs cleaning up, I’ll tackle that as well. |
Build failed |
Build failed |
0fedd0b
to
b5d176b
Compare
@swift-ci please test |
b5d176b
to
a38d9f6
Compare
@swift-ci please test |
@atrick I've added URBP to this as well, the same kinds of changes really, though I also removed a division by replacing it with a series of unchecked additions in a loop we were going to execute anyway. |
Build failed |
Build failed |
a38d9f6
to
bb36d84
Compare
@swift-ci please test |
Build failed |
Build failed |
bb36d84
to
7e98c9c
Compare
@swift-ci please test |
Fantastic! I've noticed something like a 20% improvement to performance in As discussed on the forums, I think this is actually a general improvement that applies to all Collections - including The comments in this patch reason that it is okay to ignore overflow because this is an unsafe buffer; that is not entirely correct, or at least not the whole picture. It is okay to ignore overflow here because a Collection's indexing operations are not arithmetic operations, so arithmetic overflow is a meaningless signal, even in debug builds. If there were a boundary possibly worth raising a debug-mode error for, it would be incrementing past The thing that gives safe types their memory safety is bounds-checking on access, and the reason UBP is unsafe because it doesn't do that. It has nothing to do with its indexing operations trapping on overflow or not. The only thing trapping on overflow gives us is slightly earlier termination if somebody increments beyond |
Build failed |
Build failed |
@gottesmm Any idea what's going on with the SILOptimizer pass here? I think this is just the result of the loop being restructured, but I haven't gotten around to rebuilding this toolchain locally to be sure. |
Ok, here are my comparative SIL outputs. Firstly, from the existing 5.4-RELEASE toolchain:
Then, from my new toolchain:
|
I think I've convinced myself that this IR is strictly better. We don't do an immediate jump to the end of the function if the base pointer is Might need to chat to @gottesmm to work out what we think is better here. LLVM seems to have a strong preference for the |
In the interest of code size, LLVM should be able to reorder traps (if the frontend says it's ok). That's probably worth an LLVM radar. Although this case might be fixable in SIL. Would need to see |
To convince myself I replicated this code in C and compiled to LLVM IR, and then stripped out the debug noise. My code was: #include <stddef.h>
long testSubscript(char *base, char *end) {
size_t count;
long accumulator = 0;
if (base == NULL) {
count = 0;
} else {
count = end - base;
}
for (size_t i = 0; i < count; i++) {
accumulator += base[i];
}
return accumulator;
} This includes the slightly weird pattern of NULL leading to zero count, which is unavoidable due to the way we have constructed URBP. The resulting IR is:
This is, modulo some differences in signedness and the extra precondition branch on the end pointer, basically the same as what we have now, including the loop preamble. So I've convinced myself this is fine. |
@swift-ci Apple Silicon benchmark |
Performance (x86_64): -O
Code size: -O
Performance (x86_64): -Osize
Code size: -Osize
Performance (x86_64): -Onone
Code size: -swiftlibsHow to read the dataThe tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the Noise: Sometimes the performance results (not code size!) contain false Hardware Overview
|
Performance (arm64): -O
Code size: -O
Performance (arm64): -Osize
Code size: -Osize
Performance (arm64): -Onone
Code size: -swiftlibsHow to read the dataThe tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the Noise: Sometimes the performance results (not code size!) contain false Hardware Overview
|
Thoughts on those results @lorentey? |
IIRC, |
I think the performance of this object fairly critically relies on the idea that the function calls are being inlined into client code. |
That's true, but if those functions are called and inlined within the stdlib core, the debug-configuration check is still not eliminated. That could e.g. explain the perf regressions in the string benchmarks |
Hmm, so we don't really have an appropriate kind of precondition for this? |
I don't think so. Is |
The benchmarks are largely in line with my expectations -- absolutely everything about these low-level types is load bearing, and I don't expect large performance improvements from eliminating highly predictable branches. (#34961 wasn't exactly worth the effort it took to land it, either, when we only look at the performance results. I consider this stuff to be mostly cosmetic.) I was hoping for a somewhat larger impact on code size, though.
I don't think The next step towards landing this would probably be to look at codegen changes that could explain the regressions -- for that we'd need to pinpoint what exactly uses UBPs on the affected benchmarks. If it turns out to be a precondition thing, then the usual approach to mitigate it is to create internal-only variants of the affected entry points that use |
The branches may be highly predictable, but that is a runtime consideration. It definitely blocks the compiler performing other optimisations, although the performance impact will of course vary based on the kind of code you're benchmarking. This isn't theoretical; it is practical experience. This is the command I use to run the benchmarks (from the repo's Benchmarks folder):
Also, I run each command twice. I don't trust the benchmark results directly after a build. The simplest way to test the impact of indexing overflow checks would be to change this property and have it return If the only thing to consider was how predictable the branches are, and you don't expect a significant performance impact from eliminating them, there should be ~no change (within the margin of error). That is demonstrably not the case. Run the benchmarks a few times and you'll see the google benchmark library is very good at getting consistent results for the same code. They are reliable indicators of the code's performance and it's not a fluke. I already optimised this code to death before I switched to the custom buffer pointer, and doing so gave me about a 20% increase in performance. It's huge. The best change I ever made, and I'm not sure I could reach that level of performance without it. By all means, copy the type and test it in your own projects. It's very easy to adopt and to compare results. If you have code which uses wCSIA, just tack
I would like to (respectfully) challenge this. Why does it need to trap on overflow? What is it actually protecting against? As I mentioned in my earlier comment, if there were salient indexes which the developer should not increment past, it would be the bounds of the collection, not the bounds of the integer type. I've asked a number of times now, but haven't had a direct answer to these questions, and I'm quite disappointed about that. This is an area that I care about, have spent a lot of time digging through, and my experience shows that it can yield significant benefits for a great number of Swift developers. It would be a terrible shame if I had to advise people that the standard library has known performance issues which won't be fixed for nebulous reasons. I know that I am not owed answers to any of these questions, but I would really appreciate some. I think we're missing a trick here that could unlock better performance for other developers. Thanks. |
This is a gentle reminder that the single best way to convince us stdlib engineers that a potential change would be a performance improvement is to contribute a self-contained benchmark that demonstrates the issue. 😉 As it stands, this PR does not look like much of a performance improvement, and neither did #34961. (Which is not to say I'm arguing against landing it! I just don't think it's fair to push for this as a performance fix when the available data indicates it's more of a regression.) Clearly, you have some specific algorithms that make this a measurable performance win -- have you considered porting & reducing some of them and adding them to the benchmarking suite? (If you aren't able to do that, I'd be more than happy to take on the work of understanding your project myself, but please understand that it will have to be on my own schedule.) |
Nice. To be extremely clear -- removing stdlib preconditions is an extraordinarily risky proposition. We are threading on extremely thin ice here. I don't think it's fair to characterize the desire to check, double check, and triple check these changes as "nebulous". If we make a mistake here, it could result in irreparable damage to the language. (Re-adding a precondition we've previously removed would be an ABI breaking change, and because this is in inlinable code, merely updating libswiftCore.dylib wouldn't be enough to patch up a potential issue.) Searching for some evidence that this is actually going to be a beneficial change (and investigating actually measurable regressions) is simply due diligence. The way you seem to be arguing against doing this work (up to and including intimidation tactics like the quote above) makes me want to look through this change again in case there is a hidden issue I'm not seeing -- even though I've gone through it many times now, and I've already indicated I'd like to merge this.
By default, we like Swift code to perform overflow checks on arithmetic operations, despite the obvious fact that this can sometimes have a performance impact. These unsafe constructs are, by definition, unable to fully check their preconditions, but we keep widening this safety hole -- we decided to systematically weaken conditions that we can check, starting from our original sin: disabling the range check on the UBP subscript in optimized code. This is a risky process, but we still have one safety net left: we are still leaving the precondition checks enabled in debug builds. I dislike the idea of removing even that, unless we can prove that the precondition cannot possibly trigger. I think we can prove this in the case of the range initializer calls (which is why I'm fine with using the unchecked Range initializer). We obviously can't prove this in the case of the indexing operations -- these can trivially overflow. Disabling [debug mode -- ed] overflow checks in * Edit: Actually, this requires multiple |
Ok, I think we need to take a step back for a moment. Based on @lorentey’s comment above, it is clear that I have been labouring under a misapprehension. Specifically, my understanding was that the position of the core standard library maintainers is that URBP should not be responsible for policing misuse of its indices: that given that we did not consider it an error to subscript URBP with an invalid index, and given that URBP’s index is an This understanding is clearly not true! @lorentey my read of what you’re saying is that in an ideal world you’d prefer URBP to have more checks, not fewer. This is definitely against the tide of this PR, and I’d like to better understand the situation before we do anything, including merging this. Importantly, for me, I’d like to try to come to an understanding of what we think URBP is and is for. This understanding must necessarily be holistic: it needs to take into account our similar views about URP and the other stdlib Collection classes. Who should be using URBP? When? How does this differ from our ideal world? In the absence of a clear, shared understanding of what this data type is for we are going to go around in circles on this kind of issue. I was more than happy to go digging to try to find out why the benchmarks got slower, but if this PR represents a direction the stdlib team aren’t happy with then I’d rather not waste my time. |
This isn't what I'm arguing. I'm arguing that we need to preserve assertions in unoptimized debug builds. That's the contract we ended up with for these unsafe buffer pointer types. (My personal opinion is indeed that eliminating the bounds check in the subscript was a mistake we'll eventually regret -- in hindsight, I would've preferred if UBP had a separate, explicitly unchecked subscript instead, leaving the default subscript safe. But this opinion doesn't matter the slightest -- in the stdlib we have, UBP types are completely unsafe in optimized code, and I want to make sure we implement this the best we can.)
This is not my position. Have I not been actively working on landing this? What I do have an issue with is @karwa's push to merge this quickly by violating the debug mode contract, or by taking other shortcuts. I don't feel there is any reason to land this PR as soon as possible, but there is every reason to thread carefully. We should investigate, understand, and fix issues, and make sure that this isn't going to actually make things worse. These are fundamental abstractions; absolutely any change we make to them will show up as a regression in some benchmark. This is okay! -- we can fix the egregious ones without affecting the goals of this PR. But is it really too much to ask to add a benchmark that is dedicated to exercising these code paths? UBP operations are woefully underrepresented in our benchmarking suite. (If the answer is yes, that's fine -- I'm more than willing to do it myself! In my own time, of course.) |
I'm about to throw more mud into the water, so apologies for that. I personally think the Swift compiler has always been over-optimizing trap operations. Generally reordering traps and hoisting them above memory side effects is just fine. Hoisting traps above potential program exits, I/O, and memory barriers leads to a mystifying programming model and ultimately systems that can't be easily debugged. If we ever hope to improve that without too much pushback from performance watchdogs, it will help if critical paths are not laden with traps in -O code. Add to that the fact that inlining heuristics are impossible to predict, and I think it's fair to write contrived benchmarks using As a rule though, there needs to be some way of checking UB assumptions. For bounds/overflow, the best way to do that is testing with -Onone. So let's not remove any checks from debug builds. If these critical paths are slowing down code within the stdlib itself, rather than being inlined at least into the application code, then I think we'll need to address that as a separate stdlib performance problem. |
I honestly had no idea that was a thing swiftc was doing! 🤯 |
@Lukasa you can ignore the DictionaryOfAnyHashableStrings_insert benchmark regression. I'll be fixing that if someone else doesn't get to it first: rdar://80746149 (Fix 90% performance regression in benchmark DictionaryOfAnyHashableStrings_insert) |
Just a quick reminder that this patch exists. |
Thanks for bumping this. It's an important thing to address.
I have an even bigger issue with this mischaracterisation, @lorentey. I'm not trying to push to merge this quickly - indeed, I've filed many, many bug reports and started forum topics (and you were pinged in many/all of them, but did not respond), trying to actually have a discussion about whether it is meaningful to trap on overflow for indexing operations which do not actually read from/write to memory. Precisely because I believe it is important to be careful, investigate, understand, and fix issues. This characterisation that I'm being lazy or hasty or "taking shortcuts" is grossly inaccurate and I'm deeply offended by the suggestion. My understanding is that: in general, for bounds-checked collections, indexing operations which do not touch memory do not need to trap on overflow. Overflow is only possible in those situations if the developer makes a mistake using the collection, and bounds-checking will catch that when they actually try to touch memory. For the unsafe buffers, bounds-checking is disabled in release mode. That means that -- even without overflow -- there are many, many more mistakes a developer can make which will allow them to touch memory outside of the buffer's bounds. When using unsafe buffers, we rely on developer to perform their own bounds-checking (in whichever form that takes), and if they do so correctly, that will also cover integer overflows. If they don't, they will see those problems long before they even come close to overflowing a 64-bit integer. But - again - I'm very open to hearing alternative opinions. Preferably on the forum topic I linked to previously, because I think it is an important discussion for the entire Swift community. |
@swift-ci test |
@swift-ci benchmark |
Performance (x86_64): -O
Code size: -O
Performance (x86_64): -Osize
Code size: -OsizePerformance (x86_64): -Onone
Code size: -swiftlibsHow to read the dataThe tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the Noise: Sometimes the performance results (not code size!) contain false Hardware Overview
|
This patch removes as much checked math as I was able to find from the
unsafe buffer pointers types. In particular, I removed two kinds of
checked math:
done in the same function, or in a function that must have been
called before the current one. For example, in
index(_:offsetBy:limitedBy:), when we do the final index addition we
have already validated that it must be in-bounds. Therefore, it
cannot overflow.
As previously discussed with Karoy and Andrew, the unsafe raw buffer
types are not obligated to crash if you misuse their indices, they
are free to invoke undefined behaviour. In these cases, I added
defensive overflow checks in debug mode.
The result of this change should be reductions in code size in almost
all usage sites of the raw buffer types.