-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Optimize performance of SIMD binary operations via polymorphic builtins #26699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Make sure you munge the stuff (i.e. get rid of IRGenPrepare/etc) before the review starts. |
@gottesmm Can you elaborate why the IRGenPrepare stuff should go into IRGen directly? |
Is there any reason that it /shouldn't/ be in IRGen? |
#ifndef BUILTIN_BINARY_OPERATION | ||
#define BUILTIN_BINARY_OPERATION(Id, Name, Attrs, Overload) \ | ||
BUILTIN(Id, Name, Attrs) | ||
#define BUILTIN_BINARY_OPERATION(Id, Name, Attrs) BUILTIN(Id, Name, Attrs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just get rid of BUILTIN_BINARY_OPERATION
and use BUILTIN
instead, then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Sometimes you want to #define
something just for Builtin Binary Operations and not all builtins.
…TIONs. TLDR: This patch introduces a new kind of builtin, "a polymorphic builtin". One calls it like any other builtin, e.x.: ``` Builtin.generic_add(x, y) ``` but it has a contract: eventually At constant propagation time, the optimizer attempts to specialize the generic_add to the conc DISCUSSION ---------- Today there are polymorphic like instructions in LLVM-IR. Yet, at the swift and SIL level we represent these operations instead as Builtins whose names are resolved by splatting the builtin into the name. For example, adding two things in LLVM: ``` %2 = add i64 %0, %1 %2 = add <2 x i64> %0, %1 %2 = add <4 x i64> %0, %1 %2 = add <8 x i64> %0, %1 ``` Each of the add operations are done by the same polymorphic instruction. In constrast, we splat out these Builtins in swift today, i.e.: ``` let x, y: Builtin.Int32 Builtin.add_Int32(x, y) let x, y: Builtin.Vec2xInt32 Builtin.add_Vec2xInt32(x, y) ... ``` In SIL, we translate these verbatim and then IRGen just lowers them to the appropriate polymorphic instruction. Beyond being verbose, these prevent these Builtins (which need static types) from being used in polymorphic contexts. These operations in Swift look like: Builtin.add_Vec2
…ns into traps in IRGenPrepare.
…pecialize polymorphic builtins as it inlines. The reason why I am doing this is that today, the builtin concrete type specialization happens in DiagnosticConstantPropagation. This is not because of any super reason, it is just a peephole optimizer where we do stuff sort of like this (and emit diagnostics). So since we are emitting diagnostics, it makes sense to just plugin there. Sadly, this is actually /after/ predictable memory access optimizations. This means that if (without any loss of generality) we transform a generic_add to an add_Vec4xInt32 and load/store before/after the args/results of the builtin, we get unnecessary temporaries. In contrast, by teaching the SILCloner how to specialize polymorphic builtins, the specialization occurs in Mandatory Inlining, before both Predictable Memory Access Operations and DiagnosticConstantPropagation. This means that we will have a chance to eliminate any temporary stack slots improving -Onone codegen.
If the SIMD type is known to have an inner vector representation that is known by LLVM, a fast path is used to call into a polymorphic builtin operation.
1ff040a
to
eb6f889
Compare
These changes allow for expressing the vector semantics of our
SIMD{n}<T>
types directly to LLVM. This is done via polymorphic builtins defined by @gottesmm. These builtins are only called for stdlib types on the condition ofSwift._isConcrete
(callsBuiltin.isConcrete
), which is introduced in #26466.This allows LLVM to generate efficient SIMD code in Debug builds, which may result in up to a 120x performance improvement for some operations such as addition. This also results in some performance improvements for Release builds since we no longer rely on the optimizer to auto-vectorize the loop code.
The public-facing API is only changed through underscored operations on protocols like
SIMDStorage
and introduced underscored types such as_SIMDNever
and_SIMDGenericNever<T>
.