Skip to content

Optimize performance of SIMD binary operations via polymorphic builtins #26699

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

nvzqz
Copy link
Contributor

@nvzqz nvzqz commented Aug 16, 2019

These changes allow for expressing the vector semantics of our SIMD{n}<T> types directly to LLVM. This is done via polymorphic builtins defined by @gottesmm. These builtins are only called for stdlib types on the condition of Swift._isConcrete (calls Builtin.isConcrete), which is introduced in #26466.

This allows LLVM to generate efficient SIMD code in Debug builds, which may result in up to a 120x performance improvement for some operations such as addition. This also results in some performance improvements for Release builds since we no longer rely on the optimizer to auto-vectorize the loop code.

The public-facing API is only changed through underscored operations on protocols like SIMDStorage and introduced underscored types such as _SIMDNever and _SIMDGenericNever<T>.

@gottesmm
Copy link
Contributor

Make sure you munge the stuff (i.e. get rid of IRGenPrepare/etc) before the review starts.

@nvzqz
Copy link
Contributor Author

nvzqz commented Aug 16, 2019

@gottesmm Can you elaborate why the IRGenPrepare stuff should go into IRGen directly?

@gottesmm
Copy link
Contributor

Is there any reason that it /shouldn't/ be in IRGen?

#ifndef BUILTIN_BINARY_OPERATION
#define BUILTIN_BINARY_OPERATION(Id, Name, Attrs, Overload) \
BUILTIN(Id, Name, Attrs)
#define BUILTIN_BINARY_OPERATION(Id, Name, Attrs) BUILTIN(Id, Name, Attrs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just get rid of BUILTIN_BINARY_OPERATION and use BUILTIN instead, then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Sometimes you want to #define something just for Builtin Binary Operations and not all builtins.

gottesmm and others added 4 commits August 29, 2019 15:05
…TIONs.

TLDR: This patch introduces a new kind of builtin, "a polymorphic builtin". One
calls it like any other builtin, e.x.:

```
Builtin.generic_add(x, y)
```

but it has a contract: eventually
At constant propagation time, the optimizer attempts to specialize the
generic_add to the conc

DISCUSSION
----------

Today there are polymorphic like instructions in LLVM-IR. Yet, at the
swift and SIL level we represent these operations instead as Builtins whose
names are resolved by splatting the builtin into the name. For example, adding
two things in LLVM:

```
  %2 = add i64 %0, %1
  %2 = add <2 x i64> %0, %1
  %2 = add <4 x i64> %0, %1
  %2 = add <8 x i64> %0, %1
```

Each of the add operations are done by the same polymorphic instruction. In
constrast, we splat out these Builtins in swift today, i.e.:

```
let x, y: Builtin.Int32
Builtin.add_Int32(x, y)
let x, y: Builtin.Vec2xInt32
Builtin.add_Vec2xInt32(x, y)
...
```

In SIL, we translate these verbatim and then IRGen just lowers them to the
appropriate polymorphic instruction. Beyond being verbose, these prevent these
Builtins (which need static types) from being used in polymorphic contexts.

These operations in Swift look like:

Builtin.add_Vec2
…pecialize polymorphic builtins as it inlines.

The reason why I am doing this is that today, the builtin concrete type
specialization happens in DiagnosticConstantPropagation. This is not because of
any super reason, it is just a peephole optimizer where we do stuff sort of like
this (and emit diagnostics). So since we are emitting diagnostics, it makes
sense to just plugin there. Sadly, this is actually /after/ predictable memory
access optimizations. This means that if (without any loss of generality) we
transform a generic_add to an add_Vec4xInt32 and load/store before/after the
args/results of the builtin, we get unnecessary temporaries.

In contrast, by teaching the SILCloner how to specialize polymorphic builtins,
the specialization occurs in Mandatory Inlining, before both Predictable Memory
Access Operations and DiagnosticConstantPropagation. This means that we will
have a chance to eliminate any temporary stack slots improving -Onone codegen.
If the SIMD type is known to have an inner vector representation that is
known by LLVM, a fast path is used to call into a polymorphic builtin
operation.
@nvzqz nvzqz force-pushed the simd_binary_ops branch 3 times, most recently from 1ff040a to eb6f889 Compare August 30, 2019 20:53
@shahmishal
Copy link
Member

Please update the base branch to main by Oct 5th otherwise the pull request will be closed automatically.

  • How to change the base branch: (Link)
  • More detail about the branch update: (Link)

@shahmishal shahmishal closed this Oct 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants