You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[RISCV] Decompose single source shuffles (without exact VLEN)
This is a continuation of the work started in llvm#125735 to lower
selected VLA shuffles in linear m1 components instead of generating
O(LMUL^2) or O(LMUL*Log2(LMUL) high LMUL shuffles.
This pattern focuses on shuffles where all the elements being used
across the entire destination register group come from a single
register in the source register group. Such cases come up fairly
frequently via e.g. spread(N), and repeat(N) idioms.
One subtlety to this patch is the handling of the index vector
for vrgatherei16.vv. Because the index and source registers can
have different EEW, the index vector for the Nth chunk of the
destination is not guaranteed to be register aligned. In fact,
it is common for e.g. an EEW=64 shuffle to have EEW=16 indices
which are four chunks per source register. Given this, we have
to pay a cost for extracting these chunks into the low position
before performing each shuffle.
I'd initially expressed this as a naive extract sub-vector for each
data parallel piece. However, at high LMUL, this quickly caused
register pressure problems since we could at worst need 4x the
temporary registers for the index. Instead, this patch uses a
repeating slidedown chained from previous iterations. This increases
critical path by at worst 3 slides (SEW=64 is the worst case),
but reduces register pressure to at worst 2x - and only if the
original index vector is reused elsewhere. I view this as arguably
a bit of a workaround (since our scheduling should have done better
with the plan extract variant), but a probably neccessary one.
0 commit comments