-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[RISCV] Improve legalization of e8 m8 VL>256 shuffles #79330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RISCV] Improve legalization of e8 m8 VL>256 shuffles #79330
Conversation
@llvm/pr-subscribers-backend-risc-v Author: Philip Reames (preames) ChangesIf we can't produce a large enough index vector in i8, we may need to legalize the shuffle (via splitting or scalarization). We were doing this before sub-dividing, but the actual vselect doesn't have this legality property. If our two arms can be generated without resorting to the vrgather path, we were failing for no reason. This is a functional change, but I didn't include a test (mostly because the existing logic didn't seem to have a test either.) Full diff: https://github.com/llvm/llvm-project/pull/79330.diff 1 Files Affected:
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index a38352e8e87f21..46d75529a2a151 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -4965,14 +4965,6 @@ static SDValue lowerVECTOR_SHUFFLE(SDValue Op, SelectionDAG &DAG,
if (SDValue V = lowerVECTOR_SHUFFLEAsRotate(SVN, DAG, Subtarget))
return V;
- if (VT.getScalarSizeInBits() == 8 && VT.getVectorNumElements() > 256) {
- // On such a large vector we're unable to use i8 as the index type.
- // FIXME: We could promote the index to i16 and use vrgatherei16, but that
- // may involve vector splitting if we're already at LMUL=8, or our
- // user-supplied maximum fixed-length LMUL.
- return SDValue();
- }
-
// As a backup, shuffles can be lowered via a vrgather instruction, possibly
// merged with a second vrgather.
SmallVector<int> ShuffleMaskLHS, ShuffleMaskRHS;
@@ -5002,6 +4994,14 @@ static SDValue lowerVECTOR_SHUFFLE(SDValue Op, SelectionDAG &DAG,
// single source permutation. Note that all the splat variants
// are handled above.
if (V2.isUndef()) {
+ if (VT.getScalarSizeInBits() == 8 && VT.getVectorNumElements() > 256) {
+ // On such a large vector we're unable to use i8 as the index type.
+ // FIXME: We could promote the index to i16 and use vrgatherei16, but that
+ // may involve vector splitting if we're already at LMUL=8, or our
+ // user-supplied maximum fixed-length LMUL.
+ return SDValue();
+ }
+
unsigned GatherVVOpc = RISCVISD::VRGATHER_VV_VL;
MVT IndexVT = VT.changeTypeToInteger();
// Since we can't introduce illegal index types at this stage, use i16 and
|
The description here is misleading. I believe returning SDValue() here causes the shuffle to get expanded through memory. It will go through the Expand path in LegalizeDAG. |
It doesn't go through memory. I guess I was thinking of build_vector. I'm not sure what shuffle expansion does. I guess it uses extracts and_vector? |
Triggered by discussion on #79330. In the process of writing this, realized one of my recent refactorings appears to have broken the legalization for the single source case here. Fix to follow in separate patch.
If we're lowering an e8 m8 shuffle and we have an index value greater than 255, we have no available space to generate an e16 index vector. The code had originally handled this correctly, but in a recent refactoring I had moved the single source code above the check, and thus broke the single source by accident. I have a change on review to rework this (#79330), but for now, go with the most obvious fix.
If we can't produce a large enough index vector in i8, we may need to legalize the shuffle (via scalarization - which in turn gets lowered into stack usage). This change makes two related changes: * Defering legalization until we actually need to generate the vrgather instruction. With the new recursive structure, this only happens when doing the fallback for one of the arms. * Check the actual mask values for something outside of the representable range. Both are covered by recently added tests.
fdf385c
to
dde80b7
Compare
✅ With the latest revision this PR passed the C/C++ code formatter. |
I went ahead and added test coverage for both the original code and the new case. In the process, I discovered that one my earlier refactorings actually broke this logic - I think my mental state was assuming the prior version of this patch despite it not having landed - and I landed a separate change to get us back to a working baseline. Pushed a reworked change, please take a look at the description from scratch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
if (VT.getScalarSizeInBits() == 8 && VT.getVectorNumElements() > 256) { | ||
// On such a large vector we're unable to use i8 as the index type. | ||
if (VT.getScalarSizeInBits() == 8 && | ||
any_of(enumerate(Mask), [&](const auto &Idx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need enumerate?
If we can't produce a large enough index vector in i8, we may need to legalize the shuffle (via scalarization - which in turn gets lowered into stack usage). This change makes two related changes: * Deferring legalization until we actually need to generate the vrgather instruction. With the new recursive structure, this only happens when doing the fallback for one of the arms. * Check the actual mask values for something outside of the representable range. Both are covered by recently added tests.
If we can't produce a large enough index vector in i8, we may need to legalize
the shuffle (via scalarization - which in turn gets lowered into stack usage).
This change makes two related changes:
instruction. With the new recursive structure, this only happens when
doing the fallback for one of the arms.
range.
Both are covered by recently added tests.