-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[MLIR][NVVM] Add Op for TMA Store with reduction #118853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
joker-eph
merged 1 commit into
llvm:main
from
durga4github:durgadossr/mlir_nvvm_tma_reduce
Dec 11, 2024
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, as a general design note - coming from the AMD side - I'd argue that there should be a NVVM operation that's an exact 1:1 map to the LLVM intrinsic and an operation in some other dialect - NVGPU,, I think, might be it, that takes the enum and has the user-friendly syntax.
Then, the conversion to LLVM becomes a MLIR rewrite pattern instead of hiding non-trivial logic in LLVMBUilder
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LLVM intrinsics don't support polymorphism, quite unfortunately...
The NVVM dialect is embracing this though which makes it much more "MLIR friendly".
It becomes 1:M in the translation to LLVM IR, where "name mangling" is required by selecting the right LLVM intrinsic.
However that is still a purely "mechanical" translation, which remains trivial and can be completely round-tripped (there is no complex logic involved here).
Adding another layer in MLIR is doable but:
(in the same vein, we discussed at the Dev Summit the addition of direct translation from SCF/arith to LLVM IR to avoid paying extra dialect conversion)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, Krzysztof, for your thoughts!
As you probably observed, this Op models one PTX instruction family only.
As Mehdi also pointed out, we had to model this with many intrinsics at
the LLVM level (mainly) due to the differing arguments for each variant.
Since MLIR infra offers enhanced type support like Variadic<>, Optional<>
etc, and the existing TMA Ops here already leverage these features,
I followed the same approach to model this Op too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair, I'm not going to get in the way of existing code and convention here. And perhaps LLVM intrinsics isn't exactly the right word here - the approach I've taken with AMDGPU/ROCDL is that, when there's a "prickly" feature (like, on our side, DPP, where, there're a lot of magic constants, or MFMA with its implicit type conversions), you get a more user-friendly wrapper under
amdgpu.*
and the raw underlying primitives that trivially translate to the export format inrocdl.*
.I was advocating for that approach because I figure it's both better developer experience and because it improves debugability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely right that this is a more "pure" design, I tend to look at the tradeoff in terms of compile-time on this (dialect conversion is non-trivial and that means one extra pass in the pipeline).
The developer experience should be identical I believe, and the debuggability only come from the fact that you don't have the 1:M expansion in MLIR but during the translation. Since this is basically a single switch statement dispatching it, the impact is quite minimal hopefully (there isn't much difference in observing the effect of this 1:M mapping with mlir-opt as MLIR->MLIR transformation vs mlir-translate as MLIR->LLVM IR from a debugging standpoint, it's all in the same level of abstraction and a very local transformation 1 op -> 1 instruction).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose another way to phrase things is that I expect
mlir-translate
to be a fairly simple operation - though something like this sort of switch statement for mapping an MLIR op + attributes to one of several underlying operations is probably fine. That is, I'd be surprised by a "translation" that has a bunch of complex logic in it, chipset-dependence, etc.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The design philosophy for
NVGPU
andNVVM
has long been clear and practical:NVGPU
was created to serve as a bridge between high-level dialects likememref
andvector
and the NVVM dialect.NVVM
was designed to generate PTX code, but this was never 1:1 mapping. It followed a1:N
mapping, but thisN
is always the same PTX instruction with different traits. We've relied on this extensively.NVGPU
can generateN
NVVM OP, and hereN
are distinct PTX instructions, e.g., tensor core :fence + mma + commit + wait
chain.Also,
NVGPU
is good place when something doesn’t fit neatly into GPU dialects—such as nvidia specific driver calls.[RFC] Add NV-GPU dialect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@krzysz00 , I see where you're coming from regarding moving the enum to the upper dialect. It might be a nice idea for a small ISA, but our situation is a bit different. We have OPs with multiple enums, which would mean creating numerous NVVM ops and moving the enums to NVGPU dialect. Even if we want to do that, this approach doesn’t add much benefit. Unfortunately, the ROI is not clear to me.