Skip to content

[mlir][AMDGPU] Support vector<2xf16> inputs to buffer atomic fadd #108238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 11, 2024

Conversation

krzysz00
Copy link
Contributor

Extend the lowering of atomic.fadd to support the v2f16 variant avaliable on some AMDGPU chips.

Extend the lowering of atomic.fadd to support the v2f16 variant
avaliable on some AMDGPU chips.

Co-authored-by: Giuseppe Rossini <[email protected]>
@llvmbot
Copy link
Member

llvmbot commented Sep 11, 2024

@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-mlir-gpu
@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-mlir-amdgpu

Author: Krzysztof Drewniak (krzysz00)

Changes

Extend the lowering of atomic.fadd to support the v2f16 variant avaliable on some AMDGPU chips.


Full diff: https://github.com/llvm/llvm-project/pull/108238.diff

3 Files Affected:

  • (modified) mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td (+2-2)
  • (modified) mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp (+5-2)
  • (modified) mlir/test/Conversion/AMDGPUToROCDL/amdgpu-to-rocdl.mlir (+11)
diff --git a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td
index 8a1ef94c853a58..64db4448bc2f2b 100644
--- a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td
+++ b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td
@@ -253,8 +253,8 @@ def AMDGPU_RawBufferAtomicCmpswapOp :
 // Raw buffer atomic floating point add
 def AMDGPU_RawBufferAtomicFaddOp :
     AMDGPU_Op<"raw_buffer_atomic_fadd", [AllElementTypesMatch<["value", "memref"]>,
-      AttrSizedOperandSegments]>,
-    Arguments<(ins F32:$value,
+    AttrSizedOperandSegments]>,
+    Arguments<(ins AnyTypeOf<[F32, VectorOfLengthAndType<[2], [F16]>]>:$value,
                    Arg<AnyMemRef, "buffer to operate on", [MemRead, MemWrite]>:$memref,
                    Variadic<I32>:$indices,
                    DefaultValuedAttr<BoolAttr, "true">:$boundsCheck,
diff --git a/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp b/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp
index 96b433294d258a..fc5dd7c5602129 100644
--- a/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp
+++ b/mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp
@@ -115,15 +115,18 @@ struct RawBufferOpLowering : public ConvertOpToLLVMPattern<GpuOp> {
             rewriter.getIntegerType(floatType.getWidth()));
     }
     if (auto dataVector = dyn_cast<VectorType>(wantedDataType)) {
+      uint32_t vecLen = dataVector.getNumElements();
       uint32_t elemBits = dataVector.getElementTypeBitWidth();
-      uint32_t totalBits = elemBits * dataVector.getNumElements();
+      uint32_t totalBits = elemBits * vecLen;
+      bool usePackedFp16 =
+          dyn_cast_or_null<RawBufferAtomicFaddOp>(*gpuOp) && vecLen == 2;
       if (totalBits > maxVectorOpWidth)
         return gpuOp.emitOpError(
             "Total width of loads or stores must be no more than " +
             Twine(maxVectorOpWidth) + " bits, but we call for " +
             Twine(totalBits) +
             " bits. This should've been caught in validation");
-      if (elemBits < 32) {
+      else if (!usePackedFp16 && elemBits < 32) {
         if (totalBits > 32) {
           if (totalBits % 32 != 0)
             return gpuOp.emitOpError("Load or store of more than 32-bits that "
diff --git a/mlir/test/Conversion/AMDGPUToROCDL/amdgpu-to-rocdl.mlir b/mlir/test/Conversion/AMDGPUToROCDL/amdgpu-to-rocdl.mlir
index 717667c22af800..cc51a8c40942f9 100644
--- a/mlir/test/Conversion/AMDGPUToROCDL/amdgpu-to-rocdl.mlir
+++ b/mlir/test/Conversion/AMDGPUToROCDL/amdgpu-to-rocdl.mlir
@@ -151,6 +151,17 @@ func.func @gpu_gcn_raw_buffer_atomic_fadd_f32(%value: f32, %buf: memref<64xf32>,
   func.return
 }
 
+// CHECK-LABEL: func @gpu_gcn_raw_buffer_atomic_fadd_v2f16
+func.func @gpu_gcn_raw_buffer_atomic_fadd_v2f16(%value: vector<2xf16>, %buf: memref<64xf16>, %idx: i32) {
+  // CHECK: %[[numRecords:.*]] = llvm.mlir.constant(128 : i32)
+  // GFX9:  %[[flags:.*]] = llvm.mlir.constant(159744 : i32)
+  // RDNA:  %[[flags:.*]] = llvm.mlir.constant(822243328 : i32)
+  // CHECK: %[[resource:.*]] = rocdl.make.buffer.rsrc %{{.*}}, %{{.*}}, %[[numRecords]], %[[flags]]
+  // CHECK: rocdl.raw.ptr.buffer.atomic.fadd %{{.*}}, %[[resource]], %{{.*}}, %{{.*}}, %{{.*}} : vector<2xf16>
+  amdgpu.raw_buffer_atomic_fadd {boundsCheck = true} %value -> %buf[%idx] : vector<2xf16> -> memref<64xf16>, i32
+  func.return
+}
+
 // CHECK-LABEL: func @gpu_gcn_raw_buffer_atomic_fmax_f32
 func.func @gpu_gcn_raw_buffer_atomic_fmax_f32(%value: f32, %buf: memref<64xf32>, %idx: i32) {
   // CHECK: %[[numRecords:.*]] = llvm.mlir.constant(256 : i32)

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well handle v2bf16 too

@krzysz00
Copy link
Contributor Author

Oh, that's a thing now? Sure!

@krzysz00
Copy link
Contributor Author

Update: bf16 support would need me to back out the "bf16 is i16" change that still lives in MLIR from back when bfloat wasn't implemented on AMDGPU

@krzysz00 krzysz00 merged commit 0d48d4d into llvm:main Sep 11, 2024
13 checks passed
Copy link
Member

@kuhar kuhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this land without an approval?

if (totalBits > maxVectorOpWidth)
return gpuOp.emitOpError(
"Total width of loads or stores must be no more than " +
Twine(maxVectorOpWidth) + " bits, but we call for " +
Twine(totalBits) +
" bits. This should've been caught in validation");
if (elemBits < 32) {
else if (!usePackedFp16 && elemBits < 32) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... I thought I saw an approval, reverting, thanks

krzysz00 added a commit to krzysz00/llvm-project that referenced this pull request Sep 11, 2024
krzysz00 added a commit that referenced this pull request Sep 11, 2024
…fadd (#108238)" (#108256)

This reverts commit 0d48d4d.

Mistakenly landed without approval
uint32_t totalBits = elemBits * dataVector.getNumElements();
uint32_t totalBits = elemBits * vecLen;
bool usePackedFp16 =
dyn_cast_or_null<RawBufferAtomicFaddOp>(*gpuOp) && vecLen == 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use isa_and_present

krzysz00 added a commit to krzysz00/llvm-project that referenced this pull request Sep 11, 2024
Extend the lowering of atomic.fadd to support the v2f16 variant
avaliable on some AMDGPU chips.

Re-lands llvm#108238 (and addresses review comments from there)

Co-authored-by: Giuseppe Rossini <[email protected]>
krzysz00 added a commit to krzysz00/llvm-project that referenced this pull request Sep 11, 2024
Extend the lowering of atomic.fadd to support the v2f16 variant
avaliable on some AMDGPU chips.

Re-lands llvm#108238 (and addresses review comments from there)

Co-authored-by: Giuseppe Rossini <[email protected]>
krzysz00 added a commit that referenced this pull request Sep 11, 2024
…08286)

Extend the lowering of atomic.fadd to support the v2f16 variant
avaliable on some AMDGPU chips.

Re-lands #108238 (and addresses review comments from there)

Co-authored-by: Giuseppe Rossini <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants