Skip to content

[mlir][gpu] Eliminate redundant gpu.barrier ops #71575

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 9, 2023

Conversation

spaceotter
Copy link
Contributor

Adds a canonicalizer for gpu.barrier that gets rid of duplicates.

@llvmbot
Copy link
Member

llvmbot commented Nov 7, 2023

@llvm/pr-subscribers-mlir-gpu

@llvm/pr-subscribers-mlir

Author: None (spaceotter)

Changes

Adds a canonicalizer for gpu.barrier that gets rid of duplicates.


Full diff: https://github.com/llvm/llvm-project/pull/71575.diff

2 Files Affected:

  • (modified) mlir/include/mlir/Dialect/GPU/IR/GPUOps.td (+1)
  • (modified) mlir/lib/Dialect/GPU/IR/GPUDialect.cpp (+28)
diff --git a/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td b/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
index 6375d35f4311295..632cdd96c6d4c2b 100644
--- a/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
+++ b/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
@@ -1010,6 +1010,7 @@ def GPU_BarrierOp : GPU_Op<"barrier"> {
     in convergence.
   }];
   let assemblyFormat = "attr-dict";
+  let hasCanonicalizer = 1;
 }
 
 def GPU_GPUModuleOp : GPU_Op<"module", [
diff --git a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
index 5eb2cadc884e151..d9ffacfd0d54f59 100644
--- a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
+++ b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
@@ -1139,6 +1139,34 @@ void ShuffleOp::build(OpBuilder &builder, OperationState &result, Value value,
         mode);
 }
 
+//===----------------------------------------------------------------------===//
+// BarrierOp
+//===----------------------------------------------------------------------===//
+
+namespace {
+
+/// Remove gpu.barrier after gpu.barrier, the threads are already synchronized!
+struct EraseRedundantGpuBarrierOpPairs : public OpRewritePattern<BarrierOp> {
+public:
+  using OpRewritePattern::OpRewritePattern;
+
+  LogicalResult matchAndRewrite(BarrierOp op,
+                                PatternRewriter &rewriter) const final {
+    if (isa<BarrierOp>(op->getNextNode())) {
+      rewriter.eraseOp(op->getNextNode());
+      return success();
+    }
+    return failure();
+  }
+};
+
+} // end anonymous namespace
+
+void BarrierOp::getCanonicalizationPatterns(RewritePatternSet &results,
+                                            MLIRContext *context) {
+  results.add<EraseRedundantGpuBarrierOpPairs>(context);
+}
+
 //===----------------------------------------------------------------------===//
 // GPUFuncOp
 //===----------------------------------------------------------------------===//

@grypp
Copy link
Member

grypp commented Nov 7, 2023

The PR is a good start, but it doesn’t really optimize anything. It only removes barriers one after another. The program runs on a same speed even if we don't remove them, because the threads are already synchronized on the first barrier.

I think this PR needs to be largely extended. A proper barrier elimination can remove more redundant barriers.

Example 1:
Here the first barrier can be deleted. The second might be eliminated but we need to do alias analysis and memref region.

Read(%0 : memref<?xf32,3>)
gpu.barrier
Write(%0 : memref<?xf32,3>)
gpu.barrier
Read(%0 : memref<?xf32,3>)

Example 2:

We also need to consider if-else statements. See the example below, we can eliminate first two barriers, because the last barrier is already synchronizing the threads.

if(){ 
…
gpu.barrier
} else {
…
gpu.barrier
}
gpu.barrier

@nirvedhmeshram
Copy link
Contributor

In the first example wouldnt alias analysis be needed to remove the first barrier since it could cause a Write-After-Read Hazard?

@antiagainst
Copy link
Member

@grypp Agreed that we can do better for eliding unnecessary barriers. That can happen as a next step to me; and we'd likely need a dedicated pass to have some complicated analysis. Using canonicalization pattern for this simple cleanup is actually a good match here I think.

@spaceotter We need a test for this.

@spaceotter spaceotter force-pushed the dedup branch 2 times, most recently from b81301e to c0c473e Compare November 7, 2023 21:45
@spaceotter
Copy link
Contributor Author

@antiagainst I added the test.
Synchronizing the barriers does have a performance cost, even if the threads are already synchronized, perhaps more in some architectures than others.

Copy link

github-actions bot commented Nov 7, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

@spaceotter spaceotter force-pushed the dedup branch 2 times, most recently from a3bd73a to 0088612 Compare November 7, 2023 21:52
@grypp
Copy link
Member

grypp commented Nov 8, 2023

@grypp Agreed that we can do better for eliding unnecessary barriers. That can happen as a next step to me; and we'd likely need a dedicated pass to have some complicated analysis.

I want to start by saying that I'm not against this barrier elimination, but at the moment, PR doesn't seem to offer any performance improvements. Additionally, if we were to introduce a dedicated pass, this code might become unnecessary.

I was wondering if we could explore ways to improve the value of this PR. For instance, could we consider checking for arithmetic operations between the barriers and potentially remove those barriers as well to make it more useful?

gpu.barriers 
arith scalar
arith scalar
gpu.barriers -> remove this
arith scalar
arith scalar
gpu.barriers

@ftynse
Copy link
Member

ftynse commented Nov 8, 2023

In case you missed it, we have added a more advanced barrier elimination in 9ab3468. It does some relatively expensive computations including a localized alias analysis, so it absolutely shouldn't run as part of canonicalization. Having a simple cleanup as part of canonicalization looks good to me.

@grypp
Copy link
Member

grypp commented Nov 8, 2023

I was not aware of that there is pass, that's awesome!

It does some relatively expensive computations including a localized alias analysis, so it absolutely shouldn't run as part of canonicalization.

Why shouldn't we run this as a part of the canonization? There is one and only reason we use GPUs and it is performance. This pass sounds like essential to me.

@spaceotter
Copy link
Contributor Author

@ftynse Yes I am aware of that code. I've opened a separate PR to make that usable as a pass. #71762

@ftynse
Copy link
Member

ftynse commented Nov 9, 2023

Why shouldn't we run this as a part of the canonization? There is one and only reason we use GPUs and it is performance.

It doesn't mean it should run as part of canonicalization. The purpose of canonicalization isn't to make the generated code fast, it is to make it easier for the compiler to reason about the code by bringing it into the canonical form (even though I have doubts about existence of such a form in MLIR). Since it is called repeatedly in the pipeline as generalized cleanup, we don't want canonicalization patterns to be expensive to apply. This pattern can still be applied by a separate pass, just not canonicalizaiton.

@grypp
Copy link
Member

grypp commented Nov 9, 2023

I'd say that removing barriers can be considered a part of canonicalization, which helps the compiler reason about the code (we have passes that acts based on number of barriers and etc.). However, I agree that we should not be calling an expensive pass too frequently in the pipeline.

I found this discussion valuable. Thanks for taking the time to discuss it.

@qedawkins qedawkins merged commit 51af040 into llvm:main Nov 9, 2023
zahiraam pushed a commit to zahiraam/llvm-project that referenced this pull request Nov 20, 2023
Adds a canonicalizer for gpu.barrier that gets rid of duplicates.

Co-authored-by: Eric Eaton <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants