[MLIR] Ensure deterministic parallel verification #134963

nacgarg · 2025-04-09T02:16:00Z

failableParallelForEach will non-deterministically early terminate upon failure, leading to inconsistent and potentially missing diagnostics.

This PR uses parallelForEach to ensure all operations are verified and all diagnostics are handled, while tracking the failure state separately.

Other potential fixes include:

Making failableParallelForEach have deterministic early-exit behavior (or have an option for it)
- I didn't want to change more than what was required (and potentially incur perf hits for unrelated code), but if this is a better fix I'm happy to submit a patch.
- I think all diagnostics that can be detected from verification failures should be reported, so I don't even think this would be correct behavior anyway
Adding an option for failableParallelForEach to still execute on every element on the range while still returning LogicalResult

llvmbot · 2025-04-09T02:16:35Z

@llvm/pr-subscribers-mlir-core

Author: Nachi G (nacgarg)

Changes

failableParallelForEach will non-deterministically early terminate upon failure, leading to inconsistent and potentially missing diagnostics.

This PR uses parallelForEach to ensure all operations are verified and all diagnostics are handled, while tracking the failure state separately.

Other potential fixes include:

Making failableParallelForEach have deterministic early-exit behavior (or have an option for it)
- I didn't want to change more than what was required (and potentially incur perf hits for unrelated code), but if this is a better fix I'm happy to submit a patch.
- I think all diagnostics that can be detected from verification failures should be reported, so I don't even think this would be correct behavior anyway
Adding an option for failableParallelForEach to still execute on every element on the range while still returning LogicalResult

Full diff: https://github.com/llvm/llvm-project/pull/134963.diff

1 Files Affected:

(modified) mlir/lib/IR/Verifier.cpp (+8-3)

diff --git a/mlir/lib/IR/Verifier.cpp b/mlir/lib/IR/Verifier.cpp
index 90ff8ef3b497f..20f259391c9d4 100644
--- a/mlir/lib/IR/Verifier.cpp
+++ b/mlir/lib/IR/Verifier.cpp
@@ -226,10 +226,15 @@ LogicalResult OperationVerifier::verifyOnExit(Operation &op) {
               o.hasTrait<OpTrait::IsIsolatedFromAbove>())
             opsWithIsolatedRegions.push_back(&o);
   }
-  if (failed(failableParallelForEach(
-          op.getContext(), opsWithIsolatedRegions,
-          [&](Operation *o) { return verifyOpAndDominance(*o); })))
+
+  std::atomic<bool> opFailedVerify = false;
+  parallelForEach(op.getContext(), opsWithIsolatedRegions, [&](Operation *o) {
+    if (failed(verifyOpAndDominance(*o)))
+      opFailedVerify.store(true, std::memory_order_relaxed);
+  });
+  if (opFailedVerify.load(std::memory_order_relaxed))
     return failure();
+
   OperationName opName = op.getName();
   std::optional<RegisteredOperationName> registeredInfo =
       opName.getRegisteredInfo();

llvmbot · 2025-04-09T02:16:36Z

@llvm/pr-subscribers-mlir

Author: Nachi G (nacgarg)

Changes

failableParallelForEach will non-deterministically early terminate upon failure, leading to inconsistent and potentially missing diagnostics.

This PR uses parallelForEach to ensure all operations are verified and all diagnostics are handled, while tracking the failure state separately.

Other potential fixes include:

Making failableParallelForEach have deterministic early-exit behavior (or have an option for it)
- I didn't want to change more than what was required (and potentially incur perf hits for unrelated code), but if this is a better fix I'm happy to submit a patch.
- I think all diagnostics that can be detected from verification failures should be reported, so I don't even think this would be correct behavior anyway
Adding an option for failableParallelForEach to still execute on every element on the range while still returning LogicalResult

Full diff: https://github.com/llvm/llvm-project/pull/134963.diff

1 Files Affected:

(modified) mlir/lib/IR/Verifier.cpp (+8-3)

diff --git a/mlir/lib/IR/Verifier.cpp b/mlir/lib/IR/Verifier.cpp
index 90ff8ef3b497f..20f259391c9d4 100644
--- a/mlir/lib/IR/Verifier.cpp
+++ b/mlir/lib/IR/Verifier.cpp
@@ -226,10 +226,15 @@ LogicalResult OperationVerifier::verifyOnExit(Operation &op) {
               o.hasTrait<OpTrait::IsIsolatedFromAbove>())
             opsWithIsolatedRegions.push_back(&o);
   }
-  if (failed(failableParallelForEach(
-          op.getContext(), opsWithIsolatedRegions,
-          [&](Operation *o) { return verifyOpAndDominance(*o); })))
+
+  std::atomic<bool> opFailedVerify = false;
+  parallelForEach(op.getContext(), opsWithIsolatedRegions, [&](Operation *o) {
+    if (failed(verifyOpAndDominance(*o)))
+      opFailedVerify.store(true, std::memory_order_relaxed);
+  });
+  if (opFailedVerify.load(std::memory_order_relaxed))
     return failure();
+
   OperationName opName = op.getName();
   std::optional<RegisteredOperationName> registeredInfo =
       opName.getRegisteredInfo();

`failableParallelForEach` will non-deterministically early terminate upon failure, leading to inconsistent and potentially missing diagnostics. This PR uses `parallelForEach` to ensure all operations are verified and all diagnostics are handled, while tracking the failure state separately. Other potential fixes include: - Making `failableParallelForEach` have deterministic early-exit behavior (or have an option for it) - I didn't want to change more than what was required (and potentially incur perf hits for unrelated code), but if this is a better fix I'm happy to submit a patch. - I think all diagnostics that can be detected from verification failures should be reported, so I don't even think this would be correct behavior anyway - Adding an option for `failableParallelForEach` to still execute on every element on the range while still returning `LogicalResult`

sabauma

LGTM. I've hit the non-deterministic verifier diagnostics several times. Thanks for fixing.

lattner

Makes sense to me, seems obvious

joker-eph

LG, Thanks for fixing.

joker-eph · 2025-04-09T17:47:22Z

mlir/lib/IR/Verifier.cpp

+    if (failed(verifyOpAndDominance(*o)))
+      opFailedVerify.store(true, std::memory_order_relaxed);
+  });
+  if (opFailedVerify.load(std::memory_order_relaxed))


Do we really need the memory_order_relaxed for this kind of granularity here?

I don't think it really matters. Technically we only need atomicity here, so by specifying memory_order_relaxed we don't impose any additional ordering constraints (like the default memory_order_seq_cst would).

`failableParallelForEach` will non-deterministically early terminate upon failure, leading to inconsistent and potentially missing diagnostics. This PR uses `parallelForEach` to ensure all operations are verified and all diagnostics are handled, while tracking the failure state separately. Other potential fixes include: - Making `failableParallelForEach` have deterministic early-exit behavior (or have an option for it) - I didn't want to change more than what was required (and potentially incur perf hits for unrelated code), but if this is a better fix I'm happy to submit a patch. - I think all diagnostics that can be detected from verification failures should be reported, so I don't even think this would be correct behavior anyway - Adding an option for `failableParallelForEach` to still execute on every element on the range while still returning `LogicalResult`

llvmbot added mlir:core MLIR Core Infrastructure mlir labels Apr 9, 2025

nacgarg force-pushed the fix-mlir-verifier-diagnostics-race branch 2 times, most recently from 8e7e6b5 to fdd386a Compare April 9, 2025 12:03

Add test

f6edf25

nacgarg force-pushed the fix-mlir-verifier-diagnostics-race branch from fdd386a to f6edf25 Compare April 9, 2025 12:07

sabauma requested review from alexander-shaposhnikov, matthias-springer and River707 April 9, 2025 14:04

sabauma approved these changes Apr 9, 2025

View reviewed changes

sabauma requested a review from lattner April 9, 2025 14:07

lattner approved these changes Apr 9, 2025

View reviewed changes

joker-eph approved these changes Apr 9, 2025

View reviewed changes

PeimingLiu merged commit 2f7e685 into llvm:main Apr 9, 2025
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MLIR] Ensure deterministic parallel verification #134963

[MLIR] Ensure deterministic parallel verification #134963

Uh oh!

nacgarg commented Apr 9, 2025

Uh oh!

llvmbot commented Apr 9, 2025

Uh oh!

llvmbot commented Apr 9, 2025

Uh oh!

sabauma left a comment

Uh oh!

lattner left a comment

Uh oh!

joker-eph left a comment

Uh oh!

joker-eph Apr 9, 2025

Uh oh!

nacgarg Apr 9, 2025

Uh oh!

Uh oh!

Uh oh!

[MLIR] Ensure deterministic parallel verification #134963

[MLIR] Ensure deterministic parallel verification #134963

Uh oh!

Conversation

nacgarg commented Apr 9, 2025

Uh oh!

llvmbot commented Apr 9, 2025

Uh oh!

llvmbot commented Apr 9, 2025

Uh oh!

sabauma left a comment

Choose a reason for hiding this comment

Uh oh!

lattner left a comment

Choose a reason for hiding this comment

Uh oh!

joker-eph left a comment

Choose a reason for hiding this comment

Uh oh!

joker-eph Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

nacgarg Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!