Skip to content

[MLIR] Ensure deterministic parallel verification #134963

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 9, 2025

Conversation

nacgarg
Copy link
Contributor

@nacgarg nacgarg commented Apr 9, 2025

failableParallelForEach will non-deterministically early terminate upon failure, leading to inconsistent and potentially missing diagnostics.

This PR uses parallelForEach to ensure all operations are verified and all diagnostics are handled, while tracking the failure state separately.

Other potential fixes include:

  • Making failableParallelForEach have deterministic early-exit behavior (or have an option for it)

    • I didn't want to change more than what was required (and potentially incur perf hits for unrelated code), but if this is a better fix I'm happy to submit a patch.
    • I think all diagnostics that can be detected from verification failures should be reported, so I don't even think this would be correct behavior anyway
  • Adding an option for failableParallelForEach to still execute on every element on the range while still returning LogicalResult

@llvmbot llvmbot added mlir:core MLIR Core Infrastructure mlir labels Apr 9, 2025
@llvmbot
Copy link
Member

llvmbot commented Apr 9, 2025

@llvm/pr-subscribers-mlir-core

Author: Nachi G (nacgarg)

Changes

failableParallelForEach will non-deterministically early terminate upon failure, leading to inconsistent and potentially missing diagnostics.

This PR uses parallelForEach to ensure all operations are verified and all diagnostics are handled, while tracking the failure state separately.

Other potential fixes include:

  • Making failableParallelForEach have deterministic early-exit behavior (or have an option for it)

    • I didn't want to change more than what was required (and potentially incur perf hits for unrelated code), but if this is a better fix I'm happy to submit a patch.
    • I think all diagnostics that can be detected from verification failures should be reported, so I don't even think this would be correct behavior anyway
  • Adding an option for failableParallelForEach to still execute on every element on the range while still returning LogicalResult


Full diff: https://github.com/llvm/llvm-project/pull/134963.diff

1 Files Affected:

  • (modified) mlir/lib/IR/Verifier.cpp (+8-3)
diff --git a/mlir/lib/IR/Verifier.cpp b/mlir/lib/IR/Verifier.cpp
index 90ff8ef3b497f..20f259391c9d4 100644
--- a/mlir/lib/IR/Verifier.cpp
+++ b/mlir/lib/IR/Verifier.cpp
@@ -226,10 +226,15 @@ LogicalResult OperationVerifier::verifyOnExit(Operation &op) {
               o.hasTrait<OpTrait::IsIsolatedFromAbove>())
             opsWithIsolatedRegions.push_back(&o);
   }
-  if (failed(failableParallelForEach(
-          op.getContext(), opsWithIsolatedRegions,
-          [&](Operation *o) { return verifyOpAndDominance(*o); })))
+
+  std::atomic<bool> opFailedVerify = false;
+  parallelForEach(op.getContext(), opsWithIsolatedRegions, [&](Operation *o) {
+    if (failed(verifyOpAndDominance(*o)))
+      opFailedVerify.store(true, std::memory_order_relaxed);
+  });
+  if (opFailedVerify.load(std::memory_order_relaxed))
     return failure();
+
   OperationName opName = op.getName();
   std::optional<RegisteredOperationName> registeredInfo =
       opName.getRegisteredInfo();

@llvmbot
Copy link
Member

llvmbot commented Apr 9, 2025

@llvm/pr-subscribers-mlir

Author: Nachi G (nacgarg)

Changes

failableParallelForEach will non-deterministically early terminate upon failure, leading to inconsistent and potentially missing diagnostics.

This PR uses parallelForEach to ensure all operations are verified and all diagnostics are handled, while tracking the failure state separately.

Other potential fixes include:

  • Making failableParallelForEach have deterministic early-exit behavior (or have an option for it)

    • I didn't want to change more than what was required (and potentially incur perf hits for unrelated code), but if this is a better fix I'm happy to submit a patch.
    • I think all diagnostics that can be detected from verification failures should be reported, so I don't even think this would be correct behavior anyway
  • Adding an option for failableParallelForEach to still execute on every element on the range while still returning LogicalResult


Full diff: https://github.com/llvm/llvm-project/pull/134963.diff

1 Files Affected:

  • (modified) mlir/lib/IR/Verifier.cpp (+8-3)
diff --git a/mlir/lib/IR/Verifier.cpp b/mlir/lib/IR/Verifier.cpp
index 90ff8ef3b497f..20f259391c9d4 100644
--- a/mlir/lib/IR/Verifier.cpp
+++ b/mlir/lib/IR/Verifier.cpp
@@ -226,10 +226,15 @@ LogicalResult OperationVerifier::verifyOnExit(Operation &op) {
               o.hasTrait<OpTrait::IsIsolatedFromAbove>())
             opsWithIsolatedRegions.push_back(&o);
   }
-  if (failed(failableParallelForEach(
-          op.getContext(), opsWithIsolatedRegions,
-          [&](Operation *o) { return verifyOpAndDominance(*o); })))
+
+  std::atomic<bool> opFailedVerify = false;
+  parallelForEach(op.getContext(), opsWithIsolatedRegions, [&](Operation *o) {
+    if (failed(verifyOpAndDominance(*o)))
+      opFailedVerify.store(true, std::memory_order_relaxed);
+  });
+  if (opFailedVerify.load(std::memory_order_relaxed))
     return failure();
+
   OperationName opName = op.getName();
   std::optional<RegisteredOperationName> registeredInfo =
       opName.getRegisteredInfo();

`failableParallelForEach` will non-deterministically early terminate upon failure, leading to inconsistent and potentially missing diagnostics.

This PR uses `parallelForEach` to ensure all operations are verified and all diagnostics are handled, while tracking the failure state separately.

Other potential fixes include:
- Making `failableParallelForEach` have deterministic early-exit behavior (or have an option for it)
  - I didn't want to change more than what was required (and potentially incur perf hits for unrelated code), but if this is a better fix I'm happy to submit a patch.
  - I think all diagnostics that can be detected from verification failures should be reported, so I don't even think this would be correct behavior anyway

- Adding an option for `failableParallelForEach` to still execute on every element on the range while still returning `LogicalResult`
@nacgarg nacgarg force-pushed the fix-mlir-verifier-diagnostics-race branch 2 times, most recently from 8e7e6b5 to fdd386a Compare April 9, 2025 12:03
@nacgarg nacgarg force-pushed the fix-mlir-verifier-diagnostics-race branch from fdd386a to f6edf25 Compare April 9, 2025 12:07
Copy link
Contributor

@sabauma sabauma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I've hit the non-deterministic verifier diagnostics several times. Thanks for fixing.

@sabauma sabauma requested a review from lattner April 9, 2025 14:07
Copy link
Collaborator

@lattner lattner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me, seems obvious

Copy link
Collaborator

@joker-eph joker-eph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG, Thanks for fixing.

if (failed(verifyOpAndDominance(*o)))
opFailedVerify.store(true, std::memory_order_relaxed);
});
if (opFailedVerify.load(std::memory_order_relaxed))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need the memory_order_relaxed for this kind of granularity here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it really matters. Technically we only need atomicity here, so by specifying memory_order_relaxed we don't impose any additional ordering constraints (like the default memory_order_seq_cst would).

@PeimingLiu PeimingLiu merged commit 2f7e685 into llvm:main Apr 9, 2025
11 checks passed
AllinLeeYL pushed a commit to AllinLeeYL/llvm-project that referenced this pull request Apr 10, 2025
`failableParallelForEach` will non-deterministically early terminate
upon failure, leading to inconsistent and potentially missing
diagnostics.

This PR uses `parallelForEach` to ensure all operations are verified and
all diagnostics are handled, while tracking the failure state
separately.

Other potential fixes include:
- Making `failableParallelForEach` have deterministic early-exit
behavior (or have an option for it)
- I didn't want to change more than what was required (and potentially
incur perf hits for unrelated code), but if this is a better fix I'm
happy to submit a patch.
- I think all diagnostics that can be detected from verification
failures should be reported, so I don't even think this would be correct
behavior anyway

- Adding an option for `failableParallelForEach` to still execute on
every element on the range while still returning `LogicalResult`
var-const pushed a commit to ldionne/llvm-project that referenced this pull request Apr 17, 2025
`failableParallelForEach` will non-deterministically early terminate
upon failure, leading to inconsistent and potentially missing
diagnostics.

This PR uses `parallelForEach` to ensure all operations are verified and
all diagnostics are handled, while tracking the failure state
separately.

Other potential fixes include:
- Making `failableParallelForEach` have deterministic early-exit
behavior (or have an option for it)
- I didn't want to change more than what was required (and potentially
incur perf hits for unrelated code), but if this is a better fix I'm
happy to submit a patch.
- I think all diagnostics that can be detected from verification
failures should be reported, so I don't even think this would be correct
behavior anyway

- Adding an option for `failableParallelForEach` to still execute on
every element on the range while still returning `LogicalResult`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mlir:core MLIR Core Infrastructure mlir
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants