Skip to content

[OpenMP] Fix num_iters in __kmpc_*_loop DeviceRTL functions #133435

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 1, 2025

Conversation

skatrak
Copy link
Member

@skatrak skatrak commented Mar 28, 2025

This patch removes the addition of 1 to the number of iterations when calling the following DeviceRTL functions:

  • __kmpc_distribute_for_static_loop*
  • __kmpc_distribute_static_loop*
  • __kmpc_for_static_loop*

Calls to these functions are currently only produced by the OMPIRBuilder from flang, which already passes the correct number of iterations to these functions. By adding 1 to the received num_iters variable, worksharing can produce incorrect results. This impacts flang OpenMP offloading of do, distribute and distribute parallel do constructs.

Expecting the application to pass tripcount - 1 as the argument seems unexpected as well, so rather than updating flang I think it makes more sense to update the runtime.

This patch removes the addition of 1 to the number of iterations when calling
the following DeviceRTL functions:
- `__kmpc_distribute_for_static_loop*`
- `__kmpc_distribute_static_loop*`
- `__kmpc_for_static_loop*`

Calls to these functions are currently only produced by the OMPIRBuilder from
flang, which already passes the correct number of iterations to these
functions. By adding 1 to the received `num_iters` variable, worksharing
can produce incorrect results. This impacts flang OpenMP offloading for `do`,
`distribute` and `distribute parallel do` constructs.

Expecting the application to pass `tripcount - 1` as the argument seems
unexpected as well, so rather than updating flang I think it makes more sense
to update the runtime.
@llvmbot
Copy link
Member

llvmbot commented Mar 28, 2025

@llvm/pr-subscribers-offload

Author: Sergio Afonso (skatrak)

Changes

This patch removes the addition of 1 to the number of iterations when calling the following DeviceRTL functions:

  • __kmpc_distribute_for_static_loop*
  • __kmpc_distribute_static_loop*
  • __kmpc_for_static_loop*

Calls to these functions are currently only produced by the OMPIRBuilder from flang, which already passes the correct number of iterations to these functions. By adding 1 to the received num_iters variable, worksharing can produce incorrect results. This impacts flang OpenMP offloading of do, distribute and distribute parallel do constructs.

Expecting the application to pass tripcount - 1 as the argument seems unexpected as well, so rather than updating flang I think it makes more sense to update the runtime.


Full diff: https://github.com/llvm/llvm-project/pull/133435.diff

1 Files Affected:

  • (modified) offload/DeviceRTL/src/Workshare.cpp (+3-3)
diff --git a/offload/DeviceRTL/src/Workshare.cpp b/offload/DeviceRTL/src/Workshare.cpp
index 861b9ca371ccd..a8759307b42bd 100644
--- a/offload/DeviceRTL/src/Workshare.cpp
+++ b/offload/DeviceRTL/src/Workshare.cpp
@@ -911,19 +911,19 @@ template <typename Ty> class StaticLoopChunker {
           IdentTy *loc, void (*fn)(TY, void *), void *arg, TY num_iters,       \
           TY num_threads, TY block_chunk, TY thread_chunk) {                   \
     ompx::StaticLoopChunker<TY>::DistributeFor(                                \
-        loc, fn, arg, num_iters + 1, num_threads, block_chunk, thread_chunk);  \
+        loc, fn, arg, num_iters, num_threads, block_chunk, thread_chunk);      \
   }                                                                            \
   [[gnu::flatten, clang::always_inline]] void                                  \
       __kmpc_distribute_static_loop##BW(IdentTy *loc, void (*fn)(TY, void *),  \
                                         void *arg, TY num_iters,               \
                                         TY block_chunk) {                      \
-    ompx::StaticLoopChunker<TY>::Distribute(loc, fn, arg, num_iters + 1,       \
+    ompx::StaticLoopChunker<TY>::Distribute(loc, fn, arg, num_iters,           \
                                             block_chunk);                      \
   }                                                                            \
   [[gnu::flatten, clang::always_inline]] void __kmpc_for_static_loop##BW(      \
       IdentTy *loc, void (*fn)(TY, void *), void *arg, TY num_iters,           \
       TY num_threads, TY thread_chunk) {                                       \
-    ompx::StaticLoopChunker<TY>::For(loc, fn, arg, num_iters + 1, num_threads, \
+    ompx::StaticLoopChunker<TY>::For(loc, fn, arg, num_iters, num_threads,     \
                                      thread_chunk);                            \
   }
 

Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it passes tests, seems fine.

Copy link
Contributor

@jsjodin jsjodin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@skatrak skatrak merged commit 66fca06 into llvm:main Apr 1, 2025
11 checks passed
@skatrak skatrak deleted the fix-worksharing-devicertl-tripcount branch April 1, 2025 09:29
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Apr 1, 2025
This PR should only be merged if llvm#133435
is approved upstream.

It includes changes in that PR and changes to codegen undoing a currently
downstream-only workaround for the issue that would break if the upstream PR is
merged on its own.
Ankur-0429 pushed a commit to Ankur-0429/llvm-project that referenced this pull request Apr 2, 2025
)

This patch removes the addition of 1 to the number of iterations when
calling the following DeviceRTL functions:
- `__kmpc_distribute_for_static_loop*`
- `__kmpc_distribute_static_loop*`
- `__kmpc_for_static_loop*`

Calls to these functions are currently only produced by the OMPIRBuilder
from flang, which already passes the correct number of iterations to
these functions. By adding 1 to the received `num_iters` variable,
worksharing can produce incorrect results. This impacts flang OpenMP
offloading of `do`, `distribute` and `distribute parallel do`
constructs.

Expecting the application to pass `tripcount - 1` as the argument seems
unexpected as well, so rather than updating flang I think it makes more
sense to update the runtime.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants