-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[OpenMP] Fix num_iters in __kmpc_*_loop DeviceRTL functions #133435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This patch removes the addition of 1 to the number of iterations when calling the following DeviceRTL functions: - `__kmpc_distribute_for_static_loop*` - `__kmpc_distribute_static_loop*` - `__kmpc_for_static_loop*` Calls to these functions are currently only produced by the OMPIRBuilder from flang, which already passes the correct number of iterations to these functions. By adding 1 to the received `num_iters` variable, worksharing can produce incorrect results. This impacts flang OpenMP offloading for `do`, `distribute` and `distribute parallel do` constructs. Expecting the application to pass `tripcount - 1` as the argument seems unexpected as well, so rather than updating flang I think it makes more sense to update the runtime.
@llvm/pr-subscribers-offload Author: Sergio Afonso (skatrak) ChangesThis patch removes the addition of 1 to the number of iterations when calling the following DeviceRTL functions:
Calls to these functions are currently only produced by the OMPIRBuilder from flang, which already passes the correct number of iterations to these functions. By adding 1 to the received Expecting the application to pass Full diff: https://github.com/llvm/llvm-project/pull/133435.diff 1 Files Affected:
diff --git a/offload/DeviceRTL/src/Workshare.cpp b/offload/DeviceRTL/src/Workshare.cpp
index 861b9ca371ccd..a8759307b42bd 100644
--- a/offload/DeviceRTL/src/Workshare.cpp
+++ b/offload/DeviceRTL/src/Workshare.cpp
@@ -911,19 +911,19 @@ template <typename Ty> class StaticLoopChunker {
IdentTy *loc, void (*fn)(TY, void *), void *arg, TY num_iters, \
TY num_threads, TY block_chunk, TY thread_chunk) { \
ompx::StaticLoopChunker<TY>::DistributeFor( \
- loc, fn, arg, num_iters + 1, num_threads, block_chunk, thread_chunk); \
+ loc, fn, arg, num_iters, num_threads, block_chunk, thread_chunk); \
} \
[[gnu::flatten, clang::always_inline]] void \
__kmpc_distribute_static_loop##BW(IdentTy *loc, void (*fn)(TY, void *), \
void *arg, TY num_iters, \
TY block_chunk) { \
- ompx::StaticLoopChunker<TY>::Distribute(loc, fn, arg, num_iters + 1, \
+ ompx::StaticLoopChunker<TY>::Distribute(loc, fn, arg, num_iters, \
block_chunk); \
} \
[[gnu::flatten, clang::always_inline]] void __kmpc_for_static_loop##BW( \
IdentTy *loc, void (*fn)(TY, void *), void *arg, TY num_iters, \
TY num_threads, TY thread_chunk) { \
- ompx::StaticLoopChunker<TY>::For(loc, fn, arg, num_iters + 1, num_threads, \
+ ompx::StaticLoopChunker<TY>::For(loc, fn, arg, num_iters, num_threads, \
thread_chunk); \
}
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it passes tests, seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR should only be merged if llvm#133435 is approved upstream. It includes changes in that PR and changes to codegen undoing a currently downstream-only workaround for the issue that would break if the upstream PR is merged on its own.
) This patch removes the addition of 1 to the number of iterations when calling the following DeviceRTL functions: - `__kmpc_distribute_for_static_loop*` - `__kmpc_distribute_static_loop*` - `__kmpc_for_static_loop*` Calls to these functions are currently only produced by the OMPIRBuilder from flang, which already passes the correct number of iterations to these functions. By adding 1 to the received `num_iters` variable, worksharing can produce incorrect results. This impacts flang OpenMP offloading of `do`, `distribute` and `distribute parallel do` constructs. Expecting the application to pass `tripcount - 1` as the argument seems unexpected as well, so rather than updating flang I think it makes more sense to update the runtime.
This patch removes the addition of 1 to the number of iterations when calling the following DeviceRTL functions:
__kmpc_distribute_for_static_loop*
__kmpc_distribute_static_loop*
__kmpc_for_static_loop*
Calls to these functions are currently only produced by the OMPIRBuilder from flang, which already passes the correct number of iterations to these functions. By adding 1 to the received
num_iters
variable, worksharing can produce incorrect results. This impacts flang OpenMP offloading ofdo
,distribute
anddistribute parallel do
constructs.Expecting the application to pass
tripcount - 1
as the argument seems unexpected as well, so rather than updating flang I think it makes more sense to update the runtime.