Skip to content

[OpenMP] Address __kmp_dist_for_static_init issue #129902

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 17, 2025

Conversation

skatrak
Copy link
Member

@skatrak skatrak commented Mar 5, 2025

This patch attempts to provide a fix for an issue that appears when the __kmp_dist_for_static_init function is called from a serialized team.

This is triggered by code generated by flang for distribute parallel do constructs whenever an if clause for the parallel leaf construct is present. This results in the introduction of a call to __kmpc_fork_call_if in place of __kmpc_fork_call. When it evaluates to false, it defers execution to __kmp_serialized_parallel, which creates a new serial team that is picked up by __kmp_dist_for_static_init, resulting in an incorrect team pointer that causes the nteams == (kmp_uint32)team->t.t_parent->t.t_nproc assertion to fail.

The sequence of calls replicating this issue can be summarized as:

  • __kmpc_fork_teams
  • __kmpc_fork_call_if
  • __kmpc_dist_for_static_init_*

Since I am not familiar with the implementation of the OpenMP runtime, it is possible that the above sequence of calls is incorrect, or that the bug can be better fixed in another way, so I am open to discussing this.

The following Fortran program can be compiled with flang to show the issue:

! Compile and run: flang -fopenmp test.f90 -o test && ./test
! Check LLVM IR: flang -fc1 -emit-llvm -fopenmp test.f90 -o -

program main
  implicit none
  integer, parameter :: n = 10
  integer :: i, idx(n)

  !$omp teams
  !$omp distribute parallel do if(.false.)
  do i=1,n
    idx(i) = i
  end do
  !$omp end teams

  print *, idx
end program

@llvmbot llvmbot added the openmp:libomp OpenMP host runtime label Mar 5, 2025
skatrak added a commit to skatrak/aomp that referenced this pull request Mar 5, 2025
This test is currently causing a runtime failure related to some interaction
between serial `teams` regions and `distribute parallel do` execution,
triggered by `if` clauses attached to `parallel` evaluating to `false`.

A potential fix to the issue is being proposed: llvm/llvm-project#129902.
skatrak added a commit to skatrak/aomp that referenced this pull request Mar 6, 2025
This test is currently causing a runtime failure related to some interaction
between serial `teams` regions and `distribute parallel do` execution,
triggered by `if` clauses attached to `parallel` evaluating to `false`.

A potential fix to the issue is being proposed: llvm/llvm-project#129902.
skatrak added a commit to ROCm/aomp that referenced this pull request Mar 6, 2025
This test is currently causing a runtime failure related to some interaction
between serial `teams` regions and `distribute parallel do` execution,
triggered by `if` clauses attached to `parallel` evaluating to `false`.

A potential fix to the issue is being proposed: llvm/llvm-project#129902.
@hansangbae
Copy link
Contributor

@skatrak can you try the following patch instead?

Use of the __kmp_aux_get...() functions ensures the runtime gets the correct numbers whether it is serialized or not.

diff --git a/openmp/runtime/src/kmp_sched.cpp b/openmp/runtime/src/kmp_sched.cpp
index 2e0dfac6eeb3..8404928c6ff3 100644
--- a/openmp/runtime/src/kmp_sched.cpp
+++ b/openmp/runtime/src/kmp_sched.cpp
@@ -542,9 +542,10 @@ static void __kmp_dist_for_static_init(ident_t *loc, kmp_int32 gtid,
   nth = th->th.th_team_nproc;
   team = th->th.th_team;
   KMP_DEBUG_ASSERT(th->th.th_teams_microtask); // we are in the teams construct
-  nteams = th->th.th_teams_size.nteams;
-  team_id = team->t.t_master_tid;
-  KMP_DEBUG_ASSERT(nteams == (kmp_uint32)team->t.t_parent->t.t_nproc);
+  nteams = __kmp_aux_get_num_teams();
+  team_id = __kmp_aux_get_team_num();
+  KMP_DEBUG_ASSERT(nteams == 1 ||
+                   nteams == (kmp_uint32)team->t.t_parent->t.t_nproc);
 
   // compute global trip count
   if (incr == 1) {
}

@skatrak
Copy link
Member Author

skatrak commented Mar 7, 2025

@hansangbae thank you for checking this. Your suggestion is not working for me, unfortunately. It's somehow picking up another team of threads, because nteams is not affected by num_teams clauses in the way it should (we're calling __kmpc_push_num_teams_51 to set the number of teams prior to calling __kmpc_fork_teams, in case that has something to do with it).

Interestingly, on a 24-core CPU, I get nteams=24 by default and when passing the num_teams(1) clause. If I pass num_teams(2), then nteams=12, num_teams(3) -> nteams=8, etc.

Also, team_id=0 always, so it doesn't properly calculate the iteration range of each thread, so every team executes the same chunk of iterations.

@mjklemm
Copy link
Contributor

mjklemm commented Mar 11, 2025

@skatrak @hansangbae Can you guys dig a bit deeper why this is the case? I'm tempted to think that there's an underlying bug that surfaces through the containing target region. Does the same weirdness happen for a teams region w/o enclosing target?

@skatrak
Copy link
Member Author

skatrak commented Mar 12, 2025

@mjklemm If you look closely at the test I shared, you can see it actually contains no target regions in it. But the same issue arises if an if clause for target evaluates to false and it also evaluates to false for the parallel construct, e.g.:

!$omp target teams distribute parallel do if(.false.)
...
!$omp target teams if(.false.)
!$omp distribute parallel do if(.false.)
...
!$omp target teams distribute parallel do if(target:.false.) if(parallel:.false.)
...

Basically, this bug is triggered anytime distribute parallel do runs on the host and an if clause for the parallel leaf construct evaluates to false. Which I narrowed down to the __kmpc_fork_teams + __kmpc_fork_call_if + __kmpc_dist_for_static_init_* sequence of OpenMP runtime calls, but I don't know enough about how it works to actually dig much deeper than that.

@jprotze
Copy link
Collaborator

jprotze commented Mar 13, 2025

The assertion breaks, because of an "optimization" in the kmp code, that always triggers problems:
When encountering nested serialized teams, t_serialized encodes their nesting level and no explicit kmp_base_team object is allocated for the nested team. So, accessing parent is not the actual parent, but some ancestor. @hansangbae 's suggestion for the fix sounds reasonable. Alternatively, you could be more explicit: KMP_DEBUG_ASSERT(team->t.t_serialized > 1 || nteams == (kmp_uint32)team->t.t_parent->t.t_nproc);

Without checking, I would assume that

!$omp parallel num_threads(4)
!$omp masked
!$omp teams distribute parallel do if(teams:.false.) if(parallel:.false.)

Breaks with your patch. You will compare 4 with 1

@skatrak
Copy link
Member Author

skatrak commented Mar 13, 2025

I just created a gist to reproduce this exact issue without having to use flang: https://gist.github.com/skatrak/b6da0b65c433d35cd7170e4d3a007f03. It directly calls the relevant OpenMP runtime functions, so it's hopefully easier to debug.

@skatrak
Copy link
Member Author

skatrak commented Mar 13, 2025

Thank you @jprotze for the explanation. The problem with @hansangbae's solution is that for some reason it does still not fix the issue in this case, as I shared previously. Also, replacing the assert with your suggestion on its own doesn't stop it from asserting either, because team->t.t_serialized=1 in this case.

The example you shared where my patch would break is, as far as I understand, invalid OpenMP. This is because teams regions must be strictly nested within the implicit parallel region surrounding the whole program or a target region (5.2: 232, 5-6). So it doesn't look like it could fail in that way. But maybe I'm just misunderstanding this, so let me know.

I can imagine it's possible that the bug is elsewhere in the runtime or the sequence of runtime calls is somehow incorrect and we're reaching the call to __kmp_dist_for_static_init in an invalid state, that's why I created the gist. That way it's easier to at least evaluate the second point and to debug the first.

@mjklemm
Copy link
Contributor

mjklemm commented Mar 13, 2025

The example you shared where my patch would break is, as far as I understand, invalid OpenMP. This is because teams regions must be strictly nested within the implicit parallel region surrounding the whole program or a target region (5.2: 232, 5-6). So it doesn't look like it could fail in that way. But maybe I'm just misunderstanding this, so let me know.

Yes, that is correct. The sequence of directives that @jprotze shared should not be allowed as per the nesting rules for teams

@jprotze
Copy link
Collaborator

jprotze commented Mar 13, 2025

Sure, you are right. I just thought about the nesting from the perspective of lomp data structures, but not whether the code actually makes sense.

With the code from your gist, I can reproduce the issue after I set OMP_NUM_THREADS>1 (while 1 is the default on my test system :)
Since the assertion also triggers with ./a.out T F in which case team->t.t_parent->t.t_nproc = nthreads-var (OMP_NUM_THREADS) and nteams = 2, the solution proposed by @hansangbae doesn't work.

When spawning the teams, the runtime already forks nteams*nthreads-var threads, even if they are not used. These are organized in this additional base_team structure. In case of __kmp_fork_call, a special code path is taken if (__kmp_is_fork_in_teams(...)) is true, but in case of __kmpc_serialized_parallel is used in __kmpc_fork_call_if, a new serialized team is nested into the preallocated team.

For now, I think, your patch is the best solution to fix the broken assertion.

As a long-term solution, I think that __kmpc_serialized_parallel should eventually also call __kmp_fork_in_teams to keep the nesting of teams structures clean and provide better results with OMPD/OMPT.

This patch attempts to provide a fix for an issue that appears when the
`__kmp_dist_for_static_init` function is called from a serialized team.

This is triggered by code generated by flang for `distribute parallel do`
constructs whenever an `if` clause for the `parallel` leaf construct is
present. This results in the introduction of a call to `__kmpc_fork_call_if` in
place of `__kmpc_fork_call`. When it evaluates to `false`, it defers execution
to `__kmp_serialized_parallel`, which creates a new serial team that is picked
up by `__kmp_dist_for_static_init`, resulting in an incorrect `team` pointer
that causes the `nteams == (kmp_uint32)team->t.t_parent->t.t_nproc` assertion
to fail.

The sequence of calls replicating this issue can be summarized as:
  - __kmpc_fork_teams
  - __kmpc_fork_call_if
  - __kmpc_dist_for_static_init_*

Since I am not familiar with the implementation of the OpenMP runtime, it is
possible that the previous sequence of calls is incorrect, or that the bug can
be better fixed in another way, so I am open to discussing this.

The following Fortran program can be compiled with flang to show the issue:

```f90
! Compile and run: flang -fopenmp test.f90 -o test && ./test
! Check LLVM IR: flang -fc1 -emit-llvm -fopenmp test.f90 -o -

program main
  implicit none
  integer, parameter :: n = 10
  integer :: i, idx(n)

  !$omp teams
  !$omp distribute parallel do if(.false.)
  do i=1,n
    idx(i) = i
  end do
  !$omp end teams

  print *, idx
end program
```
@skatrak skatrak force-pushed the distribute-parallel-do-if-host branch from 4aeb31f to 2b04a0f Compare March 14, 2025 16:12
@skatrak
Copy link
Member Author

skatrak commented Mar 14, 2025

Thank you again @jprotze for the detailed look, your explanation does line up with what I was seeing. I just updated the patch to guard against the while loop from setting team=nullptr, just to be safe, and updated the comment to hopefully make it clearer. Let me know if you'd like to see any further changes or if it can be merged.

Edit: There's one other spot where this appears to also make sense to do (__kmp_team_static_init). Let me know if you'd like me to add the fix there as well.

Copy link
Collaborator

@jprotze jprotze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@skatrak skatrak merged commit 6085f3f into llvm:main Mar 17, 2025
9 checks passed
@skatrak skatrak deleted the distribute-parallel-do-if-host branch March 17, 2025 11:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
openmp:libomp OpenMP host runtime
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants