Skip to content

[flang][OpenMP] Fix LASTPRIVATE for iteration variables #69773

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 24, 2023

Conversation

luporl
Copy link
Contributor

@luporl luporl commented Oct 20, 2023

Iteration variables behave slightly different with LASTPRIVATE,
as they must be assigned the value that the copy would have after
sequential execution of the loop. This case is now handled.

This patch also fixes LASTPRIVATE for loops where the step is
different from 1.

Fixes #64055

Iteration variables behave slightly different with LASTPRIVATE,
as they must be assigned the value that the copy would have after
sequential execution of the loop. This case is now handled.

This patch also fixes LASTPRIVATE for loops where the step is
different from 1.

Fixes llvm#64055
@llvmbot llvmbot added flang Flang issues not falling into any other category flang:fir-hlfir flang:openmp labels Oct 20, 2023
@llvmbot
Copy link
Member

llvmbot commented Oct 20, 2023

@llvm/pr-subscribers-flang-openmp

@llvm/pr-subscribers-flang-fir-hlfir

Author: Leandro Lupori (luporl)

Changes

Iteration variables behave slightly different with LASTPRIVATE,
as they must be assigned the value that the copy would have after
sequential execution of the loop. This case is now handled.

This patch also fixes LASTPRIVATE for loops where the step is
different from 1.

Fixes #64055


Full diff: https://github.com/llvm/llvm-project/pull/69773.diff

5 Files Affected:

  • (modified) flang/lib/Lower/OpenMP.cpp (+43-15)
  • (modified) flang/test/Lower/OpenMP/FIR/lastprivate-commonblock.f90 (+12-6)
  • (modified) flang/test/Lower/OpenMP/FIR/parallel-lastprivate-clause-scalar.f90 (+48-12)
  • (modified) flang/test/Lower/OpenMP/lastprivate-commonblock.f90 (+7-1)
  • (added) flang/test/Lower/OpenMP/lastprivate-iv.f90 (+66)
diff --git a/flang/lib/Lower/OpenMP.cpp b/flang/lib/Lower/OpenMP.cpp
index 5f5e968eaaa6414..91250fbaf42d6dd 100644
--- a/flang/lib/Lower/OpenMP.cpp
+++ b/flang/lib/Lower/OpenMP.cpp
@@ -109,6 +109,7 @@ class DataSharingProcessor {
   bool hasLastPrivateOp;
   mlir::OpBuilder::InsertPoint lastPrivIP;
   mlir::OpBuilder::InsertPoint insPt;
+  mlir::Value loopIV;
   // Symbols in private, firstprivate, and/or lastprivate clauses.
   llvm::SetVector<const Fortran::semantics::Symbol *> privatizedSymbols;
   llvm::SetVector<const Fortran::semantics::Symbol *> defaultSymbols;
@@ -157,6 +158,11 @@ class DataSharingProcessor {
   // dealocation code as well.
   void processStep1();
   void processStep2(mlir::Operation *op, bool isLoop);
+
+  void setLoopIV(mlir::Value iv) {
+    assert(!loopIV && "Loop iteration variable already set");
+    loopIV = iv;
+  }
 };
 
 void DataSharingProcessor::processStep1() {
@@ -270,7 +276,6 @@ void DataSharingProcessor ::insertBarrier() {
 }
 
 void DataSharingProcessor::insertLastPrivateCompare(mlir::Operation *op) {
-  mlir::arith::CmpIOp cmpOp;
   bool cmpCreated = false;
   mlir::OpBuilder::InsertPoint localInsPt = firOpBuilder.saveInsertionPoint();
   for (const Fortran::parser::OmpClause &clause : opClauseList.v) {
@@ -349,18 +354,17 @@ void DataSharingProcessor::insertLastPrivateCompare(mlir::Operation *op) {
           }
         }
       } else if (mlir::isa<mlir::omp::WsLoopOp>(op)) {
-        mlir::Operation *lastOper = op->getRegion(0).back().getTerminator();
-        firOpBuilder.setInsertionPoint(lastOper);
-
         // Update the original variable just before exiting the worksharing
         // loop. Conversion as follows:
         //
         //                       omp.wsloop {
         // omp.wsloop {            ...
         //    ...                  store
-        //    store       ===>     %cmp = llvm.icmp "eq" %iv %ub
-        //    omp.yield            fir.if %cmp {
-        // }                         ^%lpv_update_blk:
+        //    store       ===>     %v = arith.addi %iv, %step
+        //    omp.yield            %cmp = %step < 0 ? %v < %ub : %v > %ub
+        // }                       fir.if %cmp {
+        //                           fir.store %v to %loopIV
+        //                           ^%lpv_update_blk:
         //                         }
         //                         omp.yield
         //                       }
@@ -368,15 +372,37 @@ void DataSharingProcessor::insertLastPrivateCompare(mlir::Operation *op) {
 
         // Only generate the compare once in presence of multiple LastPrivate
         // clauses.
-        if (!cmpCreated) {
-          cmpOp = firOpBuilder.create<mlir::arith::CmpIOp>(
-              op->getLoc(), mlir::arith::CmpIPredicate::eq,
-              op->getRegion(0).front().getArguments()[0],
-              mlir::dyn_cast<mlir::omp::WsLoopOp>(op).getUpperBound()[0]);
-        }
-        auto ifOp =
-            firOpBuilder.create<fir::IfOp>(op->getLoc(), cmpOp, /*else*/ false);
+        if (cmpCreated)
+          continue;
+        cmpCreated = true;
+
+        mlir::Location loc = op->getLoc();
+        mlir::Operation *lastOper = op->getRegion(0).back().getTerminator();
+        firOpBuilder.setInsertionPoint(lastOper);
+
+        mlir::Value iv = op->getRegion(0).front().getArguments()[0];
+        mlir::Value ub =
+            mlir::dyn_cast<mlir::omp::WsLoopOp>(op).getUpperBound()[0];
+        mlir::Value step = mlir::dyn_cast<mlir::omp::WsLoopOp>(op).getStep()[0];
+
+        // v = iv + step
+        // cmp = step < 0 ? v < ub : v > ub
+        mlir::Value v = firOpBuilder.create<mlir::arith::AddIOp>(loc, iv, step);
+        mlir::Value zero =
+            firOpBuilder.createIntegerConstant(loc, step.getType(), 0);
+        mlir::Value negativeStep = firOpBuilder.create<mlir::arith::CmpIOp>(
+            loc, mlir::arith::CmpIPredicate::slt, step, zero);
+        mlir::Value vLT = firOpBuilder.create<mlir::arith::CmpIOp>(
+            loc, mlir::arith::CmpIPredicate::slt, v, ub);
+        mlir::Value vGT = firOpBuilder.create<mlir::arith::CmpIOp>(
+            loc, mlir::arith::CmpIPredicate::sgt, v, ub);
+        mlir::Value cmpOp = firOpBuilder.create<mlir::arith::SelectOp>(
+            loc, negativeStep, vLT, vGT);
+
+        auto ifOp = firOpBuilder.create<fir::IfOp>(loc, cmpOp, /*else*/ false);
         firOpBuilder.setInsertionPointToStart(&ifOp.getThenRegion().front());
+        assert(loopIV && "loopIV was not set");
+        firOpBuilder.create<fir::StoreOp>(op->getLoc(), v, loopIV);
         lastPrivIP = firOpBuilder.saveInsertionPoint();
       } else {
         TODO(converter.getCurrentLocation(),
@@ -2128,6 +2154,8 @@ static void createBodyOfOp(
       proc.processStep1();
       proc.processStep2(op, is_loop);
     } else {
+      if (is_loop && args.size() > 0)
+        dsp->setLoopIV(converter.getSymbolAddress(*args[0]));
       dsp->processStep2(op, is_loop);
     }
 
diff --git a/flang/test/Lower/OpenMP/FIR/lastprivate-commonblock.f90 b/flang/test/Lower/OpenMP/FIR/lastprivate-commonblock.f90
index 06f3e1ca82234ee..7d2118305fb4c49 100644
--- a/flang/test/Lower/OpenMP/FIR/lastprivate-commonblock.f90
+++ b/flang/test/Lower/OpenMP/FIR/lastprivate-commonblock.f90
@@ -19,12 +19,18 @@
 !CHECK: %[[val_c1_i32_0:.*]] = arith.constant 1 : i32
 !CHECK: omp.wsloop   for (%[[arg:.*]]) : i32 = (%[[val_c1_i32]]) to (%[[val_c100_i32]]) inclusive step (%[[val_c1_i32_0]]) {
 !CHECK: fir.store %[[arg]] to %[[val_0]] : !fir.ref<i32>
-!CHECK: %[[val_11:.*]] = arith.cmpi eq, %[[arg]], %[[val_c100_i32]] : i32
-!CHECK: fir.if %[[val_11]] {
-!CHECK: %[[val_12:.*]] = fir.load %[[val_9]] : !fir.ref<f32>
-!CHECK: fir.store %[[val_12]] to %[[val_5]] : !fir.ref<f32>
-!CHECK: %[[val_13:.*]] = fir.load %[[val_10]] : !fir.ref<f32>
-!CHECK: fir.store %[[val_13]] to %[[val_8]] : !fir.ref<f32>
+!CHECK: %[[val_11:.*]] = arith.addi %[[arg]], %[[val_c1_i32_0]] : i32
+!CHECK: %[[val_c0_i32:.*]] = arith.constant 0 : i32
+!CHECK: %[[val_12:.*]] = arith.cmpi slt, %[[val_c1_i32_0]], %[[val_c0_i32]] : i32
+!CHECK: %[[val_13:.*]] = arith.cmpi slt, %[[val_11]], %[[val_c100_i32]] : i32
+!CHECK: %[[val_14:.*]] = arith.cmpi sgt, %[[val_11]], %[[val_c100_i32]] : i32
+!CHECK: %[[val_15:.*]] = arith.select %[[val_12]], %[[val_13]], %[[val_14]] : i1
+!CHECK: fir.if %[[val_15]] {
+!CHECK: fir.store %[[val_11]] to %[[val_0]] : !fir.ref<i32>
+!CHECK: %[[val_16:.*]] = fir.load %[[val_9]] : !fir.ref<f32>
+!CHECK: fir.store %[[val_16]] to %[[val_5]] : !fir.ref<f32>
+!CHECK: %[[val_17:.*]] = fir.load %[[val_10]] : !fir.ref<f32>
+!CHECK: fir.store %[[val_17]] to %[[val_8]] : !fir.ref<f32>
 !CHECK: }
 !CHECK: omp.yield
 !CHECK: }
diff --git a/flang/test/Lower/OpenMP/FIR/parallel-lastprivate-clause-scalar.f90 b/flang/test/Lower/OpenMP/FIR/parallel-lastprivate-clause-scalar.f90
index aa95c0712fbb88a..67e24088c533867 100644
--- a/flang/test/Lower/OpenMP/FIR/parallel-lastprivate-clause-scalar.f90
+++ b/flang/test/Lower/OpenMP/FIR/parallel-lastprivate-clause-scalar.f90
@@ -1,4 +1,4 @@
-! This test checks lowering of `FIRSTPRIVATE` clause for scalar types.
+! This test checks lowering of `LASTPRIVATE` clause for scalar types.
 
 ! RUN: bbc -fopenmp -emit-fir %s -o - | FileCheck %s
 ! RUN: flang-new -fc1 -fopenmp -emit-fir %s -o - | FileCheck %s
@@ -24,8 +24,14 @@
 !CHECK-NEXT: %[[CALL_END_IO:.*]] = fir.call @_FortranAioEndIoStatement(%[[CALL_BEGIN_IO]])
 
 ! Testing last iteration check
-!CHECK-NEXT: %[[IV_CMP:.*]] = arith.cmpi eq, %[[INDX_WS]]
+!CHECK: %[[V:.*]] = arith.addi %[[INDX_WS]], %{{.*}} : i32
+!CHECK: %[[C0:.*]] = arith.constant 0 : i32
+!CHECK: %[[T1:.*]] = arith.cmpi slt, %{{.*}}, %[[C0]] : i32
+!CHECK: %[[T2:.*]] = arith.cmpi slt, %[[V]], %{{.*}} : i32
+!CHECK: %[[T3:.*]] = arith.cmpi sgt, %[[V]], %{{.*}} : i32
+!CHECK: %[[IV_CMP:.*]] = arith.select %[[T1]], %[[T2]], %[[T3]] : i1
 !CHECK: fir.if %[[IV_CMP]] {
+!CHECK: fir.store %[[V]] to %{{.*}} : !fir.ref<i32>
 
 ! Testing lastprivate val update
 !CHECK-DAG: %[[CVT:.*]] = fir.convert %[[ARG1_REF]] : (!fir.ref<!fir.char<1,5>>) -> !fir.ref<i8>
@@ -52,8 +58,14 @@ subroutine lastprivate_character(arg1)
 !CHECK: omp.wsloop for (%[[INDX_WS:.*]]) : {{.*}} {
 
 ! Testing last iteration check
-!CHECK-DAG: %[[IV_CMP:.*]] = arith.cmpi eq, %[[INDX_WS]]
-!CHECK-DAG: fir.if %[[IV_CMP]] {
+!CHECK: %[[V:.*]] = arith.addi %[[INDX_WS]], %{{.*}} : i32
+!CHECK: %[[C0:.*]] = arith.constant 0 : i32
+!CHECK: %[[T1:.*]] = arith.cmpi slt, %{{.*}}, %[[C0]] : i32
+!CHECK: %[[T2:.*]] = arith.cmpi slt, %[[V]], %{{.*}} : i32
+!CHECK: %[[T3:.*]] = arith.cmpi sgt, %[[V]], %{{.*}} : i32
+!CHECK: %[[IV_CMP:.*]] = arith.select %[[T1]], %[[T2]], %[[T3]] : i1
+!CHECK: fir.if %[[IV_CMP]] {
+!CHECK: fir.store %[[V]] to %{{.*}} : !fir.ref<i32>
 
 ! Testing lastprivate val update
 !CHECK-NEXT: %[[CLONE_LD:.*]] = fir.load %[[CLONE]] : !fir.ref<i32>
@@ -81,8 +93,14 @@ subroutine lastprivate_int(arg1)
 !CHECK: omp.wsloop for (%[[INDX_WS:.*]]) : {{.*}} {
 
 ! Testing last iteration check
-!CHECK: %[[IV_CMP1:.*]] = arith.cmpi eq, %[[INDX_WS]]
-!CHECK-NEXT: fir.if %[[IV_CMP1]] {
+!CHECK: %[[V:.*]] = arith.addi %[[INDX_WS]], %{{.*}} : i32
+!CHECK: %[[C0:.*]] = arith.constant 0 : i32
+!CHECK: %[[T1:.*]] = arith.cmpi slt, %{{.*}}, %[[C0]] : i32
+!CHECK: %[[T2:.*]] = arith.cmpi slt, %[[V]], %{{.*}} : i32
+!CHECK: %[[T3:.*]] = arith.cmpi sgt, %[[V]], %{{.*}} : i32
+!CHECK: %[[IV_CMP:.*]] = arith.select %[[T1]], %[[T2]], %[[T3]] : i1
+!CHECK: fir.if %[[IV_CMP]] {
+!CHECK: fir.store %[[V]] to %{{.*}} : !fir.ref<i32>
 ! Testing lastprivate val update
 !CHECK-DAG: %[[CLONE_LD1:.*]] = fir.load %[[CLONE1]] : !fir.ref<i32>
 !CHECK-DAG: fir.store %[[CLONE_LD1]] to %[[ARG1]] : !fir.ref<i32>
@@ -112,8 +130,14 @@ subroutine mult_lastprivate_int(arg1, arg2)
 !CHECK: omp.wsloop for (%[[INDX_WS:.*]]) : {{.*}} {
 
 !Testing last iteration check
-!CHECK: %[[IV_CMP1:.*]] = arith.cmpi eq, %[[INDX_WS]]
-!CHECK-NEXT: fir.if %[[IV_CMP1]] {
+!CHECK: %[[V:.*]] = arith.addi %[[INDX_WS]], %{{.*}} : i32
+!CHECK: %[[C0:.*]] = arith.constant 0 : i32
+!CHECK: %[[T1:.*]] = arith.cmpi slt, %{{.*}}, %[[C0]] : i32
+!CHECK: %[[T2:.*]] = arith.cmpi slt, %[[V]], %{{.*}} : i32
+!CHECK: %[[T3:.*]] = arith.cmpi sgt, %[[V]], %{{.*}} : i32
+!CHECK: %[[IV_CMP:.*]] = arith.select %[[T1]], %[[T2]], %[[T3]] : i1
+!CHECK: fir.if %[[IV_CMP]] {
+!CHECK: fir.store %[[V]] to %{{.*}} : !fir.ref<i32>
 !Testing lastprivate val update
 !CHECK-DAG: %[[CLONE_LD2:.*]] = fir.load %[[CLONE2]] : !fir.ref<i32>
 !CHECK-DAG: fir.store %[[CLONE_LD2]] to %[[ARG2]] : !fir.ref<i32>
@@ -148,8 +172,14 @@ subroutine mult_lastprivate_int2(arg1, arg2)
 !CHECK: omp.wsloop for (%[[INDX_WS:.*]]) : {{.*}} {
 
 ! Testing last iteration check
-!CHECK: %[[IV_CMP1:.*]] = arith.cmpi eq, %[[INDX_WS]]
-!CHECK-NEXT: fir.if %[[IV_CMP1]] {
+!CHECK: %[[V:.*]] = arith.addi %[[INDX_WS]], %{{.*}} : i32
+!CHECK: %[[C0:.*]] = arith.constant 0 : i32
+!CHECK: %[[T1:.*]] = arith.cmpi slt, %{{.*}}, %[[C0]] : i32
+!CHECK: %[[T2:.*]] = arith.cmpi slt, %[[V]], %{{.*}} : i32
+!CHECK: %[[T3:.*]] = arith.cmpi sgt, %[[V]], %{{.*}} : i32
+!CHECK: %[[IV_CMP:.*]] = arith.select %[[T1]], %[[T2]], %[[T3]] : i1
+!CHECK: fir.if %[[IV_CMP]] {
+!CHECK: fir.store %[[V]] to %{{.*}} : !fir.ref<i32>
 ! Testing lastprivate val update
 !CHECK-NEXT: %[[CLONE_LD:.*]] = fir.load %[[CLONE2]] : !fir.ref<i32>
 !CHECK-NEXT: fir.store %[[CLONE_LD]] to %[[ARG2]] : !fir.ref<i32>
@@ -179,8 +209,14 @@ subroutine firstpriv_lastpriv_int(arg1, arg2)
 !CHECK-NEXT: omp.barrier
 !CHECK: omp.wsloop for (%[[INDX_WS:.*]]) : {{.*}} {
 ! Testing last iteration check
-!CHECK: %[[IV_CMP1:.*]] = arith.cmpi eq, %[[INDX_WS]]
-!CHECK-NEXT: fir.if %[[IV_CMP1]] {
+!CHECK: %[[V:.*]] = arith.addi %[[INDX_WS]], %{{.*}} : i32
+!CHECK: %[[C0:.*]] = arith.constant 0 : i32
+!CHECK: %[[T1:.*]] = arith.cmpi slt, %{{.*}}, %[[C0]] : i32
+!CHECK: %[[T2:.*]] = arith.cmpi slt, %[[V]], %{{.*}} : i32
+!CHECK: %[[T3:.*]] = arith.cmpi sgt, %[[V]], %{{.*}} : i32
+!CHECK: %[[IV_CMP:.*]] = arith.select %[[T1]], %[[T2]], %[[T3]] : i1
+!CHECK: fir.if %[[IV_CMP]] {
+!CHECK: fir.store %[[V]] to %{{.*}} : !fir.ref<i32>
 ! Testing lastprivate val update
 !CHECK-NEXT: %[[CLONE_LD:.*]] = fir.load %[[CLONE1]] : !fir.ref<i32>
 !CHECK-NEXT: fir.store %[[CLONE_LD]] to %[[ARG1]] : !fir.ref<i32>
diff --git a/flang/test/Lower/OpenMP/lastprivate-commonblock.f90 b/flang/test/Lower/OpenMP/lastprivate-commonblock.f90
index 06fa0a12107763f..a11bdee156637b0 100644
--- a/flang/test/Lower/OpenMP/lastprivate-commonblock.f90
+++ b/flang/test/Lower/OpenMP/lastprivate-commonblock.f90
@@ -16,8 +16,14 @@
 !CHECK:    %[[PRIVATE_Y_REF:.*]] = fir.alloca f32 {bindc_name = "y", pinned, uniq_name = "_QFlastprivate_commonEy"}
 !CHECK:    %[[PRIVATE_Y_DECL:.*]]:2 = hlfir.declare %[[PRIVATE_Y_REF]] {uniq_name = "_QFlastprivate_commonEy"} : (!fir.ref<f32>) -> (!fir.ref<f32>, !fir.ref<f32>)
 !CHECK:    omp.wsloop   for  (%[[I:.*]]) : i32 = (%{{.*}}) to (%{{.*}}) inclusive step (%{{.*}}) {
-!CHECK:      %[[LAST_ITER:.*]] = arith.cmpi eq, %[[I]], %{{.*}} : i32
+!CHECK:      %[[V:.*]] = arith.addi %[[I]], %{{.*}} : i32
+!CHECK:      %[[C0:.*]] = arith.constant 0 : i32
+!CHECK:      %[[NEG_STEP:.*]] = arith.cmpi slt, %{{.*}}, %[[C0]] : i32
+!CHECK:      %[[V_LT:.*]] = arith.cmpi slt, %[[V]], %{{.*}} : i32
+!CHECK:      %[[V_GT:.*]] = arith.cmpi sgt, %[[V]], %{{.*}} : i32
+!CHECK:      %[[LAST_ITER:.*]] = arith.select %[[NEG_STEP]], %[[V_LT]], %[[V_GT]] : i1
 !CHECK:      fir.if %[[LAST_ITER]] {
+!CHECK:        fir.store %[[V]] to %{{.*}} : !fir.ref<i32>
 !CHECK:        %[[PRIVATE_X_VAL:.*]] = fir.load %[[PRIVATE_X_DECL]]#0 : !fir.ref<f32>
 !CHECK:        hlfir.assign %[[PRIVATE_X_VAL]] to %[[X_DECL]]#0 temporary_lhs : f32, !fir.ref<f32>
 !CHECK:        %[[PRIVATE_Y_VAL:.*]] = fir.load %[[PRIVATE_Y_DECL]]#0 : !fir.ref<f32>
diff --git a/flang/test/Lower/OpenMP/lastprivate-iv.f90 b/flang/test/Lower/OpenMP/lastprivate-iv.f90
new file mode 100644
index 000000000000000..70fe500129d128f
--- /dev/null
+++ b/flang/test/Lower/OpenMP/lastprivate-iv.f90
@@ -0,0 +1,66 @@
+! Test LASTPRIVATE with iteration variable.
+! RUN: %flang_fc1 -emit-hlfir -fopenmp -o - %s 2>&1 | FileCheck %s
+
+!CHECK-LABEL: func @_QPlastprivate_iv_inc
+!CHECK:    %[[I_MEM:.*]] = fir.alloca i32 {adapt.valuebyref, pinned}
+!CHECK:    %[[I:.*]]:2 = hlfir.declare %[[I_MEM]] {uniq_name = "_QFlastprivate_iv_incEi"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
+!CHECK:    %[[I2_MEM:.*]] = fir.alloca i32 {bindc_name = "i", uniq_name = "_QFlastprivate_iv_incEi"}
+!CHECK:    %[[I2:.*]]:2 = hlfir.declare %[[I2_MEM]] {uniq_name = "_QFlastprivate_iv_incEi"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
+!CHECK:    %[[LB:.*]] = arith.constant 4 : i32
+!CHECK:    %[[UB:.*]] = arith.constant 10 : i32
+!CHECK:    %[[STEP:.*]]  = arith.constant 3 : i32
+!CHECK:    omp.wsloop for  (%[[IV:.*]]) : i32 = (%[[LB]]) to (%[[UB]]) inclusive step (%[[STEP]]) {
+!CHECK:      fir.store %[[IV]] to %[[I]]#1 : !fir.ref<i32>
+!CHECK:      %[[V:.*]] = arith.addi %[[IV]], %[[STEP]] : i32
+!CHECK:      %[[C0:.*]] = arith.constant 0 : i32
+!CHECK:      %[[STEP_NEG:.*]] = arith.cmpi slt, %[[STEP]], %[[C0]] : i32
+!CHECK:      %[[V_LT:.*]] = arith.cmpi slt, %[[V]], %[[UB]] : i32
+!CHECK:      %[[V_GT:.*]] = arith.cmpi sgt, %[[V]], %[[UB]] : i32
+!CHECK:      %[[CMP:.*]] = arith.select %[[STEP_NEG]], %[[V_LT]], %[[V_GT]] : i1
+!CHECK:      fir.if %[[CMP]] {
+!CHECK:        fir.store %[[V]] to %[[I]]#1 : !fir.ref<i32>
+!CHECK:        %[[I_VAL:.*]] = fir.load %[[I]]#0 : !fir.ref<i32>
+!CHECK:        hlfir.assign %[[I_VAL]] to %[[I2]]#0 temporary_lhs : i32, !fir.ref<i32>
+!CHECK:      }
+!CHECK:      omp.yield
+!CHECK:    }
+subroutine lastprivate_iv_inc()
+  integer :: i
+
+  !$omp do lastprivate(i)
+  do i = 4, 10, 3
+  end do
+  !$omp end do
+end subroutine
+
+!CHECK-LABEL: func @_QPlastprivate_iv_dec
+!CHECK:    %[[I_MEM:.*]] = fir.alloca i32 {adapt.valuebyref, pinned}
+!CHECK:    %[[I:.*]]:2 = hlfir.declare %[[I_MEM]] {uniq_name = "_QFlastprivate_iv_decEi"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
+!CHECK:    %[[I2_MEM:.*]] = fir.alloca i32 {bindc_name = "i", uniq_name = "_QFlastprivate_iv_decEi"}
+!CHECK:    %[[I2:.*]]:2 = hlfir.declare %[[I2_MEM]] {uniq_name = "_QFlastprivate_iv_decEi"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
+!CHECK:    %[[LB:.*]] = arith.constant 10 : i32
+!CHECK:    %[[UB:.*]] = arith.constant 1 : i32
+!CHECK:    %[[STEP:.*]]  = arith.constant -3 : i32
+!CHECK:    omp.wsloop for  (%[[IV:.*]]) : i32 = (%[[LB]]) to (%[[UB]]) inclusive step (%[[STEP]]) {
+!CHECK:      fir.store %[[IV]] to %[[I]]#1 : !fir.ref<i32>
+!CHECK:      %[[V:.*]] = arith.addi %[[IV]], %[[STEP]] : i32
+!CHECK:      %[[C0:.*]] = arith.constant 0 : i32
+!CHECK:      %[[STEP_NEG:.*]] = arith.cmpi slt, %[[STEP]], %[[C0]] : i32
+!CHECK:      %[[V_LT:.*]] = arith.cmpi slt, %[[V]], %[[UB]] : i32
+!CHECK:      %[[V_GT:.*]] = arith.cmpi sgt, %[[V]], %[[UB]] : i32
+!CHECK:      %[[CMP:.*]] = arith.select %[[STEP_NEG]], %[[V_LT]], %[[V_GT]] : i1
+!CHECK:      fir.if %[[CMP]] {
+!CHECK:        fir.store %[[V]] to %[[I]]#1 : !fir.ref<i32>
+!CHECK:        %[[I_VAL:.*]] = fir.load %[[I]]#0 : !fir.ref<i32>
+!CHECK:        hlfir.assign %[[I_VAL]] to %[[I2]]#0 temporary_lhs : i32, !fir.ref<i32>
+!CHECK:      }
+!CHECK:      omp.yield
+!CHECK:    }
+subroutine lastprivate_iv_dec()
+  integer :: i
+
+  !$omp do lastprivate(i)
+  do i = 10, 1, -3
+  end do
+  !$omp end do
+end subroutine

Copy link
Contributor

@NimishMishra NimishMishra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the patch!

If I understand correctly, the previous implementation was generating the fir.if depending on a equality check between v and loopIV. As such, the final update to the iteration variable was being missed.

I'm ok with the approach taken here. Basically, we force the iteration variable to advance by step, and then check if it crosses the upperbound ub. Depending on whether step is positive or negative, a > or a < predicate is needed for comparison. As such, the ternary select operation seems fine.

I've just one point to raise (@kiranchandramohan can comment better). Through this approach, we have introduced a mandatory set of 1 arith.addi, 3 arith.cmpi, and 1 arith.select in every iteration of the loop. Again, not very sure about this, but can this have a bearing on performance, since loops are extensively present across benchmarks we use for validation?

Not analysed this in depth, but could the following work?

%cmp = arith.cmpi eq, %u, %iv
fir.if %cmp {
    %temp = arith.addi %arg0, %step
     fir.store %temp to %iv
}

Basically, we let the current flow (generating if on u = ub) be, and add the extra operations within the fir.if? Would it cause some problems I have missed? I expect if the loop is well formed (say both ub and step are negative), then also the equality condition should work.

// store ===> %v = arith.addi %iv, %step
// omp.yield %cmp = %step < 0 ? %v < %ub : %v > %ub
// } fir.if %cmp {
// fir.store %v to %loopIV
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: An additional fit.store here is redundant. We expect all last private operations to be abstracted within ^%lpv_update_blk

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fir.store here is needed. Considering the example of the previous conversation, loopIV is updated in the beginning of the loop body only, to have the same value as iv, but we need it to hold v instead (iv + step). Also, ^%lpv_update_blk copies loopIV to ia and not iv to loopIV. In the end this store gets optimized, because ^%lpv_update_blk generates a load loopIV right after it.

Copy link
Contributor

@kiranchandramohan kiranchandramohan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @luporl for the fix. Looks fine to me.

Is the update of the original loop variable unconditionally done now? Should it be only if lastprivate(loop_variable) is present? Or is the point that the loop_variable is always private and hence the value of the original variable is undefined if it is not lastprivate and hence it is OK to update always?

I've just one point to raise (@kiranchandramohan can comment better). Through this approach, we have introduced a mandatory set of 1 arith.addi, 3 arith.cmpi, and 1 arith.select in every iteration of the loop. Again, not very sure about this, but can this have a bearing on performance, since loops are extensively present across benchmarks we use for validation?

Please refer to https://discourse.llvm.org/t/rfc-privatisation-in-openmp-dialect/3526/4. The kmpc_for_static_init calls provide a value called %.omp.is_last that can be used to check if this is the last iteration. And updates can be placed by checking this value on the last iteration in the exiting path of the loop. One way to achieve this is by having a new operation in the dialect (in-place of the fir.if and related comparisons) say, omp.last_private_loop_update, that takes in the loop_variable, step-size as arguments. This can be extracted out and put in the exit path of the loop. Other alternatives include just modelling the lastprivate variables as part of the omp.wsloop construct and appropriately lowering them with the correct conditions and updates only during lowering to LLVM IR.

Please wait for @NimishMishra.

@luporl
Copy link
Contributor Author

luporl commented Oct 23, 2023

Thank you for the review @NimishMishra!

If I understand correctly, the previous implementation was generating the fir.if depending on a equality check between v and loopIV. As such, the final update to the iteration variable was being missed.

Not exactly. Consider this code snippet:

integer :: ia
ia = 0
!$omp parallel do lastprivate(ia)
do ia = 1, 10
enddo
!$omp end parallel do

In the generated FIR, there will be 3 versions of the ia variable:

%ia = fir.alloca i32 {bindc_name = "ia", uniq_name = "_QFEia"}
omp.parallel {
    %loopIV = fir.alloca i32 {adapt.valuebyref, pinned}
    omp.wsloop for  (%iv) : i32 = (%lb) to (%ub) inclusive step (%step) {
    }
}

%ia - the original variable
%loopIV - this is the private copy of %ia
%iv - this is the iteration variable created and managed by omp.wsloop

So, the previous implementation was actually checking for equality between iv and ub. The problem is that iv goes from lb to ub, but LASTPRIVATE expects ia to be updated with iv + step, where iv's value is that of the last loop iteration. v was added by this patch and is just the result of iv + step.

I've just one point to raise (@kiranchandramohan can comment better). Through this approach, we have introduced a mandatory set of 1 arith.addi, 3 arith.cmpi, and 1 arith.select in every iteration of the loop. Again, not very sure about this, but can this have a bearing on performance, since loops are extensively present across benchmarks we use for validation?

It is true that now we need to perform more operations to find out when we are in the last loop iteration. However, these are added just when LASTPRIVATE is used and is optimized to 2 instructions when lb, ub and step are
constant. In the example above, it becomes:

%v = add i32 %iv, %step
%cmp = icmp sgt i32 %v, 10

But when all of the loop parameters are variables, then I guess it is not always possible to optimize it.

Not analysed this in depth, but could the following work?

%cmp = arith.cmpi eq, %u, %iv
fir.if %cmp {
    %temp = arith.addi %arg0, %step
     fir.store %temp to %iv
}

Basically, we let the current flow (generating if on u = ub) be, and add the extra operations within the fir.if? Would it cause some problems I have missed? I expect if the loop is well formed (say both ub and step are negative), then also the equality condition should work.

This works only when step is 1 or -1. Consider this loop:

ia = 0
!$omp do lastprivate(ia)
do ia = 1, 10, 4
enddo
!$omp end do

iv will have the following values: 1, 5 and 9. After exiting the loop, ia must be updated with 13. With this approach (and without this patch), iv will never be equal to ub, resulting in ia not being updated and remaining with value 0.

@NimishMishra
Copy link
Contributor

Thank you for the detailed response. I'm happy with the current changes. Please go ahead with the merge.

@luporl
Copy link
Contributor Author

luporl commented Oct 23, 2023

Thanks for the review @kiranchandramohan!

Is the update of the original loop variable unconditionally done now? Should it be only if lastprivate(loop_variable) is present? Or is the point that the loop_variable is always private and hence the value of the original variable is undefined if it is not lastprivate and hence it is OK to update always?

The update of the original loop variable is still performed only if lastprivate(loop_variable) is present. But the private copy is now being updated, in the last iteration, whenever a lastprivate clause is present, even if it doesn't include the loop variable. Do you think it's better to omit the store in this case? I've noticed that this store (and the load after it, for loop variables) is usually optimized out, when llvm bitcode is generated, and only the store to the original variable remains.

I've just one point to raise (@kiranchandramohan can comment better). Through this approach, we have introduced a mandatory set of 1 arith.addi, 3 arith.cmpi, and 1 arith.select in every iteration of the loop. Again, not very sure about this, but can this have a bearing on performance, since loops are extensively present across benchmarks we use for validation?

Please refer to https://discourse.llvm.org/t/rfc-privatisation-in-openmp-dialect/3526/4. The kmpc_for_static_init calls provide a value called %.omp.is_last that can be used to check if this is the last iteration. And updates can be placed by checking this value on the last iteration in the exiting path of the loop. One way to achieve this is by having a new operation in the dialect (in-place of the fir.if and related comparisons) say, omp.last_private_loop_update, that takes in the loop_variable, step-size as arguments. This can be extracted out and put in the exit path of the loop. Other alternatives include just modelling the lastprivate variables as part of the omp.wsloop construct and appropriately lowering them with the correct conditions and updates only during lowering to LLVM IR.

If I understand correctly, this is a suggested optimization for a future patch, correct?
This would be especially useful when support for lastprivate with collapse is added. Otherwise, we would need to test the loop variables of each nested loop.

@kiranchandramohan
Copy link
Contributor

The update of the original loop variable is still performed only if lastprivate(loop_variable) is present. But the private copy is now being updated, in the last iteration, whenever a lastprivate clause is present, even if it doesn't include the loop variable. Do you think it's better to omit the store in this case? I've noticed that this store (and the load after it, for loop variables) is usually optimized out, when llvm bitcode is generated, and only the store to the original variable remains.

OK. That is fine.

If I understand correctly, this is a suggested optimization for a future patch, correct?
This would be especially useful when support for lastprivate with collapse is added. Otherwise, we would need to test the loop variables of each nested loop.

Yes, it is something for future work.

@luporl luporl merged commit ed97932 into llvm:main Oct 24, 2023
@luporl luporl deleted the luporl-lastpriv branch October 24, 2023 12:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flang:fir-hlfir flang:openmp flang Flang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Flang][OpenMP]Incorrect execution result of LASTPRIVATE clause specified in DO construct
4 participants