Skip to content

[SLP] Cluster SortedBases before sorting. #101144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 30, 2024

Conversation

davemgreen
Copy link
Collaborator

In order to enforce a strict-weak ordering, this patch clusters the bases that are being sorted by the root - the first value in a gep chain. The sorting is then performed in each cluster.

In order to enforce a strict-weak ordering, this patch clusters the bases that
are being sorted by the root - the first value in a gep chain. The sorting is
then performed in each cluster.
Root = Gep->getOperand(0);
SortedBases.emplace_back(Base.first, Strip, Root);
}
if (SortedBases.size() <= 16) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why limit is 16?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just limiting the complexity in case it was somehow very large. I can remove it if you think that's fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to have some named constants

Comment on lines 4858 to 4859
auto Begin = SortedBases.begin();
auto End = SortedBases.end();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto *

Root = Gep->getOperand(0);
SortedBases.emplace_back(Base.first, Strip, Root);
}
if (SortedBases.size() <= 16) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to have some named constants

Comment on lines 4866 to 4870
while (auto *Gep = dyn_cast<GetElementPtrInst>(V)) {
if (Gep->getOperand(0) == std::get<1>(V1))
return true;
V = Gep->getOperand(0);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do this preliminary, store the result in the container and look into the container rather than running in the loop for each element?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you mean? I was wondering if there would be a nicer way to pre-calculate the results, but wasn't sure how that would look in a way that was better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, instead of a loop in a sort function, do some kind of analysis before or use standard functions, like getUnderlying....

@llvmbot
Copy link
Member

llvmbot commented Jul 30, 2024

@llvm/pr-subscribers-llvm-transforms

Author: David Green (davemgreen)

Changes

In order to enforce a strict-weak ordering, this patch clusters the bases that are being sorted by the root - the first value in a gep chain. The sorting is then performed in each cluster.


Full diff: https://github.com/llvm/llvm-project/pull/101144.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+34-16)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll (+4-4)
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 6501a14d87789..98befd1fd51f5 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -4843,25 +4843,43 @@ static bool clusterSortPtrAccesses(ArrayRef<Value *> VL, Type *ElemTy,
     return false;
 
   // If we have a better order, also sort the base pointers by increasing
-  // (variable) values if possible, to try and keep the order more regular.
-  SmallVector<std::pair<Value *, Value *>> SortedBases;
-  for (auto &Base : Bases)
-    SortedBases.emplace_back(Base.first,
-                             Base.first->stripInBoundsConstantOffsets());
-  llvm::stable_sort(SortedBases, [](std::pair<Value *, Value *> V1,
-                                    std::pair<Value *, Value *> V2) {
-    const Value *V = V2.second;
-    while (auto *Gep = dyn_cast<GetElementPtrInst>(V)) {
-      if (Gep->getOperand(0) == V1.second)
-        return true;
-      V = Gep->getOperand(0);
-    }
-    return false;
-  });
+  // (variable) values if possible, to try and keep the order more regular. In
+  // order to create a valid strict-weak order we cluster by the Root of gep
+  // chains and sort within each.
+  SmallVector<std::tuple<Value *, Value *, Value *>> SortedBases;
+  for (auto &Base : Bases) {
+    Value *Strip = Base.first->stripInBoundsConstantOffsets();
+    Value *Root = Strip;
+    while (auto *Gep = dyn_cast<GetElementPtrInst>(Root))
+      Root = Gep->getOperand(0);
+    SortedBases.emplace_back(Base.first, Strip, Root);
+  }
+  auto *Begin = SortedBases.begin();
+  auto *End = SortedBases.end();
+  while (Begin != End) {
+    Value *Root = std::get<2>(*Begin);
+    auto *Mid = std::stable_partition(
+        Begin, End, [&Root](auto V) { return std::get<2>(V) == Root; });
+    DenseMap<Value *, DenseMap<Value *, bool>> LessThan;
+    for (auto I = Begin; I < Mid; ++I)
+      LessThan[std::get<1>(*I)] = DenseMap<Value *, bool>();
+    for (auto I = Begin; I < Mid; ++I) {
+      Value *V = std::get<1>(*I);
+      while (auto *Gep = dyn_cast<GetElementPtrInst>(V)) {
+        V = Gep->getOperand(0);
+        if (LessThan.contains(V))
+          LessThan[V][std::get<1>(*I)] = true;
+      }
+    }
+    std::stable_sort(Begin, Mid, [&LessThan](auto &V1, auto &V2) {
+      return LessThan[std::get<1>(V1)][std::get<1>(V2)];
+    });
+    Begin = Mid;
+  }
 
   // Collect the final order of sorted indices
   for (auto Base : SortedBases)
-    for (auto &T : Bases[Base.first])
+    for (auto &T : Bases[std::get<0>(Base)])
       SortedIndices.push_back(std::get<2>(T));
 
   assert(SortedIndices.size() == VL.size() &&
diff --git a/llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll b/llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll
index 6b5503f26fabf..d79aed89b0be7 100644
--- a/llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll
+++ b/llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll
@@ -428,14 +428,14 @@ define i32 @reduce_blockstrided4x4(ptr nocapture noundef readonly %p1, i32 nound
 ; CHECK-NEXT:    [[TMP5:%.*]] = load <4 x i8>, ptr [[ADD_PTR64]], align 1
 ; CHECK-NEXT:    [[TMP6:%.*]] = load <4 x i8>, ptr [[ARRAYIDX3_1]], align 1
 ; CHECK-NEXT:    [[TMP7:%.*]] = load <4 x i8>, ptr [[ARRAYIDX5_1]], align 1
-; CHECK-NEXT:    [[TMP8:%.*]] = shufflevector <4 x i8> [[TMP0]], <4 x i8> [[TMP1]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
-; CHECK-NEXT:    [[TMP9:%.*]] = shufflevector <4 x i8> [[TMP4]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; CHECK-NEXT:    [[TMP8:%.*]] = shufflevector <4 x i8> [[TMP0]], <4 x i8> [[TMP4]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; CHECK-NEXT:    [[TMP9:%.*]] = shufflevector <4 x i8> [[TMP1]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
 ; CHECK-NEXT:    [[TMP10:%.*]] = shufflevector <16 x i8> [[TMP8]], <16 x i8> [[TMP9]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 poison, i32 poison, i32 poison, i32 poison>
 ; CHECK-NEXT:    [[TMP11:%.*]] = shufflevector <4 x i8> [[TMP5]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
 ; CHECK-NEXT:    [[TMP12:%.*]] = shufflevector <16 x i8> [[TMP10]], <16 x i8> [[TMP11]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 16, i32 17, i32 18, i32 19>
 ; CHECK-NEXT:    [[TMP13:%.*]] = zext <16 x i8> [[TMP12]] to <16 x i32>
-; CHECK-NEXT:    [[TMP14:%.*]] = shufflevector <4 x i8> [[TMP2]], <4 x i8> [[TMP3]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
-; CHECK-NEXT:    [[TMP15:%.*]] = shufflevector <4 x i8> [[TMP6]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; CHECK-NEXT:    [[TMP14:%.*]] = shufflevector <4 x i8> [[TMP2]], <4 x i8> [[TMP6]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; CHECK-NEXT:    [[TMP15:%.*]] = shufflevector <4 x i8> [[TMP3]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
 ; CHECK-NEXT:    [[TMP16:%.*]] = shufflevector <16 x i8> [[TMP14]], <16 x i8> [[TMP15]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 poison, i32 poison, i32 poison, i32 poison>
 ; CHECK-NEXT:    [[TMP17:%.*]] = shufflevector <4 x i8> [[TMP7]], <4 x i8> poison, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
 ; CHECK-NEXT:    [[TMP18:%.*]] = shufflevector <16 x i8> [[TMP16]], <16 x i8> [[TMP17]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 16, i32 17, i32 18, i32 19>

Begin, End, [&Root](auto V) { return std::get<2>(V) == Root; });
DenseMap<Value *, DenseMap<Value *, bool>> LessThan;
for (auto I = Begin; I < Mid; ++I)
LessThan[std::get<1>(*I)] = DenseMap<Value *, bool>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LessThan[std::get<1>(*I)] = DenseMap<Value *, bool>();
LessThan.try_mplace(std::get<1>(*I));

@davemgreen davemgreen force-pushed the gh-slp-clusterbeforesort branch from 8b9906e to 2900800 Compare July 30, 2024 19:35
Copy link
Member

@alexey-bataev alexey-bataev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG

@davemgreen davemgreen merged commit 89b67a6 into llvm:main Jul 30, 2024
4 of 6 checks passed
@davemgreen davemgreen deleted the gh-slp-clusterbeforesort branch July 30, 2024 21:12
@llvm-ci
Copy link
Collaborator

llvm-ci commented Jul 30, 2024

LLVM Buildbot has detected a new failure on builder openmp-offload-libc-amdgpu-runtime running on omp-vega20-1 while building llvm at step 5 "compile-openmp".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/73/builds/2925

Here is the relevant piece of the build log for the reference:

Step 5 (compile-openmp) failure: build (failure)
...
0.828 [202/34/279] Building CXX object libc/src/stdlib/CMakeFiles/libc.src.stdlib.atoi.dir/atoi.cpp.o
0.852 [201/34/280] Building CXX object libc/src/stdio/gpu/CMakeFiles/libc.src.stdio.gpu.fread.dir/fread.cpp.o
0.853 [200/34/281] Building CXX object libc/src/stdio/gpu/CMakeFiles/libc.src.stdio.gpu.fgetc.dir/fgetc.cpp.o
0.853 [199/34/282] Building CXX object libc/src/stdio/gpu/CMakeFiles/libc.src.stdio.gpu.fgets.dir/fgets.cpp.o
0.853 [198/34/283] Building CXX object libc/src/stdio/gpu/CMakeFiles/libc.src.stdio.gpu.remove.dir/remove.cpp.o
0.854 [197/34/284] Building CXX object libc/src/stdlib/CMakeFiles/libc.src.stdlib.strtol.dir/strtol.cpp.o
0.860 [196/34/285] Building CXX object libc/src/stdio/gpu/CMakeFiles/libc.src.stdio.gpu.getc.dir/getc.cpp.o
0.861 [195/34/286] Building CXX object libc/src/stdio/gpu/CMakeFiles/libc.src.stdio.gpu.getchar.dir/getchar.cpp.o
0.872 [194/34/287] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.canonicalizef16.dir/canonicalizef16.cpp.o
0.872 [193/34/288] Building CXX object libc/src/string/CMakeFiles/libc.src.string.strncat.dir/strncat.cpp.o
FAILED: libc/src/string/CMakeFiles/libc.src.string.strncat.dir/strncat.cpp.o 
/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/clang++ --target=amdgcn-amd-amdhsa -DLIBC_NAMESPACE=__llvm_libc_20_0_0_git -I/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/libc -isystem /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/include/amdgcn-amd-amdhsa -O3 -DNDEBUG --target=amdgcn-amd-amdhsa -DLIBC_QSORT_IMPL=LIBC_QSORT_QUICK_SORT -fpie -DLIBC_FULL_BUILD -nostdlibinc -ffixed-point -fno-exceptions -fno-lax-vector-conversions -fno-unwind-tables -fno-asynchronous-unwind-tables -fno-rtti -ftrivial-auto-var-init=pattern -fno-omit-frame-pointer -Wall -Wextra -Werror -Wconversion -Wno-sign-conversion -Wimplicit-fallthrough -Wwrite-strings -Wextra-semi -Wnewline-eof -Wnonportable-system-include-path -Wstrict-prototypes -Wthread-safety -Wglobal-constructors "-DLIBC_MATH=(LIBC_MATH_SKIP_ACCURATE_PASS | LIBC_MATH_SMALL_TABLES | LIBC_MATH_NO_ERRNO | LIBC_MATH_NO_EXCEPT)" -nogpulib -fvisibility=hidden -fconvergent-functions -flto -Wno-multi-gpu -Xclang -mcode-object-version=none -DLIBC_COPT_PUBLIC_PACKAGING -UNDEBUG -MD -MT libc/src/string/CMakeFiles/libc.src.string.strncat.dir/strncat.cpp.o -MF libc/src/string/CMakeFiles/libc.src.string.strncat.dir/strncat.cpp.o.d -o libc/src/string/CMakeFiles/libc.src.string.strncat.dir/strncat.cpp.o -c /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/libc/src/string/strncat.cpp
In file included from /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/libc/src/string/strncat.cpp:9:
/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/libc/src/string/strncat.h:13:10: fatal error: 'string.h' file not found
   13 | #include <string.h>
      |          ^~~~~~~~~~
1 error generated.
0.907 [193/33/289] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.ceilf16.dir/ceilf16.cpp.o
0.921 [193/32/290] Building CXX object libc/src/stdio/gpu/CMakeFiles/libc.src.stdio.gpu.putchar.dir/putchar.cpp.o
0.922 [193/31/291] Building CXX object libc/src/stdio/gpu/CMakeFiles/libc.src.stdio.gpu.fputc.dir/fputc.cpp.o
0.938 [193/30/292] Building CXX object libc/src/stdio/gpu/CMakeFiles/libc.src.stdio.gpu.fwrite.dir/fwrite.cpp.o
0.939 [193/29/293] Building CXX object libc/src/stdio/gpu/CMakeFiles/libc.src.stdio.gpu.putc.dir/putc.cpp.o
0.947 [193/28/294] Building CXX object libc/src/stdio/gpu/CMakeFiles/libc.src.stdio.gpu.fputs.dir/fputs.cpp.o
0.952 [193/27/295] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.floorf16.dir/floorf16.cpp.o
0.953 [193/26/296] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.truncf16.dir/truncf16.cpp.o
0.967 [193/25/297] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.fabsf16.dir/fabsf16.cpp.o
0.978 [193/24/298] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.roundf16.dir/roundf16.cpp.o
0.994 [193/23/299] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.roundf.dir/roundf.cpp.o
0.995 [193/22/300] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.roundevenf16.dir/roundevenf16.cpp.o
1.016 [193/21/301] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.sinf.dir/sinf.cpp.o
1.026 [193/20/302] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.tanf.dir/tanf.cpp.o
1.035 [193/19/303] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.cosf.dir/cosf.cpp.o
1.037 [193/18/304] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.lroundf.dir/lroundf.cpp.o
1.038 [193/17/305] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.lround.dir/lround.cpp.o
1.040 [193/16/306] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.sincosf.dir/sincosf.cpp.o
1.050 [193/15/307] Building CXX object libc/src/stdio/gpu/CMakeFiles/libc.src.stdio.gpu.printf.dir/printf.cpp.o
1.069 [193/14/308] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.rintf16.dir/rintf16.cpp.o
1.084 [193/13/309] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.lroundf16.dir/lroundf16.cpp.o
1.086 [193/12/310] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.llroundf.dir/llroundf.cpp.o
1.086 [193/11/311] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.llround.dir/llround.cpp.o
1.086 [193/10/312] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.llroundf16.dir/llroundf16.cpp.o
1.094 [193/9/313] Building CXX object libc/src/stdio/gpu/CMakeFiles/libc.src.stdio.gpu.fprintf.dir/fprintf.cpp.o
1.100 [193/8/314] Building CXX object libc/src/stdio/gpu/CMakeFiles/libc.src.stdio.gpu.vprintf.dir/vprintf.cpp.o
1.108 [193/7/315] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.lrint.dir/lrint.cpp.o
1.109 [193/6/316] Building CXX object libc/src/stdio/gpu/CMakeFiles/libc.src.stdio.gpu.vfprintf.dir/vfprintf.cpp.o
1.135 [193/5/317] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.fadd.dir/fadd.cpp.o
1.155 [193/4/318] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.tan.dir/tan.cpp.o
1.182 [193/3/319] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.sin.dir/sin.cpp.o
1.183 [193/2/320] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.cos.dir/cos.cpp.o

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants