Skip to content

[AMDGPU] Make max dwords of memory cluster configurable #119342

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Dec 18, 2024

Conversation

ruiling
Copy link
Contributor

@ruiling ruiling commented Dec 10, 2024

We find it helpful to increase the value for graphics workload. Make it configurable so we can experiment with a different value. It might be more helpful we can have a per-function value, but I am not sure how this can be done properly.

We find it helpful to increase the value for graphics workload.
Make it configurable so we can experiment with a different value.
It might be more helpful we can have a per-function value, but I am not
sure how this can be done properly.
@llvmbot
Copy link
Member

llvmbot commented Dec 10, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Ruiling, Song (ruiling)

Changes

We find it helpful to increase the value for graphics workload. Make it configurable so we can experiment with a different value. It might be more helpful we can have a per-function value, but I am not sure how this can be done properly.


Full diff: https://github.com/llvm/llvm-project/pull/119342.diff

1 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+12-5)
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 4a94d690297949..ed651adcc8cd2e 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -60,6 +60,11 @@ static cl::opt<bool> Fix16BitCopies(
   cl::init(true),
   cl::ReallyHidden);
 
+static cl::opt<unsigned> MaxMemoryClusterDWORDS(
+    "amdgpu-max-memory-cluster-dwords", cl::Hidden, cl::init(8),
+    cl::desc(
+        "Restrict the maximum dwords for memory cluster during scheduler"));
+
 SIInstrInfo::SIInstrInfo(const GCNSubtarget &ST)
   : AMDGPUGenInstrInfo(AMDGPU::ADJCALLSTACKUP, AMDGPU::ADJCALLSTACKDOWN),
     RI(ST), ST(ST) {
@@ -565,12 +570,14 @@ bool SIInstrInfo::shouldClusterMemOps(ArrayRef<const MachineOperand *> BaseOps1,
   }
 
   // In order to avoid register pressure, on an average, the number of DWORDS
-  // loaded together by all clustered mem ops should not exceed 8. This is an
-  // empirical value based on certain observations and performance related
-  // experiments.
+  // loaded together by all clustered mem ops should not exceed
+  // MaxMemoryClusterDWORDS. This is an empirical value based on certain
+  // observations and performance related experiments.
   // The good thing about this heuristic is - it avoids clustering of too many
   // sub-word loads, and also avoids clustering of wide loads. Below is the
-  // brief summary of how the heuristic behaves for various `LoadSize`.
+  // brief summary of how the heuristic behaves for various `LoadSize` when
+  // MaxMemoryClusterDWORDS is 8.
+  //
   // (1) 1 <= LoadSize <= 4: cluster at max 8 mem ops
   // (2) 5 <= LoadSize <= 8: cluster at max 4 mem ops
   // (3) 9 <= LoadSize <= 12: cluster at max 2 mem ops
@@ -578,7 +585,7 @@ bool SIInstrInfo::shouldClusterMemOps(ArrayRef<const MachineOperand *> BaseOps1,
   // (5) LoadSize >= 17: do not cluster
   const unsigned LoadSize = NumBytes / ClusterSize;
   const unsigned NumDWORDs = ((LoadSize + 3) / 4) * ClusterSize;
-  return NumDWORDs <= 8;
+  return NumDWORDs <= MaxMemoryClusterDWORDS;
 }
 
 // FIXME: This behaves strangely. If, for example, you have 32 load + stores,

@ruiling ruiling requested review from jayfoad and arsenm December 10, 2024 08:24
Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function attribute

@ruiling
Copy link
Contributor Author

ruiling commented Dec 12, 2024

function attribute

Thanks! What's the suggested way to access that value in function attribute from the SIInstrInfo? Maybe keep a record of that value in MachineFunctionInfo and get from it?

@arsenm
Copy link
Contributor

arsenm commented Dec 12, 2024

function attribute

Thanks! What's the suggested way to access that value in function attribute from the SIInstrInfo? Maybe keep a record of that value in MachineFunctionInfo and get from it?

That's just a matter of caching it in MFI or whether read directly from the IR attribute. In either case, the MachineFunction needs to be passed into the hook

@ruiling
Copy link
Contributor Author

ruiling commented Dec 17, 2024

function attribute

Thanks! What's the suggested way to access that value in function attribute from the SIInstrInfo? Maybe keep a record of that value in MachineFunctionInfo and get from it?

That's just a matter of caching it in MFI or whether read directly from the IR attribute. In either case, the MachineFunction needs to be passed into the hook

Added a function attribute for it, do we want to keep the command line option for it?

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of the mir test changes parse the value?

Comment on lines 566 to 567
if (MFI->getMaxMemoryClusterDWords())
MaxMemoryClusterDWords = MFI->getMaxMemoryClusterDWords();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should just directly take MFI->getMaxMemoryClusterDWords()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's about maintaining the default value 8. Without MFI, shouldClusterMemOps() should still have a copy of the default value 8, right? If we directly take the MFI->getMaxMemoryClusterDWords(), that would mean we need to keep the default value in both MFI and within shouldClusterMemOps(). Do we want that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed this by adding a default limit value in SIInstrInfo.h, which could be accessible from both places.

@ruiling
Copy link
Contributor Author

ruiling commented Dec 17, 2024

None of the mir test changes parse the value?

Do you mean parsing the value from function attribute? I will add one.

@ruiling
Copy link
Contributor Author

ruiling commented Dec 17, 2024

None of the mir test changes parse the value?

Do you mean parsing the value from function attribute? I will add one.
I just added a small test to parse the value from MFI.

@ruiling ruiling merged commit 67c55b1 into llvm:main Dec 18, 2024
8 checks passed
@llvm-ci
Copy link
Collaborator

llvm-ci commented Dec 18, 2024

LLVM Buildbot has detected a new failure on builder openmp-offload-libc-amdgpu-runtime running on omp-vega20-1 while building llvm at step 7 "Add check check-offload".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/73/builds/10539

Here is the relevant piece of the build log for the reference
Step 7 (Add check check-offload) failure: 1200 seconds without output running [b'ninja', b'-j 32', b'check-offload'], attempting to kill
...
PASS: libomptarget :: x86_64-unknown-linux-gnu-LTO :: offloading/test_libc.cpp (974 of 987)
PASS: libomptarget :: x86_64-unknown-linux-gnu-LTO :: offloading/bug49779.cpp (975 of 987)
PASS: libomptarget :: x86_64-unknown-linux-gnu-LTO :: offloading/bug50022.cpp (976 of 987)
PASS: libomptarget :: x86_64-unknown-linux-gnu-LTO :: offloading/bug53727.cpp (977 of 987)
PASS: libomptarget :: x86_64-unknown-linux-gnu-LTO :: offloading/wtime.c (978 of 987)
PASS: libomptarget :: x86_64-unknown-linux-gnu :: offloading/bug49021.cpp (979 of 987)
PASS: libomptarget :: x86_64-unknown-linux-gnu :: offloading/std_complex_arithmetic.cpp (980 of 987)
PASS: libomptarget :: x86_64-unknown-linux-gnu-LTO :: offloading/complex_reduction.cpp (981 of 987)
PASS: libomptarget :: x86_64-unknown-linux-gnu-LTO :: offloading/bug49021.cpp (982 of 987)
PASS: libomptarget :: x86_64-unknown-linux-gnu-LTO :: offloading/std_complex_arithmetic.cpp (983 of 987)
command timed out: 1200 seconds without output running [b'ninja', b'-j 32', b'check-offload'], attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=1237.294405

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants