[InstCombine] Allow load to store forwarding for scalable structs #123908

davemgreen · 2025-01-22T08:52:39Z

This attempts to fix a regression caused when scalable types started to be
emitted as structs. A __builtin_bit_cast will create a load/store pair that we
currently can not break up. This patch allows load-store forwarding in
InstCombine to split up the structs if it can detect that each part is
bitcast-able between the source and dest types.
https://alive2.llvm.org/ce/z/ewWXqQ

Non-scalable structs are decomposed through unpackStoreToAggregate before they
get here. Geps with scalable offsets are not valid so they would require i8 gep
types with vscale offsets that do not easily get cleaned up into bitcasts.

… NFC

This attempts to fix a regression caused when scalable types started to be emitted as structs. A __builtin_bit_cast will create a load/store pair that we currently can not break up. This patch allows load-store forwarding in InstCombine to split up the structs if it can detect that each part is bitcast-able between the source and dest types. https://alive2.llvm.org/ce/z/ewWXqQ Non-scalable structs are decomposed through unpackStoreToAggregate before they get here. Geps with scalable offsets are not valid so they would require i8 gep types with vscale offsets that do not easily get cleaned up into bitcasts.

llvmbot · 2025-01-22T08:53:15Z

@llvm/pr-subscribers-llvm-analysis

Author: David Green (davemgreen)

Changes

This attempts to fix a regression caused when scalable types started to be
emitted as structs. A __builtin_bit_cast will create a load/store pair that we
currently can not break up. This patch allows load-store forwarding in
InstCombine to split up the structs if it can detect that each part is
bitcast-able between the source and dest types.
https://alive2.llvm.org/ce/z/ewWXqQ

Non-scalable structs are decomposed through unpackStoreToAggregate before they
get here. Geps with scalable offsets are not valid so they would require i8 gep
types with vscale offsets that do not easily get cleaned up into bitcasts.

Full diff: https://github.com/llvm/llvm-project/pull/123908.diff

4 Files Affected:

(modified) llvm/include/llvm/Analysis/Loads.h (+2-1)
(modified) llvm/lib/Analysis/Loads.cpp (+20-5)
(modified) llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp (+15-1)
(added) llvm/test/Transforms/InstCombine/availableloadstruct.ll (+86)

diff --git a/llvm/include/llvm/Analysis/Loads.h b/llvm/include/llvm/Analysis/Loads.h
index 639070c07897b0..0ddd0098615ab9 100644
--- a/llvm/include/llvm/Analysis/Loads.h
+++ b/llvm/include/llvm/Analysis/Loads.h
@@ -156,7 +156,8 @@ Value *FindAvailableLoadedValue(LoadInst *Load, BasicBlock *ScanBB,
 /// This overload cannot be used to scan across multiple blocks.
 Value *FindAvailableLoadedValue(LoadInst *Load, BatchAAResults &AA,
                                 bool *IsLoadCSE,
-                                unsigned MaxInstsToScan = DefMaxInstsToScan);
+                                unsigned MaxInstsToScan = DefMaxInstsToScan,
+                                bool AllowPartwiseBitcastStructs = false);
 
 /// Scan backwards to see if we have the value of the given pointer available
 /// locally within a small number of instructions.
diff --git a/llvm/lib/Analysis/Loads.cpp b/llvm/lib/Analysis/Loads.cpp
index 11ccfa33821cad..c72000430ec4e3 100644
--- a/llvm/lib/Analysis/Loads.cpp
+++ b/llvm/lib/Analysis/Loads.cpp
@@ -531,7 +531,8 @@ static bool areNonOverlapSameBaseLoadAndStore(const Value *LoadPtr,
 
 static Value *getAvailableLoadStore(Instruction *Inst, const Value *Ptr,
                                     Type *AccessTy, bool AtLeastAtomic,
-                                    const DataLayout &DL, bool *IsLoadCSE) {
+                                    const DataLayout &DL, bool *IsLoadCSE,
+                                    bool AllowPartwiseBitcastStructs = false) {
   // If this is a load of Ptr, the loaded value is available.
   // (This is true even if the load is volatile or atomic, although
   // those cases are unlikely.)
@@ -572,6 +573,19 @@ static Value *getAvailableLoadStore(Instruction *Inst, const Value *Ptr,
     if (CastInst::isBitOrNoopPointerCastable(Val->getType(), AccessTy, DL))
       return Val;
 
+    if (AllowPartwiseBitcastStructs) {
+      if (StructType *SrcStructTy = dyn_cast<StructType>(Val->getType())) {
+        if (StructType *DestStructTy = dyn_cast<StructType>(AccessTy)) {
+          if (SrcStructTy->getNumElements() == DestStructTy->getNumElements() &&
+              all_of_zip(SrcStructTy->elements(), DestStructTy->elements(),
+                         [](Type *T1, Type *T2) {
+                           return CastInst::isBitCastable(T1, T2);
+                         }))
+            return Val;
+        }
+      }
+    }
+
     TypeSize StoreSize = DL.getTypeSizeInBits(Val->getType());
     TypeSize LoadSize = DL.getTypeSizeInBits(AccessTy);
     if (TypeSize::isKnownLE(LoadSize, StoreSize))
@@ -704,8 +718,8 @@ Value *llvm::findAvailablePtrLoadStore(
 }
 
 Value *llvm::FindAvailableLoadedValue(LoadInst *Load, BatchAAResults &AA,
-                                      bool *IsLoadCSE,
-                                      unsigned MaxInstsToScan) {
+                                      bool *IsLoadCSE, unsigned MaxInstsToScan,
+                                      bool AllowPartwiseBitcastStructs) {
   const DataLayout &DL = Load->getDataLayout();
   Value *StrippedPtr = Load->getPointerOperand()->stripPointerCasts();
   BasicBlock *ScanBB = Load->getParent();
@@ -727,8 +741,9 @@ Value *llvm::FindAvailableLoadedValue(LoadInst *Load, BatchAAResults &AA,
     if (MaxInstsToScan-- == 0)
       return nullptr;
 
-    Available = getAvailableLoadStore(&Inst, StrippedPtr, AccessTy,
-                                      AtLeastAtomic, DL, IsLoadCSE);
+    Available =
+        getAvailableLoadStore(&Inst, StrippedPtr, AccessTy, AtLeastAtomic, DL,
+                              IsLoadCSE, AllowPartwiseBitcastStructs);
     if (Available)
       break;
 
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp b/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
index 61f1c17592e966..833feefcbde3c6 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
@@ -1010,10 +1010,24 @@ Instruction *InstCombinerImpl::visitLoadInst(LoadInst &LI) {
   // separated by a few arithmetic operations.
   bool IsLoadCSE = false;
   BatchAAResults BatchAA(*AA);
-  if (Value *AvailableVal = FindAvailableLoadedValue(&LI, BatchAA, &IsLoadCSE)) {
+  if (Value *AvailableVal =
+          FindAvailableLoadedValue(&LI, BatchAA, &IsLoadCSE, DefMaxInstsToScan,
+                                   /*AllowPartwiseBitcastStructs=*/true)) {
     if (IsLoadCSE)
       combineMetadataForCSE(cast<LoadInst>(AvailableVal), &LI, false);
 
+    if (AvailableVal->getType() != LI.getType() &&
+        isa<StructType>(LI.getType())) {
+      StructType *DstST = cast<StructType>(LI.getType());
+      Value *R = PoisonValue::get(LI.getType());
+      for (unsigned I = 0, E = DstST->getNumElements(); I < E; I++) {
+        Value *Ext = Builder.CreateExtractValue(AvailableVal, I);
+        Value *BC =
+            Builder.CreateBitOrPointerCast(Ext, DstST->getElementType(I));
+        R = Builder.CreateInsertValue(R, BC, I);
+      }
+      return replaceInstUsesWith(LI, R);
+    }
     return replaceInstUsesWith(
         LI, Builder.CreateBitOrPointerCast(AvailableVal, LI.getType(),
                                            LI.getName() + ".cast"));
diff --git a/llvm/test/Transforms/InstCombine/availableloadstruct.ll b/llvm/test/Transforms/InstCombine/availableloadstruct.ll
new file mode 100644
index 00000000000000..5413fc05970ad6
--- /dev/null
+++ b/llvm/test/Transforms/InstCombine/availableloadstruct.ll
@@ -0,0 +1,86 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -S -passes=instcombine < %s | FileCheck %s
+
+target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
+
+define {<16 x i8>, <16 x i8>} @check_v16i8_v4i32({<4 x i32>, <4 x i32>} %x, ptr %p) nounwind {
+; CHECK-LABEL: define { <16 x i8>, <16 x i8> } @check_v16i8_v4i32(
+; CHECK-SAME: { <4 x i32>, <4 x i32> } [[X:%.*]], ptr [[P:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[X_ELT:%.*]] = extractvalue { <4 x i32>, <4 x i32> } [[X]], 0
+; CHECK-NEXT:    store <4 x i32> [[X_ELT]], ptr [[P]], align 16
+; CHECK-NEXT:    [[P_REPACK1:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 16
+; CHECK-NEXT:    [[X_ELT2:%.*]] = extractvalue { <4 x i32>, <4 x i32> } [[X]], 1
+; CHECK-NEXT:    store <4 x i32> [[X_ELT2]], ptr [[P_REPACK1]], align 16
+; CHECK-NEXT:    [[R_UNPACK_CAST:%.*]] = bitcast <4 x i32> [[X_ELT]] to <16 x i8>
+; CHECK-NEXT:    [[TMP0:%.*]] = insertvalue { <16 x i8>, <16 x i8> } poison, <16 x i8> [[R_UNPACK_CAST]], 0
+; CHECK-NEXT:    [[R_UNPACK4_CAST:%.*]] = bitcast <4 x i32> [[X_ELT2]] to <16 x i8>
+; CHECK-NEXT:    [[R5:%.*]] = insertvalue { <16 x i8>, <16 x i8> } [[TMP0]], <16 x i8> [[R_UNPACK4_CAST]], 1
+; CHECK-NEXT:    ret { <16 x i8>, <16 x i8> } [[R5]]
+;
+entry:
+  store {<4 x i32>, <4 x i32>} %x, ptr %p
+  %r = load {<16 x i8>, <16 x i8>}, ptr %p
+  ret {<16 x i8>, <16 x i8>} %r
+}
+
+define {<vscale x 16 x i8>, <vscale x 16 x i8>} @check_nxv16i8_nxv4i32({<vscale x 4 x i32>, <vscale x 4 x i32>} %x, ptr %p) nounwind {
+; CHECK-LABEL: define { <vscale x 16 x i8>, <vscale x 16 x i8> } @check_nxv16i8_nxv4i32(
+; CHECK-SAME: { <vscale x 4 x i32>, <vscale x 4 x i32> } [[X:%.*]], ptr [[P:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    store { <vscale x 4 x i32>, <vscale x 4 x i32> } [[X]], ptr [[P]], align 16
+; CHECK-NEXT:    [[TMP0:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[X]], 0
+; CHECK-NEXT:    [[TMP1:%.*]] = bitcast <vscale x 4 x i32> [[TMP0]] to <vscale x 16 x i8>
+; CHECK-NEXT:    [[TMP2:%.*]] = insertvalue { <vscale x 16 x i8>, <vscale x 16 x i8> } poison, <vscale x 16 x i8> [[TMP1]], 0
+; CHECK-NEXT:    [[TMP3:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[X]], 1
+; CHECK-NEXT:    [[TMP4:%.*]] = bitcast <vscale x 4 x i32> [[TMP3]] to <vscale x 16 x i8>
+; CHECK-NEXT:    [[R:%.*]] = insertvalue { <vscale x 16 x i8>, <vscale x 16 x i8> } [[TMP2]], <vscale x 16 x i8> [[TMP4]], 1
+; CHECK-NEXT:    ret { <vscale x 16 x i8>, <vscale x 16 x i8> } [[R]]
+;
+entry:
+  store {<vscale x 4 x i32>, <vscale x 4 x i32>} %x, ptr %p
+  %r = load {<vscale x 16 x i8>, <vscale x 16 x i8>}, ptr %p
+  ret {<vscale x 16 x i8>, <vscale x 16 x i8>} %r
+}
+
+define {<vscale x 16 x i8>, <vscale x 16 x i8>} @alloca_nxv16i8_nxv4i32({<vscale x 4 x i32>, <vscale x 4 x i32>} %x) nounwind {
+; CHECK-LABEL: define { <vscale x 16 x i8>, <vscale x 16 x i8> } @alloca_nxv16i8_nxv4i32(
+; CHECK-SAME: { <vscale x 4 x i32>, <vscale x 4 x i32> } [[X:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[X]], 0
+; CHECK-NEXT:    [[TMP1:%.*]] = bitcast <vscale x 4 x i32> [[TMP0]] to <vscale x 16 x i8>
+; CHECK-NEXT:    [[TMP2:%.*]] = insertvalue { <vscale x 16 x i8>, <vscale x 16 x i8> } poison, <vscale x 16 x i8> [[TMP1]], 0
+; CHECK-NEXT:    [[TMP3:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[X]], 1
+; CHECK-NEXT:    [[TMP4:%.*]] = bitcast <vscale x 4 x i32> [[TMP3]] to <vscale x 16 x i8>
+; CHECK-NEXT:    [[R:%.*]] = insertvalue { <vscale x 16 x i8>, <vscale x 16 x i8> } [[TMP2]], <vscale x 16 x i8> [[TMP4]], 1
+; CHECK-NEXT:    ret { <vscale x 16 x i8>, <vscale x 16 x i8> } [[R]]
+;
+entry:
+  %p = alloca {<vscale x 4 x i32>, <vscale x 4 x i32>}
+  store {<vscale x 4 x i32>, <vscale x 4 x i32>} %x, ptr %p
+  %r = load {<vscale x 16 x i8>, <vscale x 16 x i8>}, ptr %p
+  ret {<vscale x 16 x i8>, <vscale x 16 x i8>} %r
+}
+
+define { <16 x i8>, <32 x i8> } @differenttypes({ <4 x i32>, <8 x i32> } %a, ptr %p) {
+; CHECK-LABEL: define { <16 x i8>, <32 x i8> } @differenttypes(
+; CHECK-SAME: { <4 x i32>, <8 x i32> } [[A:%.*]], ptr [[P:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    call void @llvm.lifetime.start.p0(i64 -1, ptr nonnull [[P]])
+; CHECK-NEXT:    store { <4 x i32>, <8 x i32> } [[A]], ptr [[P]], align 16
+; CHECK-NEXT:    [[TMP5:%.*]] = extractvalue { <4 x i32>, <8 x i32> } [[A]], 0
+; CHECK-NEXT:    [[TMP1:%.*]] = bitcast <4 x i32> [[TMP5]] to <16 x i8>
+; CHECK-NEXT:    [[TMP2:%.*]] = insertvalue { <16 x i8>, <32 x i8> } poison, <16 x i8> [[TMP1]], 0
+; CHECK-NEXT:    [[TMP3:%.*]] = extractvalue { <4 x i32>, <8 x i32> } [[A]], 1
+; CHECK-NEXT:    [[TMP4:%.*]] = bitcast <8 x i32> [[TMP3]] to <32 x i8>
+; CHECK-NEXT:    [[TMP0:%.*]] = insertvalue { <16 x i8>, <32 x i8> } [[TMP2]], <32 x i8> [[TMP4]], 1
+; CHECK-NEXT:    call void @llvm.lifetime.end.p0(i64 -1, ptr nonnull [[P]])
+; CHECK-NEXT:    ret { <16 x i8>, <32 x i8> } [[TMP0]]
+;
+entry:
+  call void @llvm.lifetime.start.p0(i64 -1, ptr nonnull %p) #5
+  store { <4 x i32>, <8 x i32> } %a, ptr %p, align 16
+  %2 = load { <16 x i8>, <32 x i8> }, ptr %p, align 16
+  call void @llvm.lifetime.end.p0(i64 -1, ptr nonnull %p) #5
+  ret { <16 x i8>, <32 x i8> } %2
+}

llvmbot · 2025-01-22T08:53:16Z

@llvm/pr-subscribers-llvm-transforms

Author: David Green (davemgreen)

Changes

This attempts to fix a regression caused when scalable types started to be
emitted as structs. A __builtin_bit_cast will create a load/store pair that we
currently can not break up. This patch allows load-store forwarding in
InstCombine to split up the structs if it can detect that each part is
bitcast-able between the source and dest types.
https://alive2.llvm.org/ce/z/ewWXqQ

Non-scalable structs are decomposed through unpackStoreToAggregate before they
get here. Geps with scalable offsets are not valid so they would require i8 gep
types with vscale offsets that do not easily get cleaned up into bitcasts.

Full diff: https://github.com/llvm/llvm-project/pull/123908.diff

4 Files Affected:

(modified) llvm/include/llvm/Analysis/Loads.h (+2-1)
(modified) llvm/lib/Analysis/Loads.cpp (+20-5)
(modified) llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp (+15-1)
(added) llvm/test/Transforms/InstCombine/availableloadstruct.ll (+86)

diff --git a/llvm/include/llvm/Analysis/Loads.h b/llvm/include/llvm/Analysis/Loads.h
index 639070c07897b0..0ddd0098615ab9 100644
--- a/llvm/include/llvm/Analysis/Loads.h
+++ b/llvm/include/llvm/Analysis/Loads.h
@@ -156,7 +156,8 @@ Value *FindAvailableLoadedValue(LoadInst *Load, BasicBlock *ScanBB,
 /// This overload cannot be used to scan across multiple blocks.
 Value *FindAvailableLoadedValue(LoadInst *Load, BatchAAResults &AA,
                                 bool *IsLoadCSE,
-                                unsigned MaxInstsToScan = DefMaxInstsToScan);
+                                unsigned MaxInstsToScan = DefMaxInstsToScan,
+                                bool AllowPartwiseBitcastStructs = false);
 
 /// Scan backwards to see if we have the value of the given pointer available
 /// locally within a small number of instructions.
diff --git a/llvm/lib/Analysis/Loads.cpp b/llvm/lib/Analysis/Loads.cpp
index 11ccfa33821cad..c72000430ec4e3 100644
--- a/llvm/lib/Analysis/Loads.cpp
+++ b/llvm/lib/Analysis/Loads.cpp
@@ -531,7 +531,8 @@ static bool areNonOverlapSameBaseLoadAndStore(const Value *LoadPtr,
 
 static Value *getAvailableLoadStore(Instruction *Inst, const Value *Ptr,
                                     Type *AccessTy, bool AtLeastAtomic,
-                                    const DataLayout &DL, bool *IsLoadCSE) {
+                                    const DataLayout &DL, bool *IsLoadCSE,
+                                    bool AllowPartwiseBitcastStructs = false) {
   // If this is a load of Ptr, the loaded value is available.
   // (This is true even if the load is volatile or atomic, although
   // those cases are unlikely.)
@@ -572,6 +573,19 @@ static Value *getAvailableLoadStore(Instruction *Inst, const Value *Ptr,
     if (CastInst::isBitOrNoopPointerCastable(Val->getType(), AccessTy, DL))
       return Val;
 
+    if (AllowPartwiseBitcastStructs) {
+      if (StructType *SrcStructTy = dyn_cast<StructType>(Val->getType())) {
+        if (StructType *DestStructTy = dyn_cast<StructType>(AccessTy)) {
+          if (SrcStructTy->getNumElements() == DestStructTy->getNumElements() &&
+              all_of_zip(SrcStructTy->elements(), DestStructTy->elements(),
+                         [](Type *T1, Type *T2) {
+                           return CastInst::isBitCastable(T1, T2);
+                         }))
+            return Val;
+        }
+      }
+    }
+
     TypeSize StoreSize = DL.getTypeSizeInBits(Val->getType());
     TypeSize LoadSize = DL.getTypeSizeInBits(AccessTy);
     if (TypeSize::isKnownLE(LoadSize, StoreSize))
@@ -704,8 +718,8 @@ Value *llvm::findAvailablePtrLoadStore(
 }
 
 Value *llvm::FindAvailableLoadedValue(LoadInst *Load, BatchAAResults &AA,
-                                      bool *IsLoadCSE,
-                                      unsigned MaxInstsToScan) {
+                                      bool *IsLoadCSE, unsigned MaxInstsToScan,
+                                      bool AllowPartwiseBitcastStructs) {
   const DataLayout &DL = Load->getDataLayout();
   Value *StrippedPtr = Load->getPointerOperand()->stripPointerCasts();
   BasicBlock *ScanBB = Load->getParent();
@@ -727,8 +741,9 @@ Value *llvm::FindAvailableLoadedValue(LoadInst *Load, BatchAAResults &AA,
     if (MaxInstsToScan-- == 0)
       return nullptr;
 
-    Available = getAvailableLoadStore(&Inst, StrippedPtr, AccessTy,
-                                      AtLeastAtomic, DL, IsLoadCSE);
+    Available =
+        getAvailableLoadStore(&Inst, StrippedPtr, AccessTy, AtLeastAtomic, DL,
+                              IsLoadCSE, AllowPartwiseBitcastStructs);
     if (Available)
       break;
 
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp b/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
index 61f1c17592e966..833feefcbde3c6 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
@@ -1010,10 +1010,24 @@ Instruction *InstCombinerImpl::visitLoadInst(LoadInst &LI) {
   // separated by a few arithmetic operations.
   bool IsLoadCSE = false;
   BatchAAResults BatchAA(*AA);
-  if (Value *AvailableVal = FindAvailableLoadedValue(&LI, BatchAA, &IsLoadCSE)) {
+  if (Value *AvailableVal =
+          FindAvailableLoadedValue(&LI, BatchAA, &IsLoadCSE, DefMaxInstsToScan,
+                                   /*AllowPartwiseBitcastStructs=*/true)) {
     if (IsLoadCSE)
       combineMetadataForCSE(cast<LoadInst>(AvailableVal), &LI, false);
 
+    if (AvailableVal->getType() != LI.getType() &&
+        isa<StructType>(LI.getType())) {
+      StructType *DstST = cast<StructType>(LI.getType());
+      Value *R = PoisonValue::get(LI.getType());
+      for (unsigned I = 0, E = DstST->getNumElements(); I < E; I++) {
+        Value *Ext = Builder.CreateExtractValue(AvailableVal, I);
+        Value *BC =
+            Builder.CreateBitOrPointerCast(Ext, DstST->getElementType(I));
+        R = Builder.CreateInsertValue(R, BC, I);
+      }
+      return replaceInstUsesWith(LI, R);
+    }
     return replaceInstUsesWith(
         LI, Builder.CreateBitOrPointerCast(AvailableVal, LI.getType(),
                                            LI.getName() + ".cast"));
diff --git a/llvm/test/Transforms/InstCombine/availableloadstruct.ll b/llvm/test/Transforms/InstCombine/availableloadstruct.ll
new file mode 100644
index 00000000000000..5413fc05970ad6
--- /dev/null
+++ b/llvm/test/Transforms/InstCombine/availableloadstruct.ll
@@ -0,0 +1,86 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -S -passes=instcombine < %s | FileCheck %s
+
+target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
+
+define {<16 x i8>, <16 x i8>} @check_v16i8_v4i32({<4 x i32>, <4 x i32>} %x, ptr %p) nounwind {
+; CHECK-LABEL: define { <16 x i8>, <16 x i8> } @check_v16i8_v4i32(
+; CHECK-SAME: { <4 x i32>, <4 x i32> } [[X:%.*]], ptr [[P:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[X_ELT:%.*]] = extractvalue { <4 x i32>, <4 x i32> } [[X]], 0
+; CHECK-NEXT:    store <4 x i32> [[X_ELT]], ptr [[P]], align 16
+; CHECK-NEXT:    [[P_REPACK1:%.*]] = getelementptr inbounds nuw i8, ptr [[P]], i64 16
+; CHECK-NEXT:    [[X_ELT2:%.*]] = extractvalue { <4 x i32>, <4 x i32> } [[X]], 1
+; CHECK-NEXT:    store <4 x i32> [[X_ELT2]], ptr [[P_REPACK1]], align 16
+; CHECK-NEXT:    [[R_UNPACK_CAST:%.*]] = bitcast <4 x i32> [[X_ELT]] to <16 x i8>
+; CHECK-NEXT:    [[TMP0:%.*]] = insertvalue { <16 x i8>, <16 x i8> } poison, <16 x i8> [[R_UNPACK_CAST]], 0
+; CHECK-NEXT:    [[R_UNPACK4_CAST:%.*]] = bitcast <4 x i32> [[X_ELT2]] to <16 x i8>
+; CHECK-NEXT:    [[R5:%.*]] = insertvalue { <16 x i8>, <16 x i8> } [[TMP0]], <16 x i8> [[R_UNPACK4_CAST]], 1
+; CHECK-NEXT:    ret { <16 x i8>, <16 x i8> } [[R5]]
+;
+entry:
+  store {<4 x i32>, <4 x i32>} %x, ptr %p
+  %r = load {<16 x i8>, <16 x i8>}, ptr %p
+  ret {<16 x i8>, <16 x i8>} %r
+}
+
+define {<vscale x 16 x i8>, <vscale x 16 x i8>} @check_nxv16i8_nxv4i32({<vscale x 4 x i32>, <vscale x 4 x i32>} %x, ptr %p) nounwind {
+; CHECK-LABEL: define { <vscale x 16 x i8>, <vscale x 16 x i8> } @check_nxv16i8_nxv4i32(
+; CHECK-SAME: { <vscale x 4 x i32>, <vscale x 4 x i32> } [[X:%.*]], ptr [[P:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    store { <vscale x 4 x i32>, <vscale x 4 x i32> } [[X]], ptr [[P]], align 16
+; CHECK-NEXT:    [[TMP0:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[X]], 0
+; CHECK-NEXT:    [[TMP1:%.*]] = bitcast <vscale x 4 x i32> [[TMP0]] to <vscale x 16 x i8>
+; CHECK-NEXT:    [[TMP2:%.*]] = insertvalue { <vscale x 16 x i8>, <vscale x 16 x i8> } poison, <vscale x 16 x i8> [[TMP1]], 0
+; CHECK-NEXT:    [[TMP3:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[X]], 1
+; CHECK-NEXT:    [[TMP4:%.*]] = bitcast <vscale x 4 x i32> [[TMP3]] to <vscale x 16 x i8>
+; CHECK-NEXT:    [[R:%.*]] = insertvalue { <vscale x 16 x i8>, <vscale x 16 x i8> } [[TMP2]], <vscale x 16 x i8> [[TMP4]], 1
+; CHECK-NEXT:    ret { <vscale x 16 x i8>, <vscale x 16 x i8> } [[R]]
+;
+entry:
+  store {<vscale x 4 x i32>, <vscale x 4 x i32>} %x, ptr %p
+  %r = load {<vscale x 16 x i8>, <vscale x 16 x i8>}, ptr %p
+  ret {<vscale x 16 x i8>, <vscale x 16 x i8>} %r
+}
+
+define {<vscale x 16 x i8>, <vscale x 16 x i8>} @alloca_nxv16i8_nxv4i32({<vscale x 4 x i32>, <vscale x 4 x i32>} %x) nounwind {
+; CHECK-LABEL: define { <vscale x 16 x i8>, <vscale x 16 x i8> } @alloca_nxv16i8_nxv4i32(
+; CHECK-SAME: { <vscale x 4 x i32>, <vscale x 4 x i32> } [[X:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[X]], 0
+; CHECK-NEXT:    [[TMP1:%.*]] = bitcast <vscale x 4 x i32> [[TMP0]] to <vscale x 16 x i8>
+; CHECK-NEXT:    [[TMP2:%.*]] = insertvalue { <vscale x 16 x i8>, <vscale x 16 x i8> } poison, <vscale x 16 x i8> [[TMP1]], 0
+; CHECK-NEXT:    [[TMP3:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[X]], 1
+; CHECK-NEXT:    [[TMP4:%.*]] = bitcast <vscale x 4 x i32> [[TMP3]] to <vscale x 16 x i8>
+; CHECK-NEXT:    [[R:%.*]] = insertvalue { <vscale x 16 x i8>, <vscale x 16 x i8> } [[TMP2]], <vscale x 16 x i8> [[TMP4]], 1
+; CHECK-NEXT:    ret { <vscale x 16 x i8>, <vscale x 16 x i8> } [[R]]
+;
+entry:
+  %p = alloca {<vscale x 4 x i32>, <vscale x 4 x i32>}
+  store {<vscale x 4 x i32>, <vscale x 4 x i32>} %x, ptr %p
+  %r = load {<vscale x 16 x i8>, <vscale x 16 x i8>}, ptr %p
+  ret {<vscale x 16 x i8>, <vscale x 16 x i8>} %r
+}
+
+define { <16 x i8>, <32 x i8> } @differenttypes({ <4 x i32>, <8 x i32> } %a, ptr %p) {
+; CHECK-LABEL: define { <16 x i8>, <32 x i8> } @differenttypes(
+; CHECK-SAME: { <4 x i32>, <8 x i32> } [[A:%.*]], ptr [[P:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    call void @llvm.lifetime.start.p0(i64 -1, ptr nonnull [[P]])
+; CHECK-NEXT:    store { <4 x i32>, <8 x i32> } [[A]], ptr [[P]], align 16
+; CHECK-NEXT:    [[TMP5:%.*]] = extractvalue { <4 x i32>, <8 x i32> } [[A]], 0
+; CHECK-NEXT:    [[TMP1:%.*]] = bitcast <4 x i32> [[TMP5]] to <16 x i8>
+; CHECK-NEXT:    [[TMP2:%.*]] = insertvalue { <16 x i8>, <32 x i8> } poison, <16 x i8> [[TMP1]], 0
+; CHECK-NEXT:    [[TMP3:%.*]] = extractvalue { <4 x i32>, <8 x i32> } [[A]], 1
+; CHECK-NEXT:    [[TMP4:%.*]] = bitcast <8 x i32> [[TMP3]] to <32 x i8>
+; CHECK-NEXT:    [[TMP0:%.*]] = insertvalue { <16 x i8>, <32 x i8> } [[TMP2]], <32 x i8> [[TMP4]], 1
+; CHECK-NEXT:    call void @llvm.lifetime.end.p0(i64 -1, ptr nonnull [[P]])
+; CHECK-NEXT:    ret { <16 x i8>, <32 x i8> } [[TMP0]]
+;
+entry:
+  call void @llvm.lifetime.start.p0(i64 -1, ptr nonnull %p) #5
+  store { <4 x i32>, <8 x i32> } %a, ptr %p, align 16
+  %2 = load { <16 x i8>, <32 x i8> }, ptr %p, align 16
+  call void @llvm.lifetime.end.p0(i64 -1, ptr nonnull %p) #5
+  ret { <16 x i8>, <32 x i8> } %2
+}

nikic

Non-scalable structs are decomposed through unpackStoreToAggregate before they get here. Geps with scalable offsets are not valid so they would require i8 gep types with vscale offsets that do not easily get cleaned up into bitcasts.

Could you provide some more detail on what the problem with this approach is? I'd expect this to work (not for InstCombine alone, but in conjunction with GVN for example).

Generally we really shouldn't have aggregate load/stores in IR, it's a shame clang is emitting something like this.

nikic · 2025-01-22T09:55:36Z

llvm/lib/Analysis/Loads.cpp

+            return Val;
+        }
+      }
+    }


I think this may handle the case where the struct members are bitcastable but have different offsets because of different member alignment incorrectly?

davemgreen · 2025-01-22T10:46:56Z

Could you provide some more detail on what the problem with this approach is? I'd expect this to work (not for InstCombine alone, but in conjunction with GVN for example).

This is before and after Instcombine + before gvn with the patch I have that generated i8 geps for the scalable vectors: https://godbolt.org/z/hMnP111qM. GVN wasn't printing much debug info, but I think it might not be allowing scalable vectors here:

llvm-project/llvm/lib/Transforms/Utils/VNCoercion.cpp

Line 26 in 6b1db79

if (isFirstClassAggregateOrScalableType(LoadTy) ||

nikic · 2025-01-22T10:51:18Z

Yeah, I expect that's the problem. I think it would be a good idea to support scalable types in VNCoercion for the simple case at least (which is the load and store having the same type size).

davemgreen · 2025-01-22T16:34:03Z

I can give that a try. It looks like it should be do-able.

I was thinking that sroa could be another alternative, but would have to handle scalable types too and as this is a regression since last release was hoping for something on the simpler side.

davemgreen · 2025-01-23T18:46:03Z

#123986 and #123984 handled this through GVN and unpacking.

davemgreen added 2 commits January 22, 2025 08:47

[Instcombine] Add a test for load store forwarding of scalable types.…

b801860

… NFC

davemgreen requested a review from nikic January 22, 2025 08:52

llvmbot added llvm:instcombine Covers the InstCombine, InstSimplify and AggressiveInstCombine passes llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms labels Jan 22, 2025

nikic reviewed Jan 22, 2025

View reviewed changes

davemgreen closed this Jan 23, 2025

davemgreen deleted the gh-ic-scalableloadstoreforward branch January 23, 2025 18:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[InstCombine] Allow load to store forwarding for scalable structs #123908

[InstCombine] Allow load to store forwarding for scalable structs #123908

Uh oh!

davemgreen commented Jan 22, 2025

Uh oh!

llvmbot commented Jan 22, 2025

Uh oh!

llvmbot commented Jan 22, 2025

Uh oh!

nikic left a comment

Uh oh!

nikic Jan 22, 2025

Uh oh!

davemgreen commented Jan 22, 2025

Uh oh!

nikic commented Jan 22, 2025

Uh oh!

davemgreen commented Jan 22, 2025

Uh oh!

davemgreen commented Jan 23, 2025

Uh oh!

Uh oh!

[InstCombine] Allow load to store forwarding for scalable structs #123908

[InstCombine] Allow load to store forwarding for scalable structs #123908

Uh oh!

Conversation

davemgreen commented Jan 22, 2025

Uh oh!

llvmbot commented Jan 22, 2025

Uh oh!

llvmbot commented Jan 22, 2025

Uh oh!

nikic left a comment

Choose a reason for hiding this comment

Uh oh!

nikic Jan 22, 2025

Choose a reason for hiding this comment

Uh oh!

davemgreen commented Jan 22, 2025

Uh oh!

nikic commented Jan 22, 2025

Uh oh!

davemgreen commented Jan 22, 2025

Uh oh!

davemgreen commented Jan 23, 2025

Uh oh!

Uh oh!