Skip to content

[LLVM] Fix incorrect alignment on AMDGPU variadics #96370

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

jhuber6
Copy link
Contributor

@jhuber6 jhuber6 commented Jun 22, 2024

Summary:
The variadics lowering for AMDGPU puts all the arguments into a void
pointer struct. The current logic dictates that the minimum alignment is
four regardless of what the underlying type is. This is incorrect in
the following case.

void foo(int, ...);

void bar() {
  int x;
  void *p;
  foo(0, x, p);
}

Here, because the minimum alignment is 4, we will only increment the
buffer by 4, resulting in an incorrect alignment when we then try to
access the void pointer. We need to set a minimum of 4, but increase it
to 8 in cases like this.

Summary:
The variadics lowering for AMDGPU puts all the arguments into a void
pointer struct. The current logic dictates that the minimum alignment is
four regardless of what  the underlying type is. This is incorrect in
the following case.

```c
void foo(int, ...);

void bar() {
  int x;
  void *p;
  foo(0, x, p);
}
```
Here, because the minimum alignment is 4, we will only increment the
buffer by 4, resulting in an incorrect alignment when we then try to
access the void pointer. We need to set a minimum of 4, but increase it
to 8 in cases like this.
@llvmbot llvmbot added clang Clang issues not falling into any other category backend:AMDGPU clang:codegen IR generation bugs: mangling, exceptions, etc. llvm:transforms labels Jun 22, 2024
@llvmbot
Copy link
Member

llvmbot commented Jun 22, 2024

@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-clang

@llvm/pr-subscribers-clang-codegen

Author: Joseph Huber (jhuber6)

Changes

Summary:
The variadics lowering for AMDGPU puts all the arguments into a void
pointer struct. The current logic dictates that the minimum alignment is
four regardless of what the underlying type is. This is incorrect in
the following case.

void foo(int, ...);

void bar() {
  int x;
  void *p;
  foo(0, x, p);
}

Here, because the minimum alignment is 4, we will only increment the
buffer by 4, resulting in an incorrect alignment when we then try to
access the void pointer. We need to set a minimum of 4, but increase it
to 8 in cases like this.


Patch is 56.64 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/96370.diff

4 Files Affected:

  • (modified) clang/lib/CodeGen/Targets/AMDGPU.cpp (+3-8)
  • (modified) clang/test/CodeGen/amdgpu-variadic-call.c (+12-20)
  • (modified) llvm/lib/Transforms/IPO/ExpandVariadics.cpp (+5-1)
  • (modified) llvm/test/CodeGen/AMDGPU/expand-variadic-call.ll (+296-278)
diff --git a/clang/lib/CodeGen/Targets/AMDGPU.cpp b/clang/lib/CodeGen/Targets/AMDGPU.cpp
index 4d3275e17c386..a169a7d920456 100644
--- a/clang/lib/CodeGen/Targets/AMDGPU.cpp
+++ b/clang/lib/CodeGen/Targets/AMDGPU.cpp
@@ -121,7 +121,7 @@ void AMDGPUABIInfo::computeInfo(CGFunctionInfo &FI) const {
 RValue AMDGPUABIInfo::EmitVAArg(CodeGenFunction &CGF, Address VAListAddr,
                                 QualType Ty, AggValueSlot Slot) const {
   const bool IsIndirect = false;
-  const bool AllowHigherAlign = false;
+  const bool AllowHigherAlign = true;
   return emitVoidPtrVAArg(CGF, VAListAddr, Ty, IsIndirect,
                           getContext().getTypeInfoInChars(Ty),
                           CharUnits::fromQuantity(4), AllowHigherAlign, Slot);
@@ -212,13 +212,8 @@ ABIArgInfo AMDGPUABIInfo::classifyArgumentType(QualType Ty, bool Variadic,
 
   Ty = useFirstFieldIfTransparentUnion(Ty);
 
-  if (Variadic) {
-    return ABIArgInfo::getDirect(/*T=*/nullptr,
-                                 /*Offset=*/0,
-                                 /*Padding=*/nullptr,
-                                 /*CanBeFlattened=*/false,
-                                 /*Align=*/0);
-  }
+  if (Variadic)
+    return ABIArgInfo::getDirect();
 
   if (isAggregateTypeForABI(Ty)) {
     // Records with non-trivial destructors/copy-constructors should not be
diff --git a/clang/test/CodeGen/amdgpu-variadic-call.c b/clang/test/CodeGen/amdgpu-variadic-call.c
index 17eda215211a2..0529d6b3171c8 100644
--- a/clang/test/CodeGen/amdgpu-variadic-call.c
+++ b/clang/test/CodeGen/amdgpu-variadic-call.c
@@ -1,4 +1,3 @@
-// REQUIRES: amdgpu-registered-target
 // NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --function-signature
 // RUN: %clang_cc1 -cc1 -std=c23 -triple amdgcn-amd-amdhsa -emit-llvm -O1 %s -o - | FileCheck %s
 
@@ -179,11 +178,9 @@ typedef struct
 // CHECK-LABEL: define {{[^@]+}}@one_pair_f64
 // CHECK-SAME: (i32 noundef [[F0:%.*]], double noundef [[F1:%.*]], double [[V0_COERCE0:%.*]], double [[V0_COERCE1:%.*]]) local_unnamed_addr #[[ATTR0]] {
 // CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[DOTFCA_0_INSERT:%.*]] = insertvalue [[STRUCT_PAIR_F64:%.*]] poison, double [[V0_COERCE0]], 0
-// CHECK-NEXT:    [[DOTFCA_1_INSERT:%.*]] = insertvalue [[STRUCT_PAIR_F64]] [[DOTFCA_0_INSERT]], double [[V0_COERCE1]], 1
-// CHECK-NEXT:    tail call void (...) @sink_0([[STRUCT_PAIR_F64]] [[DOTFCA_1_INSERT]]) #[[ATTR2]]
-// CHECK-NEXT:    tail call void (i32, ...) @sink_1(i32 noundef [[F0]], [[STRUCT_PAIR_F64]] [[DOTFCA_1_INSERT]]) #[[ATTR2]]
-// CHECK-NEXT:    tail call void (double, i32, ...) @sink_2(double noundef [[F1]], i32 noundef [[F0]], [[STRUCT_PAIR_F64]] [[DOTFCA_1_INSERT]]) #[[ATTR2]]
+// CHECK-NEXT:    tail call void (...) @sink_0(double [[V0_COERCE0]], double [[V0_COERCE1]]) #[[ATTR2]]
+// CHECK-NEXT:    tail call void (i32, ...) @sink_1(i32 noundef [[F0]], double [[V0_COERCE0]], double [[V0_COERCE1]]) #[[ATTR2]]
+// CHECK-NEXT:    tail call void (double, i32, ...) @sink_2(double noundef [[F1]], i32 noundef [[F0]], double [[V0_COERCE0]], double [[V0_COERCE1]]) #[[ATTR2]]
 // CHECK-NEXT:    ret void
 //
 void one_pair_f64(int f0, double f1, pair_f64 v0)
@@ -220,10 +217,9 @@ typedef union
 // CHECK-SAME: (i32 noundef [[F0:%.*]], double noundef [[F1:%.*]], i32 [[V0_COERCE:%.*]]) local_unnamed_addr #[[ATTR0]] {
 // CHECK-NEXT:  entry:
 // CHECK-NEXT:    [[TMP0:%.*]] = bitcast i32 [[V0_COERCE]] to float
-// CHECK-NEXT:    [[DOTFCA_0_INSERT:%.*]] = insertvalue [[UNION_UNION_F32_I32:%.*]] poison, float [[TMP0]], 0
-// CHECK-NEXT:    tail call void (...) @sink_0([[UNION_UNION_F32_I32]] [[DOTFCA_0_INSERT]]) #[[ATTR2]]
-// CHECK-NEXT:    tail call void (i32, ...) @sink_1(i32 noundef [[F0]], [[UNION_UNION_F32_I32]] [[DOTFCA_0_INSERT]]) #[[ATTR2]]
-// CHECK-NEXT:    tail call void (double, i32, ...) @sink_2(double noundef [[F1]], i32 noundef [[F0]], [[UNION_UNION_F32_I32]] [[DOTFCA_0_INSERT]]) #[[ATTR2]]
+// CHECK-NEXT:    tail call void (...) @sink_0(float [[TMP0]]) #[[ATTR2]]
+// CHECK-NEXT:    tail call void (i32, ...) @sink_1(i32 noundef [[F0]], float [[TMP0]]) #[[ATTR2]]
+// CHECK-NEXT:    tail call void (double, i32, ...) @sink_2(double noundef [[F1]], i32 noundef [[F0]], float [[TMP0]]) #[[ATTR2]]
 // CHECK-NEXT:    ret void
 //
 void one_pair_union_f32_i32(int f0, double f1, union_f32_i32 v0)
@@ -242,10 +238,9 @@ typedef union
 // CHECK-LABEL: define {{[^@]+}}@one_pair_transparent_union_f32_i32
 // CHECK-SAME: (i32 noundef [[F0:%.*]], double noundef [[F1:%.*]], i32 [[V0_COERCE:%.*]]) local_unnamed_addr #[[ATTR0]] {
 // CHECK-NEXT:  entry:
-// CHECK-NEXT:    [[DOTFCA_0_INSERT:%.*]] = insertvalue [[UNION_TRANSPARENT_UNION_F32_I32:%.*]] poison, i32 [[V0_COERCE]], 0
-// CHECK-NEXT:    tail call void (...) @sink_0([[UNION_TRANSPARENT_UNION_F32_I32]] [[DOTFCA_0_INSERT]]) #[[ATTR2]]
-// CHECK-NEXT:    tail call void (i32, ...) @sink_1(i32 noundef [[F0]], [[UNION_TRANSPARENT_UNION_F32_I32]] [[DOTFCA_0_INSERT]]) #[[ATTR2]]
-// CHECK-NEXT:    tail call void (double, i32, ...) @sink_2(double noundef [[F1]], i32 noundef [[F0]], [[UNION_TRANSPARENT_UNION_F32_I32]] [[DOTFCA_0_INSERT]]) #[[ATTR2]]
+// CHECK-NEXT:    tail call void (...) @sink_0(i32 [[V0_COERCE]]) #[[ATTR2]]
+// CHECK-NEXT:    tail call void (i32, ...) @sink_1(i32 noundef [[F0]], i32 [[V0_COERCE]]) #[[ATTR2]]
+// CHECK-NEXT:    tail call void (double, i32, ...) @sink_2(double noundef [[F1]], i32 noundef [[F0]], i32 [[V0_COERCE]]) #[[ATTR2]]
 // CHECK-NEXT:    ret void
 //
 void one_pair_transparent_union_f32_i32(int f0, double f1, transparent_union_f32_i32 v0)
@@ -277,12 +272,9 @@ void multiple_one(int f0, double f1, int v0, double v1)
 // CHECK-NEXT:  entry:
 // CHECK-NEXT:    [[TMP0:%.*]] = bitcast i32 [[V2_COERCE]] to float
 // CHECK-NEXT:    [[CONV:%.*]] = fpext float [[V1]] to double
-// CHECK-NEXT:    [[DOTFCA_0_INSERT16:%.*]] = insertvalue [[STRUCT_PAIR_F64:%.*]] poison, double [[V0_COERCE0]], 0
-// CHECK-NEXT:    [[DOTFCA_1_INSERT:%.*]] = insertvalue [[STRUCT_PAIR_F64]] [[DOTFCA_0_INSERT16]], double [[V0_COERCE1]], 1
-// CHECK-NEXT:    [[DOTFCA_0_INSERT:%.*]] = insertvalue [[UNION_UNION_F32_I32:%.*]] poison, float [[TMP0]], 0
-// CHECK-NEXT:    tail call void (...) @sink_0([[STRUCT_PAIR_F64]] [[DOTFCA_1_INSERT]], double noundef [[CONV]], [[UNION_UNION_F32_I32]] [[DOTFCA_0_INSERT]], i32 noundef [[V3]]) #[[ATTR2]]
-// CHECK-NEXT:    tail call void (i32, ...) @sink_1(i32 noundef [[F0]], [[STRUCT_PAIR_F64]] [[DOTFCA_1_INSERT]], double noundef [[CONV]], [[UNION_UNION_F32_I32]] [[DOTFCA_0_INSERT]], i32 noundef [[V3]]) #[[ATTR2]]
-// CHECK-NEXT:    tail call void (double, i32, ...) @sink_2(double noundef [[F1]], i32 noundef [[F0]], [[STRUCT_PAIR_F64]] [[DOTFCA_1_INSERT]], double noundef [[CONV]], [[UNION_UNION_F32_I32]] [[DOTFCA_0_INSERT]], i32 noundef [[V3]]) #[[ATTR2]]
+// CHECK-NEXT:    tail call void (...) @sink_0(double [[V0_COERCE0]], double [[V0_COERCE1]], double noundef [[CONV]], float [[TMP0]], i32 noundef [[V3]]) #[[ATTR2]]
+// CHECK-NEXT:    tail call void (i32, ...) @sink_1(i32 noundef [[F0]], double [[V0_COERCE0]], double [[V0_COERCE1]], double noundef [[CONV]], float [[TMP0]], i32 noundef [[V3]]) #[[ATTR2]]
+// CHECK-NEXT:    tail call void (double, i32, ...) @sink_2(double noundef [[F1]], i32 noundef [[F0]], double [[V0_COERCE0]], double [[V0_COERCE1]], double noundef [[CONV]], float [[TMP0]], i32 noundef [[V3]]) #[[ATTR2]]
 // CHECK-NEXT:    ret void
 //
 void multiple_two(int f0, double f1, pair_f64 v0, float v1, union_f32_i32 v2, int v3)
diff --git a/llvm/lib/Transforms/IPO/ExpandVariadics.cpp b/llvm/lib/Transforms/IPO/ExpandVariadics.cpp
index d340bc041ccda..489e13410d3b1 100644
--- a/llvm/lib/Transforms/IPO/ExpandVariadics.cpp
+++ b/llvm/lib/Transforms/IPO/ExpandVariadics.cpp
@@ -934,7 +934,11 @@ struct Amdgpu final : public VariadicABIInfo {
   }
 
   VAArgSlotInfo slotInfo(const DataLayout &DL, Type *Parameter) override {
-    return {Align(4), false};
+    const unsigned MinAlign = 1;
+    Align A = DL.getABITypeAlign(Parameter);
+    if (A < MinAlign)
+      A = Align(MinAlign);
+    return {A, false};
   }
 };
 
diff --git a/llvm/test/CodeGen/AMDGPU/expand-variadic-call.ll b/llvm/test/CodeGen/AMDGPU/expand-variadic-call.ll
index ce55558dabaf1..e7450707ac94a 100644
--- a/llvm/test/CodeGen/AMDGPU/expand-variadic-call.ll
+++ b/llvm/test/CodeGen/AMDGPU/expand-variadic-call.ll
@@ -1,50 +1,48 @@
-; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: -p --function-signature
-; RUN: opt -S --passes=expand-variadics --expand-variadics-override=lowering < %s | FileCheck %s
-; REQUIRES: amdgpu-registered-target
-target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-p9:192:256:256:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7:8:9"
-target triple = "amdgcn-amd-amdhsa"
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -S -mtriple=amdgcn-- --passes=expand-variadics \
+; RUN:   --expand-variadics-override=lowering < %s | FileCheck %s
 
-; Check the variables are lowered to the locations this target expects
+%struct.libcS = type { i8, i16, i32, i64, float, double }
 
-; The types show the call frames
+; The types show the call frames.
 ; CHECK: %single_i32.vararg = type <{ i32 }>
 ; CHECK: %single_double.vararg = type <{ double }>
 ; CHECK: %single_v4f32.vararg = type <{ <4 x float> }>
 ; CHECK: %single_v8f32.vararg = type <{ <8 x float> }>
 ; CHECK: %single_v16f32.vararg = type <{ <16 x float> }>
 ; CHECK: %single_v32f32.vararg = type <{ <32 x float> }>
-; CHECK: %i32_double.vararg = type <{ i32, double }>
+; CHECK: %i32_double.vararg = type <{ i32, [4 x i8], double }>
 ; CHECK: %double_i32.vararg = type <{ double, i32 }>
-; CHECK: %i32_libcS.vararg = type <{ i32, %struct.libcS }>
+; CHECK: %i32_libcS.vararg = type <{ i32, [4 x i8], %struct.libcS }>
+; CHECK: %struct.libcS = type { i8, i16, i32, i64, float, double }
 ; CHECK: %libcS_i32.vararg = type <{ %struct.libcS, i32 }>
-; CHECK: %i32_v4f32.vararg = type <{ i32, <4 x float> }>
+; CHECK: %i32_v4f32.vararg = type <{ i32, [12 x i8], <4 x float> }>
 ; CHECK: %v4f32_i32.vararg = type <{ <4 x float>, i32 }>
-; CHECK: %i32_v8f32.vararg = type <{ i32, <8 x float> }>
+; CHECK: %i32_v8f32.vararg = type <{ i32, [28 x i8], <8 x float> }>
 ; CHECK: %v8f32_i32.vararg = type <{ <8 x float>, i32 }>
-; CHECK: %i32_v16f32.vararg = type <{ i32, <16 x float> }>
+; CHECK: %i32_v16f32.vararg = type <{ i32, [60 x i8], <16 x float> }>
 ; CHECK: %v16f32_i32.vararg = type <{ <16 x float>, i32 }>
-; CHECK: %i32_v32f32.vararg = type <{ i32, <32 x float> }>
+; CHECK: %i32_v32f32.vararg = type <{ i32, [124 x i8], <32 x float> }>
 ; CHECK: %v32f32_i32.vararg = type <{ <32 x float>, i32 }>
 ; CHECK: %fptr_single_i32.vararg = type <{ i32 }>
 ; CHECK: %fptr_libcS.vararg = type <{ %struct.libcS }>
 
-%struct.libcS = type { i8, i16, i32, i64, float, double }
-
 @vararg_ptr = hidden addrspace(1) global ptr @vararg, align 8
 
 define hidden void @copy(ptr noundef %va) {
-; CHECK-LABEL: define {{[^@]+}}@copy(ptr noundef %va) {
-; CHECK-NEXT:  entry:
-; CHECK-NEXT:    %va.addr = alloca ptr, align 8, addrspace(5)
-; CHECK-NEXT:    %cp = alloca ptr, align 8, addrspace(5)
-; CHECK-NEXT:    %va.addr.ascast = addrspacecast ptr addrspace(5) %va.addr to ptr
-; CHECK-NEXT:    %cp.ascast = addrspacecast ptr addrspace(5) %cp to ptr
-; CHECK-NEXT:    store ptr %va, ptr addrspace(5) %va.addr, align 8
-; CHECK-NEXT:    call void @llvm.lifetime.start.p5(i64 8, ptr addrspace(5) %cp)
-; CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr %cp.ascast, ptr %va.addr.ascast, i32 8, i1 false)
-; CHECK-NEXT:    %0 = load ptr, ptr addrspace(5) %cp, align 8
-; CHECK-NEXT:    call void @valist(ptr noundef %0)
-; CHECK-NEXT:    call void @llvm.lifetime.end.p5(i64 8, ptr addrspace(5) %cp)
+; CHECK-LABEL: define hidden void @copy(
+; CHECK-SAME: ptr noundef [[VA:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[VA_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
+; CHECK-NEXT:    [[CP:%.*]] = alloca ptr, align 8, addrspace(5)
+; CHECK-NEXT:    [[VA_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[VA_ADDR]] to ptr
+; CHECK-NEXT:    [[CP_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[CP]] to ptr
+; CHECK-NEXT:    store ptr [[VA]], ptr addrspace(5) [[VA_ADDR]], align 8
+; CHECK-NEXT:    call void @llvm.lifetime.start.p5(i64 8, ptr addrspace(5) [[CP]])
+; CHECK-NEXT:    call void @llvm.memcpy.p0.p0.i32(ptr [[CP_ASCAST]], ptr [[VA_ADDR_ASCAST]], i32 8, i1 false)
+; CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr addrspace(5) [[CP]], align 8
+; CHECK-NEXT:    call void @valist(ptr noundef [[TMP0]])
+; CHECK-NEXT:    call void @llvm.lifetime.end.p5(i64 8, ptr addrspace(5) [[CP]])
 ; CHECK-NEXT:    ret void
 ;
 entry:
@@ -70,15 +68,16 @@ declare hidden void @valist(ptr noundef)
 declare void @llvm.lifetime.end.p5(i64 immarg, ptr addrspace(5) nocapture)
 
 define hidden void @start_once(...) {
-; CHECK-LABEL: define {{[^@]+}}@start_once(ptr %varargs) {
-; CHECK-NEXT:  entry:
-; CHECK-NEXT:    %s = alloca ptr, align 8, addrspace(5)
-; CHECK-NEXT:    %s.ascast = addrspacecast ptr addrspace(5) %s to ptr
-; CHECK-NEXT:    call void @llvm.lifetime.start.p5(i64 8, ptr addrspace(5) %s)
-; CHECK-NEXT:    store ptr %varargs, ptr %s.ascast, align 8
-; CHECK-NEXT:    %0 = load ptr, ptr addrspace(5) %s, align 8
-; CHECK-NEXT:    call void @valist(ptr noundef %0)
-; CHECK-NEXT:    call void @llvm.lifetime.end.p5(i64 8, ptr addrspace(5) %s)
+; CHECK-LABEL: define hidden void @start_once(
+; CHECK-SAME: ptr [[VARARGS:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[S:%.*]] = alloca ptr, align 8, addrspace(5)
+; CHECK-NEXT:    [[S_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[S]] to ptr
+; CHECK-NEXT:    call void @llvm.lifetime.start.p5(i64 8, ptr addrspace(5) [[S]])
+; CHECK-NEXT:    store ptr [[VARARGS]], ptr [[S_ASCAST]], align 8
+; CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr addrspace(5) [[S]], align 8
+; CHECK-NEXT:    call void @valist(ptr noundef [[TMP0]])
+; CHECK-NEXT:    call void @llvm.lifetime.end.p5(i64 8, ptr addrspace(5) [[S]])
 ; CHECK-NEXT:    ret void
 ;
 entry:
@@ -98,22 +97,23 @@ declare void @llvm.va_start.p0(ptr)
 declare void @llvm.va_end.p0(ptr)
 
 define hidden void @start_twice(...) {
-; CHECK-LABEL: define {{[^@]+}}@start_twice(ptr %varargs) {
-; CHECK-NEXT:  entry:
-; CHECK-NEXT:    %s0 = alloca ptr, align 8, addrspace(5)
-; CHECK-NEXT:    %s1 = alloca ptr, align 8, addrspace(5)
-; CHECK-NEXT:    %s0.ascast = addrspacecast ptr addrspace(5) %s0 to ptr
-; CHECK-NEXT:    %s1.ascast = addrspacecast ptr addrspace(5) %s1 to ptr
-; CHECK-NEXT:    call void @llvm.lifetime.start.p5(i64 8, ptr addrspace(5) %s0)
-; CHECK-NEXT:    call void @llvm.lifetime.start.p5(i64 8, ptr addrspace(5) %s1)
-; CHECK-NEXT:    store ptr %varargs, ptr %s0.ascast, align 8
-; CHECK-NEXT:    %0 = load ptr, ptr addrspace(5) %s0, align 8
-; CHECK-NEXT:    call void @valist(ptr noundef %0)
-; CHECK-NEXT:    store ptr %varargs, ptr %s1.ascast, align 8
-; CHECK-NEXT:    %1 = load ptr, ptr addrspace(5) %s1, align 8
-; CHECK-NEXT:    call void @valist(ptr noundef %1)
-; CHECK-NEXT:    call void @llvm.lifetime.end.p5(i64 8, ptr addrspace(5) %s1)
-; CHECK-NEXT:    call void @llvm.lifetime.end.p5(i64 8, ptr addrspace(5) %s0)
+; CHECK-LABEL: define hidden void @start_twice(
+; CHECK-SAME: ptr [[VARARGS:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[S0:%.*]] = alloca ptr, align 8, addrspace(5)
+; CHECK-NEXT:    [[S1:%.*]] = alloca ptr, align 8, addrspace(5)
+; CHECK-NEXT:    [[S0_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[S0]] to ptr
+; CHECK-NEXT:    [[S1_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[S1]] to ptr
+; CHECK-NEXT:    call void @llvm.lifetime.start.p5(i64 8, ptr addrspace(5) [[S0]])
+; CHECK-NEXT:    call void @llvm.lifetime.start.p5(i64 8, ptr addrspace(5) [[S1]])
+; CHECK-NEXT:    store ptr [[VARARGS]], ptr [[S0_ASCAST]], align 8
+; CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr addrspace(5) [[S0]], align 8
+; CHECK-NEXT:    call void @valist(ptr noundef [[TMP0]])
+; CHECK-NEXT:    store ptr [[VARARGS]], ptr [[S1_ASCAST]], align 8
+; CHECK-NEXT:    [[TMP1:%.*]] = load ptr, ptr addrspace(5) [[S1]], align 8
+; CHECK-NEXT:    call void @valist(ptr noundef [[TMP1]])
+; CHECK-NEXT:    call void @llvm.lifetime.end.p5(i64 8, ptr addrspace(5) [[S1]])
+; CHECK-NEXT:    call void @llvm.lifetime.end.p5(i64 8, ptr addrspace(5) [[S0]])
 ; CHECK-NEXT:    ret void
 ;
 entry:
@@ -137,15 +137,16 @@ entry:
 }
 
 define hidden void @single_i32(i32 noundef %x) {
-; CHECK-LABEL: define {{[^@]+}}@single_i32(i32 noundef %x) {
-; CHECK-NEXT:  entry:
-; CHECK-NEXT:    %vararg_buffer = alloca %single_i32.vararg, align 4, addrspace(5)
-; CHECK-NEXT:    call void @llvm.lifetime.start.p5(i64 4, ptr addrspace(5) %vararg_buffer)
-; CHECK-NEXT:    %0 = getelementptr inbounds %single_i32.vararg, ptr addrspace(5) %vararg_buffer, i32 0, i32 0
-; CHECK-NEXT:    store i32 %x, ptr addrspace(5) %0, align 4
-; CHECK-NEXT:    %1 = addrspacecast ptr addrspace(5) %vararg_buffer to ptr
-; CHECK-NEXT:    call void @vararg(ptr %1)
-; CHECK-NEXT:    call void @llvm.lifetime.end.p5(i64 4, ptr addrspace(5) %vararg_buffer)
+; CHECK-LABEL: define hidden void @single_i32(
+; CHECK-SAME: i32 noundef [[X:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[VARARG_BUFFER:%.*]] = alloca [[SINGLE_I32_VARARG:%.*]], align 4, addrspace(5)
+; CHECK-NEXT:    call void @llvm.lifetime.start.p5(i64 4, ptr addrspace(5) [[VARARG_BUFFER]])
+; CHECK-NEXT:    [[TMP0:%.*]] = getelementptr inbounds [[SINGLE_I32_VARARG]], ptr addrspace(5) [[VARARG_BUFFER]], i32 0, i32 0
+; CHECK-NEXT:    store i32 [[X]], ptr addrspace(5) [[TMP0]], align 4
+; CHECK-NEXT:    [[TMP1:%.*]] = addrspacecast ptr addrspace(5) [[VARARG_BUFFER]] to ptr
+; CHECK-NEXT:    call void @vararg(ptr [[TMP1]])
+; CHECK-NEXT:    call void @llvm.lifetime.end.p5(i64 4, ptr addrspace(5) [[VARARG_BUFFER]])
 ; CHECK-NEXT:    ret void
 ;
 entry:
@@ -156,15 +157,16 @@ entry:
 declare hidden void @vararg(...)
 
 define hidden void @single_double(double noundef %x) {
-; CHECK-LABEL: define {{[^@]+}}@single_double(double noundef %x) {
-; CHECK-NEXT:  entry:
-; CHECK-NEXT:    %vararg_buffer = alloca %single_double.vararg, align 4, addrspace(5)
-; CHECK-NEXT:    call void @llvm.lifetime.start.p5(i64 8, ptr addrspace(5) %vararg_buffer)
-; CHECK-NEXT:    %0 = getelementptr inbounds %single_double.vararg, ptr addrspace(5) %vararg_buffer, i32 0, i32 0
-; CHECK-NEXT:    store double %x, ptr addrspace(5) %0, align 8
-; CHECK-NEXT:    %1 = addrspacecast ptr addrspace(5) %vararg_buffer to ptr
-; CHECK-NEXT:    call void @vararg(ptr %1)
-; CHECK-NEXT:    call void @llvm.lifetime.end.p5(i64 8, ptr addrspace(5) %vararg_buffer)
+; CHECK-LABEL: define hidden void @single_double(
+; CHECK-SAME: double noundef [[X:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[VARARG_BUFFER:%.*]] = alloca [[SINGLE_DOUBLE_VARARG:%.*]], align 8, addrspace(5)
+; CHECK-NEXT:    call void @llvm.lifetime.start.p5(i64 8, ptr addrspace(5) [[VARARG_BUFFER]])
+; CHECK-NEXT:    [[TMP0:%.*]] = getelementptr inbounds [[SINGLE_DOUBLE_VARARG]], ptr addrspace(5) [[VARARG_BUFFER]], i32 0, i32 0
+; CHECK-NEXT:    store double [[X]], ptr addrspace(5) [[TMP0]], align 8
+; CHECK-NEXT:    [[TMP1:%.*]] = addrspacecast ptr addrspace(5) [[VARARG_BUFFER]] to ptr
+; CHECK-NEXT:    call void @vararg(ptr [[TMP1]])
+; CHECK-NEXT:    call void @llvm.lifetime.end.p5(i64 8, ptr addrspace(5) [[VARARG_BUFFER]])
 ; CHECK-NEXT:    ret void
 ;
 entry:
@@ -173,15 +175,16 @@ entry:
 }
 
 define hidden void @single_v4f32(<4 x float> noundef %x) {
-; CHECK-LABEL: define {{[^@]+}}@single_v4f32(<4 x float> noundef %x) {
-; CHECK-NEXT:  entry:
-; CHECK-NEXT:    %vararg_buffer = alloca %single_v4f32.vararg, align 4, addrspace(5)
-; CHECK-NEXT:    call void @llvm.lifetime.start.p5(i64 16, ptr addrspace(5) %vararg_buffer)
-; CHECK-NEXT:    %0 = getelementptr inbounds %single_v4f32.vararg, ptr addrspace(5) %vararg_buffer, i32 0, i32 0
-; CHECK-NEXT:    store <4 x float> %x, ptr addrspace(5) %0, align 16
-; CHECK-NEXT:    %1 = addrspacecast ptr addrspace(5) %vararg_buffer to ptr
-; CHECK-NEXT:    call void @vararg(ptr %1)
-; CHECK-NEXT:    call void @llvm.lifetime.end.p5(i64 16, ptr addr...
[truncated]

@arsenm
Copy link
Contributor

arsenm commented Jun 22, 2024

Here, because the minimum alignment is 4, we will only increment the
buffer by 4,

It should be incrementing by the size? 4 byte aligned access of 8 byte type should work fine

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jun 22, 2024

Here, because the minimum alignment is 4, we will only increment the
buffer by 4,

It should be incrementing by the size? 4 byte aligned access of 8 byte type should work fine

Guess that's an AMD thing, so I'm going to assume that @JonChesterfield wrote this intentionally to save on stack space? I suppose the issue I'm having with my printf implementation is that we then want to copy this struct and because it doesn't follow natural alignment the person printing it doesn't know where these are stored in a common sense. I suppose I could change the code to just be ptr += sizeof(T) instead of doing the alignment, but I feel like some architectures require strict alignment for these and it wouldn't work in the general case.

@arsenm
Copy link
Contributor

arsenm commented Jun 22, 2024

Incrementing by align is just a bug, of course the size is the real value. Whether we want to continue wasting space is another not-correctness discussion

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jun 22, 2024

Incrementing by align is just a bug, of course the size is the real value. Whether we want to continue wasting space is another not-correctness discussion

Struct padding is pretty universal, AMDGPU seems the odd one out here. I wouldn't mind it so much if it didn't require me to know which vendor I was dealing with in the RPC implementation, but I suppose I could store that information somewhere if we want to use a compressed option and we know it works.

@arsenm
Copy link
Contributor

arsenm commented Jun 25, 2024

Incrementing by align is just a bug, of course the size is the real value. Whether we want to continue wasting space is another not-correctness discussion

Struct padding is pretty universal, AMDGPU seems the odd one out here. I wouldn't mind it so much if it didn't require me to know which vendor I was dealing with in the RPC implementation, but I suppose I could store that information somewhere if we want to use a compressed option and we know it works.

It's not about struct padding, but the base alignment. Any pointer increment should be alignTo(ptr + size, align), not += align. The += align won't even work for large structs

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jun 25, 2024

Incrementing by align is just a bug, of course the size is the real value. Whether we want to continue wasting space is another not-correctness discussion

Struct padding is pretty universal, AMDGPU seems the odd one out here. I wouldn't mind it so much if it didn't require me to know which vendor I was dealing with in the RPC implementation, but I suppose I could store that information somewhere if we want to use a compressed option and we know it works.

It's not about struct padding, but the base alignment. Any pointer increment should be alignTo(ptr + size, align), not += align. The += align won't even work for large structs

Hm, that's what I'm doing in the printf implementation and it doesn't work without that patch. When I look at the varargs struct it didn't have any padding, which explained why alignTo(ptr + size, align) was wrong. So, I was trying to do the following, printf("%d%ld", 1, 1l). With this patch I get the following,

0xbebebebe00000001
0x0000000000000001

Without this patch, I get this. As you can see there's no struct padding so the 8 byte value is right next to the 4 byte one.

0x0000000100000001
0xbebebebe00000000

@arsenm
Copy link
Contributor

arsenm commented Jun 25, 2024

Incrementing by align is just a bug, of course the size is the real value. Whether we want to continue wasting space is another not-correctness discussion

Struct padding is pretty universal, AMDGPU seems the odd one out here. I wouldn't mind it so much if it didn't require me to know which vendor I was dealing with in the RPC implementation, but I suppose I could store that information somewhere if we want to use a compressed option and we know it works.

It's not about struct padding, but the base alignment. Any pointer increment should be alignTo(ptr + size, align), not += align. The += align won't even work for large structs

Hm, that's what I'm doing in the printf implementation and it doesn't work without that patch. When I look at the varargs struct it didn't have any padding, which explained why alignTo(ptr + size, align) was wrong. So, I was trying to do the following, printf("%d%ld", 1, 1l). With this patch I get the following,

For what IR? Is the small struct getting expanded into individual scalar pieces?

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jun 25, 2024

Hm, that's what I'm doing in the printf implementation and it doesn't work without that patch. When I look at the varargs struct it didn't have any padding, which explained why alignTo(ptr + size, align) was wrong. So, I was trying to do the following, printf("%d%ld", 1, 1l). With this patch I get the following,

For what IR? Is the small struct getting expanded into individual scalar pieces?

The implementation intentionally states that the alignment is always 4 in this case, which results in there being no padding between the four byte and eight byte values when put into a void * buffer for varargs. You should be able to see it in the changes to the tests. Unfortunately, @JonChesterfield is on vacation so I can't ask about this as it seems to be a deliberate choice, but I don't know how it's correct.

@JonChesterfield
Copy link
Collaborator

This PR is invalid.

First, the alignment on the eight byte pointer is supposed to be four. Increasing it to 8 makes things worse.

Second, I can't see any support for the claim that the code is incrementing by the alignment of the value, as opposed to the size.

The frame is setup as a struct instance with explicit padding by Int8Tys and the calculation there is correct.

The va_arg increment is done in CodeGen:emitVoidPtrVAArg, where DirectSize is ValueInfo.Width, aligned to the 4 byte slot size, then stored. It does not increment the iterator by the alignment of the type.

The lowering pass is doing exactly what was intended.

Copy link
Collaborator

@JonChesterfield JonChesterfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patch should not land. Need to know what bug this was trying to address to guess at what the right fix would be.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jul 1, 2024

Patch should not land. Need to know what bug this was trying to address to guess at what the right fix would be.

My understanding was that the variadics did lowering to a struct with a minimum alignment of four. This currently doesn't do that, hence my confusion. The current lowering provides no padding, which I now see is a deliberate choice to save on stack presumably. The issue I had was that I laid out my printf code with the assumption that the buffer was a struct, so now it won't work when copied to another target via RPC because it's not the correct alignment.

@JonChesterfield
Copy link
Collaborator

Ah yes, libc code doing the equivalent of va_arg assuming natural alignment when the underlying buffer is a packed struct with fields padded to four bytes would not work. That would be "fixed" by changing the compiler to match the assumption made by libc, but it seems much better for libc to do the misaligned load instead. Then there's no wavesize*align-padding stack space burned at runtime.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jul 1, 2024

Lower than native alignment is legal in AMDGPU hardware and it's possible to work around in the printf implementation, closing.

@jhuber6 jhuber6 closed this Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AMDGPU clang:codegen IR generation bugs: mangling, exceptions, etc. clang Clang issues not falling into any other category llvm:transforms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants