Skip to content

[transforms] Inline simple variadic functions #81058

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

JonChesterfield
Copy link
Collaborator

@JonChesterfield JonChesterfield commented Feb 7, 2024

This is a MVP style commit for eliminating variadic functions as an IR pass. It replaces the ... with a va_list value.

The implementation strategy is to divide the transform into orthogonal parts that can be tested independently. In particular this seeks to decouple target specific complexity from llvm modelling complexity. The original motivation was to provide variadic function support to amdgpu and nvptx. That's tested and working in the downstream branch from which this patch was extracted.

This commit introduces the target specific part of the transform, implemented for 32 and 64 bit x86. Test coverage of more interesting types from clang requires either pull/80002 or complicated filecheck patterns. It rewrites very simple variadic calls into one to an equivalent function taking a va_list.

Subsequent patches are needed to:

  1. rewrite variadic functions that don't happen to already match this pattern
  2. add more targets - aarch64, amdgpu, nvptx are planned at present
  3. call the pass from codegen to expand all variadic functions for GPU targets
  4. close the holes in test coverage (and implementation)
  5. add this transform to one of the optimisation pipelines, maybe O1
  6. maybe expand va_arg instructions (for the GPUs, or maybe from alternative front ends)

This patch is known to be incomplete. It is posted in this form to check the strategy and provide a mostly-correct baseline to extend. Initial discussion is at https://discourse.llvm.org/t/rfc-desugar-variadics-codegen-for-new-targets-optimisation-for-existing/72854

The end point is calls to known variadic functions are replaced with calls to va_list equivalents and thus exposed to inlining etc. The clang test checks that the variadics are eliminated entirely by this pass and existing transforms.

On nvptx, this lowering is compatible with the documentation for ptx.

@llvmbot
Copy link
Member

llvmbot commented Feb 7, 2024

@llvm/pr-subscribers-clang

@llvm/pr-subscribers-backend-x86

Author: Jon Chesterfield (JonChesterfield)

Changes

This is a MVP style commit for eliminating variadic functions as an IR pass. It replaces the ... with a va_list value.

The implementation strategy is to divide the transform into orthogonal parts that can be tested independently. In particular this seeks to decouple target specific complexity from llvm modelling complexity. The original motivation was to provide variadic function support to amdgpu and nvptx. That's tested and working in the downstream branch from which this patch was extracted.

This commit introduces the target specific part of the transform, implemented for 32 and 64 bit x86. Test coverage of more interesting types from clang requires either pull/80002 or complicated filecheck patterns. It rewrites very simple variadic calls into one to an equivalent function taking a va_list.

Subsequent patches are needed to:

  1. rewrite variadic functions that don't happen to already match this pattern
  2. add more targets - aarch64, amdgpu, nvptx are planned at present
  3. call the pass from codegen to expand all variadic functions for GPU targets
  4. close the holes in test coverage (and implementation)
  5. add this transform to one of the optimisation pipelines, maybe O1
  6. maybe expand va_arg instructions (for the GPUs, or maybe from alternative front ends)

This patch is known to be incomplete. It is posted in this form to check the strategy and provide a mostly-correct baseline to extend. Initial discussion is at https://discourse.llvm.org/t/rfc-desugar-variadics-codegen-for-new-targets-optimisation-for-existing/72854


Patch is 176.90 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/81058.diff

16 Files Affected:

  • (added) clang/test/CodeGen/expand-variadic-call.c (+273)
  • (added) clang/test/CodeGen/variadic-wrapper-removal.c (+86)
  • (added) clang/test/CodeGenCXX/inline-then-fold-variadics.cpp (+117)
  • (modified) llvm/include/llvm/CodeGen/Passes.h (+4)
  • (modified) llvm/include/llvm/InitializePasses.h (+1)
  • (added) llvm/include/llvm/Transforms/IPO/ExpandVariadics.h (+17)
  • (modified) llvm/lib/Passes/PassBuilder.cpp (+1)
  • (modified) llvm/lib/Passes/PassRegistry.def (+1)
  • (modified) llvm/lib/Transforms/IPO/CMakeLists.txt (+1)
  • (added) llvm/lib/Transforms/IPO/ExpandVariadics.cpp (+723)
  • (added) llvm/test/CodeGen/X86/expand-variadic-call-i386-darwin.ll (+385)
  • (added) llvm/test/CodeGen/X86/expand-variadic-call-i386-linux.ll (+385)
  • (added) llvm/test/CodeGen/X86/expand-variadic-call-i686-msvc.ll (+402)
  • (added) llvm/test/CodeGen/X86/expand-variadic-call-x64-darwin.ll (+589)
  • (added) llvm/test/CodeGen/X86/expand-variadic-call-x64-linux.ll (+589)
  • (modified) llvm/utils/gn/secondary/llvm/lib/Transforms/IPO/BUILD.gn (+1)
diff --git a/clang/test/CodeGen/expand-variadic-call.c b/clang/test/CodeGen/expand-variadic-call.c
new file mode 100644
index 00000000000000..fa2b984bec08a5
--- /dev/null
+++ b/clang/test/CodeGen/expand-variadic-call.c
@@ -0,0 +1,273 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py
+
+// REQUIRES: x86-registered-target
+// RUN: %clang_cc1 -triple x86_64-unknown-linux-gnu -target-cpu x86-64-v4 -std=c23 -O1 -ffreestanding -emit-llvm -o - %s | FileCheck %s
+
+// This test sanity checks calling a variadic function with the expansion transform disabled.
+// The IR test cases {arch}/expand-variadic-call-*.ll correspond to IR generated from this test case.
+
+typedef __builtin_va_list va_list;
+#define va_copy(dest, src) __builtin_va_copy(dest, src)
+#define va_start(ap, ...) __builtin_va_start(ap, 0)
+#define va_end(ap) __builtin_va_end(ap)
+#define va_arg(ap, type) __builtin_va_arg(ap, type)
+
+// 32 bit x86 alignment uses getTypeStackAlign for special cases
+// Whitebox testing.
+// Needs a type >= 16 which is either a simd or a struct containing a simd
+// darwinvectorabi should force 4 bytes
+// linux vectors with align 16/32/64 return that alignment
+
+
+void wrapped(va_list);
+
+// CHECK-LABEL: @codegen_for_copy(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[CP:%.*]] = alloca [1 x %struct.__va_list_tag], align 16
+// CHECK-NEXT:    call void @llvm.lifetime.start.p0(i64 24, ptr nonnull [[CP]]) #[[ATTR7:[0-9]+]]
+// CHECK-NEXT:    call void @llvm.va_copy(ptr nonnull [[CP]], ptr [[X:%.*]])
+// CHECK-NEXT:    call void @wrapped(ptr noundef nonnull [[CP]]) #[[ATTR8:[0-9]+]]
+// CHECK-NEXT:    call void @llvm.va_end(ptr [[CP]])
+// CHECK-NEXT:    call void @llvm.lifetime.end.p0(i64 24, ptr nonnull [[CP]]) #[[ATTR7]]
+// CHECK-NEXT:    ret void
+//
+void codegen_for_copy(va_list x)
+{
+  va_list cp;
+  va_copy(cp, x);
+  wrapped(cp);
+  va_end(cp);
+}
+
+
+// CHECK-LABEL: @vararg(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[VA:%.*]] = alloca [1 x %struct.__va_list_tag], align 16
+// CHECK-NEXT:    call void @llvm.lifetime.start.p0(i64 24, ptr nonnull [[VA]]) #[[ATTR7]]
+// CHECK-NEXT:    call void @llvm.va_start(ptr nonnull [[VA]])
+// CHECK-NEXT:    call void @wrapped(ptr noundef nonnull [[VA]]) #[[ATTR8]]
+// CHECK-NEXT:    call void @llvm.va_end(ptr [[VA]])
+// CHECK-NEXT:    call void @llvm.lifetime.end.p0(i64 24, ptr nonnull [[VA]]) #[[ATTR7]]
+// CHECK-NEXT:    ret void
+//
+ void vararg(...)
+{
+  va_list va;
+  __builtin_va_start(va, 0);
+  wrapped(va);
+  va_end(va);
+}
+
+// vectors with alignment 16/32/64 are natively aligned on linux x86
+// v32f32 would be a m1024 type, larger than x64 defines at time of writing
+typedef int i32;
+typedef float v4f32 __attribute__((__vector_size__(16), __aligned__(16)));
+typedef float v8f32 __attribute__((__vector_size__(32), __aligned__(32)));
+typedef float v16f32 __attribute__((__vector_size__(64), __aligned__(64)));
+typedef float v32f32 __attribute__((__vector_size__(128), __aligned__(128)));
+
+
+// Pass a single value to wrapped() via vararg(...)
+// CHECK-LABEL: @single_i32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(i32 noundef [[X:%.*]]) #[[ATTR9:[0-9]+]]
+// CHECK-NEXT:    ret void
+//
+void single_i32(i32 x)
+{
+  vararg(x);
+}
+
+// CHECK-LABEL: @single_double(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(double noundef [[X:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void single_double(double x)
+{
+  vararg(x);
+}
+
+// CHECK-LABEL: @single_v4f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(<4 x float> noundef [[X:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void single_v4f32(v4f32 x)
+{
+  vararg(x);
+}
+
+// CHECK-LABEL: @single_v8f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(<8 x float> noundef [[X:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void single_v8f32(v8f32 x)
+{
+  vararg(x);
+}
+
+// CHECK-LABEL: @single_v16f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(<16 x float> noundef [[X:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void single_v16f32(v16f32 x)
+{
+  vararg(x);
+}
+
+// CHECK-LABEL: @single_v32f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[INDIRECT_ARG_TEMP:%.*]] = alloca <32 x float>, align 128
+// CHECK-NEXT:    [[X:%.*]] = load <32 x float>, ptr [[TMP0:%.*]], align 128, !tbaa [[TBAA2:![0-9]+]]
+// CHECK-NEXT:    store <32 x float> [[X]], ptr [[INDIRECT_ARG_TEMP]], align 128, !tbaa [[TBAA2]]
+// CHECK-NEXT:    tail call void (...) @vararg(ptr noundef nonnull byval(<32 x float>) align 128 [[INDIRECT_ARG_TEMP]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void single_v32f32(v32f32 x)
+{
+  vararg(x);
+}
+
+
+
+// CHECK-LABEL: @i32_double(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(i32 noundef [[X:%.*]], double noundef [[Y:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void i32_double(i32 x, double y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @double_i32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(double noundef [[X:%.*]], i32 noundef [[Y:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void double_i32(double x, i32 y)
+{
+  vararg(x, y);
+}
+
+
+// A struct used by libc variadic tests
+
+typedef struct {
+  char c;
+  short s;
+  int i;
+  long l;
+  float f;
+  double d;
+}  libcS;
+
+// CHECK-LABEL: @i32_libcS(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(i32 noundef [[X:%.*]], ptr noundef nonnull byval([[STRUCT_LIBCS:%.*]]) align 8 [[Y:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void i32_libcS(i32 x, libcS y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @libcS_i32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(ptr noundef nonnull byval([[STRUCT_LIBCS:%.*]]) align 8 [[X:%.*]], i32 noundef [[Y:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void libcS_i32(libcS x, i32 y)
+{
+  vararg(x, y);
+}
+
+
+// CHECK-LABEL: @i32_v4f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(i32 noundef [[X:%.*]], <4 x float> noundef [[Y:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void i32_v4f32(i32 x, v4f32 y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @v4f32_i32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(<4 x float> noundef [[X:%.*]], i32 noundef [[Y:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void v4f32_i32(v4f32 x, i32 y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @i32_v8f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(i32 noundef [[X:%.*]], <8 x float> noundef [[Y:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void i32_v8f32(i32 x, v8f32 y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @v8f32_i32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(<8 x float> noundef [[X:%.*]], i32 noundef [[Y:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void v8f32_i32(v8f32 x, i32 y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @i32_v16f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(i32 noundef [[X:%.*]], <16 x float> noundef [[Y:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void i32_v16f32(i32 x, v16f32 y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @v16f32_i32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(<16 x float> noundef [[X:%.*]], i32 noundef [[Y:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void v16f32_i32(v16f32 x, i32 y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @i32_v32f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[INDIRECT_ARG_TEMP:%.*]] = alloca <32 x float>, align 128
+// CHECK-NEXT:    [[Y:%.*]] = load <32 x float>, ptr [[TMP0:%.*]], align 128, !tbaa [[TBAA2]]
+// CHECK-NEXT:    store <32 x float> [[Y]], ptr [[INDIRECT_ARG_TEMP]], align 128, !tbaa [[TBAA2]]
+// CHECK-NEXT:    tail call void (...) @vararg(i32 noundef [[X:%.*]], ptr noundef nonnull byval(<32 x float>) align 128 [[INDIRECT_ARG_TEMP]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void i32_v32f32(i32 x, v32f32 y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @v32f32_i32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[INDIRECT_ARG_TEMP:%.*]] = alloca <32 x float>, align 128
+// CHECK-NEXT:    [[X:%.*]] = load <32 x float>, ptr [[TMP0:%.*]], align 128, !tbaa [[TBAA2]]
+// CHECK-NEXT:    store <32 x float> [[X]], ptr [[INDIRECT_ARG_TEMP]], align 128, !tbaa [[TBAA2]]
+// CHECK-NEXT:    tail call void (...) @vararg(ptr noundef nonnull byval(<32 x float>) align 128 [[INDIRECT_ARG_TEMP]], i32 noundef [[Y:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void v32f32_i32(v32f32 x, i32 y)
+{
+  vararg(x, y);
+}
diff --git a/clang/test/CodeGen/variadic-wrapper-removal.c b/clang/test/CodeGen/variadic-wrapper-removal.c
new file mode 100644
index 00000000000000..da41dde16f3d73
--- /dev/null
+++ b/clang/test/CodeGen/variadic-wrapper-removal.c
@@ -0,0 +1,86 @@
+// REQUIRES: x86-registered-target
+// RUN: %clang_cc1 -triple i386-unknown-linux-gnu -O1 -emit-llvm -o - %s | opt --passes=expand-variadics -S | FileCheck %s
+// RUN: %clang_cc1 -triple=x86_64-linux-gnu -O1 -emit-llvm -o - %s | opt --passes=expand-variadics -S | FileCheck %s
+
+// neither arm arch is implemented yet, leaving it here as a reminder
+// armv6 is a ptr as far as the struct is concerned, but possibly also a [1 x i32] passed by value
+// that seems insistent, maybe leave 32 bit arm alone for now
+// aarch64 is a struct of five things passed using byval memcpy
+
+// R-N: %clang_cc1 -triple=armv6-none--eabi -O1 -emit-llvm -o - %s | opt --passes=expand-variadics -S | FileCheck %s
+
+// R-N: %clang_cc1 -triple=aarch64-none-linux-gnu -O1 -emit-llvm -o - %s | opt --passes=expand-variadics -S | FileCheck %s
+
+
+
+// expand-variadics rewrites calls to variadic functions into calls to
+// equivalent functions that take a va_list argument. A property of the
+// implementation is that said "equivalent function" may be a pre-existing one.
+// This is equivalent to inlining a sufficiently simple variadic wrapper.
+
+#include <stdarg.h>
+
+typedef int FILE; // close enough for this test
+
+// fprintf is sometimes implemented as a call to vfprintf. That fits the
+// pattern the transform pass recognises - given an implementation of fprintf
+// in the IR module, calls to it can be rewritten into calls into vfprintf.
+
+// CHECK-LABEL: define{{.*}} i32 @fprintf(
+// CHECK-LABEL: define{{.*}} i32 @call_fprintf(
+// CHECK-NOT:   @fprintf
+// CHECK:       @vfprintf
+int vfprintf(FILE *restrict f, const char *restrict fmt, va_list ap);
+int fprintf(FILE *restrict f, const char *restrict fmt, ...)
+{
+  int ret;
+  va_list ap;
+  va_start(ap, fmt);
+  ret = vfprintf(f, fmt, ap);
+  va_end(ap);
+  return ret;
+}
+int call_fprintf(FILE *f)
+{
+  int x = 42;
+  double y = 3.14;
+  return fprintf(f, "int %d dbl %g\n", x, y);
+}
+
+// Void return type is also OK
+
+// CHECK-LABEL: define{{.*}} void @no_result(
+// CHECK-LABEL: define{{.*}} void @call_no_result(
+// CHECK-NOT:   @no_result
+// CHECK:       @vno_result
+void vno_result(const char * fmt, va_list);
+void no_result(const char * fmt, ...)
+{
+  va_list ap;
+  va_start(ap, fmt);
+  vno_result(fmt, ap);
+  va_end(ap);
+}
+void call_no_result(FILE *f)
+{
+  int x = 101;
+  no_result("", x);
+}
+
+// The vaend in the forwarding implementation is optional where it's a no-op
+
+// CHECK-LABEL: define{{.*}} i32 @no_vaend(
+// CHECK-LABEL: define{{.*}} i32 @call_no_vaend(
+// CHECK-NOT:   @no_vaend
+// CHECK:       @vno_vaend
+int vno_vaend(int x, va_list);
+int no_vaend(int x, ...)
+{
+  va_list ap;
+  va_start(ap, x);
+  return vno_vaend(x, ap);
+}
+int call_no_vaend(int x)
+{
+  return no_vaend(x, 10, 2.5f);
+}
diff --git a/clang/test/CodeGenCXX/inline-then-fold-variadics.cpp b/clang/test/CodeGenCXX/inline-then-fold-variadics.cpp
new file mode 100644
index 00000000000000..cf436ead77a2cb
--- /dev/null
+++ b/clang/test/CodeGenCXX/inline-then-fold-variadics.cpp
@@ -0,0 +1,117 @@
+// RUN: %clang_cc1 -triple i386-unknown-linux-gnu -Wno-varargs -O1 -disable-llvm-passes -emit-llvm -o - %s | opt --passes=instcombine | opt -passes="expand-variadics,default<O1>" -S | FileCheck %s --check-prefixes=CHECK,X86Linux
+
+// RUN: %clang_cc1 -triple x86_64-linux-gnu -Wno-varargs -O1 -disable-llvm-passes -emit-llvm -o - %s | opt --passes=instcombine | opt -passes="expand-variadics,default<O1>" -S | FileCheck %s --check-prefixes=CHECK,X64SystemV
+
+// RUN: %clang_cc1 -triple i386-apple-darwin -Wno-varargs -O1 -disable-llvm-passes -emit-llvm -o - %s | opt --passes=instcombine | opt -passes="expand-variadics,default<O1>" -S | FileCheck %s --check-prefixes=CHECK,X86Darwin
+
+// RUN: %clang_cc1 -triple x86_64-apple-darwin -Wno-varargs -O1 -disable-llvm-passes -emit-llvm -o - %s | opt --passes=instcombine | opt -passes="expand-variadics,default<O1>" -S | FileCheck %s --check-prefixes=CHECK,X64SystemV
+
+// RUN: %clang_cc1 -triple i686-windows-msvc -Wno-varargs -O1 -disable-llvm-passes -emit-llvm -o - %s | opt --passes=instcombine | opt -passes="expand-variadics,default<O1>" -S | FileCheck %s --check-prefixes=CHECK,X86Windows
+
+// 64 bit windows va_arg passes most types indirectly but the call instruction uses the types by value
+// ___: %clang_cc1 -triple x86_64-pc-windows-msvc -Wno-varargs -O1 -disable-llvm-passes -emit-llvm -o - %s | opt --passes=instcombine | opt -passes="expand-variadics,default<O1>" -S | FileCheck %s --check-prefixes=CHECK
+
+// Checks for consistency between clang and expand-variadics
+// 1. Use clang to lower va_arg
+// 2. Use expand-variadics to lower the rest of the variadic operations
+// 3. Use opt -O1 to simplify the result for simpler filecheck patterns
+// The simplification will fail when the two are not consistent, modulo bugs elsewhere.
+
+#include <stdarg.h>
+
+// This test can be simplified when expand-variadics is extended to apply to more patterns.
+// The first_valist and second_valist functions can then be inlined, either in the test or
+// by enabling optimisaton passes in the clang invocation.
+// The explicit instcombine pass canonicalises the variadic function IR.
+
+// More complicated tests need instcombine of ptrmask to land first.
+
+template <typename X, typename Y>
+static X first_valist(va_list va) {
+  return va_arg(va, X);
+}
+
+template <typename X, typename Y>
+static X first(...) {
+  va_list va;
+  __builtin_va_start(va, 0);
+  return first_valist<X,Y>(va);
+}
+
+template <typename X, typename Y>
+static Y second_valist(va_list va) {
+  va_arg(va, X);
+  Y r = va_arg(va, Y);
+  return r;
+}
+
+
+template <typename X, typename Y>
+static Y second(...) {
+  va_list va;
+  __builtin_va_start(va, 0);
+  return second_valist<X,Y>(va);
+}
+
+extern "C"
+{
+// CHECK-LABEL: define{{.*}} i32 @first_i32_i32(i32{{.*}} %x, i32{{.*}} %y)
+// CHECK:       entry:
+// CHECK:       ret i32 %x
+int first_i32_i32(int x, int y)
+{
+  return first<int,int>(x, y);
+}
+
+// CHECK-LABEL: define{{.*}} i32 @second_i32_i32(i32{{.*}} %x, i32{{.*}} %y)
+// CHECK:       entry:
+// CHECK:       ret i32 %y
+int second_i32_i32(int x, int y)
+{
+  return second<int,int>(x, y);
+}
+}
+
+// Permutations of an int and a double
+extern "C"
+{
+// CHECK-LABEL: define{{.*}} i32 @first_i32_f64(i32{{.*}} %x, double{{.*}} %y)
+// CHECK:       entry:
+// CHECK:       ret i32 %x
+int first_i32_f64(int x, double y)
+{
+  return first<int,double>(x, y);
+}
+  
+// CHECK-LABEL: define{{.*}} double @second_i32_f64(i32{{.*}} %x, double{{.*}} %y)
+// CHECK:       entry:
+
+// X86Linux:    ret double %y
+// X64SystemV:  ret double %y
+// X86Darwin:   ret double %y
+// X86Windows:  [[TMP0:%.*]] = alloca <{ [4 x i8], double }>, align 4
+// X86Windows:  [[TMP1:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i32 4
+// X86Windows:  store double %y, ptr [[TMP1]], align 4
+// X86Windows:  [[TMP2:%.*]] = load double, ptr [[TMP0]], align 4
+// X86Windows:  ret double [[TMP2]]
+double second_i32_f64(int x, double y)
+{
+  return second<int,double>(x, y);
+}
+
+// CHECK-LABEL: define{{.*}} double @first_f64_i32(double{{.*}} %x, i32{{.*}} %y)
+// CHECK:       entry:
+// CHECK:       ret double %x
+double first_f64_i32(double x, int y)
+{
+  return first<double,int>(x, y);
+}
+
+// CHECK-LABEL: define{{.*}} i32 @second_f64_i32(double{{.*}} %x, i32{{.*}} %y)
+// CHECK:       entry:
+// CHECK:       ret i32 %y
+int second_f64_i32(double x, int y)
+{
+  return second<double,int>(x, y);
+}   
+}
diff --git a/llvm/include/llvm/CodeGen/Passes.h b/llvm/include/llvm/CodeGen/Passes.h
index bbfb8a0dbe26a4..fe3208df7a23b2 100644
--- a/llvm/include/llvm/CodeGen/Passes.h
+++ b/llvm/include/llvm/CodeGen/Passes.h
@@ -600,6 +600,10 @@ namespace llvm {
 
   /// Lowers KCFI operand bundles for indirect calls.
   FunctionPass *createKCFIPass();
+
+  // Inline variadic functions and expand variadic intrinsics.
+  ModulePass *createExpandVariadicsPass();
+
 } // End llvm namespace
 
 #endif
diff --git a/llvm/include/llvm/InitializePasses.h b/llvm/include/llvm/InitializePasses.h
index 3db639a6872407..6487d0a5e26d1b 100644
--- a/llvm/include/llvm/InitializePasses.h
+++ b/llvm/include/llvm/InitializePasses.h
@@ -106,6 +106,7 @@ void initializeExpandLargeDivRemLegacyPassPass(PassRegistry&);
 void initializeExpandMemCmpLegacyPassPass(PassRegistry &);
 void initializeExpandPostRAPass(PassRegistry&);
 void initializeExpandReductionsPass(PassRegistry&);
+void initializeExpandVariadicsPass(PassRegistry &);
 void initializeExpandVectorPredicationPass(PassRegistry &);
 void initializeExternalAAWrapperPassPass(PassRegistry&);
 void initializeFEntryInserterPass(PassRegistry&);
diff --git a/llvm/include/llvm/Transforms/IPO/ExpandVariadics.h b/llvm/include/llvm/Transforms/IPO/ExpandVariadics.h
new file mode 100644
index 00000000000000..e7ffe343b940e9
--- /dev/null
+++ b/llvm/include/llvm/Transforms/IPO/ExpandVariadics.h
@@ -0,0 +1,17 @@
+#ifndef LLVM_TRANSFORMS_IPO_EXPANDVARIADICS_H
+#define LLVM_TRANSFORMS_IPO_EXPANDVARIADICS_H
+
+#include "llvm/IR/PassManager.h"
+
+namespace llvm {
+
+class Module;
+
+class ExpandVariadicsPass : public PassInfoMixin<ExpandVariadicsPass> {
+public:
+  PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
+};
+
+} // end namespace llvm
+
+#endif // LLVM_TRANSFORMS_IPO_EXPANDVARIADICS_H
diff --git a/llvm/lib/Passes/PassBuilder.cpp b/llvm/lib/Passes/PassBuilder.cpp
index c934ec42f6eb15..624fffd233ce56 100644
--- a/llvm/lib/Passes/PassBuilder.cpp
+++ b/llvm/lib/Passes/PassBuilder.cpp
@@ -131,6 +131,7 @@
 #include "llvm/Transforms/IPO/DeadArgumentElimination.h"
 #include "llvm/Transforms/IPO/ElimAvailExtern.h"
 #include "llvm/Transforms/IPO/EmbedBitcodePass.h"
+#include "llvm/Transforms/IPO/ExpandVariadics.h"
 #include "llvm/Transforms/IPO/ForceFunctionAttrs.h"
 #include "llvm/Transforms/IPO/FunctionAttrs.h"
 #include "llvm/Transforms/IPO/FunctionImport.h"
diff --git a/llvm/lib/Passes/PassRegistry.def b/llvm/lib/Passes/PassRegistry.def
index 44511800ccff8d..4ea9493208315a 100644
--- a/llvm/lib/Passes/PassRegistry.def
+++ b/llvm/lib/Passes/PassRegistry.def
@@ -59,6 +59,7 @@ MODULE_PASS("dot-callgraph", CallGraphDOTPrinterPass())
 MODULE_PASS("dxil-upgrade", DXILUpgradePass())
 MODULE_PASS("elim-avail-extern", EliminateAvailableExternallyPass())
 MODULE_PASS("extract-blocks", BlockExtractorPass({}, false))
+MODULE_PASS("expand-variadics", ExpandVariadicsPass())
 MODULE_PASS("forceattrs", ForceFunctionAttrsPass())
 MODULE_PASS("function-import", FunctionImportPass())
 MODULE_PASS("globalopt", GlobalOptPass())
diff --git a/llvm/lib/Transforms/IPO/CMakeLists.txt b/llvm/lib/Transforms/IPO/CMakeLists.txt
index 034f1587ae8df4..b8bd0be91d2232 100644
--- a/llvm/lib/Transforms/IPO/CMakeLists.txt
+++ b/llvm/lib/Transforms/IPO/CMakeLists.txt
@@ -12,6 +12,7 @@ add_llvm_component_library(LLVMipo
   DeadArgumentElimination.cpp
   ElimAvailExtern.cpp
   EmbedBitcodePass.cpp
+  ExpandVariadics.cpp
   ExtractGV.cpp
   ForceFunctionAttrs.cpp
   FunctionAttrs.cpp
diff --git a/llvm/lib/Transforms/IPO/ExpandVariadics.cpp b/llvm/lib/Transforms/IPO/ExpandVariadics.cpp
new file mode 100644
index 00000000000000..4b4c635cef5092
--- /dev/null
+++ b/llvm/lib/Transforms/IPO/ExpandVariadics.cpp
@@ -0,0 +1,723 @@
+//===-- ExpandVariadicsPass.cpp --------------------------------*- C++ -*-=//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// S...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Feb 7, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Jon Chesterfield (JonChesterfield)

Changes

This is a MVP style commit for eliminating variadic functions as an IR pass. It replaces the ... with a va_list value.

The implementation strategy is to divide the transform into orthogonal parts that can be tested independently. In particular this seeks to decouple target specific complexity from llvm modelling complexity. The original motivation was to provide variadic function support to amdgpu and nvptx. That's tested and working in the downstream branch from which this patch was extracted.

This commit introduces the target specific part of the transform, implemented for 32 and 64 bit x86. Test coverage of more interesting types from clang requires either pull/80002 or complicated filecheck patterns. It rewrites very simple variadic calls into one to an equivalent function taking a va_list.

Subsequent patches are needed to:

  1. rewrite variadic functions that don't happen to already match this pattern
  2. add more targets - aarch64, amdgpu, nvptx are planned at present
  3. call the pass from codegen to expand all variadic functions for GPU targets
  4. close the holes in test coverage (and implementation)
  5. add this transform to one of the optimisation pipelines, maybe O1
  6. maybe expand va_arg instructions (for the GPUs, or maybe from alternative front ends)

This patch is known to be incomplete. It is posted in this form to check the strategy and provide a mostly-correct baseline to extend. Initial discussion is at https://discourse.llvm.org/t/rfc-desugar-variadics-codegen-for-new-targets-optimisation-for-existing/72854


Patch is 176.90 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/81058.diff

16 Files Affected:

  • (added) clang/test/CodeGen/expand-variadic-call.c (+273)
  • (added) clang/test/CodeGen/variadic-wrapper-removal.c (+86)
  • (added) clang/test/CodeGenCXX/inline-then-fold-variadics.cpp (+117)
  • (modified) llvm/include/llvm/CodeGen/Passes.h (+4)
  • (modified) llvm/include/llvm/InitializePasses.h (+1)
  • (added) llvm/include/llvm/Transforms/IPO/ExpandVariadics.h (+17)
  • (modified) llvm/lib/Passes/PassBuilder.cpp (+1)
  • (modified) llvm/lib/Passes/PassRegistry.def (+1)
  • (modified) llvm/lib/Transforms/IPO/CMakeLists.txt (+1)
  • (added) llvm/lib/Transforms/IPO/ExpandVariadics.cpp (+723)
  • (added) llvm/test/CodeGen/X86/expand-variadic-call-i386-darwin.ll (+385)
  • (added) llvm/test/CodeGen/X86/expand-variadic-call-i386-linux.ll (+385)
  • (added) llvm/test/CodeGen/X86/expand-variadic-call-i686-msvc.ll (+402)
  • (added) llvm/test/CodeGen/X86/expand-variadic-call-x64-darwin.ll (+589)
  • (added) llvm/test/CodeGen/X86/expand-variadic-call-x64-linux.ll (+589)
  • (modified) llvm/utils/gn/secondary/llvm/lib/Transforms/IPO/BUILD.gn (+1)
diff --git a/clang/test/CodeGen/expand-variadic-call.c b/clang/test/CodeGen/expand-variadic-call.c
new file mode 100644
index 00000000000000..fa2b984bec08a5
--- /dev/null
+++ b/clang/test/CodeGen/expand-variadic-call.c
@@ -0,0 +1,273 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py
+
+// REQUIRES: x86-registered-target
+// RUN: %clang_cc1 -triple x86_64-unknown-linux-gnu -target-cpu x86-64-v4 -std=c23 -O1 -ffreestanding -emit-llvm -o - %s | FileCheck %s
+
+// This test sanity checks calling a variadic function with the expansion transform disabled.
+// The IR test cases {arch}/expand-variadic-call-*.ll correspond to IR generated from this test case.
+
+typedef __builtin_va_list va_list;
+#define va_copy(dest, src) __builtin_va_copy(dest, src)
+#define va_start(ap, ...) __builtin_va_start(ap, 0)
+#define va_end(ap) __builtin_va_end(ap)
+#define va_arg(ap, type) __builtin_va_arg(ap, type)
+
+// 32 bit x86 alignment uses getTypeStackAlign for special cases
+// Whitebox testing.
+// Needs a type >= 16 which is either a simd or a struct containing a simd
+// darwinvectorabi should force 4 bytes
+// linux vectors with align 16/32/64 return that alignment
+
+
+void wrapped(va_list);
+
+// CHECK-LABEL: @codegen_for_copy(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[CP:%.*]] = alloca [1 x %struct.__va_list_tag], align 16
+// CHECK-NEXT:    call void @llvm.lifetime.start.p0(i64 24, ptr nonnull [[CP]]) #[[ATTR7:[0-9]+]]
+// CHECK-NEXT:    call void @llvm.va_copy(ptr nonnull [[CP]], ptr [[X:%.*]])
+// CHECK-NEXT:    call void @wrapped(ptr noundef nonnull [[CP]]) #[[ATTR8:[0-9]+]]
+// CHECK-NEXT:    call void @llvm.va_end(ptr [[CP]])
+// CHECK-NEXT:    call void @llvm.lifetime.end.p0(i64 24, ptr nonnull [[CP]]) #[[ATTR7]]
+// CHECK-NEXT:    ret void
+//
+void codegen_for_copy(va_list x)
+{
+  va_list cp;
+  va_copy(cp, x);
+  wrapped(cp);
+  va_end(cp);
+}
+
+
+// CHECK-LABEL: @vararg(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[VA:%.*]] = alloca [1 x %struct.__va_list_tag], align 16
+// CHECK-NEXT:    call void @llvm.lifetime.start.p0(i64 24, ptr nonnull [[VA]]) #[[ATTR7]]
+// CHECK-NEXT:    call void @llvm.va_start(ptr nonnull [[VA]])
+// CHECK-NEXT:    call void @wrapped(ptr noundef nonnull [[VA]]) #[[ATTR8]]
+// CHECK-NEXT:    call void @llvm.va_end(ptr [[VA]])
+// CHECK-NEXT:    call void @llvm.lifetime.end.p0(i64 24, ptr nonnull [[VA]]) #[[ATTR7]]
+// CHECK-NEXT:    ret void
+//
+ void vararg(...)
+{
+  va_list va;
+  __builtin_va_start(va, 0);
+  wrapped(va);
+  va_end(va);
+}
+
+// vectors with alignment 16/32/64 are natively aligned on linux x86
+// v32f32 would be a m1024 type, larger than x64 defines at time of writing
+typedef int i32;
+typedef float v4f32 __attribute__((__vector_size__(16), __aligned__(16)));
+typedef float v8f32 __attribute__((__vector_size__(32), __aligned__(32)));
+typedef float v16f32 __attribute__((__vector_size__(64), __aligned__(64)));
+typedef float v32f32 __attribute__((__vector_size__(128), __aligned__(128)));
+
+
+// Pass a single value to wrapped() via vararg(...)
+// CHECK-LABEL: @single_i32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(i32 noundef [[X:%.*]]) #[[ATTR9:[0-9]+]]
+// CHECK-NEXT:    ret void
+//
+void single_i32(i32 x)
+{
+  vararg(x);
+}
+
+// CHECK-LABEL: @single_double(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(double noundef [[X:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void single_double(double x)
+{
+  vararg(x);
+}
+
+// CHECK-LABEL: @single_v4f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(<4 x float> noundef [[X:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void single_v4f32(v4f32 x)
+{
+  vararg(x);
+}
+
+// CHECK-LABEL: @single_v8f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(<8 x float> noundef [[X:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void single_v8f32(v8f32 x)
+{
+  vararg(x);
+}
+
+// CHECK-LABEL: @single_v16f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(<16 x float> noundef [[X:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void single_v16f32(v16f32 x)
+{
+  vararg(x);
+}
+
+// CHECK-LABEL: @single_v32f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[INDIRECT_ARG_TEMP:%.*]] = alloca <32 x float>, align 128
+// CHECK-NEXT:    [[X:%.*]] = load <32 x float>, ptr [[TMP0:%.*]], align 128, !tbaa [[TBAA2:![0-9]+]]
+// CHECK-NEXT:    store <32 x float> [[X]], ptr [[INDIRECT_ARG_TEMP]], align 128, !tbaa [[TBAA2]]
+// CHECK-NEXT:    tail call void (...) @vararg(ptr noundef nonnull byval(<32 x float>) align 128 [[INDIRECT_ARG_TEMP]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void single_v32f32(v32f32 x)
+{
+  vararg(x);
+}
+
+
+
+// CHECK-LABEL: @i32_double(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(i32 noundef [[X:%.*]], double noundef [[Y:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void i32_double(i32 x, double y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @double_i32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(double noundef [[X:%.*]], i32 noundef [[Y:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void double_i32(double x, i32 y)
+{
+  vararg(x, y);
+}
+
+
+// A struct used by libc variadic tests
+
+typedef struct {
+  char c;
+  short s;
+  int i;
+  long l;
+  float f;
+  double d;
+}  libcS;
+
+// CHECK-LABEL: @i32_libcS(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(i32 noundef [[X:%.*]], ptr noundef nonnull byval([[STRUCT_LIBCS:%.*]]) align 8 [[Y:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void i32_libcS(i32 x, libcS y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @libcS_i32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(ptr noundef nonnull byval([[STRUCT_LIBCS:%.*]]) align 8 [[X:%.*]], i32 noundef [[Y:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void libcS_i32(libcS x, i32 y)
+{
+  vararg(x, y);
+}
+
+
+// CHECK-LABEL: @i32_v4f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(i32 noundef [[X:%.*]], <4 x float> noundef [[Y:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void i32_v4f32(i32 x, v4f32 y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @v4f32_i32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(<4 x float> noundef [[X:%.*]], i32 noundef [[Y:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void v4f32_i32(v4f32 x, i32 y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @i32_v8f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(i32 noundef [[X:%.*]], <8 x float> noundef [[Y:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void i32_v8f32(i32 x, v8f32 y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @v8f32_i32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(<8 x float> noundef [[X:%.*]], i32 noundef [[Y:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void v8f32_i32(v8f32 x, i32 y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @i32_v16f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(i32 noundef [[X:%.*]], <16 x float> noundef [[Y:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void i32_v16f32(i32 x, v16f32 y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @v16f32_i32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    tail call void (...) @vararg(<16 x float> noundef [[X:%.*]], i32 noundef [[Y:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void v16f32_i32(v16f32 x, i32 y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @i32_v32f32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[INDIRECT_ARG_TEMP:%.*]] = alloca <32 x float>, align 128
+// CHECK-NEXT:    [[Y:%.*]] = load <32 x float>, ptr [[TMP0:%.*]], align 128, !tbaa [[TBAA2]]
+// CHECK-NEXT:    store <32 x float> [[Y]], ptr [[INDIRECT_ARG_TEMP]], align 128, !tbaa [[TBAA2]]
+// CHECK-NEXT:    tail call void (...) @vararg(i32 noundef [[X:%.*]], ptr noundef nonnull byval(<32 x float>) align 128 [[INDIRECT_ARG_TEMP]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void i32_v32f32(i32 x, v32f32 y)
+{
+  vararg(x, y);
+}
+
+// CHECK-LABEL: @v32f32_i32(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    [[INDIRECT_ARG_TEMP:%.*]] = alloca <32 x float>, align 128
+// CHECK-NEXT:    [[X:%.*]] = load <32 x float>, ptr [[TMP0:%.*]], align 128, !tbaa [[TBAA2]]
+// CHECK-NEXT:    store <32 x float> [[X]], ptr [[INDIRECT_ARG_TEMP]], align 128, !tbaa [[TBAA2]]
+// CHECK-NEXT:    tail call void (...) @vararg(ptr noundef nonnull byval(<32 x float>) align 128 [[INDIRECT_ARG_TEMP]], i32 noundef [[Y:%.*]]) #[[ATTR9]]
+// CHECK-NEXT:    ret void
+//
+void v32f32_i32(v32f32 x, i32 y)
+{
+  vararg(x, y);
+}
diff --git a/clang/test/CodeGen/variadic-wrapper-removal.c b/clang/test/CodeGen/variadic-wrapper-removal.c
new file mode 100644
index 00000000000000..da41dde16f3d73
--- /dev/null
+++ b/clang/test/CodeGen/variadic-wrapper-removal.c
@@ -0,0 +1,86 @@
+// REQUIRES: x86-registered-target
+// RUN: %clang_cc1 -triple i386-unknown-linux-gnu -O1 -emit-llvm -o - %s | opt --passes=expand-variadics -S | FileCheck %s
+// RUN: %clang_cc1 -triple=x86_64-linux-gnu -O1 -emit-llvm -o - %s | opt --passes=expand-variadics -S | FileCheck %s
+
+// neither arm arch is implemented yet, leaving it here as a reminder
+// armv6 is a ptr as far as the struct is concerned, but possibly also a [1 x i32] passed by value
+// that seems insistent, maybe leave 32 bit arm alone for now
+// aarch64 is a struct of five things passed using byval memcpy
+
+// R-N: %clang_cc1 -triple=armv6-none--eabi -O1 -emit-llvm -o - %s | opt --passes=expand-variadics -S | FileCheck %s
+
+// R-N: %clang_cc1 -triple=aarch64-none-linux-gnu -O1 -emit-llvm -o - %s | opt --passes=expand-variadics -S | FileCheck %s
+
+
+
+// expand-variadics rewrites calls to variadic functions into calls to
+// equivalent functions that take a va_list argument. A property of the
+// implementation is that said "equivalent function" may be a pre-existing one.
+// This is equivalent to inlining a sufficiently simple variadic wrapper.
+
+#include <stdarg.h>
+
+typedef int FILE; // close enough for this test
+
+// fprintf is sometimes implemented as a call to vfprintf. That fits the
+// pattern the transform pass recognises - given an implementation of fprintf
+// in the IR module, calls to it can be rewritten into calls into vfprintf.
+
+// CHECK-LABEL: define{{.*}} i32 @fprintf(
+// CHECK-LABEL: define{{.*}} i32 @call_fprintf(
+// CHECK-NOT:   @fprintf
+// CHECK:       @vfprintf
+int vfprintf(FILE *restrict f, const char *restrict fmt, va_list ap);
+int fprintf(FILE *restrict f, const char *restrict fmt, ...)
+{
+  int ret;
+  va_list ap;
+  va_start(ap, fmt);
+  ret = vfprintf(f, fmt, ap);
+  va_end(ap);
+  return ret;
+}
+int call_fprintf(FILE *f)
+{
+  int x = 42;
+  double y = 3.14;
+  return fprintf(f, "int %d dbl %g\n", x, y);
+}
+
+// Void return type is also OK
+
+// CHECK-LABEL: define{{.*}} void @no_result(
+// CHECK-LABEL: define{{.*}} void @call_no_result(
+// CHECK-NOT:   @no_result
+// CHECK:       @vno_result
+void vno_result(const char * fmt, va_list);
+void no_result(const char * fmt, ...)
+{
+  va_list ap;
+  va_start(ap, fmt);
+  vno_result(fmt, ap);
+  va_end(ap);
+}
+void call_no_result(FILE *f)
+{
+  int x = 101;
+  no_result("", x);
+}
+
+// The vaend in the forwarding implementation is optional where it's a no-op
+
+// CHECK-LABEL: define{{.*}} i32 @no_vaend(
+// CHECK-LABEL: define{{.*}} i32 @call_no_vaend(
+// CHECK-NOT:   @no_vaend
+// CHECK:       @vno_vaend
+int vno_vaend(int x, va_list);
+int no_vaend(int x, ...)
+{
+  va_list ap;
+  va_start(ap, x);
+  return vno_vaend(x, ap);
+}
+int call_no_vaend(int x)
+{
+  return no_vaend(x, 10, 2.5f);
+}
diff --git a/clang/test/CodeGenCXX/inline-then-fold-variadics.cpp b/clang/test/CodeGenCXX/inline-then-fold-variadics.cpp
new file mode 100644
index 00000000000000..cf436ead77a2cb
--- /dev/null
+++ b/clang/test/CodeGenCXX/inline-then-fold-variadics.cpp
@@ -0,0 +1,117 @@
+// RUN: %clang_cc1 -triple i386-unknown-linux-gnu -Wno-varargs -O1 -disable-llvm-passes -emit-llvm -o - %s | opt --passes=instcombine | opt -passes="expand-variadics,default<O1>" -S | FileCheck %s --check-prefixes=CHECK,X86Linux
+
+// RUN: %clang_cc1 -triple x86_64-linux-gnu -Wno-varargs -O1 -disable-llvm-passes -emit-llvm -o - %s | opt --passes=instcombine | opt -passes="expand-variadics,default<O1>" -S | FileCheck %s --check-prefixes=CHECK,X64SystemV
+
+// RUN: %clang_cc1 -triple i386-apple-darwin -Wno-varargs -O1 -disable-llvm-passes -emit-llvm -o - %s | opt --passes=instcombine | opt -passes="expand-variadics,default<O1>" -S | FileCheck %s --check-prefixes=CHECK,X86Darwin
+
+// RUN: %clang_cc1 -triple x86_64-apple-darwin -Wno-varargs -O1 -disable-llvm-passes -emit-llvm -o - %s | opt --passes=instcombine | opt -passes="expand-variadics,default<O1>" -S | FileCheck %s --check-prefixes=CHECK,X64SystemV
+
+// RUN: %clang_cc1 -triple i686-windows-msvc -Wno-varargs -O1 -disable-llvm-passes -emit-llvm -o - %s | opt --passes=instcombine | opt -passes="expand-variadics,default<O1>" -S | FileCheck %s --check-prefixes=CHECK,X86Windows
+
+// 64 bit windows va_arg passes most types indirectly but the call instruction uses the types by value
+// ___: %clang_cc1 -triple x86_64-pc-windows-msvc -Wno-varargs -O1 -disable-llvm-passes -emit-llvm -o - %s | opt --passes=instcombine | opt -passes="expand-variadics,default<O1>" -S | FileCheck %s --check-prefixes=CHECK
+
+// Checks for consistency between clang and expand-variadics
+// 1. Use clang to lower va_arg
+// 2. Use expand-variadics to lower the rest of the variadic operations
+// 3. Use opt -O1 to simplify the result for simpler filecheck patterns
+// The simplification will fail when the two are not consistent, modulo bugs elsewhere.
+
+#include <stdarg.h>
+
+// This test can be simplified when expand-variadics is extended to apply to more patterns.
+// The first_valist and second_valist functions can then be inlined, either in the test or
+// by enabling optimisaton passes in the clang invocation.
+// The explicit instcombine pass canonicalises the variadic function IR.
+
+// More complicated tests need instcombine of ptrmask to land first.
+
+template <typename X, typename Y>
+static X first_valist(va_list va) {
+  return va_arg(va, X);
+}
+
+template <typename X, typename Y>
+static X first(...) {
+  va_list va;
+  __builtin_va_start(va, 0);
+  return first_valist<X,Y>(va);
+}
+
+template <typename X, typename Y>
+static Y second_valist(va_list va) {
+  va_arg(va, X);
+  Y r = va_arg(va, Y);
+  return r;
+}
+
+
+template <typename X, typename Y>
+static Y second(...) {
+  va_list va;
+  __builtin_va_start(va, 0);
+  return second_valist<X,Y>(va);
+}
+
+extern "C"
+{
+// CHECK-LABEL: define{{.*}} i32 @first_i32_i32(i32{{.*}} %x, i32{{.*}} %y)
+// CHECK:       entry:
+// CHECK:       ret i32 %x
+int first_i32_i32(int x, int y)
+{
+  return first<int,int>(x, y);
+}
+
+// CHECK-LABEL: define{{.*}} i32 @second_i32_i32(i32{{.*}} %x, i32{{.*}} %y)
+// CHECK:       entry:
+// CHECK:       ret i32 %y
+int second_i32_i32(int x, int y)
+{
+  return second<int,int>(x, y);
+}
+}
+
+// Permutations of an int and a double
+extern "C"
+{
+// CHECK-LABEL: define{{.*}} i32 @first_i32_f64(i32{{.*}} %x, double{{.*}} %y)
+// CHECK:       entry:
+// CHECK:       ret i32 %x
+int first_i32_f64(int x, double y)
+{
+  return first<int,double>(x, y);
+}
+  
+// CHECK-LABEL: define{{.*}} double @second_i32_f64(i32{{.*}} %x, double{{.*}} %y)
+// CHECK:       entry:
+
+// X86Linux:    ret double %y
+// X64SystemV:  ret double %y
+// X86Darwin:   ret double %y
+// X86Windows:  [[TMP0:%.*]] = alloca <{ [4 x i8], double }>, align 4
+// X86Windows:  [[TMP1:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i32 4
+// X86Windows:  store double %y, ptr [[TMP1]], align 4
+// X86Windows:  [[TMP2:%.*]] = load double, ptr [[TMP0]], align 4
+// X86Windows:  ret double [[TMP2]]
+double second_i32_f64(int x, double y)
+{
+  return second<int,double>(x, y);
+}
+
+// CHECK-LABEL: define{{.*}} double @first_f64_i32(double{{.*}} %x, i32{{.*}} %y)
+// CHECK:       entry:
+// CHECK:       ret double %x
+double first_f64_i32(double x, int y)
+{
+  return first<double,int>(x, y);
+}
+
+// CHECK-LABEL: define{{.*}} i32 @second_f64_i32(double{{.*}} %x, i32{{.*}} %y)
+// CHECK:       entry:
+// CHECK:       ret i32 %y
+int second_f64_i32(double x, int y)
+{
+  return second<double,int>(x, y);
+}   
+}
diff --git a/llvm/include/llvm/CodeGen/Passes.h b/llvm/include/llvm/CodeGen/Passes.h
index bbfb8a0dbe26a4..fe3208df7a23b2 100644
--- a/llvm/include/llvm/CodeGen/Passes.h
+++ b/llvm/include/llvm/CodeGen/Passes.h
@@ -600,6 +600,10 @@ namespace llvm {
 
   /// Lowers KCFI operand bundles for indirect calls.
   FunctionPass *createKCFIPass();
+
+  // Inline variadic functions and expand variadic intrinsics.
+  ModulePass *createExpandVariadicsPass();
+
 } // End llvm namespace
 
 #endif
diff --git a/llvm/include/llvm/InitializePasses.h b/llvm/include/llvm/InitializePasses.h
index 3db639a6872407..6487d0a5e26d1b 100644
--- a/llvm/include/llvm/InitializePasses.h
+++ b/llvm/include/llvm/InitializePasses.h
@@ -106,6 +106,7 @@ void initializeExpandLargeDivRemLegacyPassPass(PassRegistry&);
 void initializeExpandMemCmpLegacyPassPass(PassRegistry &);
 void initializeExpandPostRAPass(PassRegistry&);
 void initializeExpandReductionsPass(PassRegistry&);
+void initializeExpandVariadicsPass(PassRegistry &);
 void initializeExpandVectorPredicationPass(PassRegistry &);
 void initializeExternalAAWrapperPassPass(PassRegistry&);
 void initializeFEntryInserterPass(PassRegistry&);
diff --git a/llvm/include/llvm/Transforms/IPO/ExpandVariadics.h b/llvm/include/llvm/Transforms/IPO/ExpandVariadics.h
new file mode 100644
index 00000000000000..e7ffe343b940e9
--- /dev/null
+++ b/llvm/include/llvm/Transforms/IPO/ExpandVariadics.h
@@ -0,0 +1,17 @@
+#ifndef LLVM_TRANSFORMS_IPO_EXPANDVARIADICS_H
+#define LLVM_TRANSFORMS_IPO_EXPANDVARIADICS_H
+
+#include "llvm/IR/PassManager.h"
+
+namespace llvm {
+
+class Module;
+
+class ExpandVariadicsPass : public PassInfoMixin<ExpandVariadicsPass> {
+public:
+  PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
+};
+
+} // end namespace llvm
+
+#endif // LLVM_TRANSFORMS_IPO_EXPANDVARIADICS_H
diff --git a/llvm/lib/Passes/PassBuilder.cpp b/llvm/lib/Passes/PassBuilder.cpp
index c934ec42f6eb15..624fffd233ce56 100644
--- a/llvm/lib/Passes/PassBuilder.cpp
+++ b/llvm/lib/Passes/PassBuilder.cpp
@@ -131,6 +131,7 @@
 #include "llvm/Transforms/IPO/DeadArgumentElimination.h"
 #include "llvm/Transforms/IPO/ElimAvailExtern.h"
 #include "llvm/Transforms/IPO/EmbedBitcodePass.h"
+#include "llvm/Transforms/IPO/ExpandVariadics.h"
 #include "llvm/Transforms/IPO/ForceFunctionAttrs.h"
 #include "llvm/Transforms/IPO/FunctionAttrs.h"
 #include "llvm/Transforms/IPO/FunctionImport.h"
diff --git a/llvm/lib/Passes/PassRegistry.def b/llvm/lib/Passes/PassRegistry.def
index 44511800ccff8d..4ea9493208315a 100644
--- a/llvm/lib/Passes/PassRegistry.def
+++ b/llvm/lib/Passes/PassRegistry.def
@@ -59,6 +59,7 @@ MODULE_PASS("dot-callgraph", CallGraphDOTPrinterPass())
 MODULE_PASS("dxil-upgrade", DXILUpgradePass())
 MODULE_PASS("elim-avail-extern", EliminateAvailableExternallyPass())
 MODULE_PASS("extract-blocks", BlockExtractorPass({}, false))
+MODULE_PASS("expand-variadics", ExpandVariadicsPass())
 MODULE_PASS("forceattrs", ForceFunctionAttrsPass())
 MODULE_PASS("function-import", FunctionImportPass())
 MODULE_PASS("globalopt", GlobalOptPass())
diff --git a/llvm/lib/Transforms/IPO/CMakeLists.txt b/llvm/lib/Transforms/IPO/CMakeLists.txt
index 034f1587ae8df4..b8bd0be91d2232 100644
--- a/llvm/lib/Transforms/IPO/CMakeLists.txt
+++ b/llvm/lib/Transforms/IPO/CMakeLists.txt
@@ -12,6 +12,7 @@ add_llvm_component_library(LLVMipo
   DeadArgumentElimination.cpp
   ElimAvailExtern.cpp
   EmbedBitcodePass.cpp
+  ExpandVariadics.cpp
   ExtractGV.cpp
   ForceFunctionAttrs.cpp
   FunctionAttrs.cpp
diff --git a/llvm/lib/Transforms/IPO/ExpandVariadics.cpp b/llvm/lib/Transforms/IPO/ExpandVariadics.cpp
new file mode 100644
index 00000000000000..4b4c635cef5092
--- /dev/null
+++ b/llvm/lib/Transforms/IPO/ExpandVariadics.cpp
@@ -0,0 +1,723 @@
+//===-- ExpandVariadicsPass.cpp --------------------------------*- C++ -*-=//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// S...
[truncated]

@JonChesterfield JonChesterfield force-pushed the jc_varargs_inliner_component branch from 7285c68 to c77e660 Compare February 8, 2024 00:26
Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. A few quick nits, overall looks like lots of straightforward testing with a beefy new pass.

Comment on lines 99 to 100
virtual void initializeVAList(LLVMContext &Ctx, IRBuilder<> &Builder,
AllocaInst *va_list, Value * /*buffer*/) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we assert a value is non-null, it's best to make it a reference in the interface.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expressing the class invariants through the vtable seems harder than it should be. This function is only called if valistOnStack returns true, changing the base to a builtin_unreachable.

};

std::unique_ptr<Interface> create(Module &M) {
Triple Trip = Triple(M.getTargetTriple());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. kind of a non-standard name. I prefer

Suggested change
Triple Trip = Triple(M.getTargetTriple());
llvm::Triple Triple(M.getTargetTriple());

uint32_t minimum_slot_align() override { return 4; }
uint32_t maximum_slot_align() override { return 0; }
};
return std::make_unique<X86Windows>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these need to be special types? The usage seems like it could just be a config struct to the base class with some pair.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They were a struct with two named fields previously, could go with something like

 struct slotAlignTy {
    uint32_t min;
    uint32_t max; 
  }; 

slotAlignTy slot_align() override {return {0, 4}};

Or put state in the base class, that probably optimises better than virtual functions that return constants.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the state in the base class approach. Const fields at fixed offset, initialisation specified where it was before. Would be clearer with named parameters but the min, max sequence convention is probably good enough.

static char ID;
std::unique_ptr<VariadicABIInfo::Interface> ABI;

// todo: drop the boolean
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// todo: drop the boolean
// TODO: drop the boolean

}
};

Function *ExpandVariadics::isFunctionInlinable(Module &M, Function *F) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel a lot of these arguments should be references since we're not checking for null.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a fair amount of pointer compares. Target of call instruction to function and similar. Could go with references everywhere, but the &reference == pointer construct reads poorly to me.

@JonChesterfield JonChesterfield force-pushed the jc_varargs_inliner_component branch 2 times, most recently from 54b09d5 to b9e21af Compare February 8, 2024 01:24
Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments

AllocaInst * /*va_list*/, Value * /*buffer*/) {
// Function needs to be implemented if valist is on the stack
assert(!valistOnStack());
__builtin_unreachable();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing. If this function is never to be called why not make it pure virtual? __builtin_unreachable() is an optimization hint while llvm_unreachable is a runtime assertion that the code should not be executed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, llvm_unreachable is usually used inside LLVM, so definitely better to use llvm_unreachable here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone with more comments, but maybe to make this clear enough I need to separate the field alignment from whether va_list is a void* or something that needs more work. The function is called when it is overridden and it's not great to provide overrides that are themselves not called.

Comment on lines 382 to 384
if (!isa<ReturnInst>(BB.getTerminator())) {
return nullptr;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit. braces around single line block

return nullptr;
}

// Walk the block in order checking for specific instructions, some of them
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: style. LLVM uses PascalCase for variables in all locations below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Advice sought on how to PascalCase variables with va_list in the name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's a well-understood thing then we usually ignore it, like argc. But otherwise probably VAList.

// with it, such that target specific va_arg instructions will correctly
// iterate over it. Primarily this means getting the alignment right.

class {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how I feel about a big anonymous class in the middle of this function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm expecting some resistance to large functions. The cyclometric complexity is very low - they read linearly from top to bottom - but nevertheless people might prefer blocks of code with a single use to be moved further from it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally don't like it, I'd rather see the class moved outside the function. I understand that it's only used by one function but IMO a class def in the middle of a function breaks the "flow" and makes it harder to read it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think I've encountered any anonymous classes in LLVM. I've only ever seen anonymous structs that are POD without any members. You call this Frame, so just move it out and call it FrameTy or something.

Comment on lines 514 to 520
enum { N = 4 };
SmallVector<Type *, N> fieldTypes;
SmallVector<std::pair<Value *, bool>, N> maybeValueIsByval;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually it's easiest to leave off the template argument and let the type default to something sane unless there's an explicit. I think by default you get 64 bytes, so that'll likely be 8 or so.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vectors are always the same length as each other, does that affect your recommendation to use a default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the size is only relevant for the small size optimizations until we do malloc. I suppose if you know ahead of time that it's less likely to need malloc with 4 on both it could be argued, but if they're different sizes and it's not a performance issue it's usually cleaner.

Comment on lines +588 to +597
if (Frame.empty()) {
// Not passing anything, hopefully va_arg won't try to dereference it
// Might be a target specific thing whether one can pass nullptr instead
// of undef i32
Frame.append(Type::getInt32Ty(Ctx), nullptr, false);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (Frame.empty()) {
// Not passing anything, hopefully va_arg won't try to dereference it
// Might be a target specific thing whether one can pass nullptr instead
// of undef i32
Frame.append(Type::getInt32Ty(Ctx), nullptr, false);
}
// Not passing anything, hopefully va_arg won't try to dereference it
// Might be a target specific thing whether one can pass nullptr instead
// of undef i32
if (Frame.empty())
Frame.append(Type::getInt32Ty(Ctx), nullptr, false);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not keen - the comment is true within the scope of the if and false outside of the scope. If deleting the braces is important I can rephrase the comment (probably by writing 'if' in the comment as well as in the code)

@@ -0,0 +1,17 @@
#ifndef LLVM_TRANSFORMS_IPO_EXPANDVARIADICS_H
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLVM copyright header as well as (brief) documentation of the pass

AllocaInst * /*va_list*/, Value * /*buffer*/) {
// Function needs to be implemented if valist is on the stack
assert(!valistOnStack());
__builtin_unreachable();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, llvm_unreachable is usually used inside LLVM, so definitely better to use llvm_unreachable here.

bool VACopyIsMemcpy() { return true; }
};

struct X64SystemV final : public Interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this part needs to be in the backend.

Copy link
Collaborator Author

@JonChesterfield JonChesterfield Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really hope not. It doesn't duplicate information that the backend already knows and IR passes reaching into the backend is a mess. Likewise reaching into the front end to try to deduplicate against clang, which has the complement to this information, will have a severe adverse effect on legibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it doesn't look like a good idea to handle multiple targets in one single file. Here it just includes x86 family. There are those many targets supporting variadic functions out there. FWIW, I don't think the information needed for this pass becomes "messy" in backend.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It scales sublinearly in number of targets and x86 is about the worst case. Scattering the information across all the backends does nothing to decrease the amount of information needed.

I don't think we should try to put an abstraction over this until the switch has become unwieldy. Partly because that day may not come, partly because the details are likely to change (further) under review, and partly because the last time I tried to export information from the backend the infra didn't work with the new pass manager.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I fixed your target information issue, you just need to add the TargetMachine to the pass constructor: f7dcabe

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for patching the lds pass, commit is a useful reference on the pass manager boilerplate.

Do we have a targetmachine instance available in the middle of opt? I thought that was only created once we hit codegen, in which case depending on one would prevent this from being run as an optimisation in the opt O1 or O2 pipeline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TargetMachine in PassBuilder seems optional, but I'm wondering if we'd even want to run this pass if there is no TM. I don't especially like seeing target-specific logic in an optimization, but I'm not contributing enough to optimizations either to assert whether it's a bad thing or not


// All targets currently implemented use a ptr for the valist parameter
Type *vaListParameterType(LLVMContext &Ctx) {
return PointerType::getUnqual(Ctx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect this to use the alloca addrspace?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear - va_list as a type is declared in clang as a void* (or maybe a char * for amdgpu) at present. None of them have addrspace annotations. The IR works out nicer if we can change that to an addrspace(5) void*. Semi-related, amdgpu needs a patch to a pointer rounding function that uses ptrmask to tolerate addrspace. Missing from this patch as amdgpu isn't in the MVP.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can change that, it doesn't have to be in the MVP

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need to change those to use the right address space

bool VACopyIsMemcpy() { return true; }
};

struct X64SystemV final : public Interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I fixed your target information issue, you just need to add the TargetMachine to the pass constructor: f7dcabe

const bool IsLinuxABI = Triple.isOSLinux() || Triple.isOSCygMing();

switch (Triple.getArch()) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra empty line

Comment on lines +268 to +274
if (GlobalValue::isInterposableLinkage(F->getLinkage()))
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simpler as F->isDefinitionExact?

Don't we want to do this unconditionally for AMDGPU?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few conditional statements needed when the later patch lands that permits changing the function interfaces from codegen. This is indeed one of them. I'll check isDefinitionExact.

}
}

// Branch funnels look like variadic functions but arent:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Branch funnels look like variadic functions but arent:
// Branch funnels look like variadic functions but aren't:

NewCB->setCallingConv(CB->getCallingConv());
NewCB->copyMetadata(*CB, {LLVMContext::MD_prof, LLVMContext::MD_dbg});

if (!CB->use_empty()) // dead branch?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just replaceAllUsesWith, it shouldn't die on empty uses

NewCB->setAttributes(PAL);
NewCB->takeName(CB);
NewCB->setCallingConv(CB->getCallingConv());
NewCB->copyMetadata(*CB, {LLVMContext::MD_prof, LLVMContext::MD_dbg});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect this to be able to preserve all metadata?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure - dead argument elimination preserves exactly these two

Comment on lines +573 to +582
attributes #0 = { nounwind "min-legal-vector-width"="0" "no-builtins" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cmov,+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }
attributes #1 = { mustprogress nocallback nofree nosync nounwind willreturn memory(argmem: readwrite) }
attributes #2 = { mustprogress nocallback nofree nosync nounwind willreturn }
attributes #3 = { "no-builtins" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cmov,+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }
attributes #4 = { nounwind "min-legal-vector-width"="128" "no-builtins" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cmov,+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }
attributes #5 = { nounwind }
attributes #6 = { nobuiltin nounwind "no-builtins" }
attributes #7 = { nobuiltin "no-builtins" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be able to drop all of the attributes

attributes #6 = { nobuiltin nounwind "no-builtins" }
attributes #7 = { nobuiltin "no-builtins" }

!llvm.module.flags = !{!0, !1, !2}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At most you need whatever is required to show the debug info preservation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug info is something I haven't really thought about. That's probably something that should have its own test case.

These are aspiring to show that the call frame is correct for the given architecture - variables are at the right offset and so forth - other things like metadata / attribute propagation are target independent and should have their own tests.

I'll look at removing the attributes from these, would give better signal to noise.

ret void
}

define dso_local void @v32f32_i32(ptr nocapture noundef readonly byval(<32 x float>) align 128 %0, i32 noundef %y) local_unnamed_addr #0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need all the local_unnamed_addr / dso_local

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, gone.

@@ -0,0 +1,117 @@
// RUN: %clang_cc1 -triple i386-unknown-linux-gnu -Wno-varargs -O1 -disable-llvm-passes -emit-llvm -o - %s | opt --passes=instcombine | opt -passes="expand-variadics,default<O1>" -S | FileCheck %s --check-prefixes=CHECK,X86Linux
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can do this in one opt run?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. There's something weird/broken with opt here. Failure reproduces with other passes. Issue #81128

void v32f32_i32(v32f32 x, i32 y)
{
vararg(x, y);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also get the half and bfloat cases

@JonChesterfield
Copy link
Collaborator Author

Patch run through clang-tidy --checks=readability-identifier-naming with the config file in tree and recommendations applied. Some of the choices seem poor but it's presumably acceptable.

@JonChesterfield JonChesterfield force-pushed the jc_varargs_inliner_component branch from b9e21af to 122a305 Compare February 8, 2024 12:52
@JonChesterfield JonChesterfield force-pushed the jc_varargs_inliner_component branch from 122a305 to 956f4f9 Compare February 8, 2024 13:08
Copy link
Contributor

@Pierre-vh Pierre-vh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments are mostly about style, I haven't done a deep dive into the logic of the pass yet

if (!Equivalent)
return Changed;

for (User *U : llvm::make_early_inc_range(F->users()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you actually need llvm:: for those?

case VariadicABIInfo::ValistCc::value: {
// If it's being passed by value, need a load
// TODO: Check it's loading the right thing
auto *load = dyn_cast<LoadInst>(&*Iter);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CamelCase

if (!Call)
return nullptr;

if (auto *end = dyn_cast<VAEndInst>(&*Iter)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CamelCase

// Not totally convinced this is necessary but dead store elimination
// decides to discard the stores to the alloca and pass uninitialised
// memory along instead when the function is marked tailcall
if (TCK == CallInst::TCK_Tail) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can drop {}

NewCB->copyMetadata(*CB, {LLVMContext::MD_prof, LLVMContext::MD_dbg});

if (!CB->use_empty()) // dead branch?
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can drop {} too

Comment on lines +282 to +288
if (CB->isMustTailCall()) {
return false;
}

if (!CB->isCallee(&U) || CB->getFunctionType() != F->getFunctionType()) {
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both can drop {}

Comment on lines +229 to +235
if (Triple.isOSDarwin()) {
return std::make_unique<VariadicABIInfo::X64SystemV>();
}

if (IsLinuxABI) {
return std::make_unique<VariadicABIInfo::X64SystemV>();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both can drop {}

// aarch64 uses byval
enum class ValistCc { value, pointer, /*byval*/ };

struct Interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interface is a very generic name, can you make it a bit more specific and add docs?

bool VACopyIsMemcpy() { return true; }
};

struct X64SystemV final : public Interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TargetMachine in PassBuilder seems optional, but I'm wondering if we'd even want to run this pass if there is no TM. I don't especially like seeing target-specific logic in an optimization, but I'm not contributing enough to optimizations either to assert whether it's a bad thing or not


if (IsLinuxABI) {
struct X86Linux final : public Interface {
X86Linux() : Interface(4, 0) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't understand, why can't this just be a std::make_unique<Interface>(4, 0) ? Why the need for a derived type that does not add any feature/override?

Copy link
Collaborator

@efriedma-quic efriedma-quic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like the whole "sufficiently simple function" thing. It seems fragile. You should be able to just take a arbitrary internal varargs function, rewrite its signature to take a va_list argument, rewrite calls to va_start to make a copy of that va_list, and rewrite the callers to construct that va_list. If that function turns out to be inlinable, great; if not, you haven't really lost anything.

(Rewriting the signature of a function is complicated in its own way because you need to allocate a new Function, then transplant the original function's body into it. But it's not uncharted territory: we should be able to refactor code out of llvm/lib/Transforms/IPO/ArgumentPromotion.cpp .)


Do we have a testing plan for this? Messing up calling convention stuff tends to lead to extremely subtle bugs.

// varargs can't be tail called anyway
// Not totally convinced this is necessary but dead store elimination
// decides to discard the stores to the alloca and pass uninitialised
// memory along instead when the function is marked tailcall
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As defined in LangRef, "tail" actually just means "doesn't reference any memory allocated in the current call's stack". It's a prerequisite for tail-calling a function, but doesn't indicate the function is actually tail-callable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to recheck dead store elimination. If the tail call flag isn't removed, DSE turns the IR into undef, which seems wrong to me - but maybe it's fine and I need to reword this comment

@jdoerfert
Copy link
Member

High level question:
Does this patch eliminate the variadic call edge, or, does it perform inlining on very special variadic function definitions?
I thought the former but isFunctionInlinable, sufficiently confused me.

@RKSimon
Copy link
Collaborator

RKSimon commented Feb 14, 2024

Why have the x86 tests been placed in test\CodeGen\X86 instead of something like test\Transforms\ExpandVariadics\X86 ?

@JonChesterfield
Copy link
Collaborator Author

JonChesterfield commented Feb 14, 2024

I don't really like the whole "sufficiently simple function" thing. It seems fragile. You should be able to just take a arbitrary internal varargs function, rewrite its signature to take a va_list argument, rewrite calls to va_start to make a copy of that va_list, and rewrite the callers to construct that va_list. If that function turns out to be inlinable, great; if not, you haven't really lost anything.

Yes, you can and I do. That's patch 2 of the series, numbered 1. in the list in the commit message (for this is 0)

(Rewriting the signature of a function is complicated in its own way because you need to allocate a new Function, then transplant the original function's body into it. But it's not uncharted territory: we should be able to refactor code out of llvm/lib/Transforms/IPO/ArgumentPromotion.cpp .)

And this is why it's a separate patch.

The rewrite-call-instruction is the target dependent bit, in this patch for an initial target.

The rewrite-function is (almost) target agnostic and involves a surprisingly large amount of book keeping. We have duplicated code in ArgumentPromotion, dead argument removal, the function cloning in attributor and probably elsewhere.

Do we have a testing plan for this? Messing up calling convention stuff tends to lead to extremely subtle bugs.

This would be why I want to do it all in IR, staying away from the backend lowering. It's also why it's structured as orthogonal transforms with independent testing. Fun story, the whole reason this exists for x64 is that I had such a rough time debugging it natively on amdgpu that it was easier to generalise to x64.

@JonChesterfield
Copy link
Collaborator Author

JonChesterfield commented Feb 14, 2024

High level question: Does this patch eliminate the variadic call edge, or, does it perform inlining on very special variadic function definitions? I thought the former but isFunctionInlinable, sufficiently confused me.

This patch will rewrite calls to a variadic function into calls to a function taking a va_list. Later patches expand that to cover the rest of the cases. More complicated functions, other targets etc. Note that calls to variadic function pointers cannot generally be rewritten without permitting ABI changes, which is what I plan to do for nvptx and amdgpu.

@efriedma-quic
Copy link
Collaborator

I don't really like the whole "sufficiently simple function" thing. It seems fragile. You should be able to just take a arbitrary internal varargs function, rewrite its signature to take a va_list argument, rewrite calls to va_start to make a copy of that va_list, and rewrite the callers to construct that va_list. If that function turns out to be inlinable, great; if not, you haven't really lost anything.

Yes, you can and I do. That's patch 2 of the series, numbered 1. in the list in the commit message (for this is 0)

Not sure if this means isFunctionInlinable will go away in the followup patch, or if you plan to rewrite functions in a way that satisfies isFunctionInlinable. I think the end state should be that all functions go down the same codepath, not conditionally do something different based on whether they're "simple". I guess I don't have a strong preference for how you get there, though.

@JonChesterfield
Copy link
Collaborator Author

Not sure if this means isFunctionInlinable will go away in the followup patch, or if you plan to rewrite functions in a way that satisfies isFunctionInlinable. I think the end state should be that all functions go down the same codepath, not conditionally do something different based on whether they're "simple". I guess I don't have a strong preference for how you get there, though.

The logic I've got at present (which include the ABI rewriting) is

    bool usefulToSplit =
        splitFunctions() && (!F->isDeclaration() || rewriteABI());

    // F may already be a single basic block calling a known function
    // that takes a va_list, in which case it doens't need to be split.
    Function *Equivalent = isFunctionInlinable(M, F);

    if (usefulToSplit && !Equivalent) {
      Equivalent = DeriveInlinableVariadicFunctionPair(M, *F);
      assert(Equivalent);
      assert(isFunctionInlinable(M, *F)); // branch doesn't do this presently but it could do
      changed = true;
      functionToInliningTarget[F] = Equivalent;
    }

I'm not especially attached to the specific control flow.

The two transforms - inlining/call-rewrite and splitting an entry block off the variadic function function so that it can be inlined - are genuinely orthogonal and that is a really good property to preserve. It took some thought to get to that structure.

If we ignore that design and run functions through the block splitting unnecessarily, we win a combinatorial increase in required testing, a decrease in emitted code quality (spurious extra functions), an inability to pattern match on fprintf->vfprintf style code that happens to be in the application already. We would get to delete the isFunctionInlinable predicate.

The independent transform pipeline pattern is more important than the no special case branching heuristic. If it helps, view it as two complementary transforms where the one is skipped when it would be a no-op.

Related - if there's an objection to landing this as an inactive pass (only exercised by test code) we can put it into an optimisation pipeline immediately, it'll still remove some real world variadic calls even if the later patches don't make it.

@efriedma-quic
Copy link
Collaborator

If we ignore that design and run functions through the block splitting unnecessarily, we win a combinatorial increase in required testing, a decrease in emitted code quality (spurious extra functions), an inability to pattern match on fprintf->vfprintf style code that happens to be in the application already. We would get to delete the isFunctionInlinable predicate.

If we do unnecessary splitting, then run the inliner, we should end up with exactly the same thing you'd get by special-casing isFunctionInlinable functions. So that's... very slightly slower compile-time, in exchange for dropping the delicate isFunctionInlinable check. That seems worthwhile to me.

There's really a lot of stuff you can easily get wrong in a check like that... for example, it looks like you forgot to check the type of the load, or whether the load is volatile.

Not sure what you're referring to with the "combinatorial increase in required testing".

Related - if there's an objection to landing this as an inactive pass (only exercised by test code) we can put it into an optimisation pipeline immediately, it'll still remove some real world variadic calls even if the later patches don't make it.

I'm fine with off-by-default as long as we have a roadmap for enabling it at some point.

@JonChesterfield
Copy link
Collaborator Author

JonChesterfield commented Feb 20, 2024

Ah OK, so split every variadic definition and let the inliner sort it out afterwards. Yes, I'm good with that. Tests either get messier or add a call to the inliner. Will update the PR correspondingly, solid simplification, thanks!

Discard the combinatorial testing comment - I misunderstood the structure you had in mind.

Unconditional splitting means the transforms need to land in the other order. First splitter/outliner, second the callinstr rewrite. With the function pattern recogniser deleted the test cases for this won't fire. After the varargs splitter/outliner lands the tests for this will PR make sense again.

Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits, mostly just formatting and naming that hasn't been updated.

I agree overall that we should just put this in some canonical form and rely on other LLVM passes to take care of things like inlining. Eager to have this functionality in, so hopefully we can keep this moving.

Comment on lines +229 to +235
if (Triple.isOSDarwin()) {
return std::make_unique<VariadicABIInfo::X64SystemV>();
}

if (IsLinuxABI) {
return std::make_unique<VariadicABIInfo::X64SystemV>();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (Triple.isOSDarwin()) {
return std::make_unique<VariadicABIInfo::X64SystemV>();
}
if (IsLinuxABI) {
return std::make_unique<VariadicABIInfo::X64SystemV>();
}
if (Triple.isOSDarwin())
return std::make_unique<VariadicABIInfo::X64SystemV>();
if (IsLinuxABI)
return std::make_unique<VariadicABIInfo::X64SystemV>();

Nit, no braces around single blocks.

// with it, such that target specific va_arg instructions will correctly
// iterate over it. Primarily this means getting the alignment right.

class {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think I've encountered any anonymous classes in LLVM. I've only ever seen anonymous structs that are POD without any members. You call this Frame, so just move it out and call it FrameTy or something.


// Initialise a va_list pointing to that struct and pass it as the last
// argument
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be in its own block? I'd assume if the names VoidPtr conflict then it would be the same thing. If you don't want the VoidBuffer escaping the scope I'd just pass it directly to the uses in both cases since it's short enough.

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Title should be rephrased; this doesn't have anything to do with inlining

@JonChesterfield
Copy link
Collaborator Author

Feedback is merged into #89007 which represents this transform, with the pattern match Eli recommended against removed (so some test cases now need to run the inliner), the anonymous class is moved, the multiple patch sequence idea is discarded in favour of one patch that does the target specific and the target agnostic mangling bits.

@JonChesterfield JonChesterfield marked this pull request as draft April 17, 2024 00:38
@JonChesterfield
Copy link
Collaborator Author

Marked as "draft" to approximate the state of "available for reference but not a candidate for merging"

@JonChesterfield
Copy link
Collaborator Author

Majority is landed as #93362

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 clang Clang issues not falling into any other category llvm:transforms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants