Skip to content

InstCombine: Order shufflevector operands by complexity #113212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

MatzeB
Copy link
Contributor

@MatzeB MatzeB commented Oct 21, 2024

As shufflevector is effectively commutative we should apply the same
logic as other commutative operations where we order the inputs by
their getComplexity() value. This will put things like undef,
poison and constants on the right hand side where possible.

We had a rule moving undef to the right hand side that is superseded
by this.

@MatzeB MatzeB requested a review from nikic as a code owner October 21, 2024 20:18
@llvmbot llvmbot added clang Clang issues not falling into any other category backend:X86 llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms labels Oct 21, 2024
@llvmbot
Copy link
Member

llvmbot commented Oct 21, 2024

@llvm/pr-subscribers-llvm-analysis
@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-llvm-transforms

Author: Matthias Braun (MatzeB)

Changes

As shufflevector is effectively commutative we should apply the same
logic as other commutative operations where we order the inputs by
their getComplexity() value. This will put things like undef,
poison and constants on the right hand side where possible.

We had a rule moving undef to the right hand side that is superseded
by this.


Patch is 55.34 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/113212.diff

10 Files Affected:

  • (modified) clang/test/CodeGen/X86/avx-shuffle-builtins.c (+450-72)
  • (modified) clang/test/CodeGen/X86/sse.c (+5-6)
  • (modified) llvm/lib/Transforms/InstCombine/InstCombineVectorOps.cpp (+18-13)
  • (modified) llvm/test/Analysis/ValueTracking/knownbits-x86-hadd-hsub.ll (+1-1)
  • (modified) llvm/test/Transforms/InstCombine/insert-extract-shuffle-inseltpoison.ll (+2-2)
  • (modified) llvm/test/Transforms/InstCombine/insert-extract-shuffle.ll (+3-3)
  • (modified) llvm/test/Transforms/InstCombine/vec_shuffle-inseltpoison.ll (+1-1)
  • (modified) llvm/test/Transforms/InstCombine/vec_shuffle.ll (+41-2)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/alternate-cast-inseltpoison.ll (+1-1)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/alternate-cast.ll (+1-1)
diff --git a/clang/test/CodeGen/X86/avx-shuffle-builtins.c b/clang/test/CodeGen/X86/avx-shuffle-builtins.c
index d184d28f3e07aa..8d5b2c1d8c4394 100644
--- a/clang/test/CodeGen/X86/avx-shuffle-builtins.c
+++ b/clang/test/CodeGen/X86/avx-shuffle-builtins.c
@@ -1,7 +1,7 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --version 5
 // REQUIRES: x86-registered-target
-// RUN: %clang_cc1 -ffreestanding %s -O3 -triple=x86_64-apple-darwin -target-feature +avx -emit-llvm -o - | FileCheck %s --check-prefixes=CHECK,X64
-// RUN: %clang_cc1 -ffreestanding %s -O3 -triple=i386-apple-darwin -target-feature +avx -emit-llvm -o - | FileCheck %s --check-prefixes=CHECK,X86
-// FIXME: This is testing optimized generation of shuffle instructions and should be fixed.
+// RUN: %clang_cc1 -ffreestanding %s -target-feature +avx -emit-llvm -o - | FileCheck %s --check-prefixes=CHECK
+// RUN: %clang_cc1 -ffreestanding %s -target-feature +avx -emit-llvm -o - | FileCheck %s --check-prefixes=CHECK
 
 
 #include <immintrin.h>
@@ -10,201 +10,579 @@
 // Test LLVM IR codegen of shuffle instructions, checking if the masks are correct
 //
 
+// CHECK-LABEL: define dso_local <8 x float> @x(
+// CHECK-SAME: <8 x float> noundef [[A:%.*]], <8 x float> noundef [[B:%.*]]) #[[ATTR0:[0-9]+]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[A_ADDR:%.*]] = alloca <8 x float>, align 32
+// CHECK-NEXT:    [[B_ADDR:%.*]] = alloca <8 x float>, align 32
+// CHECK-NEXT:    store <8 x float> [[A]], ptr [[A_ADDR]], align 32
+// CHECK-NEXT:    store <8 x float> [[B]], ptr [[B_ADDR]], align 32
+// CHECK-NEXT:    [[TMP0:%.*]] = load <8 x float>, ptr [[A_ADDR]], align 32
+// CHECK-NEXT:    [[TMP1:%.*]] = load <8 x float>, ptr [[B_ADDR]], align 32
+// CHECK-NEXT:    [[SHUFP:%.*]] = shufflevector <8 x float> [[TMP0]], <8 x float> [[TMP1]], <8 x i32> <i32 3, i32 2, i32 8, i32 11, i32 7, i32 6, i32 12, i32 15>
+// CHECK-NEXT:    ret <8 x float> [[SHUFP]]
+//
 __m256 x(__m256 a, __m256 b) {
-  // CHECK-LABEL: x
-  // CHECK: shufflevector{{.*}}<i32 3, i32 2, i32 8, i32 11, i32 7, i32 6, i32 12, i32 15>
   return _mm256_shuffle_ps(a, b, 203);
 }
 
+// CHECK-LABEL: define dso_local <2 x double> @test_mm_permute_pd(
+// CHECK-SAME: <2 x double> noundef [[A:%.*]]) #[[ATTR1:[0-9]+]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[A_ADDR:%.*]] = alloca <2 x double>, align 16
+// CHECK-NEXT:    store <2 x double> [[A]], ptr [[A_ADDR]], align 16
+// CHECK-NEXT:    [[TMP0:%.*]] = load <2 x double>, ptr [[A_ADDR]], align 16
+// CHECK-NEXT:    [[PERMIL:%.*]] = shufflevector <2 x double> [[TMP0]], <2 x double> poison, <2 x i32> <i32 1, i32 0>
+// CHECK-NEXT:    ret <2 x double> [[PERMIL]]
+//
 __m128d test_mm_permute_pd(__m128d a) {
-  // CHECK-LABEL: test_mm_permute_pd
-  // CHECK: shufflevector{{.*}}<i32 1, i32 0>
   return _mm_permute_pd(a, 1);
 }
 
+// CHECK-LABEL: define dso_local <4 x double> @test_mm256_permute_pd(
+// CHECK-SAME: <4 x double> noundef [[A:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[A_ADDR:%.*]] = alloca <4 x double>, align 32
+// CHECK-NEXT:    store <4 x double> [[A]], ptr [[A_ADDR]], align 32
+// CHECK-NEXT:    [[TMP0:%.*]] = load <4 x double>, ptr [[A_ADDR]], align 32
+// CHECK-NEXT:    [[PERMIL:%.*]] = shufflevector <4 x double> [[TMP0]], <4 x double> poison, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
+// CHECK-NEXT:    ret <4 x double> [[PERMIL]]
+//
 __m256d test_mm256_permute_pd(__m256d a) {
-  // CHECK-LABEL: test_mm256_permute_pd
-  // CHECK: shufflevector{{.*}}<i32 1, i32 0, i32 3, i32 2>
   return _mm256_permute_pd(a, 5);
 }
 
+// CHECK-LABEL: define dso_local <4 x float> @test_mm_permute_ps(
+// CHECK-SAME: <4 x float> noundef [[A:%.*]]) #[[ATTR1]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[A_ADDR:%.*]] = alloca <4 x float>, align 16
+// CHECK-NEXT:    store <4 x float> [[A]], ptr [[A_ADDR]], align 16
+// CHECK-NEXT:    [[TMP0:%.*]] = load <4 x float>, ptr [[A_ADDR]], align 16
+// CHECK-NEXT:    [[PERMIL:%.*]] = shufflevector <4 x float> [[TMP0]], <4 x float> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
+// CHECK-NEXT:    ret <4 x float> [[PERMIL]]
+//
 __m128 test_mm_permute_ps(__m128 a) {
-  // CHECK-LABEL: test_mm_permute_ps
-  // CHECK: shufflevector{{.*}}<i32 3, i32 2, i32 1, i32 0>
   return _mm_permute_ps(a, 0x1b);
 }
 
 // Test case for PR12401
+// CHECK-LABEL: define dso_local <4 x float> @test_mm_permute_ps2(
+// CHECK-SAME: <4 x float> noundef [[A:%.*]]) #[[ATTR1]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[A_ADDR:%.*]] = alloca <4 x float>, align 16
+// CHECK-NEXT:    store <4 x float> [[A]], ptr [[A_ADDR]], align 16
+// CHECK-NEXT:    [[TMP0:%.*]] = load <4 x float>, ptr [[A_ADDR]], align 16
+// CHECK-NEXT:    [[PERMIL:%.*]] = shufflevector <4 x float> [[TMP0]], <4 x float> poison, <4 x i32> <i32 2, i32 1, i32 2, i32 3>
+// CHECK-NEXT:    ret <4 x float> [[PERMIL]]
+//
 __m128 test_mm_permute_ps2(__m128 a) {
-  // CHECK-LABEL: test_mm_permute_ps2
-  // CHECK: shufflevector{{.*}}<i32 2, i32 1, i32 2, i32 3>
   return _mm_permute_ps(a, 0xe6);
 }
 
+// CHECK-LABEL: define dso_local <8 x float> @test_mm256_permute_ps(
+// CHECK-SAME: <8 x float> noundef [[A:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[A_ADDR:%.*]] = alloca <8 x float>, align 32
+// CHECK-NEXT:    store <8 x float> [[A]], ptr [[A_ADDR]], align 32
+// CHECK-NEXT:    [[TMP0:%.*]] = load <8 x float>, ptr [[A_ADDR]], align 32
+// CHECK-NEXT:    [[PERMIL:%.*]] = shufflevector <8 x float> [[TMP0]], <8 x float> poison, <8 x i32> <i32 3, i32 2, i32 1, i32 0, i32 7, i32 6, i32 5, i32 4>
+// CHECK-NEXT:    ret <8 x float> [[PERMIL]]
+//
 __m256 test_mm256_permute_ps(__m256 a) {
-  // CHECK-LABEL: test_mm256_permute_ps
-  // CHECK: shufflevector{{.*}}<i32 3, i32 2, i32 1, i32 0, i32 7, i32 6, i32 5, i32 4>
   return _mm256_permute_ps(a, 0x1b);
 }
 
+// CHECK-LABEL: define dso_local <4 x double> @test_mm256_permute2f128_pd(
+// CHECK-SAME: <4 x double> noundef [[A:%.*]], <4 x double> noundef [[B:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[A_ADDR:%.*]] = alloca <4 x double>, align 32
+// CHECK-NEXT:    [[B_ADDR:%.*]] = alloca <4 x double>, align 32
+// CHECK-NEXT:    store <4 x double> [[A]], ptr [[A_ADDR]], align 32
+// CHECK-NEXT:    store <4 x double> [[B]], ptr [[B_ADDR]], align 32
+// CHECK-NEXT:    [[TMP0:%.*]] = load <4 x double>, ptr [[A_ADDR]], align 32
+// CHECK-NEXT:    [[TMP1:%.*]] = load <4 x double>, ptr [[B_ADDR]], align 32
+// CHECK-NEXT:    [[VPERM:%.*]] = shufflevector <4 x double> [[TMP0]], <4 x double> [[TMP1]], <4 x i32> <i32 2, i32 3, i32 6, i32 7>
+// CHECK-NEXT:    ret <4 x double> [[VPERM]]
+//
 __m256d test_mm256_permute2f128_pd(__m256d a, __m256d b) {
-  // CHECK-LABEL: test_mm256_permute2f128_pd
-  // CHECK: shufflevector{{.*}}<i32 2, i32 3, i32 6, i32 7> 
   return _mm256_permute2f128_pd(a, b, 0x31);
 }
 
+// CHECK-LABEL: define dso_local <8 x float> @test_mm256_permute2f128_ps(
+// CHECK-SAME: <8 x float> noundef [[A:%.*]], <8 x float> noundef [[B:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[A_ADDR:%.*]] = alloca <8 x float>, align 32
+// CHECK-NEXT:    [[B_ADDR:%.*]] = alloca <8 x float>, align 32
+// CHECK-NEXT:    store <8 x float> [[A]], ptr [[A_ADDR]], align 32
+// CHECK-NEXT:    store <8 x float> [[B]], ptr [[B_ADDR]], align 32
+// CHECK-NEXT:    [[TMP0:%.*]] = load <8 x float>, ptr [[A_ADDR]], align 32
+// CHECK-NEXT:    [[TMP1:%.*]] = load <8 x float>, ptr [[B_ADDR]], align 32
+// CHECK-NEXT:    [[VPERM:%.*]] = shufflevector <8 x float> [[TMP1]], <8 x float> [[TMP0]], <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 12, i32 13, i32 14, i32 15>
+// CHECK-NEXT:    ret <8 x float> [[VPERM]]
+//
 __m256 test_mm256_permute2f128_ps(__m256 a, __m256 b) {
-  // CHECK-LABEL: test_mm256_permute2f128_ps
-  // CHECK: shufflevector{{.*}}<i32 4, i32 5, i32 6, i32 7, i32 12, i32 13, i32 14, i32 15>
   return _mm256_permute2f128_ps(a, b, 0x13);
 }
 
+// CHECK-LABEL: define dso_local <4 x i64> @test_mm256_permute2f128_si256(
+// CHECK-SAME: <4 x i64> noundef [[A:%.*]], <4 x i64> noundef [[B:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[A_ADDR:%.*]] = alloca <4 x i64>, align 32
+// CHECK-NEXT:    [[B_ADDR:%.*]] = alloca <4 x i64>, align 32
+// CHECK-NEXT:    store <4 x i64> [[A]], ptr [[A_ADDR]], align 32
+// CHECK-NEXT:    store <4 x i64> [[B]], ptr [[B_ADDR]], align 32
+// CHECK-NEXT:    [[TMP0:%.*]] = load <4 x i64>, ptr [[A_ADDR]], align 32
+// CHECK-NEXT:    [[TMP1:%.*]] = bitcast <4 x i64> [[TMP0]] to <8 x i32>
+// CHECK-NEXT:    [[TMP2:%.*]] = load <4 x i64>, ptr [[B_ADDR]], align 32
+// CHECK-NEXT:    [[TMP3:%.*]] = bitcast <4 x i64> [[TMP2]] to <8 x i32>
+// CHECK-NEXT:    [[VPERM:%.*]] = shufflevector <8 x i32> [[TMP1]], <8 x i32> [[TMP3]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 10, i32 11>
+// CHECK-NEXT:    [[TMP4:%.*]] = bitcast <8 x i32> [[VPERM]] to <4 x i64>
+// CHECK-NEXT:    ret <4 x i64> [[TMP4]]
+//
 __m256i test_mm256_permute2f128_si256(__m256i a, __m256i b) {
-  // CHECK-LABEL: test_mm256_permute2f128_si256
-  // CHECK: shufflevector{{.*}}<i32 0, i32 1, i32 4, i32 5>
   return _mm256_permute2f128_si256(a, b, 0x20);
 }
 
 __m128
+// CHECK-LABEL: define dso_local <4 x float> @test_mm_broadcast_ss(
+// CHECK-SAME: ptr noundef [[__A:%.*]]) #[[ATTR1]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[__A_ADDR_I:%.*]] = alloca ptr, align 8
+// CHECK-NEXT:    [[__F_I:%.*]] = alloca float, align 4
+// CHECK-NEXT:    [[DOTCOMPOUNDLITERAL_I:%.*]] = alloca <4 x float>, align 16
+// CHECK-NEXT:    [[__A_ADDR:%.*]] = alloca ptr, align 8
+// CHECK-NEXT:    store ptr [[__A]], ptr [[__A_ADDR]], align 8
+// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[__A_ADDR]], align 8
+// CHECK-NEXT:    store ptr [[TMP0]], ptr [[__A_ADDR_I]], align 8
+// CHECK-NEXT:    [[TMP1:%.*]] = load ptr, ptr [[__A_ADDR_I]], align 8
+// CHECK-NEXT:    [[TMP2:%.*]] = load float, ptr [[TMP1]], align 1
+// CHECK-NEXT:    store float [[TMP2]], ptr [[__F_I]], align 4
+// CHECK-NEXT:    [[TMP3:%.*]] = load float, ptr [[__F_I]], align 4
+// CHECK-NEXT:    [[VECINIT_I:%.*]] = insertelement <4 x float> poison, float [[TMP3]], i32 0
+// CHECK-NEXT:    [[TMP4:%.*]] = load float, ptr [[__F_I]], align 4
+// CHECK-NEXT:    [[VECINIT2_I:%.*]] = insertelement <4 x float> [[VECINIT_I]], float [[TMP4]], i32 1
+// CHECK-NEXT:    [[TMP5:%.*]] = load float, ptr [[__F_I]], align 4
+// CHECK-NEXT:    [[VECINIT3_I:%.*]] = insertelement <4 x float> [[VECINIT2_I]], float [[TMP5]], i32 2
+// CHECK-NEXT:    [[TMP6:%.*]] = load float, ptr [[__F_I]], align 4
+// CHECK-NEXT:    [[VECINIT4_I:%.*]] = insertelement <4 x float> [[VECINIT3_I]], float [[TMP6]], i32 3
+// CHECK-NEXT:    store <4 x float> [[VECINIT4_I]], ptr [[DOTCOMPOUNDLITERAL_I]], align 16
+// CHECK-NEXT:    [[TMP7:%.*]] = load <4 x float>, ptr [[DOTCOMPOUNDLITERAL_I]], align 16
+// CHECK-NEXT:    ret <4 x float> [[TMP7]]
+//
 test_mm_broadcast_ss(float const *__a) {
-  // CHECK-LABEL: test_mm_broadcast_ss
-  // CHECK: insertelement <4 x float> {{.*}}, i64 0
-  // CHECK: shufflevector <4 x float> {{.*}}, <4 x float> poison, <4 x i32> zeroinitializer
   return _mm_broadcast_ss(__a);
 }
 
 __m256d
+// CHECK-LABEL: define dso_local <4 x double> @test_mm256_broadcast_sd(
+// CHECK-SAME: ptr noundef [[__A:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[__A_ADDR_I:%.*]] = alloca ptr, align 8
+// CHECK-NEXT:    [[__D_I:%.*]] = alloca double, align 8
+// CHECK-NEXT:    [[DOTCOMPOUNDLITERAL_I:%.*]] = alloca <4 x double>, align 32
+// CHECK-NEXT:    [[__A_ADDR:%.*]] = alloca ptr, align 8
+// CHECK-NEXT:    store ptr [[__A]], ptr [[__A_ADDR]], align 8
+// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[__A_ADDR]], align 8
+// CHECK-NEXT:    store ptr [[TMP0]], ptr [[__A_ADDR_I]], align 8
+// CHECK-NEXT:    [[TMP1:%.*]] = load ptr, ptr [[__A_ADDR_I]], align 8
+// CHECK-NEXT:    [[TMP2:%.*]] = load double, ptr [[TMP1]], align 1
+// CHECK-NEXT:    store double [[TMP2]], ptr [[__D_I]], align 8
+// CHECK-NEXT:    [[TMP3:%.*]] = load double, ptr [[__D_I]], align 8
+// CHECK-NEXT:    [[VECINIT_I:%.*]] = insertelement <4 x double> poison, double [[TMP3]], i32 0
+// CHECK-NEXT:    [[TMP4:%.*]] = load double, ptr [[__D_I]], align 8
+// CHECK-NEXT:    [[VECINIT2_I:%.*]] = insertelement <4 x double> [[VECINIT_I]], double [[TMP4]], i32 1
+// CHECK-NEXT:    [[TMP5:%.*]] = load double, ptr [[__D_I]], align 8
+// CHECK-NEXT:    [[VECINIT3_I:%.*]] = insertelement <4 x double> [[VECINIT2_I]], double [[TMP5]], i32 2
+// CHECK-NEXT:    [[TMP6:%.*]] = load double, ptr [[__D_I]], align 8
+// CHECK-NEXT:    [[VECINIT4_I:%.*]] = insertelement <4 x double> [[VECINIT3_I]], double [[TMP6]], i32 3
+// CHECK-NEXT:    store <4 x double> [[VECINIT4_I]], ptr [[DOTCOMPOUNDLITERAL_I]], align 32
+// CHECK-NEXT:    [[TMP7:%.*]] = load <4 x double>, ptr [[DOTCOMPOUNDLITERAL_I]], align 32
+// CHECK-NEXT:    ret <4 x double> [[TMP7]]
+//
 test_mm256_broadcast_sd(double const *__a) {
-  // CHECK-LABEL: test_mm256_broadcast_sd
-  // CHECK: insertelement <4 x double> {{.*}}, i64 0
-  // CHECK: shufflevector <4 x double> {{.*}}, <4 x double> poison, <4 x i32> zeroinitializer
   return _mm256_broadcast_sd(__a);
 }
 
 __m256
+// CHECK-LABEL: define dso_local <8 x float> @test_mm256_broadcast_ss(
+// CHECK-SAME: ptr noundef [[__A:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[__A_ADDR_I:%.*]] = alloca ptr, align 8
+// CHECK-NEXT:    [[__F_I:%.*]] = alloca float, align 4
+// CHECK-NEXT:    [[DOTCOMPOUNDLITERAL_I:%.*]] = alloca <8 x float>, align 32
+// CHECK-NEXT:    [[__A_ADDR:%.*]] = alloca ptr, align 8
+// CHECK-NEXT:    store ptr [[__A]], ptr [[__A_ADDR]], align 8
+// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[__A_ADDR]], align 8
+// CHECK-NEXT:    store ptr [[TMP0]], ptr [[__A_ADDR_I]], align 8
+// CHECK-NEXT:    [[TMP1:%.*]] = load ptr, ptr [[__A_ADDR_I]], align 8
+// CHECK-NEXT:    [[TMP2:%.*]] = load float, ptr [[TMP1]], align 1
+// CHECK-NEXT:    store float [[TMP2]], ptr [[__F_I]], align 4
+// CHECK-NEXT:    [[TMP3:%.*]] = load float, ptr [[__F_I]], align 4
+// CHECK-NEXT:    [[VECINIT_I:%.*]] = insertelement <8 x float> poison, float [[TMP3]], i32 0
+// CHECK-NEXT:    [[TMP4:%.*]] = load float, ptr [[__F_I]], align 4
+// CHECK-NEXT:    [[VECINIT2_I:%.*]] = insertelement <8 x float> [[VECINIT_I]], float [[TMP4]], i32 1
+// CHECK-NEXT:    [[TMP5:%.*]] = load float, ptr [[__F_I]], align 4
+// CHECK-NEXT:    [[VECINIT3_I:%.*]] = insertelement <8 x float> [[VECINIT2_I]], float [[TMP5]], i32 2
+// CHECK-NEXT:    [[TMP6:%.*]] = load float, ptr [[__F_I]], align 4
+// CHECK-NEXT:    [[VECINIT4_I:%.*]] = insertelement <8 x float> [[VECINIT3_I]], float [[TMP6]], i32 3
+// CHECK-NEXT:    [[TMP7:%.*]] = load float, ptr [[__F_I]], align 4
+// CHECK-NEXT:    [[VECINIT5_I:%.*]] = insertelement <8 x float> [[VECINIT4_I]], float [[TMP7]], i32 4
+// CHECK-NEXT:    [[TMP8:%.*]] = load float, ptr [[__F_I]], align 4
+// CHECK-NEXT:    [[VECINIT6_I:%.*]] = insertelement <8 x float> [[VECINIT5_I]], float [[TMP8]], i32 5
+// CHECK-NEXT:    [[TMP9:%.*]] = load float, ptr [[__F_I]], align 4
+// CHECK-NEXT:    [[VECINIT7_I:%.*]] = insertelement <8 x float> [[VECINIT6_I]], float [[TMP9]], i32 6
+// CHECK-NEXT:    [[TMP10:%.*]] = load float, ptr [[__F_I]], align 4
+// CHECK-NEXT:    [[VECINIT8_I:%.*]] = insertelement <8 x float> [[VECINIT7_I]], float [[TMP10]], i32 7
+// CHECK-NEXT:    store <8 x float> [[VECINIT8_I]], ptr [[DOTCOMPOUNDLITERAL_I]], align 32
+// CHECK-NEXT:    [[TMP11:%.*]] = load <8 x float>, ptr [[DOTCOMPOUNDLITERAL_I]], align 32
+// CHECK-NEXT:    ret <8 x float> [[TMP11]]
+//
 test_mm256_broadcast_ss(float const *__a) {
-  // CHECK-LABEL: test_mm256_broadcast_ss
-  // CHECK: insertelement <8 x float> {{.*}}, i64 0
-  // CHECK: shufflevector <8 x float> {{.*}}, <8 x float> poison, <8 x i32> zeroinitializer
   return _mm256_broadcast_ss(__a);
 }
 
 // Make sure we have the correct mask for each insertf128 case.
 
+// CHECK-LABEL: define dso_local <8 x float> @test_mm256_insertf128_ps_0(
+// CHECK-SAME: <8 x float> noundef [[A:%.*]], <4 x float> noundef [[B:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[A_ADDR:%.*]] = alloca <8 x float>, align 32
+// CHECK-NEXT:    [[B_ADDR:%.*]] = alloca <4 x float>, align 16
+// CHECK-NEXT:    store <8 x float> [[A]], ptr [[A_ADDR]], align 32
+// CHECK-NEXT:    store <4 x float> [[B]], ptr [[B_ADDR]], align 16
+// CHECK-NEXT:    [[TMP0:%.*]] = load <8 x float>, ptr [[A_ADDR]], align 32
+// CHECK-NEXT:    [[TMP1:%.*]] = load <4 x float>, ptr [[B_ADDR]], align 16
+// CHECK-NEXT:    [[WIDEN:%.*]] = shufflevector <4 x float> [[TMP1]], <4 x float> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+// CHECK-NEXT:    [[INSERT:%.*]] = shufflevector <8 x float> [[TMP0]], <8 x float> [[WIDEN]], <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 4, i32 5, i32 6, i32 7>
+// CHECK-NEXT:    ret <8 x float> [[INSERT]]
+//
 __m256 test_mm256_insertf128_ps_0(__m256 a, __m128 b) {
-  // CHECK-LABEL: test_mm256_insertf128_ps_0
-  // CHECK: shufflevector{{.*}}<i32 0, i32 1, i32 2, i32 3, i32 12, i32 13, i32 14, i32 15>
   return _mm256_insertf128_ps(a, b, 0);
 }
 
+// CHECK-LABEL: define dso_local <4 x double> @test_mm256_insertf128_pd_0(
+// CHECK-SAME: <4 x double> noundef [[A:%.*]], <2 x double> noundef [[B:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[A_ADDR:%.*]] = alloca <4 x double>, align 32
+// CHECK-NEXT:    [[B_ADDR:%.*]] = alloca <2 x double>, align 16
+// CHECK-NEXT:    store <4 x double> [[A]], ptr [[A_ADDR]], align 32
+// CHECK-NEXT:    store <2 x double> [[B]], ptr [[B_ADDR]], align 16
+// CHECK-NEXT:    [[TMP0:%.*]] = load <4 x double>, ptr [[A_ADDR]], align 32
+// CHECK-NEXT:    [[TMP1:%.*]] = load <2 x double>, ptr [[B_ADDR]], align 16
+// CHECK-NEXT:    [[WIDEN:%.*]] = shufflevector <2 x double> [[TMP1]], <2 x double> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+// CHECK-NEXT:    [[INSERT:%.*]] = shufflevector <4 x double> [[TMP0]], <4 x double> [[WIDEN]], <4 x i32> <i32 4, i32 5, i32 2, i32 3>
+// CHECK-NEXT:    ret <4 x double> [[INSERT]]
+//
 __m256d test_mm256_insertf128_pd_0(__m256d a, __m128d b) {
-  // CHECK-LABEL: test_mm256_insertf128_pd_0
-  // CHECK: shufflevector{{.*}}<i32 0, i32 1, i32 6, i32 7>
   return _mm256_insertf128_pd(a, b, 0);
 }
 
+// CHECK-LABEL: define dso_local <4 x i64> @test_mm256_insertf128_si256_0(
+// CHECK-SAME: <4 x i64> noundef [[A:%.*]], <2 x i64> noundef [[B:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[A_ADDR:%.*]] = alloca <4 x i64>, align 32
+// CHECK-NEXT:    [[B_ADDR:%.*]] = alloca <2 x i64>, align 16
+// CHECK-NEXT:    store <4 x i64> [[A]], ptr [[A_ADDR]], align 32
+// CHECK-NEXT:    store <2 x i64> [[B]], ptr [[B_ADDR]], align 16
+// CHECK-NEXT:    [[TMP0:%.*]] = load <4 x i64>, ptr [[A_ADDR]], align 32
+// CHECK-NEXT:    [[TMP1:%.*]] = bitcast <4 x i64> [[TMP0]] to <8 x i32>
+// CHECK-NEXT:    [[TMP2:%.*]] = load <2 x i64>, ptr [[B_ADDR]], align 16
+// CHECK-NEXT:    [[TMP3:%.*]] = bitcast <2 x i64> [[TMP2]] to <4 x i32>
+// CHECK-NEXT:    [[WIDEN:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+// CHECK-NEXT:    [[INSERT:%.*]] = shufflevector <8 x i32> [[TMP1]], <8 x i32> [[WIDEN]], <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 4, i32 5, i32 6, i32 7>
+// CHECK-NEXT:    [[TMP4:%.*]] = bitcast <8 x i32> [[INSERT]] to <4 x i64>
+// CHECK-NEXT:    ret <4 x i64> [[TMP4]]
+//
 __m256i test_mm256_insertf128_si256_0(__m256i a, __m128i b) {
-  // CHECK-LABEL: test_mm256_insertf128_si256_0
-  // X64: shufflevector{{.*}}<i32 0, i32 1, i32 6, i32 7>
-  // X86: shufflevector{{.*}}<i32 0, i32 1, i32 2, i32 3, i32 12, i32 13, i32 14, i32 15>
   return _mm256_insertf128_si256(a, b, 0);
 }
 
+// CHECK-LABEL: define dso_local <8 x float> @test_mm256_insertf128_ps_1(
+// CHECK-SAME: <8 x float> noundef [[A:%.*]], <4 x float> noundef [[B:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[A_ADDR:%.*]] = alloca <8 x float>, align 32
+// CHECK-NEXT:    [[B_ADDR:%.*]] = alloca <4 x float>, align 16
+// CHECK-NEXT:    store <8 x float> [[A]], ptr [[A_ADDR]], align 32
+// CHECK-NEXT:    store <4 x float> [[B]], ptr [[B_ADDR]], align 16
+// CHECK-NEXT:    [[TMP0:%.*]] = l...
[truncated]

@efriedma-quic
Copy link
Collaborator

Please push the revised clang tests as a separate patch.

This idea here makes sense, I guess, but I'm a little worried about unexpected side-effects... not sure how reliably the backend handles commuted masks, and I don't think we have any testing infrastructure that would catch that sort of issue.

@goldsteinn
Copy link
Contributor

@dtcxzyw do you still support riscv codegen diffs?

@RKSimon RKSimon self-requested a review October 24, 2024 14:33
- Remove an -O3 flag from a couple of clang x86 codegen tests so the
  tests do not need to be updated when optimizations in LLVM change.
- Change the tests to use utils/update_cc_test_checks.sh
- Change from apple/darwin triples to generic x86_64-- and
  i386-- because it was not relevant to the test but
  `update_cc_test_checks` seems to be unable to handle platforms that
  prepend `_` to function names.
As shufflevector is effectively commutative we should apply the same
logic as other commutative operations where we order the inputs by
their `getComplexity()` value. This will put things like `undef`,
`poison` and constants on the right hand side where possible.

We had a rule moving `undef` to the right hand side that is superseded
by this.
@MatzeB MatzeB force-pushed the instcombine_shufflevector_commutative branch from acb8d21 to f1e2bc1 Compare October 25, 2024 17:22
@MatzeB
Copy link
Contributor Author

MatzeB commented Oct 25, 2024

This idea here makes sense, I guess, but I'm a little worried about unexpected side-effects... not sure how reliably the backend handles commuted masks, and I don't think we have any testing infrastructure that would catch that sort of issue.

Was wondering about this too, but at least it should be easy to fix things if some targets somehow depend on the order and produce worse code in one ordering but not the other. Given that these targets could have reached this situation without this change it may be a good thing to flush out this behavior...

@MatzeB
Copy link
Contributor Author

MatzeB commented Oct 25, 2024

Please push the revised clang tests as a separate patch.

Submitted #113714 for the test updates (still included here for "stacking")

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have always expected shuffles to be canonicalized to make the lowest mask lane the first operand. I believe the AArch64 and Arm matching functions rely on that at the moment. https://godbolt.org/z/1rr1E8v1K

@MatzeB
Copy link
Contributor Author

MatzeB commented Oct 28, 2024

I have always expected shuffles to be canonicalized to make the lowest mask lane the first operand. I believe the AArch64 and Arm matching functions rely on that at the moment. https://godbolt.org/z/1rr1E8v1K

Thanks for pointing this out! So definitely cannot land the change as-is. I would still argue it is generally the right thing to do: We definitely need to normalize by some rule to enable CSE and hit better code generation path (as your example also demonstrates).

Normalizing for low complexity on RHS comes from working on #113746 which slightly simplifies things when we can assume constants being always on the RHS. And this also being a well established pattern for arithmetic operations (add, mul, bitops). But I think that doesn't rule out normalizing for something different at ISel time.

Let me investigate the situation for the popular targets. FWIW: I don't strictly need this rule for my immediate work; this is more of a drive-by fix that seemed worth doing to me. So happy to drop this change if people aren't comfortable with it...

@davemgreen
Copy link
Collaborator

It sounds OK so lang as we can make sure the backend patterns keep working - it sounds like it would be more resilient overall if we matched both forms. I think it was just assumed in the past that is wasn't needed.

@MatzeB
Copy link
Contributor Author

MatzeB commented Oct 30, 2024

I did some experiments with randomly commuting shufflevectors before instruction selection and also experimented with additional canonicalization in SelectionDAG DAGCombine phase. All in all this produces quite a bit of churn and noise for X86 and AArch64 and I feel somewhat uncomfortable pushing this now. Fixing the backends to be resilient and consistently produce the best code seems like a bigger project, so I think I will not move forward with this PR for now.

Here's my experiments: https://github.com/MatzeB/llvm-project/commits/shuffle_rando_experiment/ and especially the last commit adjusting the test-cases to the canonicalization shows that the backend patterns are currently fragile...

@MatzeB MatzeB closed this Apr 7, 2025
@MatzeB MatzeB deleted the instcombine_shufflevector_commutative branch April 7, 2025 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 clang Clang issues not falling into any other category llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants