Skip to content

[BPF] Use mul for certain div/mod operations #110712

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 1, 2024

Conversation

yonghong-song
Copy link
Contributor

@yonghong-song yonghong-song commented Oct 1, 2024

The motivation example likes below

  $ cat t1.c
  struct S {
    int var[3];
  };
  int foo1 (struct S *a, struct S *b)
  {
      return a - b;
  }

For cpu v1/v2/v3, the compilation will fail with the following error:

  $ clang --target=bpf -O2 -c t1.c -mcpu=v3
  t1.c:4:5: error: unsupported signed division, please convert to unsigned div/mod.
  4 | int foo1 (struct S *a, struct S *b)
        |     ^
  1 error generated.

The reason is that sdiv/smod is only supported at -mcpu=v4. At cpu v1/v2/v3, only udiv/umod is supported.

But the above example (for func foo1()) is reasonable common and user has to workaround the compilation failure by using udiv with conditionals.

For x86, for the above t1.c, compile and dump the asm code like below:

  $ clang -O2 -c t1.c && llvm-objdump -d t1.o
  0000000000000000 <foo1>:
       0: 48 29 f7                      subq    %rsi, %rdi
       3: 48 c1 ef 02                   shrq    $0x2, %rdi
       7: 69 c7 ab aa aa aa             imull   $0xaaaaaaab, %edi, %eax # imm = 0xAAAAAAAB
       d: c3                            retq

Basically sdiv can be replaced with sub, shr and imul. Latest gcc-bpf is also able to generate code similar to x86 with -mcpu=v1. See https://godbolt.org/z/feP9ETbjj

Generally multiplication is cheaper than divide. LLVM SelectionDAG already supports to generate alternative codes (mul etc.) for div/mod operations if the divisor is constant and if the cost of division is not cheap. For BPF backend, let us use multiplication instead of divide whenever possible. The following is the list of div/mod operations which can be converted to mul.

  • 32bit signed/unsigned div/mod with constant divisor
  • 64bit signed/unsigned div with constant divisor and with 'exact' flag in IR where 'exact' means if replacing div with mod, it is guaranteed that modulo result will be 0. The above foo1() case belongs here.

The motivation example likes below
  $ cat t1.c
  struct S {
    int var[3];
  };
  int foo1 (struct S *a, struct S *b)
  {
      return a - b;
  }

For cpu v1/v2/v3, the compilation will fail with the following error:
  $ clang --target=bpf -O2 -c t1.c -mcpu=v3
  t1.c:4:5: error: unsupported signed division, please convert to unsigned div/mod.
  4 | int foo1 (struct S *a, struct S *b)
        |     ^
  1 error generated.

The reason is that sdiv/smod is only supported at -mcpu=v4.
At cpu v1/v2/v3, only udiv/umod is supported.

But the above example (for func foo1()) is reasonable common and
user has to workaround the compilation failure by using udiv with
conditionals.

For x86, for the above t1.c, compile and dump the asm code like below:
  $ clang -O2 -c t1.c && llvm-objdump -d t1.o
  0000000000000000 <foo1>:
       0: 48 29 f7                      subq    %rsi, %rdi
       3: 48 c1 ef 02                   shrq    $0x2, %rdi
       7: 69 c7 ab aa aa aa             imull   $0xaaaaaaab, %edi, %eax # imm = 0xAAAAAAAB
       d: c3                            retq

Basically sdiv can be replaced with sub, shr and imul. Latest gcc-bpf
is also able to generate code similar to x86 with -mcpu=v1.
See https://godbolt.org/z/feP9ETbjj

Generally multiplication is cheaper than divide. LLVM SelectionDAG
already supports to generate alternative codes (mul etc.) for div/mod
operations if the divisor is constant and if the cost of division is
not cheap. For BPF backend, let us use multiplication instead of divide
whenever possible. The following is the list of div/mod operations which
can be converted to mul.
  - 32bit signed/unsigned div/mod
  - 64bit signed/unsigned div with 'exact' flag in IR where 'exact' means
    if replacing div with mod, it is guaranteed that modulo result will be 0.
    The above foo1() case belongs here.
@yonghong-song yonghong-song requested review from 4ast and eddyz87 October 1, 2024 17:59
Copy link
Member

@4ast 4ast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@@ -118,7 +118,9 @@ class BPFTargetLowering : public TargetLowering {
return Op.size() >= 8 ? MVT::i64 : MVT::i32;
}

bool isIntDivCheap(EVT VT, AttributeList Attr) const override { return true; }
bool isIntDivCheap(EVT VT, AttributeList Attr) const override {
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to keep this "cheap" when cpuv4 is enabled.
E.g. as far as I understand for arm64 BPF backend division/modulo operations are translated as single instructions.
Thus, with the following patch on top allow-diff-for-cpuv4.txt:

$ cat t1.c
struct S {
  int var[3];
};
int foo1 (struct S *a, struct S *b)
{
    return a - b;
}

$ clang --target=bpf -mcpu=v3 -O2 -S t1.c -o -
        ...
foo1:                                   # @foo1
# %bb.0:                                # %entry
	r0 = r1
	r0 -= r2
	r0 >>= 2
	w0 *= -1431655765
	exit
        ...

$ clang --target=bpf -mcpu=v4 -O2 -S t1.c -o -
        ...
foo1:                                   # @foo1
# %bb.0:                                # %entry
	r0 = r1
	r0 -= r2
	r0 s/= 12
	exit
        ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eddyz87 From @4ast the point is that even a single divide is expensive than multiplication/shift/... etc. The number of insns is not a criteria here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to keep this "cheap" when cpuv4 is enabled.

No. Integer div is the slowest operation in pretty much every cpu architecture.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see both x86 and arm LLVM backends define this as expensive.
I somehow understood this as a knob whether to generate sdiv at all, while this is only an optimization knob.

However, the commit description says:

The reason is that sdiv/smod is only supported at -mcpu=v4. At cpu v1/v2/v3, only udiv/umod is supported.
...
Basically sdiv can be replaced with sub, shr and imul.

Do we also need to add some generic lowering for cpuv{1,2,3}?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to add some generic lowering for cpuv{1,2,3}?

you mean to lower sdiv into udiv plus conditionals ?
It's doable, but let's do it in the separate patch and it's not a high priority imo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to add some generic lowering for cpuv{1,2,3}?

you mean to lower sdiv into udiv plus conditionals ?

Yes

It's doable, but let's do it in the separate patch and it's not a high priority imo.

If we are sure that optimization pattern that fires in this case would cover all other uses, then yes, lower priority.

@yonghong-song
Copy link
Contributor Author

yonghong-song commented Oct 1, 2024

For

you mean to lower sdiv into udiv plus conditionals ?

This is my original proposal. But after discussing with @4ast, looks like using mul etc. has better performance and that is why we did this by utilizing LLVM existing features. For things not covered (e.g. r1 s/= r2 for cpu v1/v2/v3), udiv+conditional approach can be experimented later if there is a real need.

@yonghong-song yonghong-song merged commit 38a8000 into llvm:main Oct 1, 2024
9 checks passed
Sterling-Augustine pushed a commit to Sterling-Augustine/llvm-project that referenced this pull request Oct 3, 2024
The motivation example likes below
```
  $ cat t1.c
  struct S {
    int var[3];
  };
  int foo1 (struct S *a, struct S *b)
  {
      return a - b;
  }
```
For cpu v1/v2/v3, the compilation will fail with the following error:
```
  $ clang --target=bpf -O2 -c t1.c -mcpu=v3
  t1.c:4:5: error: unsupported signed division, please convert to unsigned div/mod.
  4 | int foo1 (struct S *a, struct S *b)
        |     ^
  1 error generated.
```
The reason is that sdiv/smod is only supported at -mcpu=v4. At cpu
v1/v2/v3, only udiv/umod is supported.

But the above example (for func foo1()) is reasonable common and user
has to workaround the compilation failure by using udiv with
conditionals.

For x86, for the above t1.c, compile and dump the asm code like below:
```
  $ clang -O2 -c t1.c && llvm-objdump -d t1.o
  0000000000000000 <foo1>:
       0: 48 29 f7                      subq    %rsi, %rdi
       3: 48 c1 ef 02                   shrq    $0x2, %rdi
       7: 69 c7 ab aa aa aa             imull   $0xaaaaaaab, %edi, %eax # imm = 0xAAAAAAAB
       d: c3                            retq
```
Basically sdiv can be replaced with sub, shr and imul. Latest gcc-bpf is
also able to generate code similar to x86 with -mcpu=v1. See
https://godbolt.org/z/feP9ETbjj

Generally multiplication is cheaper than divide. LLVM SelectionDAG
already supports to generate alternative codes (mul etc.) for div/mod
operations if the divisor is constant and if the cost of division is not
cheap. For BPF backend, let us use multiplication instead of divide
whenever possible. The following is the list of div/mod operations which
can be converted to mul.
  - 32bit signed/unsigned div/mod with constant divisor
- 64bit signed/unsigned div with constant divisor and with 'exact' flag
in IR where 'exact' means if replacing div with mod, it is guaranteed
that modulo result will be 0. The above foo1() case belongs here.
@yonghong-song yonghong-song deleted the bpf-fix-ptr-sub branch February 8, 2025 06:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants