-
Notifications
You must be signed in to change notification settings - Fork 14.2k
[BPF] Use mul for certain div/mod operations #110712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The motivation example likes below $ cat t1.c struct S { int var[3]; }; int foo1 (struct S *a, struct S *b) { return a - b; } For cpu v1/v2/v3, the compilation will fail with the following error: $ clang --target=bpf -O2 -c t1.c -mcpu=v3 t1.c:4:5: error: unsupported signed division, please convert to unsigned div/mod. 4 | int foo1 (struct S *a, struct S *b) | ^ 1 error generated. The reason is that sdiv/smod is only supported at -mcpu=v4. At cpu v1/v2/v3, only udiv/umod is supported. But the above example (for func foo1()) is reasonable common and user has to workaround the compilation failure by using udiv with conditionals. For x86, for the above t1.c, compile and dump the asm code like below: $ clang -O2 -c t1.c && llvm-objdump -d t1.o 0000000000000000 <foo1>: 0: 48 29 f7 subq %rsi, %rdi 3: 48 c1 ef 02 shrq $0x2, %rdi 7: 69 c7 ab aa aa aa imull $0xaaaaaaab, %edi, %eax # imm = 0xAAAAAAAB d: c3 retq Basically sdiv can be replaced with sub, shr and imul. Latest gcc-bpf is also able to generate code similar to x86 with -mcpu=v1. See https://godbolt.org/z/feP9ETbjj Generally multiplication is cheaper than divide. LLVM SelectionDAG already supports to generate alternative codes (mul etc.) for div/mod operations if the divisor is constant and if the cost of division is not cheap. For BPF backend, let us use multiplication instead of divide whenever possible. The following is the list of div/mod operations which can be converted to mul. - 32bit signed/unsigned div/mod - 64bit signed/unsigned div with 'exact' flag in IR where 'exact' means if replacing div with mod, it is guaranteed that modulo result will be 0. The above foo1() case belongs here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@@ -118,7 +118,9 @@ class BPFTargetLowering : public TargetLowering { | |||
return Op.size() >= 8 ? MVT::i64 : MVT::i32; | |||
} | |||
|
|||
bool isIntDivCheap(EVT VT, AttributeList Attr) const override { return true; } | |||
bool isIntDivCheap(EVT VT, AttributeList Attr) const override { | |||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to keep this "cheap" when cpuv4 is enabled.
E.g. as far as I understand for arm64 BPF backend division/modulo operations are translated as single instructions.
Thus, with the following patch on top allow-diff-for-cpuv4.txt:
$ cat t1.c
struct S {
int var[3];
};
int foo1 (struct S *a, struct S *b)
{
return a - b;
}
$ clang --target=bpf -mcpu=v3 -O2 -S t1.c -o -
...
foo1: # @foo1
# %bb.0: # %entry
r0 = r1
r0 -= r2
r0 >>= 2
w0 *= -1431655765
exit
...
$ clang --target=bpf -mcpu=v4 -O2 -S t1.c -o -
...
foo1: # @foo1
# %bb.0: # %entry
r0 = r1
r0 -= r2
r0 s/= 12
exit
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to keep this "cheap" when cpuv4 is enabled.
No. Integer div is the slowest operation in pretty much every cpu architecture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see both x86 and arm LLVM backends define this as expensive.
I somehow understood this as a knob whether to generate sdiv at all, while this is only an optimization knob.
However, the commit description says:
The reason is that sdiv/smod is only supported at -mcpu=v4. At cpu v1/v2/v3, only udiv/umod is supported.
...
Basically sdiv can be replaced with sub, shr and imul.
Do we also need to add some generic lowering for cpuv{1,2,3}?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to add some generic lowering for cpuv{1,2,3}?
you mean to lower sdiv into udiv plus conditionals ?
It's doable, but let's do it in the separate patch and it's not a high priority imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to add some generic lowering for cpuv{1,2,3}?
you mean to lower sdiv into udiv plus conditionals ?
Yes
It's doable, but let's do it in the separate patch and it's not a high priority imo.
If we are sure that optimization pattern that fires in this case would cover all other uses, then yes, lower priority.
For
This is my original proposal. But after discussing with @4ast, looks like using mul etc. has better performance and that is why we did this by utilizing LLVM existing features. For things not covered (e.g. r1 s/= r2 for cpu v1/v2/v3), udiv+conditional approach can be experimented later if there is a real need. |
The motivation example likes below ``` $ cat t1.c struct S { int var[3]; }; int foo1 (struct S *a, struct S *b) { return a - b; } ``` For cpu v1/v2/v3, the compilation will fail with the following error: ``` $ clang --target=bpf -O2 -c t1.c -mcpu=v3 t1.c:4:5: error: unsupported signed division, please convert to unsigned div/mod. 4 | int foo1 (struct S *a, struct S *b) | ^ 1 error generated. ``` The reason is that sdiv/smod is only supported at -mcpu=v4. At cpu v1/v2/v3, only udiv/umod is supported. But the above example (for func foo1()) is reasonable common and user has to workaround the compilation failure by using udiv with conditionals. For x86, for the above t1.c, compile and dump the asm code like below: ``` $ clang -O2 -c t1.c && llvm-objdump -d t1.o 0000000000000000 <foo1>: 0: 48 29 f7 subq %rsi, %rdi 3: 48 c1 ef 02 shrq $0x2, %rdi 7: 69 c7 ab aa aa aa imull $0xaaaaaaab, %edi, %eax # imm = 0xAAAAAAAB d: c3 retq ``` Basically sdiv can be replaced with sub, shr and imul. Latest gcc-bpf is also able to generate code similar to x86 with -mcpu=v1. See https://godbolt.org/z/feP9ETbjj Generally multiplication is cheaper than divide. LLVM SelectionDAG already supports to generate alternative codes (mul etc.) for div/mod operations if the divisor is constant and if the cost of division is not cheap. For BPF backend, let us use multiplication instead of divide whenever possible. The following is the list of div/mod operations which can be converted to mul. - 32bit signed/unsigned div/mod with constant divisor - 64bit signed/unsigned div with constant divisor and with 'exact' flag in IR where 'exact' means if replacing div with mod, it is guaranteed that modulo result will be 0. The above foo1() case belongs here.
The motivation example likes below
For cpu v1/v2/v3, the compilation will fail with the following error:
The reason is that sdiv/smod is only supported at -mcpu=v4. At cpu v1/v2/v3, only udiv/umod is supported.
But the above example (for func foo1()) is reasonable common and user has to workaround the compilation failure by using udiv with conditionals.
For x86, for the above t1.c, compile and dump the asm code like below:
Basically sdiv can be replaced with sub, shr and imul. Latest gcc-bpf is also able to generate code similar to x86 with -mcpu=v1. See https://godbolt.org/z/feP9ETbjj
Generally multiplication is cheaper than divide. LLVM SelectionDAG already supports to generate alternative codes (mul etc.) for div/mod operations if the divisor is constant and if the cost of division is not cheap. For BPF backend, let us use multiplication instead of divide whenever possible. The following is the list of div/mod operations which can be converted to mul.