[NVPTX] Custom lower integer<->bf16 conversions for sm_80 #74827

d0k · 2023-12-08T11:23:11Z

sm_80 only has f32->bf16 conversions, the remaining integer conversions arrived with sm_90. Use a two-step conversion for sm_80.

There doesn't seem to be a way to express this promotion directly within the legalization framework, so fallback on Custom lowering.

sm_80 only has f32->bf16 conversions, the remaining integer conversions arrived with sm_90. Use a two-step conversion for sm_80. There doesn't seem to be a way to express this promotion directly within the legalization framework, so fallback on Custom lowering.

We tried this before with an intrinsic, but that breaks vectorization. Relying on native LLVM types doesn't while delivering the same code improvements. The downside is that LLVM now knows that it's a bfloat instead of a i16 and will optimize based on it. While making this change I had to patch a bunch of holes in the NVPTX LLVM backend, there might be more. Depends on llvm/llvm-project#74827 PiperOrigin-RevId: 589102456

Artem-B

LGTM in general, but I'm curious whether FP_TO_BF16 and BF16_TO_FP would produce better/worse/same SASS for these conversions.

Artem-B · 2023-12-11T18:37:10Z

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp

@@ -766,6 +766,12 @@ NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,
      AddPromotedToType(Op, MVT::bf16, MVT::f32);
  }

+  for (MVT VT : {MVT::i1, MVT::i16, MVT::i32, MVT::i64}) {
+    setOperationAction(
+        {ISD::SINT_TO_FP, ISD::UINT_TO_FP, ISD::FP_TO_SINT, ISD::FP_TO_UINT},


Should we make it conditional on SM/PTX here, instead of checking in the custom lowering?

Makes sense, done.

It would be useful to update the review with the latest changes. I got puzzled for a bit to see this pull request closed with this item marked as done, but unchanged. And then I went to check the actual commit and find the expected changes to be present there.

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp

d0k · 2023-12-11T20:04:19Z

Thanks for the review :)

sm_80 only has f32->bf16 conversions, the remaining integer conversions arrived with sm_90. Use a two-step conversion for sm_80. There doesn't seem to be a way to express this promotion directly within the legalization framework, so fallback on Custom lowering.

We tried this before with an intrinsic, but that breaks vectorization. Relying on native LLVM types doesn't while delivering the same code improvements. The downside is that LLVM now knows that it's a bfloat instead of a i16 and will optimize based on it. While making this change I had to patch a bunch of holes in the NVPTX LLVM backend, there might be more. Depends on llvm/llvm-project#74827 PiperOrigin-RevId: 589102456

We tried this before with an intrinsic, but that breaks vectorization. Relying on native LLVM types doesn't while delivering the same code improvements. The downside is that LLVM now knows that it's a bfloat instead of a i16 and will optimize based on it. While making this change I had to patch a bunch of holes in the NVPTX LLVM backend, there might be more. Depends on llvm/llvm-project#74827 PiperOrigin-RevId: 590118269

…p handling Before this PR - there was some special handling for conversions to and from bf16. Presumably, this existed because, in Ampere, the ptx "cvt" instruction doesn't support conversions to/from bf16. However, Hopper _does_ support direct conversions to/from bf16; so this PR removes this special handling in order to make use of the direct cvt instructions. For Ampere, it looks like the special handling is no longer needed (perhaps thanks to llvm/llvm-project#74827?)

Before this PR - there was some special handling for conversions to and from bf16. Presumably, this existed because, in Ampere, the ptx "cvt" instruction doesn't support conversions to/from bf16. However, Hopper _does_ support direct conversions to/from bf16; so this PR removes this special handling in order to make use of the direct cvt instructions. For Ampere, it looks like the special handling is no longer needed (perhaps thanks to llvm/llvm-project#74827?)

…ng (#4281) This PR removes some special handling for int->bf16 and bf16->int conversions in the TritonNVIDIAGPU->LLVM lowerings, in order to support, e.g. `cvt.bf16.s32` and `cvt.s32.bf16` instructions that are now available on Hopper. Before this PR - there was some special handling for conversions to and from bf16; for int->bf16, the conversion would be done as a int->fp32 followed by fp32->bf16. Presumably, this was done because, before sm90, the ptx "cvt" instruction doesn't support conversions to/from bf16. However, sm90 _does_ support direct conversions to/from bf16; so this PR removes this special handling in order to make use of the direct cvt instructions. For Ampere, it looks like the special handling is no longer needed and llvm handles the details of different hardware implementations (perhaps thanks to llvm/llvm-project#74827?) The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [x] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)

…ng (triton-lang#4281) This PR removes some special handling for int->bf16 and bf16->int conversions in the TritonNVIDIAGPU->LLVM lowerings, in order to support, e.g. `cvt.bf16.s32` and `cvt.s32.bf16` instructions that are now available on Hopper. Before this PR - there was some special handling for conversions to and from bf16; for int->bf16, the conversion would be done as a int->fp32 followed by fp32->bf16. Presumably, this was done because, before sm90, the ptx "cvt" instruction doesn't support conversions to/from bf16. However, sm90 _does_ support direct conversions to/from bf16; so this PR removes this special handling in order to make use of the direct cvt instructions. For Ampere, it looks like the special handling is no longer needed and llvm handles the details of different hardware implementations (perhaps thanks to llvm/llvm-project#74827?) The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [x] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)

d0k requested a review from Artem-B December 8, 2023 11:23

copybara-service bot mentioned this pull request Dec 11, 2023

[XLA:GPU] Use f32->bfloat conversion instructions on sm_80+ openxla/xla#7666

Merged

Artem-B approved these changes Dec 11, 2023

View reviewed changes

d0k closed this Dec 11, 2023

davidberard98 mentioned this pull request Jul 8, 2024

[BACKEND] Remove special handling for bf16 in fp->int, int->fp handling triton-lang/triton#4281

Merged

7 tasks

Artem-B mentioned this pull request Dec 2, 2024

[NVPTX] Port code to llvm/lib/CodeGen/SelectionDAG/* #116695

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NVPTX] Custom lower integer<->bf16 conversions for sm_80 #74827

[NVPTX] Custom lower integer<->bf16 conversions for sm_80 #74827

Uh oh!

d0k commented Dec 8, 2023

Uh oh!

Artem-B left a comment

Uh oh!

Artem-B Dec 11, 2023

Uh oh!

d0k Dec 11, 2023

Uh oh!

Artem-B Dec 11, 2023

Uh oh!

Uh oh!

Uh oh!

d0k commented Dec 11, 2023

Uh oh!

Uh oh!

[NVPTX] Custom lower integer<->bf16 conversions for sm_80 #74827

[NVPTX] Custom lower integer<->bf16 conversions for sm_80 #74827

Uh oh!

Conversation

d0k commented Dec 8, 2023

Uh oh!

Artem-B left a comment

Choose a reason for hiding this comment

Uh oh!

Artem-B Dec 11, 2023

Choose a reason for hiding this comment

Uh oh!

d0k Dec 11, 2023

Choose a reason for hiding this comment

Uh oh!

Artem-B Dec 11, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

d0k commented Dec 11, 2023

Uh oh!

Uh oh!