Skip to content

[SYCL] Extend -fsycl-device-obj to dump assembly #17390

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Apr 11, 2025

Conversation

jchlanda
Copy link
Contributor

This patch adds asm value to -fsycl-device-obj to enable dumping assembly (or PTX) of kernels.

This allows one to compile SYCL device code to plain PTX directly:
-fsycl-targets=nvptx64-nvidia-cuda -fsycl-device-obj=ptx -S
@jchlanda jchlanda marked this pull request as ready for review March 11, 2025 13:13
@jchlanda jchlanda requested review from a team as code owners March 11, 2025 13:13
@jchlanda
Copy link
Contributor Author

Please note that the original patch (represented as the first commit in this PR) was written by @stefanatwork, and it would be good to keep the attribution of work correct.

@jchlanda
Copy link
Contributor Author

Fixes: #8797

@mdtoguchi
Copy link
Contributor

What is the expectation when a fat object created with -fsycl-device-obj=asm is consumed? Is there an expectation for this to work - or is this more of a mechanism to dump assembly and not part of a regular workflow when producing final binaries?

@bader
Copy link
Contributor

bader commented Mar 11, 2025

Please note that the original patch (represented as the first commit in this PR) was written by @stefanatwork, and it would be good to keep the attribution of work correct.

@jchlanda, what do you mean? Please, take a look at #17113 or #16834. Do they look like satisfying your requirement?

@jchlanda
Copy link
Contributor Author

Please note that the original patch (represented as the first commit in this PR) was written by @stefanatwork, and it would be good to keep the attribution of work correct.

@jchlanda, what do you mean? Please, take a look at #17113 or #16834. Do they look like satisfying your requirement?

Hi @bader
Apologies, I should have been more clear. With the squash and merge I was wondering if we could somehow preserve the ownership of the commits. As I've mentioned, the first commit in this PR is not my work, I've git amd @stefanatwork older code.

@jchlanda
Copy link
Contributor Author

What is the expectation when a fat object created with -fsycl-device-obj=asm is consumed? Is there an expectation for this to work - or is this more of a mechanism to dump assembly and not part of a regular workflow when producing final binaries?

hi @mdtoguchi

This is meant to provide a convenient way of inspecting kernel's assembly/PTX without going to the effort of saving temps.

I personally would always use it alongside -fsycl-device-only.
For a silly little sample:

Q.submit([&](sycl::handler &cgh) {
   sycl::accessor AccA(BufA, cgh, sycl::write_only);
   cgh.parallel_for(sycl::range<1>{32},
                    [=](sycl::id<1> idx) { AccA[idx] = 42; });
 }).wait_and_throw();

clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda -fsycl-device-obj=asm device_obj.cpp -o -
produces:

//
// Generated by LLVM NVPTX Back-End
//

.version 8.5
.target sm_50
.address_size 64

        // .weak        _ZTSN4sycl3_V16detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idILi1EEEE_EE // -- Begin function _ZTSN4sycl3_V16detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idILi1EEEE_EE
.extern .func  (.param .b64 func_retval0) _Z28__spirv_GlobalInvocationId_xv
()
;
.extern .func  (.param .b64 func_retval0) _Z20__spirv_GlobalSize_xv
()
;
.extern .func  (.param .b64 func_retval0) _Z22__spirv_GlobalOffset_xv
()
;
                                        // @_ZTSN4sycl3_V16detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idILi1EEEE_EE
.weak .entry _ZTSN4sycl3_V16detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idILi1EEEE_EE(
        .param .align 8 .b8 _ZTSN4sycl3_V16detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idILi1EEEE_EE_param_0[8],
        .param .u64 .ptr .global .align 4 _ZTSN4sycl3_V16detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idILi1EEEE_EE_param_1,
        .param .align 8 .b8 _ZTSN4sycl3_V16detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idILi1EEEE_EE_param_2[8]
)
{
        .reg .pred      %p<3>;
        .reg .b32       %r<2>;
        .reg .b64       %rd<21>;

// %bb.0:                               // %entry
        ld.param.u64    %rd1, [_ZTSN4sycl3_V16detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idILi1EEEE_EE_param_0];
        { // callseq 0, 0
        .param .b64 retval0;
        call.uni (retval0),
        _Z28__spirv_GlobalInvocationId_xv,
        (
        );
        ld.param.b64    %rd20, [retval0];
        } // callseq 0
        { // callseq 1, 0
        .param .b64 retval0;
        call.uni (retval0),
        _Z20__spirv_GlobalSize_xv,
        (
        );
        ld.param.b64    %rd13, [retval0];
        } // callseq 1
        { // callseq 2, 0
        .param .b64 retval0;
        call.uni (retval0),
        _Z22__spirv_GlobalOffset_xv,
        (
        );
        ld.param.b64    %rd14, [retval0];
        } // callseq 2
        setp.ge.u64     %p1, %rd20, %rd1;
        @%p1 bra        $L__BB0_3;
// %bb.1:                               // %for.body.i.preheader
        ld.param.u64    %rd11, [_ZTSN4sycl3_V16detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idILi1EEEE_EE_param_1];
        ld.param.u64    %rd2, [_ZTSN4sycl3_V16detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idILi1EEEE_EE_param_2];
        shl.b64         %rd16, %rd20, 2;
        shl.b64         %rd17, %rd2, 2;
        add.s64         %rd18, %rd16, %rd17;
        add.s64         %rd19, %rd11, %rd18;
        shl.b64         %rd6, %rd13, 2;
$L__BB0_2:                              // %for.body.i
                                        // =>This Inner Loop Header: Depth=1
        mov.b32         %r1, 42;
        st.global.u32   [%rd19], %r1;
        add.s64         %rd20, %rd20, %rd13;
        add.s64         %rd19, %rd19, %rd6;
        setp.lt.u64     %p2, %rd20, %rd1;
        @%p2 bra        $L__BB0_2;
$L__BB0_3:                              // %_ZNK4sycl3_V16detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idILi1EEEE_EclES4_.exit
        ret;
                                        // -- End function
}
        // .weak        _ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_EUlNS0_2idILi1EEEE_ // -- Begin function _ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_EUlNS0_2idILi1EEEE_
.weak .entry _ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_EUlNS0_2idILi1EEEE_(
        .param .u64 .ptr .global .align 4 _ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_EUlNS0_2idILi1EEEE__param_0,
        .param .align 8 .b8 _ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_EUlNS0_2idILi1EEEE__param_1[8]
)                                       // @_ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_EUlNS0_2idILi1EEEE_
{
        .reg .b32       %r<2>;
        .reg .b64       %rd<13>;

// %bb.0:                               // %entry
        ld.param.u64    %rd1, [_ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_EUlNS0_2idILi1EEEE__param_0];
        ld.param.u64    %rd2, [_ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_EUlNS0_2idILi1EEEE__param_1];
        shl.b64         %rd3, %rd2, 2;
        add.s64         %rd4, %rd1, %rd3;
        { // callseq 3, 0
        .param .b64 retval0;
        call.uni (retval0),
        _Z28__spirv_GlobalInvocationId_xv,
        (
        );
        ld.param.b64    %rd5, [retval0];
        } // callseq 3
        { // callseq 4, 0
        .param .b64 retval0;
        call.uni (retval0),
        _Z20__spirv_GlobalSize_xv,
        (
        );
        ld.param.b64    %rd7, [retval0];
        } // callseq 4
        { // callseq 5, 0
        .param .b64 retval0;
        call.uni (retval0),
        _Z22__spirv_GlobalOffset_xv,
        (
        );
        ld.param.b64    %rd9, [retval0];
        } // callseq 5
        shl.b64         %rd11, %rd5, 2;
        add.s64         %rd12, %rd4, %rd11;
        mov.b32         %r1, 42;
        st.global.u32   [%rd12], %r1;
        ret;
                                        // -- End function
}

Which is what I'd expect to see.
Now, if things were going to get bundled, we would need to inform the compiler that it needs to generate assembly for the host as well, so this:
clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda -fsycl-device-obj=asm -S device_obj.cpp -o - also works as expected. As we get a bundled assembly for both PTX and X86 (redacted file):

# __CLANG_OFFLOAD_BUNDLE____START__ sycl-nvptx64-nvidia-cuda-sm_50

... the same PTX as above ...

# __CLANG_OFFLOAD_BUNDLE____END__ sycl-nvptx64-nvidia-cuda-sm_50

# __CLANG_OFFLOAD_BUNDLE____START__ host-x86_64-unknown-linux-gnu
        .file   "ptx_outpu.cpp"
        .text
        .globl  main                            # -- Begin function main

... the rest of X86 asm ...

# __CLANG_OFFLOAD_BUNDLE____END__ host-x86_64-unknown-linux-gnu

And finally, when targeting both host and device, but the -S is not present:
clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda -fsycl-device-obj=asm device_obj.cpp -o a.out
we get a warning message and the compiler proceeds to create a binary executable as normal:
clang++: warning: argument unused during compilation: '-fsycl-device-obj=asm' [-Wunused-command-line-argument]

@bader
Copy link
Contributor

bader commented Mar 12, 2025

Please note that the original patch (represented as the first commit in this PR) was written by @stefanatwork, and it would be good to keep the attribution of work correct.

@jchlanda, what do you mean? Please, take a look at #17113 or #16834. Do they look like satisfying your requirement?

Hi @bader Apologies, I should have been more clear. With the squash and merge I was wondering if we could somehow preserve the ownership of the commits. As I've mentioned, the first commit in this PR is not my work, I've git amd @stefanatwork older code.

I understand that. Both mentioned PRs have commits done by multiple authors. My understanding is that git log for squashed commit will look like this:

commit ...
Author: Werner, Stefan <[email protected]>
...
Co-authored-by: Jakub Chlanda <[email protected]>

Author is git commit attribute and Co-authored-by is part of the commit message.

Does it work for you?

@jchlanda
Copy link
Contributor Author

Please note that the original patch (represented as the first commit in this PR) was written by @stefanatwork, and it would be good to keep the attribution of work correct.

@jchlanda, what do you mean? Please, take a look at #17113 or #16834. Do they look like satisfying your requirement?

Hi @bader Apologies, I should have been more clear. With the squash and merge I was wondering if we could somehow preserve the ownership of the commits. As I've mentioned, the first commit in this PR is not my work, I've git amd @stefanatwork older code.

I understand that. Both mentioned PRs have commits done by multiple authors. My understanding is that git log for squashed commit will look like this:

commit ...
Author: Werner, Stefan <[email protected]>
...
Co-authored-by: Jakub Chlanda <[email protected]>

Author is git commit attribute and Co-authored-by is part of the commit message.

Does it work for you?

Yeap, works for me. Thank you.

@mdtoguchi
Copy link
Contributor

mdtoguchi commented Mar 12, 2025

@jchlanda, thanks for the usage information. Your usage model has some overlap with the use of -fsycl-dump-device-code=arg. Use of -fsycl-dump-device-code will essentially intercept the regular -fsycl flow and retain the generated assembly file when using the nvptx64 target. @srividya-sundaram for awareness.

As for the consumption, I'm more concerned about usage of something like:
clang++ -fsycl -fsycl-device-obj=asm -fsycl-targets=nvptx64-nvidia-cuda -c file.cpp
clang++ -fsycl -fsycl-device-obj=asm -fsycl-targets=nvptx64-nvidia-cuda file.o
Where the consumed device side of the object has asm instead of the expected LLVM-IR, and the assembly will go directly to the llvm-link call.

@jchlanda
Copy link
Contributor Author

@jchlanda, thanks for the usage information. Your usage model has some overlap with the use of -fsycl-dump-device-code=arg. Use of -fsycl-dump-device-code will essentially intercept the regular -fsycl flow and retain the generated assembly file when using the nvptx64 target. @srividya-sundaram for awareness.

As for the consumption, I'm more concerned about usage of something like: clang++ -fsycl -fsycl-device-obj=asm -fsycl-targets=nvptx64-nvidia-cuda -c file.cpp clang++ -fsycl -fsycl-device-obj=asm -fsycl-targets=nvptx64-nvidia-cuda file.o Where the consumed device side of the object has asm instead of the expected LLVM-IR, and the assembly will go directly to the llvm-link call.

@mdtoguchi, so it seems that -fsycl-device-obj can only be used alongside -S, or -c. In all honesty I did not realise that you could use it while creating a library, and you are right the switch trips up llvm-link. I don't have a horse in that race, was simply reviving an old patch that seemed quite useful. I suppose the two options we have now are:

  • issue a warning and ignore the -fsycl-device-obj=asm when creating a library (-c),
  • abandon this patch.

Seems like -fsycl-dump-device-code already provides the same functionality, so we are not loosing much. I'm happy to follow your judgement on this though.

@jchlanda
Copy link
Contributor Author

jchlanda commented Apr 1, 2025

@intel/dpcpp-doc-reviewers @mdtoguchi is there anything else you'd like me to do for this PR? Thank you.

@npmiller
Copy link
Contributor

npmiller commented Apr 9, 2025

ping @intel/dpcpp-doc-reviewers @mdtoguchi this is ready to review

@jchlanda
Copy link
Contributor Author

@intel/dpcpp-doc-reviewers could you please have a look at this PR?

@@ -348,7 +348,8 @@ and not recommended to use in production environment.
**`-fsycl-device-obj=<arg>`** [EXPERIMENTAL]

Specify format of device code stored in the resulting object. The <arg> can
be one of the following: "spirv" - SPIR-V is emitted, "llvmir" - LLVM-IR
be one of the following: "spirv" - SPIR-V, "asm" - assembly output when
possible (PTX, when targetting Nvidia devices) , or "llvmir" - LLVM-IR
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
possible (PTX, when targetting Nvidia devices) , or "llvmir" - LLVM-IR
possible (PTX, when targeting NVIDIA devices), or "llvmir" - LLVM-IR

@npmiller
Copy link
Contributor

@intel/llvm-gatekeepers this is ready to merge

@jchlanda
Copy link
Contributor Author

@intel/llvm-gatekeepers this should be ready to go.

Thank you.

@martygrant martygrant merged commit a89ae41 into intel:sycl Apr 11, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants