[SYCL] Extend -fsycl-device-obj to dump assembly #17390

jchlanda · 2025-03-11T10:48:13Z

This patch adds asm value to -fsycl-device-obj to enable dumping assembly (or PTX) of kernels.

This allows one to compile SYCL device code to plain PTX directly: -fsycl-targets=nvptx64-nvidia-cuda -fsycl-device-obj=ptx -S

clang/lib/Driver/Driver.cpp

jchlanda · 2025-03-11T13:14:25Z

Please note that the original patch (represented as the first commit in this PR) was written by @stefanatwork, and it would be good to keep the attribution of work correct.

jchlanda · 2025-03-11T14:09:06Z

Fixes: #8797

mdtoguchi · 2025-03-11T14:49:38Z

What is the expectation when a fat object created with -fsycl-device-obj=asm is consumed? Is there an expectation for this to work - or is this more of a mechanism to dump assembly and not part of a regular workflow when producing final binaries?

bader · 2025-03-11T16:04:48Z

Please note that the original patch (represented as the first commit in this PR) was written by @stefanatwork, and it would be good to keep the attribution of work correct.

@jchlanda, what do you mean? Please, take a look at #17113 or #16834. Do they look like satisfying your requirement?

jchlanda · 2025-03-12T06:23:36Z

Please note that the original patch (represented as the first commit in this PR) was written by @stefanatwork, and it would be good to keep the attribution of work correct.

@jchlanda, what do you mean? Please, take a look at #17113 or #16834. Do they look like satisfying your requirement?

Hi @bader
Apologies, I should have been more clear. With the squash and merge I was wondering if we could somehow preserve the ownership of the commits. As I've mentioned, the first commit in this PR is not my work, I've git amd @stefanatwork older code.

jchlanda · 2025-03-12T06:54:09Z

What is the expectation when a fat object created with -fsycl-device-obj=asm is consumed? Is there an expectation for this to work - or is this more of a mechanism to dump assembly and not part of a regular workflow when producing final binaries?

hi @mdtoguchi

This is meant to provide a convenient way of inspecting kernel's assembly/PTX without going to the effort of saving temps.

I personally would always use it alongside -fsycl-device-only.
For a silly little sample:

Q.submit([&](sycl::handler &cgh) {
   sycl::accessor AccA(BufA, cgh, sycl::write_only);
   cgh.parallel_for(sycl::range<1>{32},
                    [=](sycl::id<1> idx) { AccA[idx] = 42; });
 }).wait_and_throw();

clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda -fsycl-device-obj=asm device_obj.cpp -o -
produces:

//
// Generated by LLVM NVPTX Back-End
//

.version 8.5
.target sm_50
.address_size 64

        // .weak        _ZTSN4sycl3_V16detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idILi1EEEE_EE // -- Begin function _ZTSN4sycl3_V16detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idILi1EEEE_EE
.extern .func  (.param .b64 func_retval0) _Z28__spirv_GlobalInvocationId_xv
()
;
.extern .func  (.param .b64 func_retval0) _Z20__spirv_GlobalSize_xv
()
;
.extern .func  (.param .b64 func_retval0) _Z22__spirv_GlobalOffset_xv
()
;
                                        // @_ZTSN4sycl3_V16detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idILi1EEEE_EE
.weak .entry _ZTSN4sycl3_V16detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idILi1EEEE_EE(
        .param .align 8 .b8 _ZTSN4sycl3_V16detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idILi1EEEE_EE_param_0[8],
        .param .u64 .ptr .global .align 4 _ZTSN4sycl3_V16detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idILi1EEEE_EE_param_1,
        .param .align 8 .b8 _ZTSN4sycl3_V16detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idILi1EEEE_EE_param_2[8]
)
{
        .reg .pred      %p<3>;
        .reg .b32       %r<2>;
        .reg .b64       %rd<21>;

// %bb.0:                               // %entry
        ld.param.u64    %rd1, [_ZTSN4sycl3_V16detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idILi1EEEE_EE_param_0];
        { // callseq 0, 0
        .param .b64 retval0;
        call.uni (retval0),
        _Z28__spirv_GlobalInvocationId_xv,
        (
        );
        ld.param.b64    %rd20, [retval0];
        } // callseq 0
        { // callseq 1, 0
        .param .b64 retval0;
        call.uni (retval0),
        _Z20__spirv_GlobalSize_xv,
        (
        );
        ld.param.b64    %rd13, [retval0];
        } // callseq 1
        { // callseq 2, 0
        .param .b64 retval0;
        call.uni (retval0),
        _Z22__spirv_GlobalOffset_xv,
        (
        );
        ld.param.b64    %rd14, [retval0];
        } // callseq 2
        setp.ge.u64     %p1, %rd20, %rd1;
        @%p1 bra        $L__BB0_3;
// %bb.1:                               // %for.body.i.preheader
        ld.param.u64    %rd11, [_ZTSN4sycl3_V16detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idILi1EEEE_EE_param_1];
        ld.param.u64    %rd2, [_ZTSN4sycl3_V16detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idILi1EEEE_EE_param_2];
        shl.b64         %rd16, %rd20, 2;
        shl.b64         %rd17, %rd2, 2;
        add.s64         %rd18, %rd16, %rd17;
        add.s64         %rd19, %rd11, %rd18;
        shl.b64         %rd6, %rd13, 2;
$L__BB0_2:                              // %for.body.i
                                        // =>This Inner Loop Header: Depth=1
        mov.b32         %r1, 42;
        st.global.u32   [%rd19], %r1;
        add.s64         %rd20, %rd20, %rd13;
        add.s64         %rd19, %rd19, %rd6;
        setp.lt.u64     %p2, %rd20, %rd1;
        @%p2 bra        $L__BB0_2;
$L__BB0_3:                              // %_ZNK4sycl3_V16detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idILi1EEEE_EclES4_.exit
        ret;
                                        // -- End function
}
        // .weak        _ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_EUlNS0_2idILi1EEEE_ // -- Begin function _ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_EUlNS0_2idILi1EEEE_
.weak .entry _ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_EUlNS0_2idILi1EEEE_(
        .param .u64 .ptr .global .align 4 _ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_EUlNS0_2idILi1EEEE__param_0,
        .param .align 8 .b8 _ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_EUlNS0_2idILi1EEEE__param_1[8]
)                                       // @_ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_EUlNS0_2idILi1EEEE_
{
        .reg .b32       %r<2>;
        .reg .b64       %rd<13>;

// %bb.0:                               // %entry
        ld.param.u64    %rd1, [_ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_EUlNS0_2idILi1EEEE__param_0];
        ld.param.u64    %rd2, [_ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_EUlNS0_2idILi1EEEE__param_1];
        shl.b64         %rd3, %rd2, 2;
        add.s64         %rd4, %rd1, %rd3;
        { // callseq 3, 0
        .param .b64 retval0;
        call.uni (retval0),
        _Z28__spirv_GlobalInvocationId_xv,
        (
        );
        ld.param.b64    %rd5, [retval0];
        } // callseq 3
        { // callseq 4, 0
        .param .b64 retval0;
        call.uni (retval0),
        _Z20__spirv_GlobalSize_xv,
        (
        );
        ld.param.b64    %rd7, [retval0];
        } // callseq 4
        { // callseq 5, 0
        .param .b64 retval0;
        call.uni (retval0),
        _Z22__spirv_GlobalOffset_xv,
        (
        );
        ld.param.b64    %rd9, [retval0];
        } // callseq 5
        shl.b64         %rd11, %rd5, 2;
        add.s64         %rd12, %rd4, %rd11;
        mov.b32         %r1, 42;
        st.global.u32   [%rd12], %r1;
        ret;
                                        // -- End function
}

Which is what I'd expect to see.
Now, if things were going to get bundled, we would need to inform the compiler that it needs to generate assembly for the host as well, so this:
clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda -fsycl-device-obj=asm -S device_obj.cpp -o - also works as expected. As we get a bundled assembly for both PTX and X86 (redacted file):

# __CLANG_OFFLOAD_BUNDLE____START__ sycl-nvptx64-nvidia-cuda-sm_50

... the same PTX as above ...

# __CLANG_OFFLOAD_BUNDLE____END__ sycl-nvptx64-nvidia-cuda-sm_50

# __CLANG_OFFLOAD_BUNDLE____START__ host-x86_64-unknown-linux-gnu
        .file   "ptx_outpu.cpp"
        .text
        .globl  main                            # -- Begin function main

... the rest of X86 asm ...

# __CLANG_OFFLOAD_BUNDLE____END__ host-x86_64-unknown-linux-gnu

And finally, when targeting both host and device, but the -S is not present:
clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda -fsycl-device-obj=asm device_obj.cpp -o a.out
we get a warning message and the compiler proceeds to create a binary executable as normal:
clang++: warning: argument unused during compilation: '-fsycl-device-obj=asm' [-Wunused-command-line-argument]

bader · 2025-03-12T14:39:19Z

Please note that the original patch (represented as the first commit in this PR) was written by @stefanatwork, and it would be good to keep the attribution of work correct.

@jchlanda, what do you mean? Please, take a look at #17113 or #16834. Do they look like satisfying your requirement?

Hi @bader Apologies, I should have been more clear. With the squash and merge I was wondering if we could somehow preserve the ownership of the commits. As I've mentioned, the first commit in this PR is not my work, I've git amd @stefanatwork older code.

I understand that. Both mentioned PRs have commits done by multiple authors. My understanding is that git log for squashed commit will look like this:

commit ...
Author: Werner, Stefan <[email protected]>
...
Co-authored-by: Jakub Chlanda <[email protected]>

Author is git commit attribute and Co-authored-by is part of the commit message.

Does it work for you?

jchlanda · 2025-03-12T14:41:28Z

Please note that the original patch (represented as the first commit in this PR) was written by @stefanatwork, and it would be good to keep the attribution of work correct.

@jchlanda, what do you mean? Please, take a look at #17113 or #16834. Do they look like satisfying your requirement?

Hi @bader Apologies, I should have been more clear. With the squash and merge I was wondering if we could somehow preserve the ownership of the commits. As I've mentioned, the first commit in this PR is not my work, I've git amd @stefanatwork older code.

I understand that. Both mentioned PRs have commits done by multiple authors. My understanding is that git log for squashed commit will look like this:
commit ...
Author: Werner, Stefan <[email protected]>
...
Co-authored-by: Jakub Chlanda <[email protected]>
Author is git commit attribute and Co-authored-by is part of the commit message.

Does it work for you?

Yeap, works for me. Thank you.

mdtoguchi · 2025-03-12T16:10:25Z

@jchlanda, thanks for the usage information. Your usage model has some overlap with the use of -fsycl-dump-device-code=arg. Use of -fsycl-dump-device-code will essentially intercept the regular -fsycl flow and retain the generated assembly file when using the nvptx64 target. @srividya-sundaram for awareness.

As for the consumption, I'm more concerned about usage of something like:
clang++ -fsycl -fsycl-device-obj=asm -fsycl-targets=nvptx64-nvidia-cuda -c file.cpp
clang++ -fsycl -fsycl-device-obj=asm -fsycl-targets=nvptx64-nvidia-cuda file.o
Where the consumed device side of the object has asm instead of the expected LLVM-IR, and the assembly will go directly to the llvm-link call.

jchlanda · 2025-03-13T13:43:48Z

@jchlanda, thanks for the usage information. Your usage model has some overlap with the use of -fsycl-dump-device-code=arg. Use of -fsycl-dump-device-code will essentially intercept the regular -fsycl flow and retain the generated assembly file when using the nvptx64 target. @srividya-sundaram for awareness.

As for the consumption, I'm more concerned about usage of something like: clang++ -fsycl -fsycl-device-obj=asm -fsycl-targets=nvptx64-nvidia-cuda -c file.cpp clang++ -fsycl -fsycl-device-obj=asm -fsycl-targets=nvptx64-nvidia-cuda file.o Where the consumed device side of the object has asm instead of the expected LLVM-IR, and the assembly will go directly to the llvm-link call.

@mdtoguchi, so it seems that -fsycl-device-obj can only be used alongside -S, or -c. In all honesty I did not realise that you could use it while creating a library, and you are right the switch trips up llvm-link. I don't have a horse in that race, was simply reviving an old patch that seemed quite useful. I suppose the two options we have now are:

issue a warning and ignore the -fsycl-device-obj=asm when creating a library (-c),
abandon this patch.

Seems like -fsycl-dump-device-code already provides the same functionality, so we are not loosing much. I'm happy to follow your judgement on this though.

jchlanda · 2025-04-01T06:42:50Z

@intel/dpcpp-doc-reviewers @mdtoguchi is there anything else you'd like me to do for this PR? Thank you.

npmiller · 2025-04-09T09:56:49Z

ping @intel/dpcpp-doc-reviewers @mdtoguchi this is ready to review

jchlanda · 2025-04-10T05:49:57Z

@intel/dpcpp-doc-reviewers could you please have a look at this PR?

bader · 2025-04-10T16:45:30Z

sycl/doc/UsersManual.md

@@ -348,7 +348,8 @@ and not recommended to use in production environment.
 **`-fsycl-device-obj=<arg>`** [EXPERIMENTAL]

    Specify format of device code stored in the resulting object. The <arg> can
-    be one of the following:  "spirv" - SPIR-V is emitted, "llvmir" - LLVM-IR
+    be one of the following:  "spirv" - SPIR-V, "asm" - assembly output when
+    possible (PTX, when targetting Nvidia devices) , or "llvmir" - LLVM-IR


Suggested change

possible (PTX, when targetting Nvidia devices) , or "llvmir" - LLVM-IR

possible (PTX, when targeting NVIDIA devices), or "llvmir" - LLVM-IR

npmiller · 2025-04-11T09:11:43Z

@intel/llvm-gatekeepers this is ready to merge

jchlanda · 2025-04-11T09:23:08Z

@intel/llvm-gatekeepers this should be ready to go.

Thank you.

Allow "ptx" as device object format

5e39582

This allows one to compile SYCL device code to plain PTX directly: -fsycl-targets=nvptx64-nvidia-cuda -fsycl-device-obj=ptx -S

jchlanda temporarily deployed to WindowsCILock March 11, 2025 10:49 — with GitHub Actions Inactive

[SYCL] Test and generalise -fsycl-deice-obj=asm to support more targets

34f7247

Seanst98 reviewed Mar 11, 2025

View reviewed changes

clang/lib/Driver/Driver.cpp Outdated Show resolved Hide resolved

jchlanda temporarily deployed to WindowsCILock March 11, 2025 11:38 — with GitHub Actions Inactive

jchlanda temporarily deployed to WindowsCILock March 11, 2025 13:11 — with GitHub Actions Inactive

jchlanda marked this pull request as ready for review March 11, 2025 13:13

jchlanda requested review from a team as code owners March 11, 2025 13:13

SmallVector has 3 elems

16d79d3

jchlanda temporarily deployed to WindowsCILock March 11, 2025 13:35 — with GitHub Actions Inactive

jchlanda mentioned this pull request Mar 11, 2025

[SYCL][CUDA] Allow "ptx" as device object format #10801

Closed

Seanst98 approved these changes Mar 11, 2025

View reviewed changes

jchlanda closed this Mar 12, 2025

jchlanda reopened this Mar 12, 2025

jchlanda temporarily deployed to WindowsCILock March 12, 2025 13:02 — with GitHub Actions Inactive

jchlanda temporarily deployed to WindowsCILock March 12, 2025 13:32 — with GitHub Actions Inactive

jchlanda temporarily deployed to WindowsCILock March 26, 2025 10:27 — with GitHub Actions Inactive

PR feedback and nocudalib in tests

5b725dc

jchlanda had a problem deploying to WindowsCILock March 26, 2025 10:45 — with GitHub Actions Error

jchlanda temporarily deployed to WindowsCILock March 26, 2025 11:34 — with GitHub Actions Inactive

Merge remote-tracking branch 'upstream/sycl' into stefanatwork/ptx_out

4681124

jchlanda had a problem deploying to WindowsCILock March 26, 2025 12:34 — with GitHub Actions Error

jchlanda temporarily deployed to WindowsCILock March 26, 2025 12:34 — with GitHub Actions Inactive

jchlanda had a problem deploying to WindowsCILock March 26, 2025 13:48 — with GitHub Actions Failure

Merge remote-tracking branch 'upstream/sycl' into stefanatwork/ptx_out

a5eee25

jchlanda temporarily deployed to WindowsCILock March 26, 2025 14:35 — with GitHub Actions Inactive

jchlanda temporarily deployed to WindowsCILock March 27, 2025 06:10 — with GitHub Actions Inactive

Merge remote-tracking branch 'upstream/sycl' into stefanatwork/ptx_out

bd3df96

jchlanda temporarily deployed to WindowsCILock March 27, 2025 06:49 — with GitHub Actions Inactive

jchlanda temporarily deployed to WindowsCILock March 27, 2025 10:45 — with GitHub Actions Inactive

Merge remote-tracking branch 'upstream/sycl' into stefanatwork/ptx_out

5ac63f4

jchlanda temporarily deployed to WindowsCILock March 27, 2025 11:07 — with GitHub Actions Inactive

jchlanda requested a review from mdtoguchi March 27, 2025 13:29

mdtoguchi approved these changes Apr 9, 2025

View reviewed changes

bader reviewed Apr 10, 2025

View reviewed changes

bader approved these changes Apr 10, 2025

View reviewed changes

martygrant merged commit a89ae41 into intel:sycl Apr 11, 2025
24 checks passed

	possible (PTX, when targetting Nvidia devices) , or "llvmir" - LLVM-IR
	possible (PTX, when targeting NVIDIA devices), or "llvmir" - LLVM-IR

[SYCL] Extend -fsycl-device-obj to dump assembly #17390

[SYCL] Extend -fsycl-device-obj to dump assembly #17390

Uh oh!

Conversation

jchlanda commented Mar 11, 2025

Uh oh!

Uh oh!

jchlanda commented Mar 11, 2025

Uh oh!

jchlanda commented Mar 11, 2025

Uh oh!

mdtoguchi commented Mar 11, 2025

Uh oh!

bader commented Mar 11, 2025

Uh oh!

jchlanda commented Mar 12, 2025

Uh oh!

jchlanda commented Mar 12, 2025

Uh oh!

bader commented Mar 12, 2025

Uh oh!

jchlanda commented Mar 12, 2025

Uh oh!

mdtoguchi commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jchlanda commented Mar 13, 2025

Uh oh!

jchlanda commented Apr 1, 2025

Uh oh!

npmiller commented Apr 9, 2025

Uh oh!

jchlanda commented Apr 10, 2025

Uh oh!

bader Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

npmiller commented Apr 11, 2025

Uh oh!

jchlanda commented Apr 11, 2025

Uh oh!

Uh oh!

Uh oh!

mdtoguchi commented Mar 12, 2025 •

edited

Loading