[SYCL] Do not build device code for sub-devices #5240

bader · 2021-12-28T22:13:59Z

Technically sub-devices are the same as their root device, so we can
build program for root device only and re-use the binary for sub-devices
to avoid "duplicate" builds.

Apply LLVM's coding style rule - include as little as possible. https://llvm.org/docs/CodingStandards.html#include-as-little-as-possible

Technically sub-devices are the same as their root device, so we can build program for root device only and re-use the binary for sub-devices to avoid "duplicate" builds.

alexbatashev

LGTM

1. SubDevices unit test fails on CUDA systems with the following message terminate called after throwing an instance of 'cl::sycl::feature_not_supported' what(): SPIR-V online compilation is not supported in this context -59 (CL_INVALID_OPERATION) It looks like instead of using OpenCL CPU as "mock" plug-in, unit test framework uses "default" plugin. I applied short term solution and skip the test if CUDA or HIP back-ends are selected. 2. subdevice_pi from llmv-test-suite fails with: terminate called after throwing an instance of 'cl::LLVM::compile_program_error' what(): The program was built for 1 devices Build program log for 'Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz': -33 (CL_INVALID_DEVICE) It turned out that implementation re-uses a program built for a device associated with a different context. I fixed that problem, but still we can't optimize some cases from subdevice_pi test due to a strange behavior of Intel OpenCL CPU implementation. See code comments for more details. At this point I ran out of strength to fix all issues with unit test, so I temporary disable it. I'm going to extend subdevice_pi test with checks for build program optimizations. DPC++ runtime internal classes require refactoring to simplify unit testing.

sycl/source/detail/persistent_device_code_cache.cpp

Tests on Intel GPU devices show that build results created for any subdevices can be re-used for other sub-devices.

This reverts commit 7ac48ae.

This reverts commit a1e483a.

smaslov-intel · 2022-02-08T17:22:30Z

Technically sub-devices are the same as their root device, so we can
build program for root device only and re-use the binary for sub-devices
to avoid "duplicate" builds.

Is this really true for any possible backend/device? Is SYCL standard claiming this?
My thinking is that this would be specific to backends/plugins.

smaslov-intel

Should this be moved to plugins? Or at least maybe add a PI device query that building on root-device is sufficient for running on all of its sub-devices?

bader · 2022-02-08T18:51:24Z

Technically sub-devices are the same as their root device, so we can
build program for root device only and re-use the binary for sub-devices
to avoid "duplicate" builds.

Is this really true for any possible backend/device? Is SYCL standard claiming this? My thinking is that this would be specific to backends/plugins.

It might be specific to device rather than back-end. @intel/dpcpp-specification-reviewers, could you clarify if SYCL standard allows such optimization "de jure". "De facto" is works on Intel's CPU and GPU devices.

Should this be moved to plugins? Or at least maybe add a PI device query that building on root-device is sufficient for running on all of its sub-devices?

New PI device query extension might be required if SYCL standard doesn't specify the implementation behavior in such cases. I don't there is a value of moving this logic to the plug-in as the same device (e.g. Intel GPU) can be exposed via multiple plug-ins (e.g. Level Zero and OpenCL).

smaslov-intel · 2022-02-08T19:00:24Z

I'd be OK doing this optimization in SYCL RT rather than individual plugins if we guarantee it is legal to do (presumably by querying new device info from the plugin).

gmlueck · 2022-02-08T19:35:23Z

It might be specific to device rather than back-end. @intel/dpcpp-specification-reviewers, could you clarify if SYCL standard allows such optimization "de jure". "De facto" is works on Intel's CPU and GPU devices.

This is mostly an implementation detail that is not exposed in the spec. Unless the application uses the kernel_bundle APIs, the application just submits a kernel to a device, and it's up to the implementation to decide whether it needs to build the kernel or reuse a cached version. Thus, the implementation can decide whether a cached version of the kernel for device A is also valid on device B. Nothing needs to be clarified in the spec for this case.

The kernel_bundle does expose the issue, though, because the build, compile, and link functions all allow the application to pass a set of devices. I guess you are asking whether the SYCL spec should specifically allow an application to pass only device A to sycl::build() and then implicitly allow the kernel to be run also on any sub-device of A.

I think we should not add this to the spec because this might not work automatically on other backends. If an application wants to run the kernel also on sub-devices, it seems like the application can just add the sub-devices to the device list that is passed to sycl::build(). If the implementation knows that the same kernel will work on all sub-devices, it can optimize the call and only compile the kernel for the root device.

Currently, only Level Zero returns true for a new query. Level Zero supports only Intel GPU devices at the moment and to my knowledge they all should be homogeneous. All other backends return false, which disables build optimizations.

bader · 2022-02-14T15:58:22Z

I'd be OK doing this optimization in SYCL RT rather than individual plugins if we guarantee it is legal to do (presumably by querying new device info from the plugin).

I've added a new device info and query it from the runtime - 6e310b0.

sycl/plugins/level_zero/pi_level_zero.cpp

smaslov-intel · 2022-02-14T23:45:38Z

sycl/plugins/opencl/pi_opencl.cpp

@@ -203,7 +203,13 @@ pi_result piDeviceGetInfo(pi_device device, pi_device_info paramName,
    std::memcpy(paramValue, &result, sizeof(cl_bool));
    return PI_SUCCESS;
  }
-
+  case PI_DEVICE_INFO_HOMOGENEOUS_ARCH: {
+    // FIXME: conservatively return false due to lack of low-level API exposing


should we maybe return true for Intel GPU's already to get OpenCL backend parity with Level-Zero?

I added a check for GPU type, but w/o a vendor check. I'm not sure how many OpenCL implementations supports device partition, but I guess it's done for homogeneous GPU only. Let me know if you want to harden the check.

sycl/include/CL/sycl/detail/pi.h

smaslov-intel · 2022-02-14T23:54:30Z

sycl/source/detail/program_manager/program_manager.cpp

+  // To work around this case we optimize only one case: root device shares the
+  // same context with its sub-device(s). We built for the root device and


Where is "the same context" checked? What if context just has no root-device in it, only all of its sub-devices?

See https://github.com/intel/llvm/pull/5240/files#diff-78dd7f7ba0b6120dece1ae4ab5a09c9936ff654a1de2c31ff2dbb1fc58d90393R490. Put additional comment to emphasize.

What if context just has no root-device in it, only all of its sub-devices?

The optimization won't be enabled in such case.

I see, thanks. Maybe as a future optimization we could implicitly add the root-device to the context, if >1 of it's sub-devices are there already (such that we can save on 1+ module builds). If you agree, please consider adding a TODO comment.

I'm not sure if SYCL spec allows implicitly adding devices to the context implicitly created by the runtime, but I think it's not allowed if the context is provided by the user.

The latest patch removes following comment, which I was considering as direction for future optimizations: e6ca4f9#diff-78dd7f7ba0b6120dece1ae4ab5a09c9936ff654a1de2c31ff2dbb1fc58d90393L509-L511
I think it would be great if Level Zero allows us to re-use the program built for any (sub-)device and not only a root device. I tested it on Intel GPU and it works already, but again it's not guaranteed by the spec wording. In this case we don't need implicitly add the root-device to optimize the build for sub-devices.

@bashbaug, does it make sense to pursue this direction? If so, I can recover the comment.

I can think of some cases in theory at least where a program built for one sub-device wouldn't be valid for a sibling sub-device, so this is not a safe assumption in all cases. If we decide this is a direction we want to pursue we'd need to find a way to detect or request this behavior.

smaslov-intel

Please address few comments

- Renamed PI_DEVICE_INFO_HOMOGENEOUS_ARCH to PI_DEVICE_INFO_BUILD_ON_SUBDEVICE - Aligned OpenCL backend with Level Zero backend

smaslov-intel

LGTM

bader · 2022-02-17T17:24:38Z

@smaslov-intel, I pulled sycl branch to resolve merge conflicts and added one more code comment, which probably deserves a separate discussion - bf57926. We might want to discuss other side effects of exposing command queues as PiDevices (one of them is we build program for each "command queue" i.e. multiple times for the same device)

I don't think the issue mentioned in the comment should block merging this PR - current implementation solves the problem with multiple builds.

sycl/source/detail/program_manager/program_manager.cpp

smaslov-intel

Please rework the comment

Technically sub-devices are the same as their root device, so we can build program for root device only and re-use the binary for sub-devices to avoid "duplicate" builds.

* upstream/sycl: (2757 commits) [SYCL][Doc] Fixing incorrect merge of community Readme.md with our version (intel#5636) [SYCL] Change USM pooling parameters. (intel#5457) [CI] Fix cache location on Windows (intel#5603) [SYCL][NFC] Fix a warning about uninitialized struct members (intel#5610) [Buildbot] Update Windows GPU version to 101.1340 (intel#5620) Fix SPIRV -> OCL barrier call argument attributes Move SPV_INTEL_memory_access_aliasing tokens from spirv_internal [SYCL][ESIMD] Add support for named barrier APIs (intel#5583) [SYCL][L0] Remove ZeModule when program build failed (intel#5541) [SYCL] Silence "unknown attribute" warning for `device_indirectly_callable` (intel#5591) [SYCL][DOC] Introductory material for extensions (intel#5605) [SYCL][DOC] Change extension names to lower case (intel#5607) [SYCL] Improve get_kernel_bundle performance (intel#5496) [SYCL] Do not build device code for sub-devices (intel#5240) [sycl-post-link] Fix a crash during spec-constant properties generation (intel#5538) [SYCL][DOC] Move SPIR-V and OpenCL extensions (intel#5578) [SYCL][ESIMD][EMU] Update memory intrinsics for ESIMD_EMU plugin (intel#4748) [CI] Allow stale issue bot to analyze more issues (intel#5602) [SYCL][L0] Honor property::queue::enable_profiling (intel#5543) [OpenMP] Properly save strings when doing LTO ...

bader added 5 commits December 28, 2021 22:11

[SYCL][NFC] Call getCacheItemPath only if cache is enabled

cd9818b

[SYCL][NFC] Don't include sycl.hpp from headers

04e3869

Apply LLVM's coding style rule - include as little as possible. https://llvm.org/docs/CodingStandards.html#include-as-little-as-possible

[SYCL][NFC] Factor out empty kernel creation boilerplate

ba29bbe

[SYCL] Do not build device code for sub-devices.

f5b380b

Technically sub-devices are the same as their root device, so we can build program for root device only and re-use the binary for sub-devices to avoid "duplicate" builds.

Apply clang-format

5a3587e

bader requested review from alexbatashev, sergey-semenov and a team December 28, 2021 22:13

alexbatashev previously approved these changes Dec 29, 2021

View reviewed changes

bader added 2 commits December 29, 2021 13:50

[NFC] Fix a few typos in the comments

61e09bd

bader dismissed alexbatashev’s stale review via 28b7f80 December 29, 2021 14:08

This was referenced Dec 29, 2021

[SYCL] Fix kernel program cache for multiple devices and refactor some unit tests #5017

Merged

Check that program is built once for "fused" case intel/llvm-test-suite#697

Closed

bader commented Dec 29, 2021

View reviewed changes

sycl/source/detail/persistent_device_code_cache.cpp Show resolved Hide resolved

bader added 7 commits January 20, 2022 11:40

Merge remote-tracking branch 'intel/sycl' into optimize-build

d5b93f0

Improved build results caching for GPU devices.

a1e483a

Tests on Intel GPU devices show that build results created for any subdevices can be re-used for other sub-devices.

Improve GPU caching.

7ac48ae

Revert "Improve GPU caching."

d0f2861

This reverts commit 7ac48ae.

Revert "Improved build results caching for GPU devices."

231a1a3

This reverts commit a1e483a.

Merge remote-tracking branch 'intel/sycl' into optimize-build

8f2d9c4

Fix formatting.

d44e27f

bader marked this pull request as ready for review February 8, 2022 16:48

bader requested a review from a team as a code owner February 8, 2022 16:48

smaslov-intel suggested changes Feb 8, 2022

View reviewed changes

Merge remote-tracking branch 'intel/sycl' into optimize-build

ce299cd

bader requested a review from a team as a code owner February 14, 2022 15:55

bader requested a review from smaslov-intel February 14, 2022 15:56

smaslov-intel reviewed Feb 14, 2022

View reviewed changes

sycl/plugins/level_zero/pi_level_zero.cpp Outdated Show resolved Hide resolved

smaslov-intel reviewed Feb 14, 2022

View reviewed changes

sycl/include/CL/sycl/detail/pi.h Outdated Show resolved Hide resolved

smaslov-intel reviewed Feb 14, 2022

View reviewed changes

Address code review feedback

e6ca4f9

- Renamed PI_DEVICE_INFO_HOMOGENEOUS_ARCH to PI_DEVICE_INFO_BUILD_ON_SUBDEVICE - Aligned OpenCL backend with Level Zero backend

smaslov-intel previously approved these changes Feb 15, 2022

View reviewed changes

bader added 2 commits February 17, 2022 17:04

Added a FIXME comment.

bf57926

Merge remote-tracking branch 'intel/sycl' into optimize-build

d062d77

bader dismissed smaslov-intel’s stale review via d062d77 February 17, 2022 17:06

bader commented Feb 17, 2022

View reviewed changes

sycl/source/detail/program_manager/program_manager.cpp Outdated Show resolved Hide resolved

Update sycl/source/detail/program_manager/program_manager.cpp

0e650ea

smaslov-intel reviewed Feb 17, 2022

View reviewed changes

sycl/source/detail/program_manager/program_manager.cpp Outdated Show resolved Hide resolved

smaslov-intel suggested changes Feb 17, 2022

View reviewed changes

Move comment to Level Zero plug-in.

d1cc7aa

bader requested a review from smaslov-intel February 18, 2022 03:51

alexbatashev approved these changes Feb 18, 2022

View reviewed changes

smaslov-intel approved these changes Feb 18, 2022

View reviewed changes

bader merged commit 13a7455 into intel:sycl Feb 18, 2022

bader deleted the optimize-build branch February 18, 2022 15:08

		// To work around this case we optimize only one case: root device shares the
		// same context with its sub-device(s). We built for the root device and

[SYCL] Do not build device code for sub-devices #5240

[SYCL] Do not build device code for sub-devices #5240

Uh oh!

Conversation

bader commented Dec 28, 2021

Uh oh!

alexbatashev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

smaslov-intel commented Feb 8, 2022

Uh oh!

smaslov-intel left a comment

Choose a reason for hiding this comment

Uh oh!

bader commented Feb 8, 2022

Uh oh!

smaslov-intel commented Feb 8, 2022

Uh oh!

gmlueck commented Feb 8, 2022

Uh oh!

bader commented Feb 14, 2022

Uh oh!

Uh oh!

smaslov-intel Feb 14, 2022

Choose a reason for hiding this comment

Uh oh!

bader Feb 15, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

smaslov-intel Feb 14, 2022

Choose a reason for hiding this comment

Uh oh!

bader Feb 15, 2022

Choose a reason for hiding this comment

Uh oh!

bader Feb 15, 2022

Choose a reason for hiding this comment

Uh oh!

smaslov-intel Feb 15, 2022

Choose a reason for hiding this comment

Uh oh!

bader Feb 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bashbaug Feb 15, 2022

Choose a reason for hiding this comment

Uh oh!

smaslov-intel left a comment

Choose a reason for hiding this comment

Uh oh!

smaslov-intel left a comment

Choose a reason for hiding this comment

Uh oh!

bader commented Feb 17, 2022

Uh oh!

Uh oh!

Uh oh!

smaslov-intel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bader Feb 15, 2022 •

edited

Loading