Skip to content

[SYCL] Do not build device code for sub-devices #5240

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Feb 18, 2022
Merged

Conversation

bader
Copy link
Contributor

@bader bader commented Dec 28, 2021

Technically sub-devices are the same as their root device, so we can
build program for root device only and re-use the binary for sub-devices
to avoid "duplicate" builds.

Technically sub-devices are the same as their root device, so we can
build program for root device only and re-use the binary for sub-devices
to avoid "duplicate" builds.
@bader bader requested review from alexbatashev, sergey-semenov and a team December 28, 2021 22:13
alexbatashev
alexbatashev previously approved these changes Dec 29, 2021
Copy link
Contributor

@alexbatashev alexbatashev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

1. SubDevices unit test fails on CUDA systems with the following message

terminate called after throwing an instance of 'cl::sycl::feature_not_supported'
  what():  SPIR-V online compilation is not supported in this context -59 (CL_INVALID_OPERATION)

It looks like instead of using OpenCL CPU as "mock" plug-in, unit test
framework uses "default" plugin.
I applied short term solution and skip the test if CUDA or HIP back-ends
are selected.

2. subdevice_pi from llmv-test-suite fails with:

 terminate called after throwing an instance of 'cl::LLVM::compile_program_error'
   what():  The program was built for 1 devices
 Build program log for 'Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz':
  -33 (CL_INVALID_DEVICE)

It turned out that implementation re-uses a program built for a device
associated with a different context.
I fixed that problem, but still we can't optimize some cases from
subdevice_pi test due to a strange behavior of Intel OpenCL CPU
implementation. See code comments for more details.

At this point I ran out of strength to fix all issues with unit test, so
I temporary disable it. I'm going to extend subdevice_pi test with
checks for build program optimizations.
DPC++ runtime internal classes require refactoring to simplify unit
testing.
@bader bader marked this pull request as ready for review February 8, 2022 16:48
@bader bader requested a review from a team as a code owner February 8, 2022 16:48
@smaslov-intel
Copy link
Contributor

Technically sub-devices are the same as their root device, so we can
build program for root device only and re-use the binary for sub-devices
to avoid "duplicate" builds.

Is this really true for any possible backend/device? Is SYCL standard claiming this?
My thinking is that this would be specific to backends/plugins.

Copy link
Contributor

@smaslov-intel smaslov-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be moved to plugins? Or at least maybe add a PI device query that building on root-device is sufficient for running on all of its sub-devices?

@bader
Copy link
Contributor Author

bader commented Feb 8, 2022

Technically sub-devices are the same as their root device, so we can
build program for root device only and re-use the binary for sub-devices
to avoid "duplicate" builds.

Is this really true for any possible backend/device? Is SYCL standard claiming this? My thinking is that this would be specific to backends/plugins.

It might be specific to device rather than back-end. @intel/dpcpp-specification-reviewers, could you clarify if SYCL standard allows such optimization "de jure". "De facto" is works on Intel's CPU and GPU devices.

Should this be moved to plugins? Or at least maybe add a PI device query that building on root-device is sufficient for running on all of its sub-devices?

New PI device query extension might be required if SYCL standard doesn't specify the implementation behavior in such cases. I don't there is a value of moving this logic to the plug-in as the same device (e.g. Intel GPU) can be exposed via multiple plug-ins (e.g. Level Zero and OpenCL).

@smaslov-intel
Copy link
Contributor

I'd be OK doing this optimization in SYCL RT rather than individual plugins if we guarantee it is legal to do (presumably by querying new device info from the plugin).

@gmlueck
Copy link
Contributor

gmlueck commented Feb 8, 2022

It might be specific to device rather than back-end. @intel/dpcpp-specification-reviewers, could you clarify if SYCL standard allows such optimization "de jure". "De facto" is works on Intel's CPU and GPU devices.

This is mostly an implementation detail that is not exposed in the spec. Unless the application uses the kernel_bundle APIs, the application just submits a kernel to a device, and it's up to the implementation to decide whether it needs to build the kernel or reuse a cached version. Thus, the implementation can decide whether a cached version of the kernel for device A is also valid on device B. Nothing needs to be clarified in the spec for this case.

The kernel_bundle does expose the issue, though, because the build, compile, and link functions all allow the application to pass a set of devices. I guess you are asking whether the SYCL spec should specifically allow an application to pass only device A to sycl::build() and then implicitly allow the kernel to be run also on any sub-device of A.

I think we should not add this to the spec because this might not work automatically on other backends. If an application wants to run the kernel also on sub-devices, it seems like the application can just add the sub-devices to the device list that is passed to sycl::build(). If the implementation knows that the same kernel will work on all sub-devices, it can optimize the call and only compile the kernel for the root device.

Currently, only Level Zero returns true for a new query. Level Zero
supports only Intel GPU devices at the moment and to my knowledge they
all should be homogeneous.

All other backends return false, which disables build optimizations.
@bader bader requested a review from a team as a code owner February 14, 2022 15:55
@bader bader requested a review from smaslov-intel February 14, 2022 15:56
@bader
Copy link
Contributor Author

bader commented Feb 14, 2022

I'd be OK doing this optimization in SYCL RT rather than individual plugins if we guarantee it is legal to do (presumably by querying new device info from the plugin).

I've added a new device info and query it from the runtime - 6e310b0.

@@ -203,7 +203,13 @@ pi_result piDeviceGetInfo(pi_device device, pi_device_info paramName,
std::memcpy(paramValue, &result, sizeof(cl_bool));
return PI_SUCCESS;
}

case PI_DEVICE_INFO_HOMOGENEOUS_ARCH: {
// FIXME: conservatively return false due to lack of low-level API exposing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we maybe return true for Intel GPU's already to get OpenCL backend parity with Level-Zero?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a check for GPU type, but w/o a vendor check. I'm not sure how many OpenCL implementations supports device partition, but I guess it's done for homogeneous GPU only. Let me know if you want to harden the check.

Comment on lines 506 to 507
// To work around this case we optimize only one case: root device shares the
// same context with its sub-device(s). We built for the root device and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is "the same context" checked? What if context just has no root-device in it, only all of its sub-devices?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if context just has no root-device in it, only all of its sub-devices?

The optimization won't be enabled in such case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks. Maybe as a future optimization we could implicitly add the root-device to the context, if >1 of it's sub-devices are there already (such that we can save on 1+ module builds). If you agree, please consider adding a TODO comment.

Copy link
Contributor Author

@bader bader Feb 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if SYCL spec allows implicitly adding devices to the context implicitly created by the runtime, but I think it's not allowed if the context is provided by the user.

The latest patch removes following comment, which I was considering as direction for future optimizations: e6ca4f9#diff-78dd7f7ba0b6120dece1ae4ab5a09c9936ff654a1de2c31ff2dbb1fc58d90393L509-L511
I think it would be great if Level Zero allows us to re-use the program built for any (sub-)device and not only a root device. I tested it on Intel GPU and it works already, but again it's not guaranteed by the spec wording. In this case we don't need implicitly add the root-device to optimize the build for sub-devices.

@bashbaug, does it make sense to pursue this direction? If so, I can recover the comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can think of some cases in theory at least where a program built for one sub-device wouldn't be valid for a sibling sub-device, so this is not a safe assumption in all cases. If we decide this is a direction we want to pursue we'd need to find a way to detect or request this behavior.

Copy link
Contributor

@smaslov-intel smaslov-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address few comments

- Renamed PI_DEVICE_INFO_HOMOGENEOUS_ARCH to PI_DEVICE_INFO_BUILD_ON_SUBDEVICE
- Aligned OpenCL backend with Level Zero backend
smaslov-intel
smaslov-intel previously approved these changes Feb 15, 2022
Copy link
Contributor

@smaslov-intel smaslov-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bader
Copy link
Contributor Author

bader commented Feb 17, 2022

@smaslov-intel, I pulled sycl branch to resolve merge conflicts and added one more code comment, which probably deserves a separate discussion - bf57926. We might want to discuss other side effects of exposing command queues as PiDevices (one of them is we build program for each "command queue" i.e. multiple times for the same device)

I don't think the issue mentioned in the comment should block merging this PR - current implementation solves the problem with multiple builds.

Copy link
Contributor

@smaslov-intel smaslov-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rework the comment

@bader bader requested a review from smaslov-intel February 18, 2022 03:51
@bader bader merged commit 13a7455 into intel:sycl Feb 18, 2022
@bader bader deleted the optimize-build branch February 18, 2022 15:08
smaslov-intel pushed a commit to smaslov-intel/llvm that referenced this pull request Feb 19, 2022
Technically sub-devices are the same as their root device, so we can
build program for root device only and re-use the binary for sub-devices
to avoid "duplicate" builds.
alexbatashev added a commit to alexbatashev/llvm that referenced this pull request Feb 23, 2022
* upstream/sycl: (2757 commits)
  [SYCL][Doc] Fixing incorrect merge of community Readme.md with our version (intel#5636)
  [SYCL] Change USM pooling parameters. (intel#5457)
  [CI] Fix cache location on Windows (intel#5603)
  [SYCL][NFC] Fix a warning about uninitialized struct members (intel#5610)
  [Buildbot] Update Windows GPU version to 101.1340 (intel#5620)
  Fix SPIRV -> OCL barrier call argument attributes
  Move SPV_INTEL_memory_access_aliasing tokens from spirv_internal
  [SYCL][ESIMD] Add support for named barrier APIs (intel#5583)
  [SYCL][L0] Remove ZeModule when program build failed (intel#5541)
  [SYCL] Silence "unknown attribute" warning for `device_indirectly_callable` (intel#5591)
  [SYCL][DOC] Introductory material for extensions (intel#5605)
  [SYCL][DOC] Change extension names to lower case (intel#5607)
  [SYCL] Improve get_kernel_bundle performance (intel#5496)
  [SYCL] Do not build device code for sub-devices (intel#5240)
  [sycl-post-link] Fix a crash during spec-constant properties generation (intel#5538)
  [SYCL][DOC] Move SPIR-V and OpenCL extensions (intel#5578)
  [SYCL][ESIMD][EMU] Update memory intrinsics for ESIMD_EMU plugin (intel#4748)
  [CI] Allow stale issue bot to analyze more issues (intel#5602)
  [SYCL][L0] Honor property::queue::enable_profiling (intel#5543)
  [OpenMP] Properly save strings when doing LTO
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants