-
Notifications
You must be signed in to change notification settings - Fork 788
[SYCL] Do not build device code for sub-devices #5240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Apply LLVM's coding style rule - include as little as possible. https://llvm.org/docs/CodingStandards.html#include-as-little-as-possible
Technically sub-devices are the same as their root device, so we can build program for root device only and re-use the binary for sub-devices to avoid "duplicate" builds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
1. SubDevices unit test fails on CUDA systems with the following message terminate called after throwing an instance of 'cl::sycl::feature_not_supported' what(): SPIR-V online compilation is not supported in this context -59 (CL_INVALID_OPERATION) It looks like instead of using OpenCL CPU as "mock" plug-in, unit test framework uses "default" plugin. I applied short term solution and skip the test if CUDA or HIP back-ends are selected. 2. subdevice_pi from llmv-test-suite fails with: terminate called after throwing an instance of 'cl::LLVM::compile_program_error' what(): The program was built for 1 devices Build program log for 'Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz': -33 (CL_INVALID_DEVICE) It turned out that implementation re-uses a program built for a device associated with a different context. I fixed that problem, but still we can't optimize some cases from subdevice_pi test due to a strange behavior of Intel OpenCL CPU implementation. See code comments for more details. At this point I ran out of strength to fix all issues with unit test, so I temporary disable it. I'm going to extend subdevice_pi test with checks for build program optimizations. DPC++ runtime internal classes require refactoring to simplify unit testing.
Is this really true for any possible backend/device? Is SYCL standard claiming this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be moved to plugins? Or at least maybe add a PI device query that building on root-device is sufficient for running on all of its sub-devices?
It might be specific to device rather than back-end. @intel/dpcpp-specification-reviewers, could you clarify if SYCL standard allows such optimization "de jure". "De facto" is works on Intel's CPU and GPU devices.
New PI device query extension might be required if SYCL standard doesn't specify the implementation behavior in such cases. I don't there is a value of moving this logic to the plug-in as the same device (e.g. Intel GPU) can be exposed via multiple plug-ins (e.g. Level Zero and OpenCL). |
I'd be OK doing this optimization in SYCL RT rather than individual plugins if we guarantee it is legal to do (presumably by querying new device info from the plugin). |
This is mostly an implementation detail that is not exposed in the spec. Unless the application uses the The I think we should not add this to the spec because this might not work automatically on other backends. If an application wants to run the kernel also on sub-devices, it seems like the application can just add the sub-devices to the device list that is passed to |
Currently, only Level Zero returns true for a new query. Level Zero supports only Intel GPU devices at the moment and to my knowledge they all should be homogeneous. All other backends return false, which disables build optimizations.
I've added a new device info and query it from the runtime - 6e310b0. |
sycl/plugins/opencl/pi_opencl.cpp
Outdated
@@ -203,7 +203,13 @@ pi_result piDeviceGetInfo(pi_device device, pi_device_info paramName, | |||
std::memcpy(paramValue, &result, sizeof(cl_bool)); | |||
return PI_SUCCESS; | |||
} | |||
|
|||
case PI_DEVICE_INFO_HOMOGENEOUS_ARCH: { | |||
// FIXME: conservatively return false due to lack of low-level API exposing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we maybe return true for Intel GPU's already to get OpenCL backend parity with Level-Zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a check for GPU type, but w/o a vendor check. I'm not sure how many OpenCL implementations supports device partition, but I guess it's done for homogeneous GPU only. Let me know if you want to harden the check.
// To work around this case we optimize only one case: root device shares the | ||
// same context with its sub-device(s). We built for the root device and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is "the same context" checked? What if context just has no root-device in it, only all of its sub-devices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://github.com/intel/llvm/pull/5240/files#diff-78dd7f7ba0b6120dece1ae4ab5a09c9936ff654a1de2c31ff2dbb1fc58d90393R490. Put additional comment to emphasize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if context just has no root-device in it, only all of its sub-devices?
The optimization won't be enabled in such case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks. Maybe as a future optimization we could implicitly add the root-device to the context, if >1 of it's sub-devices are there already (such that we can save on 1+ module builds). If you agree, please consider adding a TODO comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if SYCL spec allows implicitly adding devices to the context implicitly created by the runtime, but I think it's not allowed if the context is provided by the user.
The latest patch removes following comment, which I was considering as direction for future optimizations: e6ca4f9#diff-78dd7f7ba0b6120dece1ae4ab5a09c9936ff654a1de2c31ff2dbb1fc58d90393L509-L511
I think it would be great if Level Zero allows us to re-use the program built for any (sub-)device and not only a root device. I tested it on Intel GPU and it works already, but again it's not guaranteed by the spec wording. In this case we don't need implicitly add the root-device to optimize the build for sub-devices.
@bashbaug, does it make sense to pursue this direction? If so, I can recover the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can think of some cases in theory at least where a program built for one sub-device wouldn't be valid for a sibling sub-device, so this is not a safe assumption in all cases. If we decide this is a direction we want to pursue we'd need to find a way to detect or request this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address few comments
- Renamed PI_DEVICE_INFO_HOMOGENEOUS_ARCH to PI_DEVICE_INFO_BUILD_ON_SUBDEVICE - Aligned OpenCL backend with Level Zero backend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@smaslov-intel, I pulled sycl branch to resolve merge conflicts and added one more code comment, which probably deserves a separate discussion - bf57926. We might want to discuss other side effects of exposing command queues as PiDevices (one of them is we build program for each "command queue" i.e. multiple times for the same device) I don't think the issue mentioned in the comment should block merging this PR - current implementation solves the problem with multiple builds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rework the comment
Technically sub-devices are the same as their root device, so we can build program for root device only and re-use the binary for sub-devices to avoid "duplicate" builds.
* upstream/sycl: (2757 commits) [SYCL][Doc] Fixing incorrect merge of community Readme.md with our version (intel#5636) [SYCL] Change USM pooling parameters. (intel#5457) [CI] Fix cache location on Windows (intel#5603) [SYCL][NFC] Fix a warning about uninitialized struct members (intel#5610) [Buildbot] Update Windows GPU version to 101.1340 (intel#5620) Fix SPIRV -> OCL barrier call argument attributes Move SPV_INTEL_memory_access_aliasing tokens from spirv_internal [SYCL][ESIMD] Add support for named barrier APIs (intel#5583) [SYCL][L0] Remove ZeModule when program build failed (intel#5541) [SYCL] Silence "unknown attribute" warning for `device_indirectly_callable` (intel#5591) [SYCL][DOC] Introductory material for extensions (intel#5605) [SYCL][DOC] Change extension names to lower case (intel#5607) [SYCL] Improve get_kernel_bundle performance (intel#5496) [SYCL] Do not build device code for sub-devices (intel#5240) [sycl-post-link] Fix a crash during spec-constant properties generation (intel#5538) [SYCL][DOC] Move SPIR-V and OpenCL extensions (intel#5578) [SYCL][ESIMD][EMU] Update memory intrinsics for ESIMD_EMU plugin (intel#4748) [CI] Allow stale issue bot to analyze more issues (intel#5602) [SYCL][L0] Honor property::queue::enable_profiling (intel#5543) [OpenMP] Properly save strings when doing LTO ...
Technically sub-devices are the same as their root device, so we can
build program for root device only and re-use the binary for sub-devices
to avoid "duplicate" builds.