[SYCL] Optimize memory transfers #6213

t4c1 · 2022-05-30T12:19:38Z

Optimize memory transfers by removing a redundant host to host data transfer when the data in a buffer is copied the first time from a user supplied const pointer on the host to a device that does not support host-unified memory.

pvchupin · 2022-06-10T04:52:33Z

ping @steffenlarsen

steffenlarsen

It seems like this tries to revert #3105. I worry that we may have regressions for the backends mentioned in #3105 if we simply just scrap the changes. Could we potentially keep the removed path, but only take it on accelerators (and maybe OpenCL too?)

sycl/source/detail/scheduler/graph_builder.cpp

sergey-semenov · 2022-06-13T14:28:45Z

It seems like this tries to revert #3105. I worry that we may have regressions for the backends mentioned in #3105 if we simply just scrap the changes. Could we potentially keep the removed path, but only take it on accelerators (and maybe OpenCL too?)

I'm going to echo Steffen's concern here. The original difference in behavior between creating host unified memory and other buffers (USE_HOST_PTR for the former, COPY_HOST_PTR when needed for the latter) was implemented because USE_HOST_PTR is essentially free in the first case and very expensive on discrete OpenCL/L0 devices. This was then modified in #3105 from creating a device buffer with COPY_HOST_PTR to creating a device buffer then writing to it, which we originally assumed to be functionally identical, but this provided FPGA backend with additional information that made it possible to get rid of a redundant memory copy on their side.

It seems to me that this patch is effectively reverting both of those changes and we're back to square one with using USE_HOST_PTR for all cases. Could you elaborate on why the current sequence of PI calls is causing problems for the CUDA backend?

t4c1 · 2022-06-15T09:33:29Z

#3105 does multiple things. Ones important for this discussion are:

One of them is what is in its description. It removes COPY_HOST_PTR from the mem object creation flags. That effectively separates creation of allocation and copy. I am not changing that.
The second one is that for devices with host unified memory it creates an extra copy of the data on host before copying it to device. That is not documented in PR and looks redundant, so I am removing it.

sergey-semenov · 2022-06-22T16:04:12Z

One of them is what is in its description. It removes COPY_HOST_PTR from the mem object creation flags. That effectively separates creation of allocation and copy. I am not changing that.

You are changing whether the host pointer is actually passed to the buffer during its creation though. #3105 removed the logic of COPY_HOST_PTR vs USE_HOST_PTR during buffer creation: if InitFromUserData is set, USE_HOST_PTR is used, otherwise we create the buffer without the pointer to then perform a write operation if needed. After removing the host unified memory condition from InitFromUserData in this patch all cases will follow the USE_HOST_PTR workflow, so the original difference in behavior between devices with and without host unified memory is lost.

The second one is that for devices with host unified memory it creates an extra copy of the data on host before copying it to device. That is not documented in PR and looks redundant, so I am removing it.

I'm curious as to where this overhead is coming from. Unless I'm missing something, creating a host allocation in the branch this patch deletes should do nothing in terms of memory copying (since we should be able to just reuse the pointer that the user passed to the buffer).

sergey-semenov

Marking this to request changes so this doesn't get accidentally merged while there's still an issue with reverting earlier performance oriented changes.

t4c1 · 2022-07-05T14:43:44Z

I'm curious as to where this overhead is coming from. Unless I'm missing something, creating a host allocation in the branch this patch deletes should do nothing in terms of memory copying (since we should be able to just reuse the pointer that the user passed to the buffer).

After some digging I figured out that indeed USE_HOST_PTR is used if the provided host pointer is non-const. However, in the case I am looking at the pointer is const, and the data is then copied.

This would make fixing my original implementation, so that it also retains the optimization for FPGAs require a large refactor of runtime.

I think an alternative and simpler solution can be to have read-only allocas so thay can use const host pointer without a copy. If the memory needs to be written a new allocation will be created. I am force-pushing this implementation.

sergey-semenov

Sorry about the late review, just minor comments. Please update the branch to resolve merge conflicts.

sycl/source/detail/memory_manager.cpp

sycl/source/detail/scheduler/commands.hpp

sycl/source/detail/scheduler/graph_builder.cpp

sycl/source/detail/scheduler/scheduler.hpp

steffenlarsen

Overall looks good, just a few mostly stylistic comments. Would it be possible to test this, for example with SYCL_PI_TRACE?

sycl/source/detail/memory_manager.cpp

sycl/source/detail/scheduler/commands.hpp

sycl/source/detail/scheduler/graph_builder.cpp

t4c1 · 2022-08-10T07:15:03Z

Would it be possible to test this, for example with SYCL_PI_TRACE?

I don't think so. The problem is that we would expect a different trace depending on whether the device supports host-unified memory. And I don't think we can make FIleCheck conditional on runtime variable.

steffenlarsen

LGTM!

pvchupin · 2022-08-11T02:26:33Z

@t4c1, looks like basic_tests/min_max_test.cpp fails on Windows after this change, can you take a look?
https://github.com/intel/llvm/runs/7778244379?check_suite_focus=true

t4c1 · 2022-08-11T10:59:13Z

I have some trouble with my windows build at the moment, so I can not reproduce it, but I am pretty sure the failure is unrelated to this PR. I think it was caused by #6541 and the fix will be something like https://github.com/intel/llvm/pull/6172/files.

steffenlarsen · 2022-08-11T11:04:32Z

I have some trouble with my windows build at the moment, so I can not reproduce it, but I am pretty sure the failure is unrelated to this PR. I think it was caused by #6541 and the fix will be something like https://github.com/intel/llvm/pull/6172/files.

Agreed. Failure does seem to come from #6541. I am looking into it.

t4c1 requested a review from a team as a code owner May 30, 2022 12:19

t4c1 requested a review from steffenlarsen May 30, 2022 12:19

steffenlarsen reviewed Jun 10, 2022

View reviewed changes

sycl/source/detail/scheduler/graph_builder.cpp Outdated Show resolved Hide resolved

steffenlarsen requested a review from sergey-semenov June 13, 2022 10:52

sergey-semenov requested changes Jun 27, 2022

View reviewed changes

alternative solution to optimize memory transfers

f46f79d

t4c1 force-pushed the optimize_transfers branch from 9754892 to f46f79d Compare July 5, 2022 14:44

t4c1 added 3 commits July 5, 2022 15:47

format

2b8ea00

fix reorder warning

48a47c8

Merge branch 'sycl' into optimize_transfers

c857e84

AerialMantis requested review from sergey-semenov and steffenlarsen August 1, 2022 16:01

sergey-semenov reviewed Aug 2, 2022

View reviewed changes

t4c1 added 2 commits August 2, 2022 14:16

Merge branch 'sycl' into optimize_transfers

3adfc85

addressed review comments

3d56051

sergey-semenov previously approved these changes Aug 2, 2022

View reviewed changes

Merge branch 'sycl' into optimize_transfers

ba5a41e

steffenlarsen reviewed Aug 9, 2022

View reviewed changes

sycl/source/detail/memory_manager.cpp Outdated Show resolved Hide resolved

sycl/source/detail/scheduler/commands.hpp Outdated Show resolved Hide resolved

sycl/source/detail/scheduler/graph_builder.cpp Outdated Show resolved Hide resolved

Merge branch 'sycl' into optimize_transfers

738439d

addressed review comments

f4ab957

t4c1 dismissed sergey-semenov’s stale review via f4ab957 August 10, 2022 07:20

format

e1d9b8f

steffenlarsen approved these changes Aug 10, 2022

View reviewed changes

pvchupin merged commit 92d35cd into intel:sycl Aug 11, 2022

[SYCL] Optimize memory transfers #6213

[SYCL] Optimize memory transfers #6213

Uh oh!

Conversation

t4c1 commented May 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pvchupin commented Jun 10, 2022

Uh oh!

steffenlarsen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sergey-semenov commented Jun 13, 2022

Uh oh!

t4c1 commented Jun 15, 2022

Uh oh!

sergey-semenov commented Jun 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sergey-semenov left a comment

Choose a reason for hiding this comment

Uh oh!

t4c1 commented Jul 5, 2022

Uh oh!

sergey-semenov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

steffenlarsen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

t4c1 commented Aug 10, 2022

Uh oh!

steffenlarsen left a comment

Choose a reason for hiding this comment

Uh oh!

pvchupin commented Aug 11, 2022

Uh oh!

t4c1 commented Aug 11, 2022

Uh oh!

steffenlarsen commented Aug 11, 2022

Uh oh!

Uh oh!

t4c1 commented May 30, 2022 •

edited

Loading

sergey-semenov commented Jun 22, 2022 •

edited

Loading

steffenlarsen left a comment •

edited

Loading