Skip to content

[SYCL] Optimize memory transfers #6213

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Aug 11, 2022
Merged

[SYCL] Optimize memory transfers #6213

merged 10 commits into from
Aug 11, 2022

Conversation

t4c1
Copy link
Contributor

@t4c1 t4c1 commented May 30, 2022

Optimize memory transfers by removing a redundant host to host data transfer when the data in a buffer is copied the first time from a user supplied const pointer on the host to a device that does not support host-unified memory.

@t4c1 t4c1 requested a review from a team as a code owner May 30, 2022 12:19
@t4c1 t4c1 requested a review from steffenlarsen May 30, 2022 12:19
@pvchupin
Copy link
Contributor

ping @steffenlarsen

Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this tries to revert #3105. I worry that we may have regressions for the backends mentioned in #3105 if we simply just scrap the changes. Could we potentially keep the removed path, but only take it on accelerators (and maybe OpenCL too?)

@sergey-semenov
Copy link
Contributor

It seems like this tries to revert #3105. I worry that we may have regressions for the backends mentioned in #3105 if we simply just scrap the changes. Could we potentially keep the removed path, but only take it on accelerators (and maybe OpenCL too?)

I'm going to echo Steffen's concern here. The original difference in behavior between creating host unified memory and other buffers (USE_HOST_PTR for the former, COPY_HOST_PTR when needed for the latter) was implemented because USE_HOST_PTR is essentially free in the first case and very expensive on discrete OpenCL/L0 devices. This was then modified in #3105 from creating a device buffer with COPY_HOST_PTR to creating a device buffer then writing to it, which we originally assumed to be functionally identical, but this provided FPGA backend with additional information that made it possible to get rid of a redundant memory copy on their side.

It seems to me that this patch is effectively reverting both of those changes and we're back to square one with using USE_HOST_PTR for all cases. Could you elaborate on why the current sequence of PI calls is causing problems for the CUDA backend?

@t4c1
Copy link
Contributor Author

t4c1 commented Jun 15, 2022

#3105 does multiple things. Ones important for this discussion are:

  • One of them is what is in its description. It removes COPY_HOST_PTR from the mem object creation flags. That effectively separates creation of allocation and copy. I am not changing that.
  • The second one is that for devices with host unified memory it creates an extra copy of the data on host before copying it to device. That is not documented in PR and looks redundant, so I am removing it.

@sergey-semenov
Copy link
Contributor

sergey-semenov commented Jun 22, 2022

  • One of them is what is in its description. It removes COPY_HOST_PTR from the mem object creation flags. That effectively separates creation of allocation and copy. I am not changing that.

You are changing whether the host pointer is actually passed to the buffer during its creation though. #3105 removed the logic of COPY_HOST_PTR vs USE_HOST_PTR during buffer creation: if InitFromUserData is set, USE_HOST_PTR is used, otherwise we create the buffer without the pointer to then perform a write operation if needed. After removing the host unified memory condition from InitFromUserData in this patch all cases will follow the USE_HOST_PTR workflow, so the original difference in behavior between devices with and without host unified memory is lost.

  • The second one is that for devices with host unified memory it creates an extra copy of the data on host before copying it to device. That is not documented in PR and looks redundant, so I am removing it.

I'm curious as to where this overhead is coming from. Unless I'm missing something, creating a host allocation in the branch this patch deletes should do nothing in terms of memory copying (since we should be able to just reuse the pointer that the user passed to the buffer).

Copy link
Contributor

@sergey-semenov sergey-semenov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking this to request changes so this doesn't get accidentally merged while there's still an issue with reverting earlier performance oriented changes.

@t4c1
Copy link
Contributor Author

t4c1 commented Jul 5, 2022

I'm curious as to where this overhead is coming from. Unless I'm missing something, creating a host allocation in the branch this patch deletes should do nothing in terms of memory copying (since we should be able to just reuse the pointer that the user passed to the buffer).

After some digging I figured out that indeed USE_HOST_PTR is used if the provided host pointer is non-const. However, in the case I am looking at the pointer is const, and the data is then copied.

This would make fixing my original implementation, so that it also retains the optimization for FPGAs require a large refactor of runtime.

I think an alternative and simpler solution can be to have read-only allocas so thay can use const host pointer without a copy. If the memory needs to be written a new allocation will be created. I am force-pushing this implementation.

@t4c1 t4c1 force-pushed the optimize_transfers branch from 9754892 to f46f79d Compare July 5, 2022 14:44
Copy link
Contributor

@sergey-semenov sergey-semenov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the late review, just minor comments. Please update the branch to resolve merge conflicts.

sergey-semenov
sergey-semenov previously approved these changes Aug 2, 2022
Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, just a few mostly stylistic comments. Would it be possible to test this, for example with SYCL_PI_TRACE?

@t4c1
Copy link
Contributor Author

t4c1 commented Aug 10, 2022

Would it be possible to test this, for example with SYCL_PI_TRACE?

I don't think so. The problem is that we would expect a different trace depending on whether the device supports host-unified memory. And I don't think we can make FIleCheck conditional on runtime variable.

Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@pvchupin pvchupin merged commit 92d35cd into intel:sycl Aug 11, 2022
@pvchupin
Copy link
Contributor

@t4c1, looks like basic_tests/min_max_test.cpp fails on Windows after this change, can you take a look?
https://github.com/intel/llvm/runs/7778244379?check_suite_focus=true

@t4c1
Copy link
Contributor Author

t4c1 commented Aug 11, 2022

I have some trouble with my windows build at the moment, so I can not reproduce it, but I am pretty sure the failure is unrelated to this PR. I think it was caused by #6541 and the fix will be something like https://github.com/intel/llvm/pull/6172/files.

@steffenlarsen
Copy link
Contributor

I have some trouble with my windows build at the moment, so I can not reproduce it, but I am pretty sure the failure is unrelated to this PR. I think it was caused by #6541 and the fix will be something like https://github.com/intel/llvm/pull/6172/files.

Agreed. Failure does seem to come from #6541. I am looking into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants