Skip to content

[SYCL] discard_write optimization #2854

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 7, 2020

Conversation

cperkinsintel
Copy link
Contributor

When using discard_write with a host accessor, we do not need to keep the buffer updated with the latest changes on the device. Skipping that operation results in a significant speed-up.

When we have host unified memory, then Map/Unmap operations are enqueued. As these must always match, it is up to the backend (or its plugin) to take advantage of any optimization. In the case of Level Zero, this can be be done in the SYCL plugin interface code. This PR includes a fix for LevelZero.
CUDA requires no change. OpenCL will need to be changed in the library itself.

When we are not using host unified memory, instead of Map/Unmap operations, we enqueue basic Read/Write ops. When using discard_write on the host, the Read op is unnecessary. This PR schedules an empty operation instead.

…used, we do not need to perform a memcpy. Can result in significant peformance improvement when using discard_write access mode flag

Signed-off-by: Chris Perkins <[email protected]>
…d write membuffer operations are enqueued instead of Map/Unmap. But the read is unecessary. This avoids the redundant op.
@cperkinsintel cperkinsintel changed the title [SYCL] discard_write optimization -- DRAFT [SYCL] discard_write optimization Dec 3, 2020
@cperkinsintel cperkinsintel marked this pull request as ready for review December 3, 2020 20:31
@cperkinsintel cperkinsintel requested review from smaslov-intel and a team as code owners December 3, 2020 20:31
Signed-off-by: Chris Perkins <[email protected]>
MQueue, MDstReq.MDims, MDstReq.MMemoryRange, MDstReq.MAccessRange,
MDstReq.MOffset, MDstReq.MElemSize, std::move(RawEvents), Event);
}
MemoryManager::copy(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA requires no change. OpenCL will need to be changed in the library itself.

Does it mean that this unconditional copy is expected to slow down OpenCL execution currently?
And why CUDA does not need a change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe my commit comment could be clearer. There are two sides: unified memory and not.

This particular line of code you are commenting on is for when we do not have unified memory. In that case, the underlying backend doesn't really matter. SYCL itself is scheduling individual mem read/writes to maintain the coherency, and this optimization will avoid the needless mem read.

The comment about CUDA requires no change is in the context of when there is unified memory. Then we are scheduling paired map/unmap operations. Any optimization will have to be performed by the backend (or it's PI interface). In the case of Level Zero, this PR adds the optimization to its PI interface. In the case of CUDA, it looks like the PI interface is already performing the optimization (and microbenchmarks confirm that). For OpenCL, the CL_MAP_WRITE_INVALIDATE_REGION flag is being passed by the PI to the OpenCL plugin, but no optimization seems to be occurring (as tested by a simple benchmark).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of CUDA, it looks like the PI interface is already performing the optimization (and microbenchmarks confirm that)

Could you double check and spot it in the CUDA PI plugin sources?

For OpenCL, the CL_MAP_WRITE_INVALIDATE_REGION flag is being passed by the PI to the OpenCL plugin, but no optimization seems to be occurring (as tested by a simple benchmark).

This is strange that OpenCL doesn't optimize this, are you going to follow up with OpenCL team?

Copy link
Contributor Author

@cperkinsintel cperkinsintel Dec 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you double check and spot it in the CUDA PI plugin sources?

It's at

CL_MAP_WRITE_INVALIDATE_REGION))) {
. with a supporting comment, but I never step traced it. But the benchmark confirms the difference.

This is strange that OpenCL doesn't optimize this, are you going to follow up with OpenCL team?

Agreed. That is the plan.

Copy link
Contributor

@smaslov-intel smaslov-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bader bader merged commit 6733c8b into intel:sycl Dec 7, 2020
@sergey-semenov
Copy link
Contributor

@cperkinsintel Please, add a test for the read/write part of the change as a separate PR.

jsji added a commit that referenced this pull request Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants