-
Notifications
You must be signed in to change notification settings - Fork 790
[SYCL] discard_write optimization #2854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] discard_write optimization #2854
Conversation
…used, we do not need to perform a memcpy. Can result in significant peformance improvement when using discard_write access mode flag Signed-off-by: Chris Perkins <[email protected]>
…d write membuffer operations are enqueued instead of Map/Unmap. But the read is unecessary. This avoids the redundant op.
Signed-off-by: Chris Perkins <[email protected]>
MQueue, MDstReq.MDims, MDstReq.MMemoryRange, MDstReq.MAccessRange, | ||
MDstReq.MOffset, MDstReq.MElemSize, std::move(RawEvents), Event); | ||
} | ||
MemoryManager::copy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA requires no change. OpenCL will need to be changed in the library itself.
Does it mean that this unconditional copy is expected to slow down OpenCL execution currently?
And why CUDA does not need a change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe my commit comment could be clearer. There are two sides: unified memory and not.
This particular line of code you are commenting on is for when we do not have unified memory. In that case, the underlying backend doesn't really matter. SYCL itself is scheduling individual mem read/writes to maintain the coherency, and this optimization will avoid the needless mem read.
The comment about CUDA requires no change is in the context of when there is unified memory. Then we are scheduling paired map/unmap operations. Any optimization will have to be performed by the backend (or it's PI interface). In the case of Level Zero, this PR adds the optimization to its PI interface. In the case of CUDA, it looks like the PI interface is already performing the optimization (and microbenchmarks confirm that). For OpenCL, the CL_MAP_WRITE_INVALIDATE_REGION
flag is being passed by the PI to the OpenCL plugin, but no optimization seems to be occurring (as tested by a simple benchmark).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of CUDA, it looks like the PI interface is already performing the optimization (and microbenchmarks confirm that)
Could you double check and spot it in the CUDA PI plugin sources?
For OpenCL, the CL_MAP_WRITE_INVALIDATE_REGION flag is being passed by the PI to the OpenCL plugin, but no optimization seems to be occurring (as tested by a simple benchmark).
This is strange that OpenCL doesn't optimize this, are you going to follow up with OpenCL team?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you double check and spot it in the CUDA PI plugin sources?
It's at
llvm/sycl/plugins/cuda/pi_cuda.cpp
Line 4120 in 6733c8b
CL_MAP_WRITE_INVALIDATE_REGION))) { |
This is strange that OpenCL doesn't optimize this, are you going to follow up with OpenCL team?
Agreed. That is the plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@cperkinsintel Please, add a test for the read/write part of the change as a separate PR. |
When using
discard_write
with a host accessor, we do not need to keep the buffer updated with the latest changes on the device. Skipping that operation results in a significant speed-up.When we have host unified memory, then Map/Unmap operations are enqueued. As these must always match, it is up to the backend (or its plugin) to take advantage of any optimization. In the case of Level Zero, this can be be done in the SYCL plugin interface code. This PR includes a fix for LevelZero.
CUDA requires no change. OpenCL will need to be changed in the library itself.
When we are not using host unified memory, instead of Map/Unmap operations, we enqueue basic Read/Write ops. When using discard_write on the host, the Read op is unnecessary. This PR schedules an empty operation instead.