Skip to content

[amdgpu] Implement D2D memcpy as HSA call #69955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 32 additions & 7 deletions openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2250,14 +2250,37 @@ struct AMDGPUDeviceTy : public GenericDeviceTy, AMDGenericDeviceTy {
PinnedMemoryManager);
}

/// Exchange data between two devices within the plugin. This function is not
/// supported in this plugin.
/// Exchange data between two devices within the plugin.
Error dataExchangeImpl(const void *SrcPtr, GenericDeviceTy &DstGenericDevice,
void *DstPtr, int64_t Size,
AsyncInfoWrapperTy &AsyncInfoWrapper) override {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have AsyncInfoWrapper here, my understanding is that these already contain a "stream" which is coded to an HSA signal. Maybe @jdoerfert can expand on this, but my understanding would be that we'd call getStream or something.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, this summon a signal from scratch was copied from another memcpy call. This should be slower, simpler and more robust than splicing it into the stream machinery, so better to land this version, see it goes through the testing, and then see if integration with the stream seems to work

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that AsyncInfoWrapper is unused is a strong indicator that something is off. The code below is blocking, it should not be blocking. Instead of creating the signal, use the stream in AsyncInfoWrapper, something like above in the pinned case:

  AMDGPUStreamTy *Stream = nullptr;
    void *PinnedPtr = nullptr;
    // Use one-step asynchronous operation when host memory is already pinned.
    if (void *PinnedPtr =
            PinnedAllocs.getDeviceAccessiblePtrFromPinnedBuffer(HstPtr)) {
      if (auto Err = getStream(AsyncInfoWrapper, Stream))
        return Err;
      return Stream->pushPinnedMemoryCopyAsync(PinnedPtr, TgtPtr, Size);
    }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory copies are handwritten in each direction so it's not a matter of pulling a signal from a stream and calling an existing function. It's implementable but not trivial.

I'd like to land this blocking version first in order to establish that the underling HSA calls behave as they're supposed to via the internal testing infrastructure, before adding the additional complication to wire it into the signals.

Especially so as I am far from convinced that signals work robustly between GPUs, given they're on different HSA queues.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Streams approach in #69977, seems to work equivalently to this one

// This function should never be called because the function
// AMDGPUPluginTy::isDataExchangable() returns false.
return Plugin::error("dataExchangeImpl not supported");
AMDGPUDeviceTy &DstDevice = static_cast<AMDGPUDeviceTy &>(DstGenericDevice);

hsa_agent_t SrcAgent = getAgent();
hsa_agent_t DstAgent = DstDevice.getAgent();

AMDGPUSignalTy Signal;
if (auto Err = Signal.init())
return Err;

// The agents need to have access to the corresponding memory
// This is presently only true if the pointers were originally
// allocated by this runtime or the caller made the appropriate
// access calls.
hsa_status_t Status = hsa_amd_memory_async_copy(
DstPtr, DstAgent, SrcPtr, SrcAgent, (Size > 0) ? (size_t)Size : 0, 0,
nullptr, Signal.get());
if (auto Err =
Plugin::check(Status, "Error in D2D hsa_amd_memory_async_copy: %s"))
return Err;

if (auto Err = Signal.wait(getStreamBusyWaitMicroseconds()))
return Err;

if (auto Err = Signal.deinit())
return Err;

return Plugin::success();
}

/// Initialize the async info for interoperability purposes.
Expand Down Expand Up @@ -2899,7 +2922,7 @@ struct AMDGPUPluginTy final : public GenericPluginTy {

/// This plugin does not support exchanging data between two devices.
bool isDataExchangable(int32_t SrcDeviceId, int32_t DstDeviceId) override {
return false;
return true;
}

/// Get the host device instance.
Expand Down Expand Up @@ -3174,9 +3197,11 @@ void *AMDGPUDeviceTy::allocate(size_t Size, void *, TargetAllocTy Kind) {
return nullptr;
}

if (Alloc && (Kind == TARGET_ALLOC_HOST || Kind == TARGET_ALLOC_SHARED)) {
if (Alloc) {
auto &KernelAgents = Plugin::get<AMDGPUPluginTy>().getKernelAgents();

// Inherently necessary for host or shared allocations
// Also enabled for device memory to allow device to device memcpy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this costly if unsued?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unknown. I'd guess it's cheap if there's only one GPU in the system and has non-zero cost if there are multiple GPUs that do not perform D2D copies. However the alternative is to try to work backwards from the void* to a corresponding memory pool using maps and that's definitely non-zero cost as well.

// Enable all kernel agents to access the buffer.
if (auto Err = MemoryPool->enableAccess(Alloc, Size, KernelAgents)) {
REPORT("%s\n", toString(std::move(Err)).data());
Expand Down