Skip to content

[OpenMP][OMPT] Add OMPT callback for device data exchange 'Device-to-Device' #81991

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 26, 2024

Conversation

mhalk
Copy link
Contributor

@mhalk mhalk commented Feb 16, 2024

Since there's no ompt_target_data_transfer_tofrom_device (within ompt_target_data_op_t enum) or something other that conveys the meaning of inter-device data exchange we decided to indicate a Device-to-Device transfer by using: optype == ompt_target_data_transfer_from_device (=3)

Hence, a device transfer may be identified e.g. by checking for: (optype == 3) &&
(src_device_num < omp_get_num_devices()) &&
(dest_device_num < omp_get_num_devices())

Fixes: #66478

@mhalk mhalk requested review from jplehr and dhruvachak February 16, 2024 13:38
@llvmbot llvmbot added the openmp:libomptarget OpenMP offload runtime label Feb 16, 2024
@mhalk mhalk force-pushed the fix/ompt_llvm_66478 branch 2 times, most recently from a4ed787 to 5fd8e50 Compare February 16, 2024 20:22
@mhalk
Copy link
Contributor Author

mhalk commented Feb 16, 2024

Removed assertions w.r.t. target_task_data as well as src_addr & dest_addr as these may have various values depending on the usage. For example omp_target_memcpy will cause target_task_data=(nil) as well as both addresses to be (nil), this makes checking cumbersome and the tests brittle.
Hence, in coordination with @dhruvachak, we decided to remove the corresponding assertion.

@jplehr
Copy link
Contributor

jplehr commented Feb 16, 2024

Since there's no ompt_target_data_transfer_tofrom_device (within ompt_target_data_op_t enum) or something other that conveys the meaning of inter-device data exchange we decided to indicate a Device-to-Device transfer by using: optype == ompt_target_data_transfer_from_device (=3)

Hence, a device transfer may be identified e.g. by checking for: (optype == 3) && (src_device_num < omp_get_num_devices()) && (dest_device_num < omp_get_num_devices())

Fixes: #66478

Do we have a place to document this decision other than this PR / commit message?

@Thyre
Copy link
Contributor

Thyre commented Feb 17, 2024

Can confirm that this pull request also works for NVIDIA GPUs, though I weren't able to test it with multiple accelerators due to build issues on our HPC machines.

$ clang --version
clang version 19.0.0git (https://github.com/llvm/llvm-project.git 5fd8e50feff94dac7e741b07c956622b7c25bc6a)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/jreuter/Projects/Compilers/llvm-project/_build/_install/bin
$ clang -fopenmp --offload-arch=native reproducer.c
$ ./a.out
Callback Init: device_num=0 type=sm_75 device=0x55d75ee03a40 lookup=0x7fb3518ebb50 doc=(nil)
Allocating memory on device
  Callback DataOp EMI: endpoint=1 optype=1 target_task_data=(nil) (0x0) target_data=0x7fb3516787d0 (0x0) host_op_id=0x7fb3516787c8 (0x8000000000000001) src=(nil) src_device_num=1 dest=(nil) dest_device_num=0 bytes=4 code=0x55d75ce76853
  Callback DataOp EMI: endpoint=2 optype=1 target_task_data=(nil) (0x0) target_data=0x7fb3516787d0 (0x0) host_op_id=0x7fb3516787c8 (0x8000000000000001) src=(nil) src_device_num=1 dest=0x7fb325a00000 dest_device_num=0 bytes=4 code=0x55d75ce76853
  Callback DataOp EMI: endpoint=1 optype=1 target_task_data=(nil) (0x0) target_data=0x7fb3516787d0 (0x0) host_op_id=0x7fb3516787c8 (0x8000000000000002) src=(nil) src_device_num=1 dest=(nil) dest_device_num=0 bytes=4 code=0x55d75ce76864
  Callback DataOp EMI: endpoint=2 optype=1 target_task_data=(nil) (0x0) target_data=0x7fb3516787d0 (0x0) host_op_id=0x7fb3516787c8 (0x8000000000000002) src=(nil) src_device_num=1 dest=0x7fb325a00200 dest_device_num=0 bytes=4 code=0x55d75ce76864
Testing host to device
  Callback DataOp EMI: endpoint=1 optype=2 target_task_data=(nil) (0x0) target_data=0x7fb3516787d0 (0x0) host_op_id=0x7fb3516787c8 (0x8000000000000003) src=0x55d75f66b200 src_device_num=1 dest=0x7fb325a00000 dest_device_num=0 bytes=4 code=0x55d75ce768ca
  Callback DataOp EMI: endpoint=2 optype=2 target_task_data=(nil) (0x0) target_data=0x7fb3516787d0 (0x0) host_op_id=0x7fb3516787c8 (0x8000000000000003) src=0x55d75f66b200 src_device_num=1 dest=0x7fb325a00000 dest_device_num=0 bytes=4 code=0x55d75ce768ca
Testing device to device
  Callback DataOp EMI: endpoint=1 optype=3 target_task_data=(nil) (0x0) target_data=0x7fb3516787d0 (0x0) host_op_id=0x7fb3516787c8 (0x8000000000000004) src=0x7fb325a00000 src_device_num=0 dest=0x7fb325a00200 dest_device_num=0 bytes=4 code=0x55d75ce768fc
  Callback DataOp EMI: endpoint=2 optype=3 target_task_data=(nil) (0x0) target_data=0x7fb3516787d0 (0x0) host_op_id=0x7fb3516787c8 (0x8000000000000004) src=0x7fb325a00000 src_device_num=0 dest=0x7fb325a00200 dest_device_num=0 bytes=4 code=0x55d75ce768fc
Testing device to host
  Callback DataOp EMI: endpoint=1 optype=3 target_task_data=(nil) (0x0) target_data=0x7fb3516787d0 (0x0) host_op_id=0x7fb3516787c8 (0x8000000000000005) src=0x7fb325a00200 src_device_num=0 dest=0x55d75f66b200 dest_device_num=1 bytes=4 code=0x55d75ce76942
  Callback DataOp EMI: endpoint=2 optype=3 target_task_data=(nil) (0x0) target_data=0x7fb3516787d0 (0x0) host_op_id=0x7fb3516787c8 (0x8000000000000005) src=0x7fb325a00200 src_device_num=0 dest=0x55d75f66b200 dest_device_num=1 bytes=4 code=0x55d75ce76942
Checking correctness
Freeing memory on device
  Callback DataOp EMI: endpoint=1 optype=4 target_task_data=(nil) (0x0) target_data=0x7fb3516787d0 (0x0) host_op_id=0x7fb3516787c8 (0x8000000000000006) src=0x7fb325a00000 src_device_num=0 dest=(nil) dest_device_num=-1 bytes=0 code=0x55d75ce769a4
  Callback DataOp EMI: endpoint=2 optype=4 target_task_data=(nil) (0x0) target_data=0x7fb3516787d0 (0x0) host_op_id=0x7fb3516787c8 (0x8000000000000006) src=0x7fb325a00000 src_device_num=0 dest=(nil) dest_device_num=-1 bytes=0 code=0x55d75ce769a4
  Callback DataOp EMI: endpoint=1 optype=4 target_task_data=(nil) (0x0) target_data=0x7fb3516787d0 (0x0) host_op_id=0x7fb3516787c8 (0x8000000000000007) src=0x7fb325a00200 src_device_num=0 dest=(nil) dest_device_num=-1 bytes=0 code=0x55d75ce769b0
  Callback DataOp EMI: endpoint=2 optype=4 target_task_data=(nil) (0x0) target_data=0x7fb3516787d0 (0x0) host_op_id=0x7fb3516787c8 (0x8000000000000007) src=0x7fb325a00200 src_device_num=0 dest=(nil) dest_device_num=-1 bytes=0 code=0x55d75ce769b0
Callback Fini: device_num=0

x86_64 still reports two transfers even though LIBOMPTARGET_DEBUG shows omptarget --> copy from device to device.

$ clang --version
clang version 19.0.0git (https://github.com/llvm/llvm-project.git 5fd8e50feff94dac7e741b07c956622b7c25bc6a)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/jreuter/Projects/Compilers/llvm-project/_build/_install/bin
$ clang -fopenmp -fopenmp-targets=x86_64 reproducer.c
$ ./a.out
Callback Init: device_num=0 type=generic-64bit device=0x5644e0820950 lookup=0x7fb3d245bb50 doc=(nil)
Callback Init: device_num=1 type=generic-64bit device=0x5644e0821380 lookup=0x7fb3d245bb50 doc=(nil)
Callback Init: device_num=2 type=generic-64bit device=0x5644e08219a0 lookup=0x7fb3d245bb50 doc=(nil)
Callback Init: device_num=3 type=generic-64bit device=0x5644e08221d0 lookup=0x7fb3d245bb50 doc=(nil)
Allocating memory on device
  Callback DataOp EMI: endpoint=1 optype=1 target_task_data=(nil) (0x0) target_data=0x7fb3d22367d0 (0x0) host_op_id=0x7fb3d22367c8 (0x8000000000000001) src=(nil) src_device_num=4 dest=(nil) dest_device_num=0 bytes=4 code=0x5644dee0c853
  Callback DataOp EMI: endpoint=2 optype=1 target_task_data=(nil) (0x0) target_data=0x7fb3d22367d0 (0x0) host_op_id=0x7fb3d22367c8 (0x8000000000000001) src=(nil) src_device_num=4 dest=0x5644e0820790 dest_device_num=0 bytes=4 code=0x5644dee0c853
  Callback DataOp EMI: endpoint=1 optype=1 target_task_data=(nil) (0x0) target_data=0x7fb3d22367d0 (0x0) host_op_id=0x7fb3d22367c8 (0x8000000000000002) src=(nil) src_device_num=4 dest=(nil) dest_device_num=1 bytes=4 code=0x5644dee0c864
  Callback DataOp EMI: endpoint=2 optype=1 target_task_data=(nil) (0x0) target_data=0x7fb3d22367d0 (0x0) host_op_id=0x7fb3d22367c8 (0x8000000000000002) src=(nil) src_device_num=4 dest=0x5644e07fafb0 dest_device_num=1 bytes=4 code=0x5644dee0c864
Testing host to device
  Callback DataOp EMI: endpoint=1 optype=2 target_task_data=(nil) (0x0) target_data=0x7fb3d22367d0 (0x0) host_op_id=0x7fb3d22367c8 (0x8000000000000003) src=0x5644e081b990 src_device_num=4 dest=0x5644e0820790 dest_device_num=0 bytes=4 code=0x5644dee0c8ca
  Callback DataOp EMI: endpoint=2 optype=2 target_task_data=(nil) (0x0) target_data=0x7fb3d22367d0 (0x0) host_op_id=0x7fb3d22367c8 (0x8000000000000003) src=0x5644e081b990 src_device_num=4 dest=0x5644e0820790 dest_device_num=0 bytes=4 code=0x5644dee0c8ca
Testing device to device
  Callback DataOp EMI: endpoint=1 optype=3 target_task_data=(nil) (0x0) target_data=0x7fb3d22367d0 (0x0) host_op_id=0x7fb3d22367c8 (0x8000000000000004) src=0x5644e0820790 src_device_num=0 dest=0x5644e0820880 dest_device_num=4 bytes=4 code=0x5644dee0c8fc
  Callback DataOp EMI: endpoint=2 optype=3 target_task_data=(nil) (0x0) target_data=0x7fb3d22367d0 (0x0) host_op_id=0x7fb3d22367c8 (0x8000000000000004) src=0x5644e0820790 src_device_num=0 dest=0x5644e0820880 dest_device_num=4 bytes=4 code=0x5644dee0c8fc
  Callback DataOp EMI: endpoint=1 optype=2 target_task_data=(nil) (0x0) target_data=0x7fb3d22367d0 (0x0) host_op_id=0x7fb3d22367c8 (0x8000000000000005) src=0x5644e0820880 src_device_num=4 dest=0x5644e07fafb0 dest_device_num=1 bytes=4 code=0x5644dee0c8fc
  Callback DataOp EMI: endpoint=2 optype=2 target_task_data=(nil) (0x0) target_data=0x7fb3d22367d0 (0x0) host_op_id=0x7fb3d22367c8 (0x8000000000000005) src=0x5644e0820880 src_device_num=4 dest=0x5644e07fafb0 dest_device_num=1 bytes=4 code=0x5644dee0c8fc
Testing device to host
  Callback DataOp EMI: endpoint=1 optype=3 target_task_data=(nil) (0x0) target_data=0x7fb3d22367d0 (0x0) host_op_id=0x7fb3d22367c8 (0x8000000000000006) src=0x5644e07fafb0 src_device_num=1 dest=0x5644e081b990 dest_device_num=4 bytes=4 code=0x5644dee0c942
  Callback DataOp EMI: endpoint=2 optype=3 target_task_data=(nil) (0x0) target_data=0x7fb3d22367d0 (0x0) host_op_id=0x7fb3d22367c8 (0x8000000000000006) src=0x5644e07fafb0 src_device_num=1 dest=0x5644e081b990 dest_device_num=4 bytes=4 code=0x5644dee0c942
Checking correctness
Freeing memory on device
  Callback DataOp EMI: endpoint=1 optype=4 target_task_data=(nil) (0x0) target_data=0x7fb3d22367d0 (0x0) host_op_id=0x7fb3d22367c8 (0x8000000000000007) src=0x5644e0820790 src_device_num=0 dest=(nil) dest_device_num=-1 bytes=0 code=0x5644dee0c9a4
  Callback DataOp EMI: endpoint=2 optype=4 target_task_data=(nil) (0x0) target_data=0x7fb3d22367d0 (0x0) host_op_id=0x7fb3d22367c8 (0x8000000000000007) src=0x5644e0820790 src_device_num=0 dest=(nil) dest_device_num=-1 bytes=0 code=0x5644dee0c9a4
  Callback DataOp EMI: endpoint=1 optype=4 target_task_data=(nil) (0x0) target_data=0x7fb3d22367d0 (0x0) host_op_id=0x7fb3d22367c8 (0x8000000000000008) src=0x5644e07fafb0 src_device_num=1 dest=(nil) dest_device_num=-1 bytes=0 code=0x5644dee0c9b0
  Callback DataOp EMI: endpoint=2 optype=4 target_task_data=(nil) (0x0) target_data=0x7fb3d22367d0 (0x0) host_op_id=0x7fb3d22367c8 (0x8000000000000008) src=0x5644e07fafb0 src_device_num=1 dest=(nil) dest_device_num=-1 bytes=0 code=0x5644dee0c9b0
Callback Fini: device_num=0
Callback Fini: device_num=1
Callback Fini: device_num=2
Callback Fini: device_num=3
Click to expand output with LIBOMPTARGET_DEBUG
omptarget --> Init offload library!
OMPT --> Entering connectLibrary (libomp)
OMPT --> OMPT: Trying to load library libomp.so
OMPT --> OMPT: Trying to get address of connection routine ompt_libomp_connect
OMPT --> OMPT: Library connection handle = 0x7efc564dca90
omptarget --> Call to omp_get_num_devices returning 0
OMPT --> Executing initializeLibrary (libomp)
OMPT --> initializeLibrary (libomp) bound lookupCallbackByCode=0x7efc564dd710
OMPT --> initializeLibrary (libomp) bound ompt_get_task_data_fn=0x7efc564de020
OMPT --> initializeLibrary (libomp) bound ompt_get_target_task_data_fn=0x7efc564de060
OMPT --> Exiting connectLibrary (libomp)
omptarget --> Loading RTLs...
omptarget --> Attempting to load library 'libomptarget.rtl.x86_64.so'...
omptarget --> Successfully loaded library 'libomptarget.rtl.x86_64.so'!
OMPT --> OMPT: Entering connectLibrary (libomptarget)
OMPT --> OMPT: Trying to load library libomptarget.so
OMPT --> OMPT: Trying to get address of connection routine ompt_libomptarget_connect
OMPT --> OMPT: Library connection handle = 0x7efc5640cac0
OMPT --> Enter ompt_libomptarget_connect
OMPT --> OMPT: Executing initializeLibrary (libomptarget)
OMPT --> OMPT: initializeLibrary (libomptarget) bound lookupCallbackByCode=0x7efc564dd710
OMPT --> Leave ompt_libomptarget_connect
OMPT --> OMPT: Exiting connectLibrary (libomptarget)
omptarget --> Registered 'libomptarget.rtl.x86_64.so' with 4 plugin visible devices!
omptarget --> Attempting to load library 'libomptarget.rtl.cuda.so'...
omptarget --> Successfully loaded library 'libomptarget.rtl.cuda.so'!
TARGET CUDA RTL --> Implementing cuInit with dlsym(cuInit) -> 0x7efc4e8c1ec0
TARGET CUDA RTL --> Implementing cuCtxGetDevice with dlsym(cuCtxGetDevice) -> 0x7efc4e8c9b50
TARGET CUDA RTL --> Implementing cuDeviceGet with dlsym(cuDeviceGet) -> 0x7efc4e8c1f00
TARGET CUDA RTL --> Implementing cuDeviceGetAttribute with dlsym(cuDeviceGetAttribute) -> 0x7efc4e8c2000
TARGET CUDA RTL --> Implementing cuDeviceGetCount with dlsym(cuDeviceGetCount) -> 0x7efc4e8c1f20
TARGET CUDA RTL --> Implementing cuFuncGetAttribute with dlsym(cuFuncGetAttribute) -> 0x7efc4e8f6f90
TARGET CUDA RTL --> Implementing cuDeviceGetName with dlsym(cuDeviceGetName) -> 0x7efc4e8c1f40
TARGET CUDA RTL --> Implementing cuDeviceTotalMem with dlsym(cuDeviceTotalMem) -> 0x7efc4e918b80
TARGET CUDA RTL --> Implementing cuDriverGetVersion with dlsym(cuDriverGetVersion) -> 0x7efc4e8c1ee0
TARGET CUDA RTL --> Implementing cuGetErrorString with dlsym(cuGetErrorString) -> 0x7efc4e8c1e80
TARGET CUDA RTL --> Implementing cuLaunchKernel with dlsym(cuLaunchKernel) -> 0x7efc4e92f100
TARGET CUDA RTL --> Implementing cuMemAlloc with dlsym(cuMemAlloc_v2) -> 0x7efc4e8d4f80
TARGET CUDA RTL --> Implementing cuMemAllocHost with dlsym(cuMemAllocHost) -> 0x7efc4e918c80
TARGET CUDA RTL --> Implementing cuMemAllocManaged with dlsym(cuMemAllocManaged) -> 0x7efc4e8d50a0
TARGET CUDA RTL --> Implementing cuMemAllocAsync with dlsym(cuMemAllocAsync) -> 0x7efc4e93a9c0
TARGET CUDA RTL --> Implementing cuMemcpyDtoDAsync with dlsym(cuMemcpyDtoDAsync_v2) -> 0x7efc4e9241c0
TARGET CUDA RTL --> Implementing cuMemcpyDtoH with dlsym(cuMemcpyDtoH_v2) -> 0x7efc4e924000
TARGET CUDA RTL --> Implementing cuMemcpyDtoHAsync with dlsym(cuMemcpyDtoHAsync_v2) -> 0x7efc4e9241a0
TARGET CUDA RTL --> Implementing cuMemcpyHtoD with dlsym(cuMemcpyHtoD_v2) -> 0x7efc4e923fe0
TARGET CUDA RTL --> Implementing cuMemcpyHtoDAsync with dlsym(cuMemcpyHtoDAsync_v2) -> 0x7efc4e924180
TARGET CUDA RTL --> Implementing cuMemFree with dlsym(cuMemFree_v2) -> 0x7efc4e8d4fc0
TARGET CUDA RTL --> Implementing cuMemFreeHost with dlsym(cuMemFreeHost) -> 0x7efc4e8d5020
TARGET CUDA RTL --> Implementing cuMemFreeAsync with dlsym(cuMemFreeAsync) -> 0x7efc4e93a9a0
TARGET CUDA RTL --> Implementing cuModuleGetFunction with dlsym(cuModuleGetFunction) -> 0x7efc4e8c9e30
TARGET CUDA RTL --> Implementing cuModuleGetGlobal with dlsym(cuModuleGetGlobal_v2) -> 0x7efc4e8c9e50
TARGET CUDA RTL --> Implementing cuModuleUnload with dlsym(cuModuleUnload) -> 0x7efc4e8c9df0
TARGET CUDA RTL --> Implementing cuStreamCreate with dlsym(cuStreamCreate) -> 0x7efc4e8ebc60
TARGET CUDA RTL --> Implementing cuStreamDestroy with dlsym(cuStreamDestroy_v2) -> 0x7efc4e8ebe80
TARGET CUDA RTL --> Implementing cuStreamSynchronize with dlsym(cuStreamSynchronize) -> 0x7efc4e92f0a0
TARGET CUDA RTL --> Implementing cuStreamQuery with dlsym(cuStreamQuery) -> 0x7efc4e924560
TARGET CUDA RTL --> Implementing cuCtxSetCurrent with dlsym(cuCtxSetCurrent) -> 0x7efc4e8c9b10
TARGET CUDA RTL --> Implementing cuDevicePrimaryCtxRelease with dlsym(cuDevicePrimaryCtxRelease_v2) -> 0x7efc4e8c99f0
TARGET CUDA RTL --> Implementing cuDevicePrimaryCtxGetState with dlsym(cuDevicePrimaryCtxGetState) -> 0x7efc4e8c9a30
TARGET CUDA RTL --> Implementing cuDevicePrimaryCtxSetFlags with dlsym(cuDevicePrimaryCtxSetFlags_v2) -> 0x7efc4e8c9a10
TARGET CUDA RTL --> Implementing cuDevicePrimaryCtxRetain with dlsym(cuDevicePrimaryCtxRetain) -> 0x7efc4e8c99d0
TARGET CUDA RTL --> Implementing cuModuleLoadDataEx with dlsym(cuModuleLoadDataEx) -> 0x7efc4e8c9db0
TARGET CUDA RTL --> Implementing cuDeviceCanAccessPeer with dlsym(cuDeviceCanAccessPeer) -> 0x7efc4e90d8c0
TARGET CUDA RTL --> Implementing cuCtxEnablePeerAccess with dlsym(cuCtxEnablePeerAccess) -> 0x7efc4e90d8e0
TARGET CUDA RTL --> Implementing cuMemcpyPeerAsync with dlsym(cuMemcpyPeerAsync) -> 0x7efc4e924350
TARGET CUDA RTL --> Implementing cuCtxGetLimit with dlsym(cuCtxGetLimit) -> 0x7efc4e8c9c10
TARGET CUDA RTL --> Implementing cuCtxSetLimit with dlsym(cuCtxSetLimit) -> 0x7efc4e8c9bf0
TARGET CUDA RTL --> Implementing cuEventCreate with dlsym(cuEventCreate) -> 0x7efc4e8ebf00
TARGET CUDA RTL --> Implementing cuEventRecord with dlsym(cuEventRecord) -> 0x7efc4e92f0c0
TARGET CUDA RTL --> Implementing cuStreamWaitEvent with dlsym(cuStreamWaitEvent) -> 0x7efc4e924500
TARGET CUDA RTL --> Implementing cuEventSynchronize with dlsym(cuEventSynchronize) -> 0x7efc4e8ebf80
TARGET CUDA RTL --> Implementing cuEventDestroy with dlsym(cuEventDestroy) -> 0x7efc4e923f60
OMPT --> OMPT: Entering connectLibrary (libomptarget)
OMPT --> OMPT: Trying to load library libomptarget.so
OMPT --> OMPT: Trying to get address of connection routine ompt_libomptarget_connect
OMPT --> OMPT: Library connection handle = 0x7efc5640cac0
OMPT --> Enter ompt_libomptarget_connect
OMPT --> OMPT: Executing initializeLibrary (libomptarget)
OMPT --> OMPT: initializeLibrary (libomptarget) bound lookupCallbackByCode=0x7efc564dd710
OMPT --> Leave ompt_libomptarget_connect
OMPT --> OMPT: Exiting connectLibrary (libomptarget)
omptarget --> Registered 'libomptarget.rtl.cuda.so' with 1 plugin visible devices!
omptarget --> Attempting to load library 'libomptarget.rtl.amdgpu.so'...
omptarget --> Successfully loaded library 'libomptarget.rtl.amdgpu.so'!
TARGET AMDGPU RTL --> Unable to load library 'libhsa-runtime64.so': libhsa-runtime64.so: cannot open shared object file: No such file or directory!
TARGET AMDGPU RTL --> Failed to initialize AMDGPU's HSA library
omptarget --> No devices supported in this RTL
omptarget --> RTLs loaded!
omptarget --> Image 0x000055c1f4f146e0 is compatible with RTL libomptarget.rtl.x86_64.so!
PluginInterface --> OMPT: class bound ompt_callback_device_initialize=0x55c1f4f133d0
PluginInterface --> OMPT: class bound ompt_callback_device_finalize=0x55c1f4f13420
PluginInterface --> OMPT: class bound ompt_callback_device_load=0x55c1f4f13450
PluginInterface --> OMPT: class bound ompt_callback_device_unload=(nil)
Callback Init: device_num=0 type=generic-64bit device=0x55c1f5997a30 lookup=0x7efc564dcb50 doc=(nil)
PluginInterface --> OMPT: class bound ompt_callback_device_initialize=0x55c1f4f133d0
PluginInterface --> OMPT: class bound ompt_callback_device_finalize=0x55c1f4f13420
PluginInterface --> OMPT: class bound ompt_callback_device_load=0x55c1f4f13450
PluginInterface --> OMPT: class bound ompt_callback_device_unload=(nil)
Callback Init: device_num=1 type=generic-64bit device=0x55c1f5998460 lookup=0x7efc564dcb50 doc=(nil)
PluginInterface --> OMPT: class bound ompt_callback_device_initialize=0x55c1f4f133d0
PluginInterface --> OMPT: class bound ompt_callback_device_finalize=0x55c1f4f13420
PluginInterface --> OMPT: class bound ompt_callback_device_load=0x55c1f4f13450
PluginInterface --> OMPT: class bound ompt_callback_device_unload=(nil)
Callback Init: device_num=2 type=generic-64bit device=0x55c1f5998a80 lookup=0x7efc564dcb50 doc=(nil)
PluginInterface --> OMPT: class bound ompt_callback_device_initialize=0x55c1f4f133d0
PluginInterface --> OMPT: class bound ompt_callback_device_finalize=0x55c1f4f13420
PluginInterface --> OMPT: class bound ompt_callback_device_load=0x55c1f4f13450
PluginInterface --> OMPT: class bound ompt_callback_device_unload=(nil)
Callback Init: device_num=3 type=generic-64bit device=0x55c1f59992b0 lookup=0x7efc564dcb50 doc=(nil)
omptarget --> Plugin adaptor 0x000055c1f59720d0 has index 0, exposes 4 out of 4 devices!
omptarget --> Registering image 0x000055c1f4f146e0 with RTL libomptarget.rtl.x86_64.so!
omptarget --> Done registering entries!
omptarget --> Call to omp_get_num_devices returning 4
Allocating memory on device
omptarget --> Call to omp_target_alloc for device 0 requesting 4 bytes
omptarget --> Call to omp_get_num_devices returning 4
omptarget --> Call to omp_get_initial_device returning 4
OMPT --> in ompt_target_region_begin (TargetRegionId = 0)
omptarget --> Call to omp_get_num_devices returning 4
omptarget --> Call to omp_get_initial_device returning 4
  Callback DataOp EMI: endpoint=1 optype=1 target_task_data=(nil) (0x0) target_data=0x7efc562697d0 (0x0) host_op_id=0x7efc562697c8 (0x8000000000000001) src=(nil) src_device_num=4 dest=(nil) dest_device_num=0 bytes=4 code=0x55c1f4f13853
PluginInterface --> MemoryManagerTy::allocate: size 4 with host pointer 0x0000000000000000.
PluginInterface --> findBucket: Size 4 is floored to 4.
PluginInterface --> findBucket: Size 4 goes to bucket 0
PluginInterface --> Cannot find a node in the FreeLists. Allocate on device.
PluginInterface --> Node address 0x000055c1f5997910, target pointer 0x000055c1f59977a0, size 4
omptarget --> Call to omp_get_num_devices returning 4
omptarget --> Call to omp_get_initial_device returning 4
  Callback DataOp EMI: endpoint=2 optype=1 target_task_data=(nil) (0x0) target_data=0x7efc562697d0 (0x0) host_op_id=0x7efc562697c8 (0x8000000000000001) src=(nil) src_device_num=4 dest=0x55c1f59977a0 dest_device_num=0 bytes=4 code=0x55c1f4f13853
OMPT --> in ompt_target_region_end (TargetRegionId = 0)
omptarget --> omp_target_alloc returns device ptr 0x000055c1f59977a0
omptarget --> Call to omp_target_alloc for device 1 requesting 4 bytes
omptarget --> Call to omp_get_num_devices returning 4
omptarget --> Call to omp_get_initial_device returning 4
OMPT --> in ompt_target_region_begin (TargetRegionId = 0)
omptarget --> Call to omp_get_num_devices returning 4
omptarget --> Call to omp_get_initial_device returning 4
  Callback DataOp EMI: endpoint=1 optype=1 target_task_data=(nil) (0x0) target_data=0x7efc562697d0 (0x0) host_op_id=0x7efc562697c8 (0x8000000000000002) src=(nil) src_device_num=4 dest=(nil) dest_device_num=1 bytes=4 code=0x55c1f4f13864
PluginInterface --> MemoryManagerTy::allocate: size 4 with host pointer 0x0000000000000000.
PluginInterface --> findBucket: Size 4 is floored to 4.
PluginInterface --> findBucket: Size 4 goes to bucket 0
PluginInterface --> Cannot find a node in the FreeLists. Allocate on device.
PluginInterface --> Node address 0x000055c1f5997940, target pointer 0x000055c1f5971fb0, size 4
omptarget --> Call to omp_get_num_devices returning 4
omptarget --> Call to omp_get_initial_device returning 4
  Callback DataOp EMI: endpoint=2 optype=1 target_task_data=(nil) (0x0) target_data=0x7efc562697d0 (0x0) host_op_id=0x7efc562697c8 (0x8000000000000002) src=(nil) src_device_num=4 dest=0x55c1f5971fb0 dest_device_num=1 bytes=4 code=0x55c1f4f13864
OMPT --> in ompt_target_region_end (TargetRegionId = 0)
omptarget --> omp_target_alloc returns device ptr 0x000055c1f5971fb0
Testing host to device
omptarget --> Call to omp_get_num_devices returning 4
omptarget --> Call to omp_target_memcpy, dst device 0, src device 4, dst addr 0x000055c1f59977a0, src addr 0x000055c1f5992980, dst offset 0, src offset 0, length 4
omptarget --> Call to omp_get_num_devices returning 4
omptarget --> Call to omp_get_initial_device returning 4
omptarget --> Call to omp_get_num_devices returning 4
omptarget --> Call to omp_get_initial_device returning 4
omptarget --> Call to omp_get_num_devices returning 4
omptarget --> Call to omp_get_initial_device returning 4
omptarget --> copy from host to device
omptarget --> Call to omp_get_num_devices returning 4
omptarget --> Call to omp_get_initial_device returning 4
OMPT --> in ompt_target_region_begin (TargetRegionId = 0)
  Callback DataOp EMI: endpoint=1 optype=2 target_task_data=(nil) (0x0) target_data=0x7efc562697d0 (0x0) host_op_id=0x7efc562697c8 (0x8000000000000003) src=0x55c1f5992980 src_device_num=4 dest=0x55c1f59977a0 dest_device_num=0 bytes=4 code=0x55c1f4f138ca
  Callback DataOp EMI: endpoint=2 optype=2 target_task_data=(nil) (0x0) target_data=0x7efc562697d0 (0x0) host_op_id=0x7efc562697c8 (0x8000000000000003) src=0x55c1f5992980 src_device_num=4 dest=0x55c1f59977a0 dest_device_num=0 bytes=4 code=0x55c1f4f138ca
OMPT --> in ompt_target_region_end (TargetRegionId = 0)
omptarget --> omp_target_memcpy returns 0
Testing device to device
omptarget --> Call to omp_target_memcpy, dst device 1, src device 0, dst addr 0x000055c1f5971fb0, src addr 0x000055c1f59977a0, dst offset 0, src offset 0, length 4
omptarget --> Call to omp_get_num_devices returning 4
omptarget --> Call to omp_get_initial_device returning 4
omptarget --> Call to omp_get_num_devices returning 4
omptarget --> Call to omp_get_initial_device returning 4
omptarget --> Call to omp_get_num_devices returning 4
omptarget --> Call to omp_get_initial_device returning 4
omptarget --> copy from device to device
omptarget --> Call to omp_get_num_devices returning 4
omptarget --> Call to omp_get_initial_device returning 4
OMPT --> in ompt_target_region_begin (TargetRegionId = 0)
  Callback DataOp EMI: endpoint=1 optype=3 target_task_data=(nil) (0x0) target_data=0x7efc562697d0 (0x0) host_op_id=0x7efc562697c8 (0x8000000000000004) src=0x55c1f59977a0 src_device_num=0 dest=0x55c1f5997890 dest_device_num=4 bytes=4 code=0x55c1f4f138fc
  Callback DataOp EMI: endpoint=2 optype=3 target_task_data=(nil) (0x0) target_data=0x7efc562697d0 (0x0) host_op_id=0x7efc562697c8 (0x8000000000000004) src=0x55c1f59977a0 src_device_num=0 dest=0x55c1f5997890 dest_device_num=4 bytes=4 code=0x55c1f4f138fc
OMPT --> in ompt_target_region_end (TargetRegionId = 0)
omptarget --> Call to omp_get_num_devices returning 4
omptarget --> Call to omp_get_initial_device returning 4
OMPT --> in ompt_target_region_begin (TargetRegionId = 0)
  Callback DataOp EMI: endpoint=1 optype=2 target_task_data=(nil) (0x0) target_data=0x7efc562697d0 (0x0) host_op_id=0x7efc562697c8 (0x8000000000000005) src=0x55c1f5997890 src_device_num=4 dest=0x55c1f5971fb0 dest_device_num=1 bytes=4 code=0x55c1f4f138fc
  Callback DataOp EMI: endpoint=2 optype=2 target_task_data=(nil) (0x0) target_data=0x7efc562697d0 (0x0) host_op_id=0x7efc562697c8 (0x8000000000000005) src=0x55c1f5997890 src_device_num=4 dest=0x55c1f5971fb0 dest_device_num=1 bytes=4 code=0x55c1f4f138fc
OMPT --> in ompt_target_region_end (TargetRegionId = 0)
omptarget --> omp_target_memcpy returns 0
Testing device to host
omptarget --> Call to omp_get_num_devices returning 4
omptarget --> Call to omp_target_memcpy, dst device 4, src device 1, dst addr 0x000055c1f5992980, src addr 0x000055c1f5971fb0, dst offset 0, src offset 0, length 4
omptarget --> Call to omp_get_num_devices returning 4
omptarget --> Call to omp_get_initial_device returning 4
omptarget --> Call to omp_get_num_devices returning 4
omptarget --> Call to omp_get_initial_device returning 4
omptarget --> Call to omp_get_num_devices returning 4
omptarget --> Call to omp_get_initial_device returning 4
omptarget --> copy from device to host
omptarget --> Call to omp_get_num_devices returning 4
omptarget --> Call to omp_get_initial_device returning 4
OMPT --> in ompt_target_region_begin (TargetRegionId = 0)
  Callback DataOp EMI: endpoint=1 optype=3 target_task_data=(nil) (0x0) target_data=0x7efc562697d0 (0x0) host_op_id=0x7efc562697c8 (0x8000000000000006) src=0x55c1f5971fb0 src_device_num=1 dest=0x55c1f5992980 dest_device_num=4 bytes=4 code=0x55c1f4f13942
  Callback DataOp EMI: endpoint=2 optype=3 target_task_data=(nil) (0x0) target_data=0x7efc562697d0 (0x0) host_op_id=0x7efc562697c8 (0x8000000000000006) src=0x55c1f5971fb0 src_device_num=1 dest=0x55c1f5992980 dest_device_num=4 bytes=4 code=0x55c1f4f13942
OMPT --> in ompt_target_region_end (TargetRegionId = 0)
omptarget --> omp_target_memcpy returns 0
Checking correctness
Freeing memory on device
omptarget --> Call to omp_target_free for device 0 and address 0x000055c1f59977a0
omptarget --> Call to omp_get_num_devices returning 4
omptarget --> Call to omp_get_initial_device returning 4
OMPT --> in ompt_target_region_begin (TargetRegionId = 0)
  Callback DataOp EMI: endpoint=1 optype=4 target_task_data=(nil) (0x0) target_data=0x7efc562697d0 (0x0) host_op_id=0x7efc562697c8 (0x8000000000000007) src=0x55c1f59977a0 src_device_num=0 dest=(nil) dest_device_num=-1 bytes=0 code=0x55c1f4f139a4
PluginInterface --> MemoryManagerTy::free: target memory 0x000055c1f59977a0.
PluginInterface --> findBucket: Size 4 is floored to 4.
PluginInterface --> findBucket: Size 4 goes to bucket 0
PluginInterface --> Found its node 0x000055c1f5997910. Insert it to bucket 0.
  Callback DataOp EMI: endpoint=2 optype=4 target_task_data=(nil) (0x0) target_data=0x7efc562697d0 (0x0) host_op_id=0x7efc562697c8 (0x8000000000000007) src=0x55c1f59977a0 src_device_num=0 dest=(nil) dest_device_num=-1 bytes=0 code=0x55c1f4f139a4
OMPT --> in ompt_target_region_end (TargetRegionId = 0)
omptarget --> omp_target_free deallocated device ptr
omptarget --> Call to omp_target_free for device 1 and address 0x000055c1f5971fb0
omptarget --> Call to omp_get_num_devices returning 4
omptarget --> Call to omp_get_initial_device returning 4
OMPT --> in ompt_target_region_begin (TargetRegionId = 0)
  Callback DataOp EMI: endpoint=1 optype=4 target_task_data=(nil) (0x0) target_data=0x7efc562697d0 (0x0) host_op_id=0x7efc562697c8 (0x8000000000000008) src=0x55c1f5971fb0 src_device_num=1 dest=(nil) dest_device_num=-1 bytes=0 code=0x55c1f4f139b0
PluginInterface --> MemoryManagerTy::free: target memory 0x000055c1f5971fb0.
PluginInterface --> findBucket: Size 4 is floored to 4.
PluginInterface --> findBucket: Size 4 goes to bucket 0
PluginInterface --> Found its node 0x000055c1f5997940. Insert it to bucket 0.
  Callback DataOp EMI: endpoint=2 optype=4 target_task_data=(nil) (0x0) target_data=0x7efc562697d0 (0x0) host_op_id=0x7efc562697c8 (0x8000000000000008) src=0x55c1f5971fb0 src_device_num=1 dest=(nil) dest_device_num=-1 bytes=0 code=0x55c1f4f139b0
OMPT --> in ompt_target_region_end (TargetRegionId = 0)
omptarget --> omp_target_free deallocated device ptr
omptarget --> Unloading target library!
omptarget --> Unregistered image 0x000055c1f4f146e0 from RTL 0x000055c1f59720d0!
omptarget --> Done unregistering images!
omptarget --> Removing translation table for descriptor 0x000055c1f4f14660
omptarget --> Done unregistering library!
omptarget --> Deinit offload library!
OMPT --> Executing finalizeLibrary (libomp)
OMPT --> OMPT: Executing finalizeLibrary (libomptarget)
OMPT --> OMPT: Executing finalizeLibrary (libomptarget)
Callback Fini: device_num=0
Callback Fini: device_num=1
Callback Fini: device_num=2
Callback Fini: device_num=3

@mhalk
Copy link
Contributor Author

mhalk commented Feb 19, 2024

Do we have a place to document this decision other than this PR / commit message?

No, currently not. (But now that I think of it we should add another easily accessible doc.)
The first place that comes to my mind is the (TargetDataExchange)RAII instantiation: Here.

@mhalk mhalk force-pushed the fix/ompt_llvm_66478 branch from 5fd8e50 to 029ed0a Compare February 19, 2024 11:11
@mhalk
Copy link
Contributor Author

mhalk commented Feb 19, 2024

x86_64 still reports two transfers even though LIBOMPTARGET_DEBUG shows omptarget --> copy from device to device.

Just reproduced this behavior within the testcase and I'll discuss this shortly but my guess is that the x86_64 plugin is simply handling the 'data exhange' differently (chopping it into a pair of retrieve+submit via the initial device).
My take is: changing that particular behavior is beyond the scope of this PR and should be discussed beforehand.

Observing x86_64 callbacks seems like a very peculiar use-case to me TBH, especially since there is no accurate representation of the actually present / 'real' devices AFAIK. My assumption is, that the latter is also the reason why we take the route 'retrieve+submit' via the 'real' (initial) device.

@Thyre
Copy link
Contributor

Thyre commented Feb 19, 2024

x86_64 still reports two transfers even though LIBOMPTARGET_DEBUG shows omptarget --> copy from device to device.

Just reproduced this behavior within the testcase and I'll discuss this shortly but my guess is that the x86_64 plugin is simply handling the 'data exhange' differently (chopping it into a pair of retrieve+submit via the initial device). My take is: changing that particular behavior is beyond the scope of this PR and should be discussed beforehand.

Observing x86_64 callbacks seems like a very peculiar use-case to me TBH, especially since there is no accurate representation of the actually present / 'real' devices AFAIK. My assumption is, that the latter is also the reason why we take the route 'retrieve+submit' via the 'real' (initial) device.

Just wanted to bring it up, so that it is known 😄
I also agree that changing the behavior is outside of the PR scope and probably also doesn't have a high priority. I mean, if the runtime is splitting the transfer into two transfers, than its correct that we see that with OMPT.

I also agree that observing x86_64 callbacks for target callbacks is only somewhat useful. We will see that there are events, but the existence of four devices for example may confuse people. In addition, we will see the actual host computation via other callbacks anyway.

} else if (ompt_callback_target_data_op_fn) {
// HostOpId is set by the runtime
HostOpId = createOpId();
// Invoke the tool supplied data op callback
ompt_callback_target_data_op_fn(
TargetData.value, HostOpId, ompt_target_data_transfer_from_device,
TgtPtrBegin, DeviceId, HstPtrBegin,
/*TgtDeviceNum=*/omp_get_initial_device(), Size, Code);
DstPtrBegin, DstDeviceId, SrcPtrBegin, SrcDeviceId, Size, Code);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reversed. Shouldn't it be
SrcPtrBegin, SrcDeviceId, DstPtrBegin, DstDeviceId
?

Copy link
Contributor Author

@mhalk mhalk Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good catch! (I think this got reversed twice.)
edit: This also inspired me to add 'HOST' and 'DEVICE' captures to the related non-EMI test of yours.

OMPT_IF_BUILT(
InterfaceRAII TargetDataExchangeRAII(
RegionInterface.getCallbacks<ompt_target_data_transfer_from_device>(),
DeviceID, SrcPtr, DstDev.RTLDeviceID, DstPtr, Size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use RTLDeviceID instead of DeviceID in the callback since that's what is used in the actual RTL->data_exchange invocation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, might be less confusing.

Copy link
Contributor

@jplehr jplehr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more comments from my side on this one. I'll let @dhruvachak accept once he is happy with it.

int Host = omp_get_initial_device();

printf("Allocating Memory on Device\n");
int *DevPtr = (int *)omp_target_alloc(sizeof(int), Device);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check that DevPtr is not null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Will add an assert().

*HstPtr = 42;

printf("Testing: Host to Device\n");
omp_target_memcpy(DevPtr, HstPtr, sizeof(int), 0, 0, Device, Host);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the return value of all calls to omp_target_memcpy. Otherwise, if it fails, the host value could remain 42 but the program still failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do! Let me know if asserts work for you (if not: let me know what you'd prefer.).

@mhalk mhalk force-pushed the fix/ompt_llvm_66478 branch from 029ed0a to abeb1ae Compare February 22, 2024 13:32
@mhalk mhalk requested a review from dhruvachak February 22, 2024 13:33
…Device'

Since there's no `ompt_target_data_transfer_tofrom_device` (within
ompt_target_data_op_t enum) or something other that conveys the meaning of
inter-device data exchange we decided to indicate a Device-to-Device transfer
by using: optype == ompt_target_data_transfer_from_device (=3)

Hence, a device transfer may be identified e.g. by checking for:
(optype == 3) &&
(src_device_num < omp_get_num_devices()) &&
(dest_device_num < omp_get_num_devices())

Fixes: llvm#66478
@mhalk mhalk force-pushed the fix/ompt_llvm_66478 branch from abeb1ae to 8d7ac0b Compare February 23, 2024 12:35
Copy link
Contributor

@dhruvachak dhruvachak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@mhalk mhalk merged commit e521752 into llvm:main Feb 26, 2024
@mhalk mhalk deleted the fix/ompt_llvm_66478 branch April 2, 2024 16:15
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Apr 6, 2024
…Device' (llvm#81991)

Since there's no `ompt_target_data_transfer_tofrom_device` (within
ompt_target_data_op_t enum) or something other that conveys the meaning
of inter-device data exchange we decided to indicate a Device-to-Device
transfer by using: optype == ompt_target_data_transfer_from_device (=3)

Hence, a device transfer may be identified e.g. by checking for: (optype
== 3) &&
(src_device_num < omp_get_num_devices()) &&
(dest_device_num < omp_get_num_devices())

Fixes: llvm#66478
Change-Id: I4c382ee61a05102c7ffc6de9b765e072f6386f11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
openmp:libomptarget OpenMP offload runtime
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[OMPT] data_op[_emi] callback is not or incorrectly dispatched on device to device operations depending on vendor
5 participants