Skip to content

Fix segfault in cu_memory_provider_get_last_native_error() #1183

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

ldorau
Copy link
Contributor

@ldorau ldorau commented Mar 12, 2025

Description

Fix segfault in cu_memory_provider_get_last_native_error() when it is called after a CUDA device was destroyed.

Checklist

  • Code compiles without errors locally
  • All tests pass locally
  • CI workflows execute properly

@ldorau ldorau requested a review from a team as a code owner March 12, 2025 12:53
@ldorau
Copy link
Contributor Author

ldorau commented Mar 12, 2025

@bratpiorka @lukaszstolarczuk it is required for the CUDA fix: intel/llvm#17411

strncpy(TLS_last_native_error.msg_buff, error_name,
TLS_MSG_BUF_LEN - 1);
} else {
strncpy(TLS_last_native_error.msg_buff, "cuGetErrorName() failed",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we extend this error message, or/and add comment, that it can happen if cuda is allready destroyed.

documentation for this functions, says that they never return NULL, so this might be misleading

Copy link
Contributor Author

@ldorau ldorau Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__ERROR.html#group__CUDA__ERROR_1g2c4ac087113652bb3d1f95bf2513c468

If the error code is not recognized, CUDA_ERROR_INVALID_VALUE will be returned and *pStr will be set to the NULL address.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@ldorau ldorau force-pushed the Fix_segfault_in_cu_memory_provider_get_last_native_error branch from 6596da2 to 19aba45 Compare March 13, 2025 07:41
Fix segfault in cu_memory_provider_get_last_native_error()
when it is called after a CUDA device was destroyed.

Signed-off-by: Lukasz Dorau <[email protected]>
@ldorau ldorau force-pushed the Fix_segfault_in_cu_memory_provider_get_last_native_error branch from 19aba45 to 7173cc5 Compare March 13, 2025 07:42
@KFilipek KFilipek merged commit 998debe into oneapi-src:main Mar 13, 2025
159 of 160 checks passed
@ldorau ldorau deleted the Fix_segfault_in_cu_memory_provider_get_last_native_error branch March 13, 2025 09:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants