-
Notifications
You must be signed in to change notification settings - Fork 30
Check if malloc_device return nullptr in matmul code #1493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
ndgrigorian
merged 1 commit into
feature/matmul-vecdot-tensordot
from
check-memory-allocation
Jan 11, 2024
Merged
Check if malloc_device return nullptr in matmul code #1493
ndgrigorian
merged 1 commit into
feature/matmul-vecdot-tensordot
from
check-memory-allocation
Jan 11, 2024
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
View rendered docs @ https://intelpython.github.io/dpctl/pulls/1493/index.html |
Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_72 ran successfully. |
Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞 |
ndgrigorian
added a commit
that referenced
this pull request
Jan 23, 2024
…1490) * Adds ThreeOffsets_CombinedIndexer This enables strided data processing by gemm kernels * Remove unused `elementwise_functions.cpp` * Implements `matmul`, `vecdot`, and `tensordot` These three functions are implemented through a common `py_dot` binding, which is also part of a new tensor submodule `_tensor_linalg_impl` * Tweaks to `matmul` and gemm kernels Fixes a missing indexer in gemm functor with threading along `nm` dimensions Fixes `matmul` broadcasting, which was broadcasting in some unnecessary cases * Remove double-counting of batch offset in gemm batch tree reduction * Fixes missing dependency in vecdot When the first argument would not be cast and the second argument would be, the copy dependency was not appended to the list of dependencies, creating race conditions * Run test_matmul_simple2 in Windows before full test suite Part of triaging crashes on Windows * Test removing test_matmul_simple leaving only test_matmul_simple2 * Fix incorrect comments throughtout gemm kernels Comments incorrectly stated that the third argument to `scale_gemm_k_parameters` is modified by reference * Drastically reduced parameters used for gemm kernels which thread over k Experimental change to see if this stabilizes CI * Test removal of k-threading gemm kernel which writes to multiple outputs atomically * Refactors `gemm_tree_impl` Now uses two smaller functions, `gemm_tree_k_impl` and `gemm_tree_nm_impl` for greater readability * Reverse order of numeric types passed to test_matmul_simple2 May improve stability on CPU * Refactors `gemm_contig_tree_impl` `gemm_contig_tree_impl` now calls new functions `gemm_contig_tree_k_impl` and `gemm_contig_tree_nm_impl` * Refactoring `gemm_batch_tree` functions Adds new functions for calling `nm` threading and `k` threading kernels to improve readability * Test reversing data types for `test_matmul_strided` * pre-commit fixes in `gemm.hpp` * Check if malloc_device return nullptr (#1493) * Add step to Linux conda package workflow to run `test_matmul_strided` under gdb Part of triaging CPU crashes * Remove unnecessary comments * Adds a fast-path for empty (k = 0) gemm kernels * Adds logic that avoids certain kernels on CPU that are known to be problematic Specifically uses logic to always avoid paths which would call k threaded functors on CPU with m_groups > 1 * Also access memory if indices are in range This prevents out-of-bound access that was responsible for crashes observed in CI. * Simplified computation of m_id/gr_id in kernels No need to use both it.get_global_linear_id() and it.get_group_linear_id() to compute batch id and group id. * Change generic kernels to work for any value of m_groups, not just m_groups=2 * Remove work-arounds/special-casing for CPUs * Extended test_matmul_strided, reverted work-arounds * Revert remaining gemm work-arounds This commit removes remaining checks for if a kernel is called on CPU as well as reverting hyperparameters for gemm kernels to their original values * Revert tuning down of `gemm` kernel parameters * Removed logically dead code from _linear_algebra_functions.py * Added more tests to improve coverage of _linear_algebra_functions * Fixed "UnboundLocalError: local variable 'buf1_dt' referenced before assignment" Initialized buf1_dt and buf2_dt to None * More tests to improve coverage * Removed more dead branches in _linear_algebra_functions.py * `tensordot` now properly handles negative `axes` As per array API, negative axes are not permitted * Adds `test_tensordot_type_matrix` to `test_usm_ndarray_linalg.py` * Addresses flaws in gemm tree kernel logic Previously, assertions for calling a full tree reduction with only a single work-group of elements could be tripped The kernel logic has been changed such that this is no longer possible * Implements `__matmul__`, `__imatmul__`, and `__rmatmul__` operators for usm_ndarray * Makes usm_ndarray operator argument names consistent * Test changes for `tensordot` Adds a test for axes errors in `tensordot` for negative axes Incorporates test for `tensordot` promotion of both inputs into `test_tensordot_type_promotion` * Reverts running certain `matmul` tests under gdb * Fix to typo in `test_tensordot_promotion` * Removes unnecessary input type checks in `matmul` * More tests added to `test_usm_linalg.py` Adds several tests for `matmul` and expands some `tensordot` and `vecdot` tests to improve coverage * Use result_type with tensors to take device capability into account * Use order keyword in test of type promotion for matmul * Make generic k-threaded kernels handle arbitrary m_groups Also increases hyper-parameters for k-threaded kernels to improve performance * Adjusted dispatch logic for gemm kernels Now uses m_groups = 4 when m > 4, and otherwise, m_groups = 1 to improve performance --------- Co-authored-by: Oleksandr Pavlyk <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Check that
sycl::malloc_device
did not returnnullptr
to signal allocation failure.