Releases: IntelPython/dpctl
v0.20.1
v0.20.0
This release achieves compliance of dpctl.tensor
with the Python Array API 2024.12 standard.
The dpctl
namespace has also received a number of new features, including new Python classes dpctl.LocalAccessor
, dpctl.WorkGroupMemory
, and dpctl.RawKernelArg
to be used as kernel argument types, support for peer access between dpctl.SyclDevice
instances, and support for composite Level Zero devices.
Added
- Added
dpctl.WorkGroupMemory
class representingsycl::ext::oneapi::experimental::work_group_memory
, to be used as a kernel argument type gh-1984 - Added
dpctl.LocalAccessor
class representingsycl::local_accessor
, to be used as a kernel argument type gh-1991 - Added
dpctl.SyclPlatform.get_devices
method for getting alldpctl.SyclDevices
for the platform gh-1992 - Added support for the composite devices extension for Level Zero devices, usable with some devices when setting
ZE_FLAT_DEVICE_HIERARCHY=COMBINED
gh-1993 - Added
out
keyword totensor.take
gh-2010 - Added
dpctl.RawKernelArg
class representingsycl::ext::oneapi::experimental::raw_kernal_arg
, to be used as a kernel argument type gh-2038 - Added
dpctl.SyclDevice
methods for querying, enabling, and disabling peer access between devices gh-2077, gh-2082
Changed
- Updated Level Zero loader detection to no longer rely on reading
libur_adapter_level_zero.so
for the loader filename gh-2025 - Updated integer array indexing to align with the 2024.12 array API specification gh-2032
- Support for Boolean data-type is added to
dpctl.tensor.ceil
,dpctl.tensor.floor
, anddpctl.tensor.trunc
gh-2033 - Changed implementation of
DPCTLPlatform_GetDefaultContext
from using deprecatedext_oneapi_get_default_context
tokhr_get_default_context
gh-2042 - Updated supported array API specification version to 2024.12 gh-2047
- Implementation struct for
tensor.imag
now uses a static member value for the imaginary part of real-valued inputs gh-2063 - Updated
repr
to show the shape of the abbreviated arrays and show the shape and data type of zero-size arrays gh-2067 - Changed
tensor.__array_namespace_info__().capabilities()[]"max dimensions"]
toNone
gh-2071
Fixed
- Refactored code common to accumulation operations (
dpt.cumulative_sum
,dpt.cumulative_prod
,dpt.cumulative_logsumexp
) and removed unnecessary event initialization gh-2011 - Fixed incorrect results for
dpt.cumulative_sum
anddpt.cumulative_prod
whendtype=dpt.bool
gh-2018 - Fixed a typo in
dpctl.SyclPlatform
repr gh-2035 - Fixed a bug in
tensor.asarray
whereorder="K"
could fail to produce an array sufficient for the internal copy operation for some edge cases, including a contiguous array with permuted dimensions gh-2058 - Fixed a typo in
dpctl.memory.USMAllocationError
gh-2072
Maintenance
- Document
dpctl.device_type
,dpctl.backend_type
,dpctl.event_status_type
, anddpctl.global_mem_cache_type
enums gh-2019 - Updated
SYCL_INCLUDE_DIR_HINT
in Conda recipe gh-2039 - Updated expected dtypes in element-wise function docstrings gh-2041, gh-2048
- Set
ARRAY_API_TESTS_VERSION=2024.12
when running array API conformity job in CI gh-2046 - Install
hwloc
when running CI job for nightly SYCL compiler gh-2050 - Added
cython-lint
topre-commit
to improve style and readability of Cython code gh-2056 - Skip upload jobs when GitHub CI is called from a forked repo gh-2059
- Disable nightly tests run from forked repos gh-2060
- Fixed a typo in beginner's guide example gh-2061
- Updated bandit version gh-2075
- Updated Conda installation instructions gh-2080, gh-2081
- Fixed an incorrect link to changelog in package metadata gh-2085
- Miscellaneous changes to continuous integration/delivery (CI/CD) supporting scripts gh-2020, gh-2034, gh-2043, gh-2044, gh-2065, gh-2066, gh-2068, gh-2070
New Contributors
- @jharlow-intel made their first contribution in #2054
- @david-cortes-intel made their first contribution in #2080
v0.19.0
This release features official, out-of-the-box support for compiling dpctl
for specified AMD GPU architectures, the addition of new function tensor.top_k
, a radix-sort-based implementation of sorting functions, and improvements to interoperability with DLPack through tensor.dldevice_to_sycl_device
and tensor.sycl_device_to_dldevice
.
A number of adjustments were also made to improve performance of dpctl
reductions (i.e., sum
, min
, max
, etc.), accumulators (i.e., cumulative_sum
, cumulative_logsumexp
), and copy-and-cast operations.
Added
- Support for compiling
dpctl
for specified AMD GPU architecture with use of CodePlay oneAPI plug-in gh-1731 - Added
tensor.top_k
per Python Array API specification gh-1921 - Added functions
tensor.dldevice_to_sycl_device
andtensor.sycl_device_to_dldevice
for converting between DLPack and sycl devices, and a methodget_device_id
todpctl.SyclDevice
to improve interoperability with DLPack protocol gh-1953 - Added
DPCTL_OFFLOAD_COMPRESS
cmake option (set toOFF
by default) to toggle --offload-compress linker option when buildingdpctl
gh-1961
Changed
- Improved performance of copy-and-cast operations from
numpy.ndarray
totensor.usm_ndarray
for contiguous inputs gh-1829 py_sort
andpy_argsort
now throwpy::value_error
if inputs are not C-contiguous gh-1838- Improved performance of copying operation to C-/F-contig array, with optimization for batch of square matrices gh-1850
- Improved performance of
tensor.argsort
function for all types gh-1859 - Improved performance of
tensor.sort
andtensor.argsort
for short arrays in the range [16, 64] elements gh-1866 - Implemented radix sort algorithm to be used in
dpt.sort
anddpt.argsort
gh-1867, gh-1883 - Extended
dpctl.SyclTimer
withdevice_timer
keyword, implementing different methods of collecting device times gh-1872 dpctl
changed to see GPU devices out of the box in virtual environment on Windows gh-1922- Improved performance of
tensor.cumulative_sum
,tensor.cumulative_prod
,tensor.cumulative_logsumexp
as well as performance of boolean indexing gh-1923, gh-1942 - Improved performance of
tensor.min
,tensor.max
,tensor.logsumexp
,tensor.reduce_hypot
for floating point type arrays by at least 2x gh-1932, gh-1937 - Updated Cython examples to use scikit-build gh-1935
- Reduced binary size of
_tensor_accumulation_impl
by 13 MB gh-1957 - Extended
tensor.asarray
to support objects that implement__usm_ndarray__
property to be interpreted asusm_ndarray
objects gh-1959 tensor.usm_ndarray
object disallows implicit conversions to NumPy array gh-1964stream
arguments intensor.usm_ndarray
methods now raise an error ifstream
is not atensor.SyclQueue
gh-1969dpctl
initialization sets subprocess to use SPAWN method on Linux to enablegdb-oneapi
to debug kernels submitted from Python applications gh-1971- Reduced binary size of
_tensor_elementwise_impl
gh-1976 - Allow
dpctl.SyclQueue.memcpy
to and from multi-dimensional buffers gh-1985
Fixed
- Fixed a bug in
tensor.roll
for very large values ofshift
gh-1869 - Fix for
tensor.result_type
when all inputs are Python built-in scalars gh-1877 - Improved error in constructors
tensor.full
andtensor.full_like
when provided a non-numeric fill value gh-1878 - Added a check for pointer alignment when copying to C-contiguous memory gh-1890, gh-1891
- Fixed
dpctl
installed into virtual environment not finding DPC++ runtime libraries by addingDPCTL_WITH_REDIST
cmake option (set toOFF
by default) gh-1893 - Fixed incorrect result (issue gh-1901) in
tensor.cumulative_sum
and in advanced indexing gh-1902 - Fixed
__setitem__()
fortensor.usm_ndarray
when passed an empty boolean mask gh-1915 tensor.from_dlpack
docstring now shows that return type can be NumPy array and stipulates when this will be the case gh-1919- Fixed docstring in helper class in DLPack tests gh-1920
- Fixed a bug in
tensor.astype
wherecopy=False
would not be respected for 1d arrays when order keyword is specified gh-1928 - Replaced deprecated
CL/sycl.hpp
with recommendedsycl/sycl.hpp
in examples gh-1933 - Fixed
tensor.take_along_axis
andtensor.put_along_axis
raising an error fortensor.uint64
indices when given an array of dimension greater than 1 gh-1934 - Fixed unexpected results of
tensor.sum
with a requested output type ofbool
gh-1958 - Use
std::move
to avoid unnecessary copying of temporary intriul_ctor.cpp
gh-1960 - Make
stream
a keyword-only argument intensor.usm_ndarray.to_device
per requirement by array API specification gh-1966 - Improve efficiency of copy implementation and avoid an unnecessary kernel invocation in
tensor.argsort
for 1d input gh-1967 - Corrected uses of NumPy constructors with
tensor.usm_ndarray
inputs in test suite gh-1968 - Fixed array API namespace inspection utilities showing
complex128
as a valid dtype on devices without double precision anddevice
keywords not working withdpctl.SyclQueue
or filter strings gh-1979 - Fixed a bug in
test_sycl_device_interface.cpp
which would cause compilation to fail with Clang version 20.0 gh-1989 - Fixed memory leaks in smart-pointer-managed USM temporaries in synchronizing kernel calls gh-2002
UsmNDArray_MakeSimpleFromPtr
andUsmNDArray_MakeFromPtr
now raise an error when provided an invalidtypenum
before attempting to create the array gh-2003- Fixed typos in
tensor.from_numpy
andtensor.astype
gh-2006
Maintenance
- Revert pinning of cmake to 3.26 on Windows gh-1823
- Update black version used in Python code style workflow gh-1828
- Fixed CI/CD workflow for building conda packages on Windows gh-1831
- Revert work-around in
test_sycl_kernel_submit.py
for problem in MKL 2024.2.0 gh-1836 - Do not use Mambaforge variant of miniforge as deprecated gh-1844
- Use pybind11=2.13.6 gh-1845
- Remove unnecessary include in C++ header file gh-1846
- Build translation unit "simplify_iteration_space.cpp" compiled multiple times as a static library gh-1847
- Add instructions for installing
dpctl
from Intel PyPi channel gh-1860 - Fix warnings when generating docs gh-1855, gh-1861
- Align conda recipe with conda-forge's
{{ stdlib("c") }}
migration gh-1868 - Add missing include of SYCL header to "math_utils.hpp" gh-1899
- Add support of CV-qualifiers in
is_complex<T>
helper gh-1900 - Tuning work for elementwise functions with modest performance gains (under 10%) gh-1889
- Reduce binary ...
v0.18.3
v0.18.2
This is a bug-fix release, see https://github.com/IntelPython/dpctl/milestone/15.
It backports fixes for
tensor.result_type
behavior for scalars (see gh-1874) and- errors when using
dpctl
in virtual environment on Linux (gh-1892).
Changes from PR gh-1899 were also backported.
v0.18.1
This is incremental release where only installation instructions in README were updated to reflect the change in location of index with Python packages built by Intel(R) relative to 0.18.0 release.
v0.18.0
This release reaches an important milestone of making offloading fully asynchronous.
Calls to dpctl.tensor
submit tasks for execution to DPC++ runtime and return without waiting for execution of these tasks to finish.
The sequential semantics a user comes to expect from execution of Python script is preserved though.
The full list of changes that went into this release are:
Added
- Implement
tensor.take_along_axis
per Python Array API specification gh-1778 - Implement
tensor.put_along_axis
to complementtensor.take_along_axis
gh-1798 - Support for 'device=tensor.kDLCPU' in
tensor.from_dlpack
function andtensor.usm_ndarray.__dlpack__
method gh-1781 - Support DLPack on Windows gh-1746
- Implement
tensor.nextafter
function per Python Array API specification gh-1730 - Implement
tensor.count_nonzero
andtensor.diff
functions from Python array API specification gh-1732, gh-1780 - Add support for
order="K"
to*_like
array creation functions, and change defaultorder
keyword value from'C'
to'K'
gh-1808 - Support for 'max dimensions' in Array API capabilities info data gh-1774
- Add support for device aspect 'emulated' gh-1691
dpctl::tensor::usm_memory
class defined indpctl4pybind11.hpp
adds constructor to create Python USM memory objects viewing into existing USM allocations, which can be made by an external library gh-1782- Add support for COVERAGE build type in project's CMake script gh-1692
Change
- Change ownership of USM allocation by
dpctl.memory
objects, make executions ofdpctl.tensor
operations asynchronous gh-1705 - Add support for Python scalars by
tensor.where
function gh-1719 - Optimize division by Python scalar in statistical functions
tensor.mean
,tensor.std
,tensor.var
gh-1820 - Use transcendental functions from
sycl
namespace instead ofstd
namespace gh-1707 - Changes for compatibility with recent NumPy in runtime environment gh-1735, gh-1772, gh-1804
- Array creation function
tensor.zeros
to use asynchronousmemset
operation gh-1806 - The setter of
tensor.usm_ndarray.shape
property now supports Python scalar value gh-1786 - Use 'pyproject.toml' instead of 'setup.py' aligning with current packaging best practices gh-1660
- No longer set SOVERSION property in DPCTLSyclInterface library on Linux gh-1773
- Update version of 'pybind11' used gh-1758, gh-1812
- Handle possible exceptions by
usm_host_allocator
used withstd::vector
gh-1791 - Use
dpctl::tensor::offset_utils::sycl_free_noexcept
instead ofsycl::free
inhost_task
tasks associated with life-time management of temporary USM allocations gh-1797 - Add
"same_kind"
-style casting for in-place mathematical operators oftensor.usm_ndarray
gh-1827, gh-1830
Fixed
- Fix setting of release variable Sphinx config file gh-1685
- Handle possible NULL return value from device aspect queries
DPCTLDevice_GetMaxWorkGroupSize1d
andDPCTLDevice_GetMaxWorkGroupSize2d
gh-1690 - Add license header to conda script files gh-1695
- Fix
tensor.round
behavior on CUDA devices gh-1700 - Add missing
#include <sstream>
gh-1701 - Fix for issue 1724 gh-1728
- Correct USM type for return array of
tensor.extract
function gh-1727 - Fix for
tensor.unique_all
andtensor.unique_inverse
to always return index arrays with default indexing data type gh-1741 - Propagate read-only flag from
__sycl_usm_array_interface__
intensor.asarray
function gh-1756 tensor.clip
to handle Python scalars which are out of bound for the data type of integral array gh-1759- Avoid dead-locking by releasing GIL around blocking operations in libtensor gh-1753
- Element-wise
tensor.divide
and comparison operations allow greater range of Python integer and integer array combinations gh-1771 - Fix for unexpected behavior when using floating point types for array indexing gh-1792
- Enable
pytest --pyargs dpctl.tests
gh-1833
Maintenance
- Improve performance of
test_sort_complex_fp_nan
gh-1704 - Improve exception wording raised by
tensor.broadcast_arrays()
gh-1720 - Remove
template
keyword in method call ofsycl::kernel_bundle
gh-1726 - Backport changelog edits from maintenance/0.17.x gh-1736
- Replace uses of 'intel' channels in docs and readme file gh-1737
- Update references to deprecated environment variable
SYCL_DEVICE_FILTER
gh-1740 - Correction for installation instruction steps gh-1754
- Fix for crash during testing with open source SYCL bundle by updating CPU RT library used gh-1762
- Add missing include to fix build break with newer LLVM gh-1776
- Add
#include <utility>
for definition ofstd::move
used gh-1787 - Change to CMake script to accomodate DPC++ transition from PI to UR architecture gh-1788
- Document
tensor._flags.Flags
class gh-1794 - Fix for unreferenced unreleased bug in copy-and-cast code logic gh-1799
- Explicitly include headers used in C++ translation units implementing reduction operations gh-1802
- Clean-up uses of
Strided1DIndexer
class gh-1805 - Tweak to readability of C++ code implementing matrix-matrix multiplication gh-1810
- Do not add
sycl::event
associated with compute task to vector of events representing execution ofhost_task
gh-1807 - Remove 'level-zero' conda package from run-time dependencies of 'dpctl' since Intel GPU driver stack now explicitly depends on
libze1
package which provides Level-Zero loader library gh-1801, gh-1840 - Use dedicated type-support matrices for in-place element-wise binary operations gh-1816
- Remove recommendation to install wheels from Anaconda PyPI index gh-1819
- Removed use of post-link and pre-unlink conda scripts in
dpctl
gh-1821 - Pin compiler used to build 0.18.0 version to 2025.0.0 gh-1822
- A varienty of changes to continuous integration/delivery (CI/CD) supporting scripts to keep CI running smoothly:
gh-1686, gh-1688, gh-1697, gh-1698, gh-1703, gh-1702, gh-1709, gh-1712, gh-1713, gh-1722, gh-1725, gh-1729, gh-1733, [gh-1721](https...
0.17.0
This release features updated documentation web-page https://intelpython.github.io/dpctl/latest/index.html, adds cumulative reductions,
and complies with revision 2023.12 of Python Array API specification.
Added
- Added pybind11 caster for
sycl::half
to map to/from Pythonfloat
to"dpctl4pybind11.hpp"
header: gh-1655 - Added support for DLPack data interchange per Python Array API 2023.12 specification: gh-1667
- Implemented
tensor.cumulative_sum
,tensor.cumulative_prod
andtensor.cumulative_logsumexp
: gh-1602
Changed
- Expanded documentation for
dpctl
: gh-1619 - Expanded
utils.intel_device_info
functionality: gh-1656 - Improved performance of elementwise operations: gh-1651
- Efficiency improvement by avoiding unnecessary copying of
sycl::queue
: gh-1645 dpctl
uses pybind11 2.12.0: gh-1640- Improved performance of
tensor.reshape
operation withorder="F"
when copying is needed, or requested: gh-1677
Fixed
- Fixed initialization of byte type constants in
dpctl_capi
Python/C API loader class in"dpctl4pybind11.hpp"
: gh-1665 - Fixed crash in
tensor.sort
reported for a CPU device and a CUDA device: gh-1676 - Fixed race condition in accumulation kernel for custom operations that caused test failures with AMD CPUs: gh-1624
- Fixed comparison operators for mixed signed and unsigned integral types: gh-1650
- Support use of index arrays of different integral types in indexing operations: gh-47
- Fixed source code to compile for NVidia(TM) GPUs with DPC++ 2024.1: gh-1630
- Corrected
tensor.tile
for scalar inputs and empty repetitions: gh-1628 - Fixed support for
out
keyword intensor.matmul
: gh-1610 - Fixed bug in basic slicing of empty arrays: gh-1680
- Fixed bug in
tensor.bitwise_invert
for boolean input array: gh-1681 - Fixed bug in
tensor.repeat
on zero-size input arrays: gh-1682
New Contributors
- @bdmoore1 made their first contribution in #1659
- @ekomarova made their first contribution in #1666
Full Changelog: https://github.com/IntelPython/dpctl/blob/master/CHANGELOG.md
v0.16.1
This release includes bug fixes and provides a change needed by numba_dpex
project to support dispatching kernels
consuming instances of sycl::local_accessor
template type.
Changed
- Changed behavior of
dpctl.tensor.usm_ndarray.__dlpack_device__
method to return device id of the parent unpartitioned device if array is allocated on a sub-device instead of raising an exception: #1604
- Array creation functions and the
usm_ndarray
constructor indpctl.tensor
submodule now use cached default-selected device to improve performance: #1606 - Changed treatment of
axis
keyword fordpctl.tensor.tensordot
anddpctl.tensor.vecdot
to align with Python Array API 2023.12 specification: #1608 - Changed implementation of
DPCTLQueue_SubmitRange
,DPCTLQueue_SubmitNDRange
in DPCTLSyclInterface library to supportsycl::local_accessor
arguments needed bynumba_dpex
; the enumDPCTLKernelArgT\ ype
to correspond to C++ disjoint types: #1609, #1611, #1612
Fixed
- Fixed a crash on Windows platform during execution of getter of
dpctl.SyclPlatfom.default_context
property: : #1604 - Fixed kernel submission error on NVidia CUDA GPUs during
dpctl.tensor.matmul
operation: #1605 - Fixed corruption of context cache table entries: #1607
- Fixed incorrect result from
dpctl.tensor.tensordot
reported in issue #1570: #1608 - Fixed output of
python -m dpctl --library
to fix specified library name: #1615
v0.16.0
This release is virtually identical to 0.15.1 as far as features are concerned.
This release is meant to be built with DPC++ 2024.1.0, that no longer support older integrated Gen9 Intel GPUs, such as those that came with Intel Core 10th generation and older.