Skip to content

Commit 0403a7c

Browse files
authored
Complete dpnp.cov implementation (#2303)
The PR proposes to rework implementation of `dpnp.cov` and to add support of all missing keywords, replace TODO with already supported functionality and to remove any fallback on numpy call. Also this PR update documentation and adds extra tests to improve the coverage.
1 parent 0bda96a commit 0403a7c

File tree

6 files changed

+477
-102
lines changed

6 files changed

+477
-102
lines changed

dpnp/dpnp_iface_statistics.py

Lines changed: 153 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@
5454
to_supported_dtypes,
5555
)
5656

57-
from .dpnp_utils import call_origin, get_usm_allocations
57+
from .dpnp_utils import get_usm_allocations
5858
from .dpnp_utils.dpnp_utils_reduction import dpnp_wrap_reduction_call
5959
from .dpnp_utils.dpnp_utils_statistics import dpnp_cov, dpnp_median
6060

@@ -748,61 +748,175 @@ def cov(
748748
749749
For full documentation refer to :obj:`numpy.cov`.
750750
751+
Parameters
752+
----------
753+
m : {dpnp.ndarray, usm_ndarray}
754+
A 1-D or 2-D array containing multiple variables and observations.
755+
Each row of `m` represents a variable, and each column a single
756+
observation of all those variables. Also see `rowvar` below.
757+
y : {None, dpnp.ndarray, usm_ndarray}, optional
758+
An additional set of variables and observations. `y` has the same form
759+
as that of `m`.
760+
761+
Default: ``None``.
762+
rowvar : bool, optional
763+
If `rowvar` is ``True``, then each row represents a variable, with
764+
observations in the columns. Otherwise, the relationship is transposed:
765+
each column represents a variable, while the rows contain observations.
766+
767+
Default: ``True``.
768+
bias : bool, optional
769+
Default normalization is by ``(N - 1)``, where ``N`` is the number of
770+
observations given (unbiased estimate). If `bias` is ``True``, then
771+
normalization is by ``N``. These values can be overridden by using the
772+
keyword `ddof`.
773+
774+
Default: ``False``.
775+
ddof : {None, int}, optional
776+
If not ``None`` the default value implied by `bias` is overridden. Note
777+
that ``ddof=1`` will return the unbiased estimate, even if both
778+
`fweights` and `aweights` are specified, and ``ddof=0`` will return the
779+
simple average. See the notes for the details.
780+
781+
Default: ``None``.
782+
fweights : {None, dpnp.ndarray, usm_ndarray}, optional
783+
1-D array of integer frequency weights; the number of times each
784+
observation vector should be repeated.
785+
It is required that ``fweights >= 0``. However, the function will not
786+
raise an error when ``fweights < 0`` for performance reasons.
787+
788+
Default: ``None``.
789+
aweights : {None, dpnp.ndarray, usm_ndarray}, optional
790+
1-D array of observation vector weights. These relative weights are
791+
typically large for observations considered "important" and smaller for
792+
observations considered less "important". If ``ddof=0`` the array of
793+
weights can be used to assign probabilities to observation vectors.
794+
It is required that ``aweights >= 0``. However, the function will not
795+
error when ``aweights < 0`` for performance reasons.
796+
797+
Default: ``None``.
798+
dtype : {None, str, dtype object}, optional
799+
Data-type of the result. By default, the return data-type will have
800+
at least floating point type based on the capabilities of the device on
801+
which the input arrays reside.
802+
803+
Default: ``None``.
804+
751805
Returns
752806
-------
753807
out : dpnp.ndarray
754808
The covariance matrix of the variables.
755809
756-
Limitations
757-
-----------
758-
Input array ``m`` is supported as :obj:`dpnp.ndarray`.
759-
Dimension of input array ``m`` is limited by ``m.ndim <= 2``.
760-
Size and shape of input arrays are supported to be equal.
761-
Parameter `y` is supported only with default value ``None``.
762-
Parameter `bias` is supported only with default value ``False``.
763-
Parameter `ddof` is supported only with default value ``None``.
764-
Parameter `fweights` is supported only with default value ``None``.
765-
Parameter `aweights` is supported only with default value ``None``.
766-
Otherwise the function will be executed sequentially on CPU.
767-
Input array data types are limited by supported DPNP :ref:`Data types`.
768-
769810
See Also
770811
--------
771-
:obj:`dpnp.corrcoef` : Normalized covariance matrix
812+
:obj:`dpnp.corrcoef` : Normalized covariance matrix.
813+
814+
Notes
815+
-----
816+
Assume that the observations are in the columns of the observation array `m`
817+
and let ``f = fweights`` and ``a = aweights`` for brevity. The steps to
818+
compute the weighted covariance are as follows::
819+
820+
>>> import dpnp as np
821+
>>> m = np.arange(10, dtype=np.float32)
822+
>>> f = np.arange(10) * 2
823+
>>> a = np.arange(10) ** 2.0
824+
>>> ddof = 1
825+
>>> w = f * a
826+
>>> v1 = np.sum(w)
827+
>>> v2 = np.sum(w * a)
828+
>>> m -= np.sum(m * w, axis=None, keepdims=True) / v1
829+
>>> cov = np.dot(m * w, m.T) * v1 / (v1**2 - ddof * v2)
830+
831+
Note that when ``a == 1``, the normalization factor
832+
``v1 / (v1**2 - ddof * v2)`` goes over to ``1 / (np.sum(f) - ddof)``
833+
as it should.
772834
773835
Examples
774836
--------
775837
>>> import dpnp as np
776838
>>> x = np.array([[0, 2], [1, 1], [2, 0]]).T
777-
>>> x.shape
778-
(2, 3)
779-
>>> [i for i in x]
780-
[0, 1, 2, 2, 1, 0]
781-
>>> out = np.cov(x)
782-
>>> out.shape
783-
(2, 2)
784-
>>> [i for i in out]
785-
[1.0, -1.0, -1.0, 1.0]
839+
840+
Consider two variables, :math:`x_0` and :math:`x_1`, which correlate
841+
perfectly, but in opposite directions:
842+
843+
>>> x
844+
array([[0, 1, 2],
845+
[2, 1, 0]])
846+
847+
Note how :math:`x_0` increases while :math:`x_1` decreases. The covariance
848+
matrix shows this clearly:
849+
850+
>>> np.cov(x)
851+
array([[ 1., -1.],
852+
[-1., 1.]])
853+
854+
Note that element :math:`C_{0,1}`, which shows the correlation between
855+
:math:`x_0` and :math:`x_1`, is negative.
856+
857+
Further, note how `x` and `y` are combined:
858+
859+
>>> x = np.array([-2.1, -1, 4.3])
860+
>>> y = np.array([3, 1.1, 0.12])
861+
>>> X = np.stack((x, y), axis=0)
862+
>>> np.cov(X)
863+
array([[11.71 , -4.286 ], # may vary
864+
[-4.286 , 2.14413333]])
865+
>>> np.cov(x, y)
866+
array([[11.71 , -4.286 ], # may vary
867+
[-4.286 , 2.14413333]])
868+
>>> np.cov(x)
869+
array(11.71)
786870
787871
"""
788872

789-
if not dpnp.is_supported_array_type(m):
790-
pass
791-
elif m.ndim > 2:
792-
pass
793-
elif bias:
794-
pass
795-
elif ddof is not None:
796-
pass
797-
elif fweights is not None:
798-
pass
799-
elif aweights is not None:
800-
pass
873+
arrays = [m]
874+
if y is not None:
875+
arrays.append(y)
876+
dpnp.check_supported_arrays_type(*arrays)
877+
878+
if m.ndim > 2:
879+
raise ValueError("m has more than 2 dimensions")
880+
881+
if y is not None:
882+
if y.ndim > 2:
883+
raise ValueError("y has more than 2 dimensions")
884+
885+
if ddof is not None:
886+
if not isinstance(ddof, int):
887+
raise ValueError("ddof must be integer")
801888
else:
802-
return dpnp_cov(m, y=y, rowvar=rowvar, dtype=dtype)
889+
ddof = 0 if bias else 1
890+
891+
def_float = dpnp.default_float_type(m.sycl_queue)
892+
if dtype is None:
893+
dtype = dpnp.result_type(*arrays, def_float)
894+
895+
if fweights is not None:
896+
dpnp.check_supported_arrays_type(fweights)
897+
if not dpnp.issubdtype(fweights.dtype, numpy.integer):
898+
raise TypeError("fweights must be integer")
899+
900+
if fweights.ndim > 1:
901+
raise ValueError("cannot handle multidimensional fweights")
902+
903+
fweights = dpnp.astype(fweights, dtype=def_float)
904+
905+
if aweights is not None:
906+
dpnp.check_supported_arrays_type(aweights)
907+
if aweights.ndim > 1:
908+
raise ValueError("cannot handle multidimensional aweights")
909+
910+
aweights = dpnp.astype(aweights, dtype=def_float)
803911

804-
return call_origin(
805-
numpy.cov, m, y, rowvar, bias, ddof, fweights, aweights, dtype=dtype
912+
return dpnp_cov(
913+
m,
914+
y=y,
915+
rowvar=rowvar,
916+
ddof=ddof,
917+
dtype=dtype,
918+
fweights=fweights,
919+
aweights=aweights,
806920
)
807921

808922

dpnp/dpnp_utils/dpnp_utils_statistics.py

Lines changed: 59 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,6 @@
3232

3333
import dpnp
3434
from dpnp.dpnp_array import dpnp_array
35-
from dpnp.dpnp_utils import get_usm_allocations, map_dtype_to_device
3635

3736
__all__ = ["dpnp_cov", "dpnp_median"]
3837

@@ -119,66 +118,77 @@ def _flatten_array_along_axes(a, axes_to_flatten, overwrite_input):
119118
return a_flatten, overwrite_input
120119

121120

122-
def dpnp_cov(m, y=None, rowvar=True, dtype=None):
121+
def dpnp_cov(
122+
m, y=None, rowvar=True, ddof=1, dtype=None, fweights=None, aweights=None
123+
):
123124
"""
124-
dpnp_cov(m, y=None, rowvar=True, dtype=None)
125-
126125
Estimate a covariance matrix based on passed data.
127-
No support for given weights is provided now.
128126
129-
The implementation is done through existing dpnp and dpctl methods
130-
instead of separate function call of dpnp backend.
127+
The implementation is done through existing dpnp functions.
131128
132129
"""
133130

134-
def _get_2dmin_array(x, dtype):
135-
"""
136-
Transform an input array to a form required for building a covariance matrix.
137-
138-
If applicable, it reshapes the input array to have 2 dimensions or greater.
139-
If applicable, it transposes the input array when 'rowvar' is False.
140-
It casts to another dtype, if the input array differs from requested one.
141-
142-
"""
143-
if x.ndim == 0:
144-
x = x.reshape((1, 1))
145-
elif x.ndim == 1:
146-
x = x[dpnp.newaxis, :]
147-
148-
if not rowvar and x.ndim != 1:
149-
x = x.T
150-
151-
if x.dtype != dtype:
152-
x = dpnp.astype(x, dtype)
153-
return x
131+
# need to create a copy of input, since it will be modified in-place
132+
x = dpnp.array(m, ndmin=2, dtype=dtype)
133+
if not rowvar and m.ndim != 1:
134+
x = x.T
154135

155-
# input arrays must follow CFD paradigm
156-
_, queue = get_usm_allocations((m,) if y is None else (m, y))
136+
if x.shape[0] == 0:
137+
return dpnp.empty_like(
138+
x, shape=(0, 0), dtype=dpnp.default_float_type(m.sycl_queue)
139+
)
157140

158-
# calculate a type of result array if not passed explicitly
159-
if dtype is None:
160-
dtypes = [m.dtype, dpnp.default_float_type(sycl_queue=queue)]
161-
if y is not None:
162-
dtypes.append(y.dtype)
163-
dtype = dpnp.result_type(*dtypes)
164-
# TODO: remove when dpctl.result_type() is returned dtype based on fp64
165-
dtype = map_dtype_to_device(dtype, queue.sycl_device)
166-
167-
X = _get_2dmin_array(m, dtype)
168141
if y is not None:
169-
y = _get_2dmin_array(y, dtype)
170-
171-
X = dpnp.concatenate((X, y), axis=0)
172-
173-
avg = X.mean(axis=1)
142+
y_ndim = y.ndim
143+
y = dpnp.array(y, copy=None, ndmin=2, dtype=dtype)
144+
if not rowvar and y_ndim != 1:
145+
y = y.T
146+
x = dpnp.concatenate((x, y), axis=0)
147+
148+
# get the product of frequencies and weights
149+
w = None
150+
if fweights is not None:
151+
if fweights.shape[0] != x.shape[1]:
152+
raise ValueError("incompatible numbers of samples and fweights")
153+
154+
w = fweights
155+
156+
if aweights is not None:
157+
if aweights.shape[0] != x.shape[1]:
158+
raise ValueError("incompatible numbers of samples and aweights")
159+
160+
if w is None:
161+
w = aweights
162+
else:
163+
w *= aweights
164+
165+
avg, w_sum = dpnp.average(x, axis=1, weights=w, returned=True)
166+
w_sum = w_sum[0]
167+
168+
# determine the normalization
169+
if w is None:
170+
fact = x.shape[1] - ddof
171+
elif ddof == 0:
172+
fact = w_sum
173+
elif aweights is None:
174+
fact = w_sum - ddof
175+
else:
176+
fact = w_sum - ddof * dpnp.sum(w * aweights) / w_sum
174177

175-
fact = X.shape[1] - 1
176-
X -= avg[:, None]
178+
if fact <= 0:
179+
warnings.warn(
180+
"Degrees of freedom <= 0 for slice", RuntimeWarning, stacklevel=2
181+
)
182+
fact = 0.0
177183

178-
c = dpnp.dot(X, X.T.conj())
179-
c *= 1 / fact if fact != 0 else dpnp.nan
184+
x -= avg[:, None]
185+
if w is None:
186+
x_t = x.T
187+
else:
188+
x_t = (x * w).T
180189

181-
return dpnp.squeeze(c)
190+
c = dpnp.dot(x, x_t.conj()) / fact
191+
return c.squeeze()
182192

183193

184194
def dpnp_median(

0 commit comments

Comments
 (0)