Skip to content

BUG: ArrowExtensionArray.mode(dropna=False) not respecting NAs #50986

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Feb 18, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
596d47d
BUG: ArrowExtensionArray.mode(dropna=False) not respecting NAs
mroeschke Jan 26, 2023
6b28efe
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Jan 28, 2023
d8c16ba
Fix tests
mroeschke Jan 28, 2023
ad82e2f
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 3, 2023
ca5ece9
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 8, 2023
ad888bb
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 9, 2023
7997994
remove parameterization
mroeschke Feb 9, 2023
a009e7e
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 9, 2023
22d4ed4
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 10, 2023
1c89b6d
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 11, 2023
c791152
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 13, 2023
fa1a345
Fix whatsnew
mroeschke Feb 13, 2023
8fe50b7
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 15, 2023
b97db29
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 15, 2023
9815179
Use value_counts
mroeschke Feb 15, 2023
ae97a15
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 16, 2023
7acd4a5
Remove unneeded xfail
mroeschke Feb 16, 2023
0f2eea1
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 16, 2023
1a680a8
Dropna after
mroeschke Feb 16, 2023
e8e6672
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 17, 2023
7a5aff1
Revert "Dropna after"
mroeschke Feb 17, 2023
4ef9a2f
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 17, 2023
5e5ed68
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 17, 2023
b3959f8
Remove unused request
mroeschke Feb 17, 2023
30918fd
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 17, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1004,6 +1004,7 @@ Numeric
- Bug in DataFrame reduction methods (e.g. :meth:`DataFrame.sum`) with object dtype, ``axis=1`` and ``numeric_only=False`` would not be coerced to float (:issue:`49551`)
- Bug in :meth:`DataFrame.sem` and :meth:`Series.sem` where an erroneous ``TypeError`` would always raise when using data backed by an :class:`ArrowDtype` (:issue:`49759`)
- Bug in :meth:`Series.__add__` casting to object for list and masked :class:`Series` (:issue:`22962`)
- Bug in :meth:`~arrays.ArrowExtensionArray.mode` where ``dropna=False`` was not respected when there was ``NA`` values (:issue:`50982`)

Conversion
^^^^^^^^^^
Expand Down
11 changes: 8 additions & 3 deletions pandas/core/arrays/arrow/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -1269,7 +1269,6 @@ def _mode(self: ArrowExtensionArrayT, dropna: bool = True) -> ArrowExtensionArra
----------
dropna : bool, default True
Don't consider counts of NA values.
Not implemented by pyarrow.

Returns
-------
Expand All @@ -1292,15 +1291,21 @@ def _mode(self: ArrowExtensionArrayT, dropna: bool = True) -> ArrowExtensionArra
data = self._data

modes = pc.mode(data, pc.count_distinct(data).as_py())
values = modes.field(0)
counts = modes.field(1)
# counts sorted descending i.e counts[0] = max
if not dropna and self._data.null_count > counts[0].as_py():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know you've wanted to keep as much of this "in pyarrow" as possible, but i find this hard to follow. would it be that bad to just implement mode in terms of value_counts?

if not len(self):[...]
vcs = self.value_counts(dropna=dropna)
res_ser = vcs[vcs == vcs.max()].sort_index()
return res_ser.index._values

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Locally with dispatching the mode implementation to value counts results in a noticeable slowdown, so I would prefer to keep using the pc.mode implementation

In [9]: %timeit arr._mode()
147 µs ± 1.69 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [4]: %timeit arr._mode()
1.13 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, thanks for taking a look

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing this comment here, I find it surprising that there is such a big difference between both. Especially because you are doing a count_distinct to include all unique values in the result of mode, so essentially making it equivalent to a value_counts. So if there is a big difference, that seems to indicate some performance issue in the implementation in arrow.

Now, trying with the following, I see different results:

In [8]: arr = pa.array(np.random.randint(0, 1000, 1_000_000))

In [12]: %timeit pc.mode(arr, pc.count_distinct(arr).as_py())
12.7 ms ± 496 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [13]: %timeit pc.value_counts(arr)
8.96 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

(it might depend a lot on the exact characteristics of the data, though, number of uniques vs total number of values, etc)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a quick benchmark replacing the implementation to use ArrowExtensionArray.value_counts, not just comparing pc.mode vs pc.value_counts per se

% ipython
Python 3.8.15 | packaged by conda-forge | (default, Nov 22 2022, 08:53:40)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pyarrow as pa

In [2]: data = list(range(100_000)) + [None] * 100_000

In [3]: arr = pd.arrays.ArrowExtensionArray(pa.array(data))

In [4]: %timeit arr._mode()
15.9 ms ± 57.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

% git diff
+        vcs = self.value_counts(dropna=dropna)
+        res_ser = vcs[vcs == vcs.max()].sort_index()
+        return res_ser.index._values
+        # pa_type = self._data.type
+        # if pa.types.is_temporal(pa_type):
+        #     nbits = pa_type.bit_width
+        #     if nbits == 32:
+        #         data = self._data.cast(pa.int32())
+        #     elif nbits == 64:
+        #         data = self._data.cast(pa.int64())
+        #     else:
+        #         raise NotImplementedError(pa_type)
+        # else:
+        #     data = self._data
+        #
+        # modes = pc.mode(data, pc.count_distinct(data).as_py())
+        # counts = modes.field(1)
+        # # counts sorted descending i.e counts[0] = max
+        # if not dropna and self._data.null_count > counts[0].as_py():
+        #     return type(self)(pa.array([None], type=pa_type))
+        # mask = pc.equal(counts, counts[0])
+        # most_common = modes.field(0).filter(mask)
+        #
+        # if pa.types.is_temporal(pa_type):
+        #     most_common = most_common.cast(pa_type)
+        #
+        # if not dropna and self._data.null_count == counts[0].as_py():
+        #     most_common = pa.concat_arrays(
+        #         [most_common, pa.array([None], type=pa_type)]
+        #     )
+        #
+        # return type(self)(most_common)

% ipython
Python 3.8.15 | packaged by conda-forge | (default, Nov 22 2022, 08:53:40)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pyarrow as pa

In [2]: data = list(range(100_000)) + [None] * 100_000

In [3]: arr = pd.arrays.ArrowExtensionArray(pa.array(data))

In [4]: %timeit arr._mode()
54.7 ms ± 672 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I suppose the overhead is coming from the extra work the ArrowExtensionArray.value_counts is doing?

Doing it with pyarrow compute directly here in _mode might be faster and simpler (compared to current _mode)? Something like:

res = pc.value_counts(self.data)
most_common = res.field("values").filter(pc.equal(res.field("counts"), pc.max(res.field("counts"))))

(this is still faster than calling pc.mode on your example data)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Looks like pc.value_counts also has some benefits

  1. Works for string and binary types
  2. The result maintains the order of the original values.

So it's good to switch to using value_counts here

return type(self)(pa.array([None], type=pa_type))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted to comment this on the arrow issue that you could do something like the above as a workaround, but so you are already doing that;)
Personally, I think it's a fine workaround on the pandas side (certainly on the short term). It should also be basically as performant compared to when pyarrow would do this in mode itself, since null_count is cheap (and cached).

mask = pc.equal(counts, counts[0])
most_common = values.filter(mask)
most_common = modes.field(0).filter(mask)

if pa.types.is_temporal(pa_type):
most_common = most_common.cast(pa_type)

if not dropna and self._data.null_count == counts[0].as_py():
most_common = pa.concat_arrays(
[most_common, pa.array([None], type=pa_type)]
)

return type(self)(most_common)

def _maybe_convert_setitem_value(self, value):
Expand Down
21 changes: 21 additions & 0 deletions pandas/tests/extension/test_arrow.py
Original file line number Diff line number Diff line change
Expand Up @@ -1385,6 +1385,27 @@ def test_mode(data_for_grouping, dropna, take_idx, exp_idx, request):
tm.assert_series_equal(result, expected)


def test_mode_dropna_false_mode_na(data, request):
# GH 50982
pa_dtype = data.dtype.pyarrow_dtype
if pa.types.is_string(pa_dtype) or pa.types.is_binary(pa_dtype):
request.node.add_marker(
pytest.mark.xfail(
raises=pa.ArrowNotImplementedError,
reason=f"mode not supported by pyarrow for {pa_dtype}",
)
)
more_nans = pd.Series([None, None, data[0]], dtype=data.dtype)
result = more_nans.mode(dropna=False)
expected = pd.Series([None], dtype=data.dtype)
tm.assert_series_equal(result, expected)

equal_nans = pd.Series([None, data[0]], dtype=data.dtype)
result = equal_nans.mode(dropna=False)
expected = pd.Series([data[0], None], dtype=data.dtype)
tm.assert_series_equal(result, expected)


def test_is_bool_dtype():
# GH 22667
data = ArrowExtensionArray(pa.array([True, False, True]))
Expand Down