Skip to content

BUG: ArrowExtensionArray.mode(dropna=False) not respecting NAs #50986

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Feb 18, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
596d47d
BUG: ArrowExtensionArray.mode(dropna=False) not respecting NAs
mroeschke Jan 26, 2023
6b28efe
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Jan 28, 2023
d8c16ba
Fix tests
mroeschke Jan 28, 2023
ad82e2f
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 3, 2023
ca5ece9
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 8, 2023
ad888bb
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 9, 2023
7997994
remove parameterization
mroeschke Feb 9, 2023
a009e7e
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 9, 2023
22d4ed4
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 10, 2023
1c89b6d
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 11, 2023
c791152
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 13, 2023
fa1a345
Fix whatsnew
mroeschke Feb 13, 2023
8fe50b7
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 15, 2023
b97db29
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 15, 2023
9815179
Use value_counts
mroeschke Feb 15, 2023
ae97a15
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 16, 2023
7acd4a5
Remove unneeded xfail
mroeschke Feb 16, 2023
0f2eea1
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 16, 2023
1a680a8
Dropna after
mroeschke Feb 16, 2023
e8e6672
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 17, 2023
7a5aff1
Revert "Dropna after"
mroeschke Feb 17, 2023
4ef9a2f
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 17, 2023
5e5ed68
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 17, 2023
b3959f8
Remove unused request
mroeschke Feb 17, 2023
30918fd
Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas
mroeschke Feb 17, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1229,6 +1229,7 @@ Numeric
- Bug in DataFrame reduction methods (e.g. :meth:`DataFrame.sum`) with object dtype, ``axis=1`` and ``numeric_only=False`` would not be coerced to float (:issue:`49551`)
- Bug in :meth:`DataFrame.sem` and :meth:`Series.sem` where an erroneous ``TypeError`` would always raise when using data backed by an :class:`ArrowDtype` (:issue:`49759`)
- Bug in :meth:`Series.__add__` casting to object for list and masked :class:`Series` (:issue:`22962`)
- Bug in :meth:`~arrays.ArrowExtensionArray.mode` where ``dropna=False`` was not respected when there was ``NA`` values (:issue:`50982`)
- Bug in :meth:`DataFrame.query` with ``engine="numexpr"`` and column names are ``min`` or ``max`` would raise a ``TypeError`` (:issue:`50937`)

Conversion
Expand Down
14 changes: 7 additions & 7 deletions pandas/core/arrays/arrow/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -1345,7 +1345,6 @@ def _mode(self: ArrowExtensionArrayT, dropna: bool = True) -> ArrowExtensionArra
----------
dropna : bool, default True
Don't consider counts of NA values.
Not implemented by pyarrow.

Returns
-------
Expand All @@ -1364,12 +1363,13 @@ def _mode(self: ArrowExtensionArrayT, dropna: bool = True) -> ArrowExtensionArra
else:
data = self._data

modes = pc.mode(data, pc.count_distinct(data).as_py())
values = modes.field(0)
counts = modes.field(1)
# counts sorted descending i.e counts[0] = max
mask = pc.equal(counts, counts[0])
most_common = values.filter(mask)
if dropna:
data = data.drop_null()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if you checked, but it might be more efficient to do this after the value_counts, so on res (assuming that res is a much shorter array, and so cheaper to filter)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [1]: import pyarrow as pa

In [2]: data = list(range(100_000)) + [None] * 100_000

In [3]:  arr = pd.arrays.ArrowExtensionArray(pa.array(data))

In [4]: %timeit arr._mode()
7.01 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <-- drop_null before
6.93 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <-- drop_null after

So might as well do this after as you suggested

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like filtering after gives incorrect result for multi-mode tests. If filtering were to occur after, I would have to drop the NAs in values and then filter the counts where the NA were in values. To keep things simpler for now I'll give leave this to filter before the value_counts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If filtering were to occur after, I would have to drop the NAs in values and then filter the counts where the NA were in values

That would be something like:

if dropna:
    res = res.filter(res.field("values").is_valid())

to drop values based on one field of the struct, before calculating most_common.

So that line is a bit more complicated as calling drop_null, but only slightly. Now, it also doesn't seem to matter that much ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks. This passes the tests but appears slower than dropping the NAs beforehand for this example, so I think we should just drop the NAs beforehand for now.

In [1]: import pyarrow as pa

In [2]: data = list(range(100_000)) + [None] * 100_000

In [3]: arr = pd.arrays.ArrowExtensionArray(pa.array(data))

In [4]: %timeit arr._mode()
6.72 ms ± 45 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <- drop_nulls before
7.24 ms ± 87 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <- fllter, is_valild


res = pc.value_counts(data)
most_common = res.field("values").filter(
pc.equal(res.field("counts"), pc.max(res.field("counts")))
)

if pa.types.is_temporal(pa_type):
most_common = most_common.cast(pa_type)
Expand Down
37 changes: 15 additions & 22 deletions pandas/tests/extension/test_arrow.py
Original file line number Diff line number Diff line change
Expand Up @@ -1339,38 +1339,31 @@ def test_quantile(data, interpolation, quantile, request):
tm.assert_series_equal(result, expected)


@pytest.mark.parametrize("dropna", [True, False])
@pytest.mark.parametrize(
"take_idx, exp_idx",
[[[0, 0, 2, 2, 4, 4], [4, 0]], [[0, 0, 0, 2, 4, 4], [0]]],
[[[0, 0, 2, 2, 4, 4], [0, 4]], [[0, 0, 0, 2, 4, 4], [0]]],
ids=["multi_mode", "single_mode"],
)
def test_mode(data_for_grouping, dropna, take_idx, exp_idx, request):
pa_dtype = data_for_grouping.dtype.pyarrow_dtype
if pa.types.is_string(pa_dtype) or pa.types.is_binary(pa_dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

request.node.add_marker(
pytest.mark.xfail(
raises=pa.ArrowNotImplementedError,
reason=f"mode not supported by pyarrow for {pa_dtype}",
)
)
elif (
pa.types.is_boolean(pa_dtype)
and "multi_mode" in request.node.nodeid
and pa_version_under9p0
):
request.node.add_marker(
pytest.mark.xfail(
reason="https://issues.apache.org/jira/browse/ARROW-17096",
)
)
def test_mode_dropna_true(data_for_grouping, take_idx, exp_idx):
data = data_for_grouping.take(take_idx)
ser = pd.Series(data)
result = ser.mode(dropna=dropna)
result = ser.mode(dropna=True)
expected = pd.Series(data_for_grouping.take(exp_idx))
tm.assert_series_equal(result, expected)


def test_mode_dropna_false_mode_na(data):
# GH 50982
more_nans = pd.Series([None, None, data[0]], dtype=data.dtype)
result = more_nans.mode(dropna=False)
expected = pd.Series([None], dtype=data.dtype)
tm.assert_series_equal(result, expected)

expected = pd.Series([None, data[0]], dtype=data.dtype)
result = expected.mode(dropna=False)
tm.assert_series_equal(result, expected)


def test_is_bool_dtype():
# GH 22667
data = ArrowExtensionArray(pa.array([True, False, True]))
Expand Down