BUG: ArrowExtensionArray.mode(dropna=False) not respecting NAs #50986

mroeschke · 2023-01-26T02:26:17Z

closes BUG: .mode with pyarrow dtypes ignores dropna #50982 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

pandas/core/arrays/arrow/array.py

jbrockmendel · 2023-01-26T16:11:02Z

pandas/core/arrays/arrow/array.py

        counts = modes.field(1)
        # counts sorted descending i.e counts[0] = max
+        if not dropna and self._data.null_count > counts[0].as_py():


i know you've wanted to keep as much of this "in pyarrow" as possible, but i find this hard to follow. would it be that bad to just implement mode in terms of value_counts?

if not len(self):[...] vcs = self.value_counts(dropna=dropna) res_ser = vcs[vcs == vcs.max()].sort_index() return res_ser.index._values

Locally with dispatching the mode implementation to value counts results in a noticeable slowdown, so I would prefer to keep using the pc.mode implementation

In [9]: %timeit arr._mode() 147 µs ± 1.69 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) In [4]: %timeit arr._mode() 1.13 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

makes sense, thanks for taking a look

Seeing this comment here, I find it surprising that there is such a big difference between both. Especially because you are doing a count_distinct to include all unique values in the result of mode, so essentially making it equivalent to a value_counts. So if there is a big difference, that seems to indicate some performance issue in the implementation in arrow.

Now, trying with the following, I see different results:

In [8]: arr = pa.array(np.random.randint(0, 1000, 1_000_000)) In [12]: %timeit pc.mode(arr, pc.count_distinct(arr).as_py()) 12.7 ms ± 496 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [13]: %timeit pc.value_counts(arr) 8.96 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

(it might depend a lot on the exact characteristics of the data, though, number of uniques vs total number of values, etc)

This was a quick benchmark replacing the implementation to use ArrowExtensionArray.value_counts, not just comparing pc.mode vs pc.value_counts per se

% ipython Python 3.8.15 | packaged by conda-forge | (default, Nov 22 2022, 08:53:40) Type 'copyright', 'credits' or 'license' for more information IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: import pyarrow as pa In [2]: data = list(range(100_000)) + [None] * 100_000 In [3]: arr = pd.arrays.ArrowExtensionArray(pa.array(data)) In [4]: %timeit arr._mode() 15.9 ms ± 57.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) % git diff + vcs = self.value_counts(dropna=dropna) + res_ser = vcs[vcs == vcs.max()].sort_index() + return res_ser.index._values + # pa_type = self._data.type + # if pa.types.is_temporal(pa_type): + # nbits = pa_type.bit_width + # if nbits == 32: + # data = self._data.cast(pa.int32()) + # elif nbits == 64: + # data = self._data.cast(pa.int64()) + # else: + # raise NotImplementedError(pa_type) + # else: + # data = self._data + # + # modes = pc.mode(data, pc.count_distinct(data).as_py()) + # counts = modes.field(1) + # # counts sorted descending i.e counts[0] = max + # if not dropna and self._data.null_count > counts[0].as_py(): + # return type(self)(pa.array([None], type=pa_type)) + # mask = pc.equal(counts, counts[0]) + # most_common = modes.field(0).filter(mask) + # + # if pa.types.is_temporal(pa_type): + # most_common = most_common.cast(pa_type) + # + # if not dropna and self._data.null_count == counts[0].as_py(): + # most_common = pa.concat_arrays( + # [most_common, pa.array([None], type=pa_type)] + # ) + # + # return type(self)(most_common) % ipython Python 3.8.15 | packaged by conda-forge | (default, Nov 22 2022, 08:53:40) Type 'copyright', 'credits' or 'license' for more information IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: import pyarrow as pa In [2]: data = list(range(100_000)) + [None] * 100_000 In [3]: arr = pd.arrays.ArrowExtensionArray(pa.array(data)) In [4]: %timeit arr._mode() 54.7 ms ± 672 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Then I suppose the overhead is coming from the extra work the ArrowExtensionArray.value_counts is doing?

Doing it with pyarrow compute directly here in _mode might be faster and simpler (compared to current _mode)? Something like:

res = pc.value_counts(self.data) most_common = res.field("values").filter(pc.equal(res.field("counts"), pc.max(res.field("counts"))))

(this is still faster than calling pc.mode on your example data)

Nice! Looks like pc.value_counts also has some benefits

Works for string and binary types

The result maintains the order of the original values.

So it's good to switch to using value_counts here

jbrockmendel · 2023-01-27T20:06:18Z

looks like this breaks the existing mode tests

jbrockmendel · 2023-02-03T23:05:33Z

Couple comments, neither deal-breakers. Otherwise LGTM

jbrockmendel · 2023-02-13T22:24:48Z

@jorisvandenbossche can you take a look? this calls a bunch of pyarrow stuff directly

jorisvandenbossche · 2023-02-15T00:19:04Z

pandas/core/arrays/arrow/array.py

+        if not dropna and self._data.null_count > counts[0].as_py():
+            return type(self)(pa.array([None], type=pa_type))


I just wanted to comment this on the arrow issue that you could do something like the above as a workaround, but so you are already doing that;)
Personally, I think it's a fine workaround on the pandas side (certainly on the short term). It should also be basically as performant compared to when pyarrow would do this in mode itself, since null_count is cheap (and cached).

jbrockmendel · 2023-02-15T22:02:48Z

pandas/tests/extension/test_arrow.py

    ids=["multi_mode", "single_mode"],
 )
 def test_mode_dropna_true(data_for_grouping, take_idx, exp_idx, request):
    pa_dtype = data_for_grouping.dtype.pyarrow_dtype
-    if pa.types.is_string(pa_dtype) or pa.types.is_binary(pa_dtype):


jorisvandenbossche · 2023-02-16T07:07:47Z

pandas/core/arrays/arrow/array.py

-        mask = pc.equal(counts, counts[0])
-        most_common = values.filter(mask)
+        if dropna:
+            data = data.drop_null()


I don't know if you checked, but it might be more efficient to do this after the value_counts, so on res (assuming that res is a much shorter array, and so cheaper to filter)

In [1]: import pyarrow as pa In [2]: data = list(range(100_000)) + [None] * 100_000 In [3]: arr = pd.arrays.ArrowExtensionArray(pa.array(data)) In [4]: %timeit arr._mode() 7.01 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <-- drop_null before 6.93 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <-- drop_null after

So might as well do this after as you suggested

Looks like filtering after gives incorrect result for multi-mode tests. If filtering were to occur after, I would have to drop the NAs in values and then filter the counts where the NA were in values. To keep things simpler for now I'll give leave this to filter before the value_counts

If filtering were to occur after, I would have to drop the NAs in values and then filter the counts where the NA were in values

That would be something like:

if dropna: res = res.filter(res.field("values").is_valid())

to drop values based on one field of the struct, before calculating most_common.

So that line is a bit more complicated as calling drop_null, but only slightly. Now, it also doesn't seem to matter that much ;)

Ah thanks. This passes the tests but appears slower than dropping the NAs beforehand for this example, so I think we should just drop the NAs beforehand for now.

In [1]: import pyarrow as pa In [2]: data = list(range(100_000)) + [None] * 100_000 In [3]: arr = pd.arrays.ArrowExtensionArray(pa.array(data)) In [4]: %timeit arr._mode() 6.72 ms ± 45 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <- drop_nulls before 7.24 ms ± 87 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <- fllter, is_valild

This reverts commit 1a680a8.

mroeschke · 2023-02-18T00:48:47Z

Greenish

jbrockmendel

LGTM thanks for taking this on

BUG: ArrowExtensionArray.mode(dropna=False) not respecting NAs

596d47d

mroeschke added the Arrow pyarrow functionality label Jan 26, 2023

jbrockmendel reviewed Jan 26, 2023

View reviewed changes

pandas/core/arrays/arrow/array.py Show resolved Hide resolved

jbrockmendel reviewed Jan 26, 2023

View reviewed changes

mroeschke added 2 commits January 27, 2023 17:23

Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas

6b28efe

Fix tests

d8c16ba

mroeschke added this to the 2.0 milestone Feb 1, 2023

Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas

ad82e2f

mroeschke added 8 commits February 8, 2023 14:16

Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas

ca5ece9

Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas

ad888bb

remove parameterization

7997994

Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas

a009e7e

Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas

22d4ed4

Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas

1c89b6d

Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas

c791152

Fix whatsnew

fa1a345

jorisvandenbossche reviewed Feb 15, 2023

View reviewed changes

mroeschke added 3 commits February 15, 2023 10:11

Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas

8fe50b7

Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas

b97db29

Use value_counts

9815179

jbrockmendel reviewed Feb 15, 2023

View reviewed changes

mroeschke added 2 commits February 15, 2023 18:12

Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas

ae97a15

Remove unneeded xfail

7acd4a5

jorisvandenbossche reviewed Feb 16, 2023

View reviewed changes

mroeschke added 3 commits February 16, 2023 13:43

Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas

0f2eea1

Dropna after

1a680a8

Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas

e8e6672

mroeschke added 5 commits February 16, 2023 16:26

Revert "Dropna after"

7a5aff1

This reverts commit 1a680a8.

Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas

4ef9a2f

Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas

5e5ed68

Remove unused request

b3959f8

Merge remote-tracking branch 'upstream/main' into bug/arrow/mode_nas

30918fd

jbrockmendel approved these changes Feb 18, 2023

View reviewed changes

mroeschke merged commit 3fd020c into pandas-dev:main Feb 18, 2023

mroeschke deleted the bug/arrow/mode_nas branch February 18, 2023 01:00

		if not dropna and self._data.null_count > counts[0].as_py():
		return type(self)(pa.array([None], type=pa_type))

Uh oh!

BUG: ArrowExtensionArray.mode(dropna=False) not respecting NAs #50986

BUG: ArrowExtensionArray.mode(dropna=False) not respecting NAs #50986

Uh oh!

Conversation

mroeschke commented Jan 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Jan 27, 2023

Uh oh!

jbrockmendel commented Feb 3, 2023

Uh oh!

jbrockmendel commented Feb 13, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mroeschke commented Feb 18, 2023

Uh oh!

jbrockmendel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mroeschke commented Jan 26, 2023 •

edited

Loading