-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
BUG: ArrowExtensionArray.mode(dropna=False) not respecting NAs #50986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
596d47d
6b28efe
d8c16ba
ad82e2f
ca5ece9
ad888bb
7997994
a009e7e
22d4ed4
1c89b6d
c791152
fa1a345
8fe50b7
b97db29
9815179
ae97a15
7acd4a5
0f2eea1
1a680a8
e8e6672
7a5aff1
4ef9a2f
5e5ed68
b3959f8
30918fd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1269,7 +1269,6 @@ def _mode(self: ArrowExtensionArrayT, dropna: bool = True) -> ArrowExtensionArra | |
---------- | ||
dropna : bool, default True | ||
Don't consider counts of NA values. | ||
Not implemented by pyarrow. | ||
|
||
Returns | ||
------- | ||
|
@@ -1292,15 +1291,21 @@ def _mode(self: ArrowExtensionArrayT, dropna: bool = True) -> ArrowExtensionArra | |
data = self._data | ||
|
||
modes = pc.mode(data, pc.count_distinct(data).as_py()) | ||
values = modes.field(0) | ||
counts = modes.field(1) | ||
# counts sorted descending i.e counts[0] = max | ||
if not dropna and self._data.null_count > counts[0].as_py(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i know you've wanted to keep as much of this "in pyarrow" as possible, but i find this hard to follow. would it be that bad to just implement mode in terms of value_counts?
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Locally with dispatching the mode implementation to value counts results in a noticeable slowdown, so I would prefer to keep using the
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. makes sense, thanks for taking a look There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seeing this comment here, I find it surprising that there is such a big difference between both. Especially because you are doing a count_distinct to include all unique values in the result of Now, trying with the following, I see different results:
(it might depend a lot on the exact characteristics of the data, though, number of uniques vs total number of values, etc) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This was a quick benchmark replacing the implementation to use
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Then I suppose the overhead is coming from the extra work the Doing it with pyarrow compute directly here in
(this is still faster than calling There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice! Looks like
So it's good to switch to using value_counts here |
||
return type(self)(pa.array([None], type=pa_type)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just wanted to comment this on the arrow issue that you could do something like the above as a workaround, but so you are already doing that;) |
||
mask = pc.equal(counts, counts[0]) | ||
most_common = values.filter(mask) | ||
most_common = modes.field(0).filter(mask) | ||
|
||
if pa.types.is_temporal(pa_type): | ||
most_common = most_common.cast(pa_type) | ||
|
||
if not dropna and self._data.null_count == counts[0].as_py(): | ||
most_common = pa.concat_arrays( | ||
[most_common, pa.array([None], type=pa_type)] | ||
) | ||
|
||
return type(self)(most_common) | ||
|
||
def _maybe_convert_setitem_value(self, value): | ||
|
Uh oh!
There was an error while loading. Please reload this page.