-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
ENH: Implement more str accessor methods for ArrowDtype #52614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Implement more str accessor methods for ArrowDtype #52614
Conversation
pandas/core/arrays/arrow/array.py
Outdated
None if val.as_py() is None else val.as_py().partition(sep) | ||
for val in chunk | ||
] | ||
for chunk in self._pa_array.iterchunks() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like this will maintain the chunking structure. is there a reason not to chain these together and end up with a single chunk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw this related issue about ops not maintaining the underlying chunking structure and though best to try to keep it here: #42357
pandas/core/arrays/arrow/array.py
Outdated
for val in chunk | ||
] | ||
for chunk in self._pa_array.iterchunks() | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it worth making a helper for this pattern so this can just be
def _str_index(...)
predicate = lambda x: x.index(sub, start, end)
return self._helper_whatever(predicate)
ser = pd.Series(["abc", None], dtype=ArrowDtype(pa.string())) | ||
with pytest.raises( | ||
NotImplementedError, match=f"str.{method} not supported with pd.ArrowDtype" | ||
NotImplementedError, match="str.extract not supported with pd.ArrowDtype" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity why is this one left out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
str.extract was just a lot trickier to implement. I can try revisiting it in a followup
In general, it's often best to avoid iterating over pyarrow arrays (wrapping each array element in a pyarrow Scalar, and then converting each individually to a python object with
Wrapping in and converting the scalar has quite some overhead, while the conversion to numpy is better optimized and happens in one go at a lower level. |
Going to merge this in, I can address any followups if needed. |
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon! Remember to remove the If these instructions are inaccurate, feel free to suggest an improvement. |
…s for ArrowDtype
…2614) * Add more str arrow functions * Finish functions * finish methods and add tests * Finish implementing * Fix >3.8 compat * Create helper function
Most of these don't have corresponding pyarrow compute methods