ENH: Implement more str accessor methods for ArrowDtype #52614

mroeschke · 2023-04-12T00:50:12Z

Most of these don't have corresponding pyarrow compute methods

…string

jbrockmendel · 2023-04-15T01:26:15Z

pandas/core/arrays/arrow/array.py

+                        None if val.as_py() is None else val.as_py().partition(sep)
+                        for val in chunk
+                    ]
+                    for chunk in self._pa_array.iterchunks()


it looks like this will maintain the chunking structure. is there a reason not to chain these together and end up with a single chunk?

I saw this related issue about ops not maintaining the underlying chunking structure and though best to try to keep it here: #42357

jbrockmendel · 2023-04-15T01:27:44Z

pandas/core/arrays/arrow/array.py

+                        for val in chunk
+                    ]
+                    for chunk in self._pa_array.iterchunks()
+                ]


is it worth making a helper for this pattern so this can just be

def _str_index(...) predicate = lambda x: x.index(sub, start, end) return self._helper_whatever(predicate)

jbrockmendel · 2023-04-15T01:28:40Z

pandas/tests/extension/test_arrow.py

    ser = pd.Series(["abc", None], dtype=ArrowDtype(pa.string()))
    with pytest.raises(
-        NotImplementedError, match=f"str.{method} not supported with pd.ArrowDtype"
+        NotImplementedError, match="str.extract not supported with pd.ArrowDtype"


out of curiosity why is this one left out?

str.extract was just a lot trickier to implement. I can try revisiting it in a followup

jorisvandenbossche · 2023-04-16T17:31:36Z

In general, it's often best to avoid iterating over pyarrow arrays (wrapping each array element in a pyarrow Scalar, and then converting each individually to a python object with as_py() has quite some overhead). Typically, first converting the whole array to numpy and iterating over the numpy array can be more efficient:

import string, random

In [23]: arr = pa.array(["".join(random.choices(string.ascii_letters, k=5)) for _ in range(1_000_000)])

In [24]: %timeit [val.as_py() for val in arr]
517 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [25]: %timeit [val for val in arr.to_numpy(zero_copy_only=False)]
80.7 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Wrapping in and converting the scalar has quite some overhead, while the conversion to numpy is better optimized and happens in one go at a lower level.

…string

mroeschke · 2023-04-21T23:41:02Z

Going to merge this in, I can address any followups if needed.

lumberbot-app · 2023-04-21T23:41:13Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

git checkout 2.0.x
git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

git cherry-pick -x -m1 fbcbdaf70da354801d99a1cebb889153a3eca481

You will likely have some merge/cherry-pick conflict here, fix them and commit:

git commit -am 'Backport PR #52614: ENH: Implement more str accessor methods for ArrowDtype'

Push to a named branch:

git push YOURFORK 2.0.x:auto-backport-of-pr-52614-on-2.0.x

Create a PR against branch 2.0.x, I would have named this PR:

"Backport PR #52614 on branch 2.0.x (ENH: Implement more str accessor methods for ArrowDtype)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

…s for ArrowDtype

…wDtype (#52842)

…2614) * Add more str arrow functions * Finish functions * finish methods and add tests * Finish implementing * Fix >3.8 compat * Create helper function

mroeschke added 7 commits April 7, 2023 16:06

Add more str arrow functions

345e812

Merge remote-tracking branch 'upstream/main' into enh/str/more_arrow_…

8ee20fa

…string

Finish functions

a992ea5

finish methods and add tests

4fb0748

Merge remote-tracking branch 'upstream/main' into enh/str/more_arrow_…

30e441e

…string

Merge remote-tracking branch 'upstream/main' into enh/str/more_arrow_…

85f4242

…string

Finish implementing

449d23e

mroeschke added Strings String extension data type and string data Arrow pyarrow functionality labels Apr 12, 2023

mroeschke added 2 commits April 12, 2023 14:56

Merge remote-tracking branch 'upstream/main' into enh/str/more_arrow_…

d3f4752

…string

Fix >3.8 compat

555b12f

mroeschke modified the milestones: 2.1, 2.0.1 Apr 12, 2023

Merge remote-tracking branch 'upstream/main' into enh/str/more_arrow_…

eeff7ed

…string

jbrockmendel reviewed Apr 15, 2023

View reviewed changes

mroeschke added 2 commits April 17, 2023 11:24

Merge remote-tracking branch 'upstream/main' into enh/str/more_arrow_…

9eb232d

…string

Create helper function

c168aa4

mroeschke merged commit fbcbdaf into pandas-dev:main Apr 21, 2023

mroeschke deleted the enh/str/more_arrow_string branch April 21, 2023 23:41

lumberbot-app bot added the Still Needs Manual Backport label Apr 21, 2023

mroeschke added a commit to mroeschke/pandas that referenced this pull request Apr 21, 2023

Backport PR pandas-dev#52614: ENH: Implement more str accessor method…

aafa1cf

…s for ArrowDtype

mroeschke removed the Still Needs Manual Backport label Apr 21, 2023

phofl pushed a commit that referenced this pull request Apr 22, 2023

Backport PR #52614: ENH: Implement more str accessor methods for Arro…

fa94d3b

…wDtype (#52842)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Implement more str accessor methods for ArrowDtype #52614

ENH: Implement more str accessor methods for ArrowDtype #52614

Uh oh!

mroeschke commented Apr 12, 2023

Uh oh!

jbrockmendel Apr 15, 2023

Uh oh!

mroeschke Apr 17, 2023

Uh oh!

jbrockmendel Apr 15, 2023

Uh oh!

jbrockmendel Apr 15, 2023

Uh oh!

mroeschke Apr 17, 2023

Uh oh!

jorisvandenbossche commented Apr 16, 2023

Uh oh!

mroeschke commented Apr 21, 2023

Uh oh!

lumberbot-app bot commented Apr 21, 2023

Uh oh!

Uh oh!

Uh oh!

ENH: Implement more str accessor methods for ArrowDtype #52614

ENH: Implement more str accessor methods for ArrowDtype #52614

Uh oh!

Conversation

mroeschke commented Apr 12, 2023

Uh oh!

jbrockmendel Apr 15, 2023

Choose a reason for hiding this comment

Uh oh!

mroeschke Apr 17, 2023

Choose a reason for hiding this comment

Uh oh!

jbrockmendel Apr 15, 2023

Choose a reason for hiding this comment

Uh oh!

jbrockmendel Apr 15, 2023

Choose a reason for hiding this comment

Uh oh!

mroeschke Apr 17, 2023

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Apr 16, 2023

Uh oh!

mroeschke commented Apr 21, 2023

Uh oh!

lumberbot-app bot commented Apr 21, 2023

Uh oh!

Uh oh!