BUG: documented usage of of `str.split(...).str.get` fails on dtype `large_string[pyarrow]` #61431

SandroCasagrande · 2025-05-12T15:38:24Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
pd.Series(["abc"], dtype="large_string[pyarrow]").str.split("b").str


-traceback
Traceback (most recent call last):
  File "<python-input-7>", line 1, in <module>
    a = pd.Series(["abc"], dtype="large_string[pyarrow]").str.split("b").str[0]
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/pandas-main-string-test/lib/python3.13/site-packages/pandas/core/generic.py", line 6127, in __getattr__
    return object.__getattribute__(self, name)
           ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/pandas-main-string-test/lib/python3.13/site-packages/pandas/core/accessor.py", line 228, in __get__
    return self._accessor(obj)
           ~~~~~~~~~~~~~~^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/pandas-main-string-test/lib/python3.13/site-packages/pandas/core/strings/accessor.py", line 208, in __init__
    self._inferred_dtype = self._validate(data)
                           ~~~~~~~~~~~~~~^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/pandas-main-string-test/lib/python3.13/site-packages/pandas/core/strings/accessor.py", line 262, in _validate
    raise AttributeError(
        f"Can only use .str accessor with string values, not {inferred_dtype}"
    )
AttributeError: Can only use .str accessor with string values, not unknown-array. Did you mean: 'std'?

Issue Description

The return dtype of split is very different when acting on large_string (results in pyarrow list) and string (results in object).

Interestingly, using the list accessor works only on large_string dtype

>>> pd.Series(["abc"], dtype="large_string[pyarrow]").str.split("b").list[0]
0    a
dtype: large_string[pyarrow]

but not on string dtype

>>> pd.Series(["abc"], dtype="string[pyarrow]").str.split("b").list[0]
Traceback (most recent call last):
  File "<python-input-15>", line 1, in <module>
    pd.Series(["abc"], dtype="string[pyarrow]").str.split("b").list[0]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/pandas-main-string-test/lib/python3.13/site-packages/pandas/core/generic.py", line 6127, in __getattr__
    return object.__getattribute__(self, name)
           ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/pandas-main-string-test/lib/python3.13/site-packages/pandas/core/accessor.py", line 228, in __get__
    return self._accessor(obj)
           ~~~~~~~~~~~~~~^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/pandas-main-string-test/lib/python3.13/site-packages/pandas/core/arrays/arrow/accessors.py", line 73, in __init__
    super().__init__(
    ~~~~~~~~~~~~~~~~^
        data,
        ^^^^^
        validation_msg="Can only use the '.list' accessor with "
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        "'list[pyarrow]' dtype, not {dtype}.",
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/pandas-main-string-test/lib/python3.13/site-packages/pandas/core/arrays/arrow/accessors.py", line 41, in __init__
    self._validate(data)
    ~~~~~~~~~~~~~~^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/pandas-main-string-test/lib/python3.13/site-packages/pandas/core/arrays/arrow/accessors.py", line 51, in _validate
    raise AttributeError(self._validation_msg.format(dtype=dtype))
AttributeError: Can only use the '.list' accessor with 'list[pyarrow]' dtype, not object.. Did you mean: 'hist'?

From a use perspective this is unfortunate, as I have to know the underlying dtype in order to choose the correct accessor (or cast).

Expected Behavior

Should work similar to

>>> pd.Series(["abc"], dtype="string[pyarrow]").str.split("b").str[0]
0    a
dtype: object

since it is documented behavior

pandas/doc/source/user_guide/text.rst

Line 229 in f496acf

s2.str.split("_").str[1]

(dtype is debatable).

Installed Versions

INSTALLED VERSIONS

commit : f496acf
python : 3.13.2
python-bits : 64
OS : Darwin
OS-release : 24.4.0
Version : Darwin Kernel Version 24.4.0: Fri Apr 11 18:33:47 PDT 2025; root:xnu-11417.101.15~117/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 3.0.0.dev0+2100.gf496acffcc
numpy : 2.2.5
dateutil : 2.9.0.post0
pip : 25.1
Cython : 3.0.11
sphinx : None
IPython : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : None
lxml.etree : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
psycopg2 : None
pymysql : None
pyarrow : 20.0.0
pyreadstat : None
pytest : None
python-calamine : None
pytz : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2025.2
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

rhshadrach · 2025-05-18T10:59:39Z

Thanks for the report! Agreed on the inconsistency here.

print(pd.Series(["abc"], dtype="large_string[pyarrow]").str.split("b"))
# 0    ['a' 'c']
# dtype: list<item: large_string>[pyarrow]
print(pd.Series(["abc"], dtype="string[pyarrow]").str.split("b"))
# 0    [a, c]
# dtype: object

The behavior on string[pyarrow] was introduced in http://github.com/pandas-dev/pandas/pull/40708. cc @simonjayhawkins @jorisvandenbossche

While the current behavior of returning ArrowExtensionArray list dtype on large_string[pyarrow] seems preferable to object dtype in isolation, one benefit of returning object dtype on string[pyarrow] is that it does smooth the transition from object strings to PyArrow strings. But if we were to decide one day we do in fact want ArrowExtensionArray, this is a hard behavior to deprecate.

cc @WillAyd @mroeschke for any thoughts as well.

WillAyd · 2025-05-19T22:48:50Z

This is a general issue that I was hoping the logical type system proposal would clarify, as it gets pretty tough to cherry pick different code paths for different data types.

I think the best solution would return a list data type as a result of this operation. It is more inline with the intent of the user code, and more performant

rhshadrach · 2025-05-20T02:38:25Z

I think the best solution would return a list data type as a result of this operation.

@WillAyd - which operation?

WillAyd · 2025-05-20T03:33:46Z

str.split

rhshadrach · 2025-05-20T21:12:50Z

On both string[pyarrow] and large_string[pyarrow]? Certainly not object dtype I assume, nor Python-backed strings.

lilnecati · 2025-05-24T06:12:49Z

👍

simonjayhawkins · 2025-05-27T11:59:58Z

Thanks @SandroCasagrande for the report. I completely understand the confusion around pandas dtypes and why one could expect the behavior to be different or even lead one to expect consistency here.

Let's start by introducing a quirk of pandas and then expanding on that.

There's a dtype in pandas core called ArrowDtype. This is an experimental ExtensionDtype for ALL PyArrow data types. But one can easily create a Series backed by, say, an Arrow string array.

pd.Series(["abc"], dtype=pd.ArrowDtype(pa.string()))
# 0    abc
# dtype: string[pyarrow]

we see this gives dtype: string[pyarrow]. This is the dtype string alias which is also accepted as input to the dtype parameter of the Series constructor. So let's do that instead.

pd.Series(["abc"], dtype="string[pyarrow]")
# 0    abc
# dtype: string

Oh. The string alias of the dtype is now just string! So let's do that instead.

pd.Series(["abc"], dtype="string")
# 0    abc
# dtype: string

The quirk is that all these Series are different! The last one is not even backed by PyArrow!

So what's going on?

pd.Series(["abc"], dtype=pd.ArrowDtype(pa.string())).dtype  # string[pyarrow]

type(_)  # pandas.core.dtypes.dtypes.ArrowDtype

pd.Series(["abc"], dtype="string[pyarrow]").dtype  # string[pyarrow]

type(_)  # pandas.core.arrays.string_.StringDtype

pd.Series(["abc"], dtype="string").dtype  # string[python]

type(_)  # pandas.core.arrays.string_.StringDtype

Basically there's overlap in the dtype string aliases for the ArrowDtype and the StringDtype

ArrowDtype is an experimental dtype and being an extension array follows the EA API but there is no restriction on the return type of this EA and hence follow the documented usage of the pandas dtypes. (being an EA it could have been shipped separately and personally I don't know why this experimental EA was included in pandas core in the first place)

so the basic problem here is that dtype="large_string[pyarrow]") and dtype="string[pyarrow]") are significantly different dtypes and associated with different extension array types, one that is experimental and always returns Arrow dtypes and the other that conforms to the documented pandas api.

Hopefully this background will help the discussion in determining if this is indeed a bug and whether there should be consistency here.

jorisvandenbossche · 2025-05-28T17:58:34Z

This is a general issue that I was hoping the logical type system proposal would clarify, as it gets pretty tough to cherry pick different code paths for different data types.

Indeed ..

I think the best solution would return a list data type as a result of this operation. It is more inline with the intent of the user code, and more performant

We should, eventually, indeed return a list data type, once we have a dedicated list data type. But again my position is that we should only do this for the default dtypes once we have a default list dtype. And so until we have a better logical dtype system, I think the default behaviour for the default string dtype for str.split() being object dtype is "correct".

(if the default string dtype, which uses NaN as missing value indicator, would return a ArrowDtype(list) type, that would introduce NA-variants of dtypes in existing workflows of people that did not opt in into using pyarrow-NA-dtypes)

simonjayhawkins · 2025-06-03T11:41:16Z

We should, eventually, indeed return a list data type, once we have a dedicated list data type. But again my position is that we should only do this for the default dtypes once we have a default list dtype.

And to be clear, we have approval to implement this in PDEP-10. So no blockers here.

(if the default string dtype, which uses NaN as missing value indicator, would return a ArrowDtype(list) type, that would introduce NA-variants of dtypes in existing workflows of people that did not opt in into using pyarrow-NA-dtypes)

Just like PDEP-14 introduced a numpy semantics nan-variant, we also require a numpy semantics variant of the nested dtype. (this perhaps requires a PDEP to mirror PDEP-14 but specific for the nested dtypes)

SandroCasagrande added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 12, 2025

BryteLite mentioned this issue May 18, 2025

BUG: Compiler Flag Drift May Affect Pandas ABI Stability via Memory Assumptions #61452

Closed

3 tasks

rhshadrach added Strings String extension data type and string data Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 18, 2025

ClauPet mentioned this issue May 30, 2025

BUG: Passing string[pyarrow] to the dtype parameter of e.g. csv_read() does produce a string type Series #61496

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: documented usage of of `str.split(...).str.get` fails on dtype `large_string[pyarrow]` #61431

BUG: documented usage of of `str.split(...).str.get` fails on dtype `large_string[pyarrow]` #61431

SandroCasagrande commented May 12, 2025

INSTALLED VERSIONS

rhshadrach commented May 18, 2025 •

edited

Loading

Uh oh!

WillAyd commented May 19, 2025

Uh oh!

rhshadrach commented May 20, 2025

Uh oh!

WillAyd commented May 20, 2025

Uh oh!

rhshadrach commented May 20, 2025

Uh oh!

lilnecati commented May 24, 2025

Uh oh!

simonjayhawkins commented May 27, 2025

Uh oh!

jorisvandenbossche commented May 28, 2025

Uh oh!

simonjayhawkins commented Jun 3, 2025

Uh oh!

Uh oh!

BUG: documented usage of of str.split(...).str.get fails on dtype large_string[pyarrow] #61431

BUG: documented usage of of str.split(...).str.get fails on dtype large_string[pyarrow] #61431

Comments

SandroCasagrande commented May 12, 2025

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

rhshadrach commented May 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WillAyd commented May 19, 2025

Uh oh!

rhshadrach commented May 20, 2025

Uh oh!

WillAyd commented May 20, 2025

Uh oh!

rhshadrach commented May 20, 2025

Uh oh!

lilnecati commented May 24, 2025

Uh oh!

simonjayhawkins commented May 27, 2025

Uh oh!

jorisvandenbossche commented May 28, 2025

Uh oh!

simonjayhawkins commented Jun 3, 2025

Uh oh!

BUG: documented usage of of `str.split(...).str.get` fails on dtype `large_string[pyarrow]` #61431

BUG: documented usage of of `str.split(...).str.get` fails on dtype `large_string[pyarrow]` #61431

rhshadrach commented May 18, 2025 •

edited

Loading