Skip to content

BUG: documented usage of of str.split(...).str.get fails on dtype large_string[pyarrow] #61431

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
SandroCasagrande opened this issue May 12, 2025 · 9 comments
Open
3 tasks done
Labels
Bug Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data

Comments

@SandroCasagrande
Copy link
Contributor

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
pd.Series(["abc"], dtype="large_string[pyarrow]").str.split("b").str


-traceback
Traceback (most recent call last):
  File "<python-input-7>", line 1, in <module>
    a = pd.Series(["abc"], dtype="large_string[pyarrow]").str.split("b").str[0]
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/pandas-main-string-test/lib/python3.13/site-packages/pandas/core/generic.py", line 6127, in __getattr__
    return object.__getattribute__(self, name)
           ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/pandas-main-string-test/lib/python3.13/site-packages/pandas/core/accessor.py", line 228, in __get__
    return self._accessor(obj)
           ~~~~~~~~~~~~~~^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/pandas-main-string-test/lib/python3.13/site-packages/pandas/core/strings/accessor.py", line 208, in __init__
    self._inferred_dtype = self._validate(data)
                           ~~~~~~~~~~~~~~^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/pandas-main-string-test/lib/python3.13/site-packages/pandas/core/strings/accessor.py", line 262, in _validate
    raise AttributeError(
        f"Can only use .str accessor with string values, not {inferred_dtype}"
    )
AttributeError: Can only use .str accessor with string values, not unknown-array. Did you mean: 'std'?

Issue Description

The return dtype of split is very different when acting on large_string (results in pyarrow list) and string (results in object).

Interestingly, using the list accessor works only on large_string dtype

>>> pd.Series(["abc"], dtype="large_string[pyarrow]").str.split("b").list[0]
0    a
dtype: large_string[pyarrow]

but not on string dtype

>>> pd.Series(["abc"], dtype="string[pyarrow]").str.split("b").list[0]
Traceback (most recent call last):
  File "<python-input-15>", line 1, in <module>
    pd.Series(["abc"], dtype="string[pyarrow]").str.split("b").list[0]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/pandas-main-string-test/lib/python3.13/site-packages/pandas/core/generic.py", line 6127, in __getattr__
    return object.__getattribute__(self, name)
           ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/pandas-main-string-test/lib/python3.13/site-packages/pandas/core/accessor.py", line 228, in __get__
    return self._accessor(obj)
           ~~~~~~~~~~~~~~^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/pandas-main-string-test/lib/python3.13/site-packages/pandas/core/arrays/arrow/accessors.py", line 73, in __init__
    super().__init__(
    ~~~~~~~~~~~~~~~~^
        data,
        ^^^^^
        validation_msg="Can only use the '.list' accessor with "
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        "'list[pyarrow]' dtype, not {dtype}.",
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/pandas-main-string-test/lib/python3.13/site-packages/pandas/core/arrays/arrow/accessors.py", line 41, in __init__
    self._validate(data)
    ~~~~~~~~~~~~~~^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/pandas-main-string-test/lib/python3.13/site-packages/pandas/core/arrays/arrow/accessors.py", line 51, in _validate
    raise AttributeError(self._validation_msg.format(dtype=dtype))
AttributeError: Can only use the '.list' accessor with 'list[pyarrow]' dtype, not object.. Did you mean: 'hist'?

From a use perspective this is unfortunate, as I have to know the underlying dtype in order to choose the correct accessor (or cast).

Expected Behavior

Should work similar to

>>> pd.Series(["abc"], dtype="string[pyarrow]").str.split("b").str[0]
0    a
dtype: object

since it is documented behavior

s2.str.split("_").str[1]
(dtype is debatable).

Installed Versions

INSTALLED VERSIONS

commit : f496acf
python : 3.13.2
python-bits : 64
OS : Darwin
OS-release : 24.4.0
Version : Darwin Kernel Version 24.4.0: Fri Apr 11 18:33:47 PDT 2025; root:xnu-11417.101.15~117/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 3.0.0.dev0+2100.gf496acffcc
numpy : 2.2.5
dateutil : 2.9.0.post0
pip : 25.1
Cython : 3.0.11
sphinx : None
IPython : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : None
lxml.etree : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
psycopg2 : None
pymysql : None
pyarrow : 20.0.0
pyreadstat : None
pytest : None
python-calamine : None
pytz : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2025.2
qtpy : None
pyqt5 : None

@SandroCasagrande SandroCasagrande added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 12, 2025
@rhshadrach
Copy link
Member

rhshadrach commented May 18, 2025

Thanks for the report! Agreed on the inconsistency here.

print(pd.Series(["abc"], dtype="large_string[pyarrow]").str.split("b"))
# 0    ['a' 'c']
# dtype: list<item: large_string>[pyarrow]
print(pd.Series(["abc"], dtype="string[pyarrow]").str.split("b"))
# 0    [a, c]
# dtype: object

The behavior on string[pyarrow] was introduced in http://github.com/pandas-dev/pandas/pull/40708. cc @simonjayhawkins @jorisvandenbossche

While the current behavior of returning ArrowExtensionArray list dtype on large_string[pyarrow] seems preferable to object dtype in isolation, one benefit of returning object dtype on string[pyarrow] is that it does smooth the transition from object strings to PyArrow strings. But if we were to decide one day we do in fact want ArrowExtensionArray, this is a hard behavior to deprecate.

cc @WillAyd @mroeschke for any thoughts as well.

@rhshadrach rhshadrach added Strings String extension data type and string data Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 18, 2025
@WillAyd
Copy link
Member

WillAyd commented May 19, 2025

This is a general issue that I was hoping the logical type system proposal would clarify, as it gets pretty tough to cherry pick different code paths for different data types.

I think the best solution would return a list data type as a result of this operation. It is more inline with the intent of the user code, and more performant

@rhshadrach
Copy link
Member

I think the best solution would return a list data type as a result of this operation.

@WillAyd - which operation?

@WillAyd
Copy link
Member

WillAyd commented May 20, 2025

str.split

@rhshadrach
Copy link
Member

On both string[pyarrow] and large_string[pyarrow]? Certainly not object dtype I assume, nor Python-backed strings.

@lilnecati
Copy link

👍

@simonjayhawkins
Copy link
Member

Thanks @SandroCasagrande for the report. I completely understand the confusion around pandas dtypes and why one could expect the behavior to be different or even lead one to expect consistency here.

Let's start by introducing a quirk of pandas and then expanding on that.

There's a dtype in pandas core called ArrowDtype. This is an experimental ExtensionDtype for ALL PyArrow data types. But one can easily create a Series backed by, say, an Arrow string array.

pd.Series(["abc"], dtype=pd.ArrowDtype(pa.string()))
# 0    abc
# dtype: string[pyarrow]

we see this gives dtype: string[pyarrow]. This is the dtype string alias which is also accepted as input to the dtype parameter of the Series constructor. So let's do that instead.

pd.Series(["abc"], dtype="string[pyarrow]")
# 0    abc
# dtype: string

Oh. The string alias of the dtype is now just string! So let's do that instead.

pd.Series(["abc"], dtype="string")
# 0    abc
# dtype: string

The quirk is that all these Series are different! The last one is not even backed by PyArrow!

So what's going on?

pd.Series(["abc"], dtype=pd.ArrowDtype(pa.string())).dtype  # string[pyarrow]

type(_)  # pandas.core.dtypes.dtypes.ArrowDtype

pd.Series(["abc"], dtype="string[pyarrow]").dtype  # string[pyarrow]

type(_)  # pandas.core.arrays.string_.StringDtype

pd.Series(["abc"], dtype="string").dtype  # string[python]

type(_)  # pandas.core.arrays.string_.StringDtype

Basically there's overlap in the dtype string aliases for the ArrowDtype and the StringDtype

ArrowDtype is an experimental dtype and being an extension array follows the EA API but there is no restriction on the return type of this EA and hence follow the documented usage of the pandas dtypes. (being an EA it could have been shipped separately and personally I don't know why this experimental EA was included in pandas core in the first place)

so the basic problem here is that dtype="large_string[pyarrow]") and dtype="string[pyarrow]") are significantly different dtypes and associated with different extension array types, one that is experimental and always returns Arrow dtypes and the other that conforms to the documented pandas api.

Hopefully this background will help the discussion in determining if this is indeed a bug and whether there should be consistency here.

@jorisvandenbossche
Copy link
Member

This is a general issue that I was hoping the logical type system proposal would clarify, as it gets pretty tough to cherry pick different code paths for different data types.

Indeed ..

I think the best solution would return a list data type as a result of this operation. It is more inline with the intent of the user code, and more performant

We should, eventually, indeed return a list data type, once we have a dedicated list data type. But again my position is that we should only do this for the default dtypes once we have a default list dtype. And so until we have a better logical dtype system, I think the default behaviour for the default string dtype for str.split() being object dtype is "correct".

(if the default string dtype, which uses NaN as missing value indicator, would return a ArrowDtype(list) type, that would introduce NA-variants of dtypes in existing workflows of people that did not opt in into using pyarrow-NA-dtypes)

@simonjayhawkins
Copy link
Member

We should, eventually, indeed return a list data type, once we have a dedicated list data type. But again my position is that we should only do this for the default dtypes once we have a default list dtype.

And to be clear, we have approval to implement this in PDEP-10. So no blockers here.

(if the default string dtype, which uses NaN as missing value indicator, would return a ArrowDtype(list) type, that would introduce NA-variants of dtypes in existing workflows of people that did not opt in into using pyarrow-NA-dtypes)

Just like PDEP-14 introduced a numpy semantics nan-variant, we also require a numpy semantics variant of the nested dtype. (this perhaps requires a PDEP to mirror PDEP-14 but specific for the nested dtypes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

6 participants