BUG: Series.str.isdigit with pyarrow dtype doesn't honor unicode superscripts #61466

GarrettWu · 2025-05-20T20:25:32Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
s = pd.Series(['23', '³', '⅕', ''], dtype=pd.StringDtype(storage="pyarrow"))
s.str.isdigit()


	0
0	True
1	False
2	False
3	False

dtype: boolean

Issue Description

Series.str.isdigit() with pyarrow string dtype doesn't honor unicode superscript/subscript. Which diverges with the public doc. https://pandas.pydata.org/docs/reference/api/pandas.Series.str.isdigit.html#pandas.Series.str.isdigit

The bug only happens in Pyarrow string dtype, Python string dtype behavior is correct.

Expected Behavior

import pandas as pd
s = pd.Series(['23', '³', '⅕', ''], dtype=pd.StringDtype(storage="pyarrow"))
s.str.isdigit()

	0
0	True
1	True
2	False
3	False

dtype: boolean

Installed Versions

INSTALLED VERSIONS

commit : 0691c5c
python : 3.11.12
python-bits : 64
OS : Linux
OS-release : 6.1.123+
Version : #1 SMP PREEMPT_DYNAMIC Sun Mar 30 16:01:29 UTC 2025
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.3
numpy : 2.0.2
pytz : 2025.2
dateutil : 2.9.0.post0
pip : 24.1.2
Cython : 3.0.12
sphinx : 8.2.3
IPython : 7.34.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.13.4
blosc : None
bottleneck : 1.4.2
dataframe-api-compat : None
fastparquet : None
fsspec : 2025.3.2
html5lib : 1.1
hypothesis : None
gcsfs : 2025.3.2
jinja2 : 3.1.6
lxml.etree : 5.4.0
matplotlib : 3.10.0
numba : 0.60.0
numexpr : 2.10.2
odfpy : None
openpyxl : 3.1.5
pandas_gbq : 0.28.1
psycopg2 : 2.9.10
pymysql : None
pyarrow : 18.1.0
pyreadstat : None
pytest : 8.3.5
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.15.3
sqlalchemy : 2.0.40
tables : 3.10.2
tabulate : 0.9.0
xarray : 2025.3.1
xlrd : 2.0.1
xlsxwriter : None
zstandard : 0.23.0
tzdata : 2025.2
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

rhshadrach · 2025-05-20T22:38:48Z

Thanks for the report, confirmed on main. Further investigations and PRs to fix are welcome!

iabhi4 · 2025-05-24T07:51:25Z

@rhshadrach The issue stems from pyarrow.compute.utf8_is_digit not recognizing non-ASCII Unicode digits (e.g., '³'). To align with str.isdigit()'s behavior and pandas docs, I propose replacing the Arrow compute call in _str_isdigit() with

def _str_isdigit(self):
        values = self.to_numpy(na_value=None)
        data = []
        mask = []

        for val in values:
            if val is None:
                data.append(False)
                mask.append(True)
            else:
                data.append(val.isdigit())
                mask.append(False)

        from pandas.core.arrays.boolean import BooleanArray
        return BooleanArray(np.array(data, dtype=bool), np.array(mask, dtype=bool))

While this isn’t vectorized, it correctly honors all Unicode digit categories, which aligns with user expectations. Let me know if this workaround is acceptable for now, or if you’d prefer keeping the current Arrow-based behavior and instead clarifying the limitation in the documentation.

Related upstream issue: I’ve confirmed that this is a pyarrow limitation and have raised an enhancement request in the Arrow repo to bring utf8_is_digit in line with str.isdigit().

Optionally, we could also explore reimplementing this in Cython using PyUnicode_READ and Py_UNICODE_ISDIGIT for performance while maintaining Unicode correctness.

Let me know what direction you'd prefer, happy to work on a patch either way

rhshadrach · 2025-05-30T18:34:23Z

Looks like this is getting fixed upstream (thanks!). Assuming that to be the case, my preference would be to leave pandas as-is.

cc @WillAyd @jorisvandenbossche for any thoughts.

WillAyd · 2025-05-30T19:02:47Z

Yes I agree - let's keep it as an upstream fix. Thanks for the thorough investigation and solution @iabhi4

GarrettWu · 2025-06-04T17:31:57Z

Thanks @iabhi4 for the upstream fix apache/arrow#46589. It solves the superscripts issue, but introduces another discrepancy:

// '¾' (vulgar fraction) is treated as a digit by utf8proc 'No'

Any chance we can fix it too? Otherwise str.isdigit is still different on python string and pyarrow string types.

GarrettWu added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 20, 2025

GarrettWu changed the title ~~BUG: Series.str.isdigit~~ BUG: Series.str.isdigit with pyarrow dtype doesn't honor unicode superscripts May 20, 2025

rhshadrach added Strings String extension data type and string data Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 20, 2025

rhshadrach added Upstream issue Issue related to pandas dependency Needs Discussion Requires discussion from core team before further action labels May 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: Series.str.isdigit with pyarrow dtype doesn't honor unicode superscripts #61466

BUG: Series.str.isdigit with pyarrow dtype doesn't honor unicode superscripts #61466

GarrettWu commented May 20, 2025

INSTALLED VERSIONS

rhshadrach commented May 20, 2025

Uh oh!

iabhi4 commented May 24, 2025 •

edited

Loading

Uh oh!

rhshadrach commented May 30, 2025

Uh oh!

WillAyd commented May 30, 2025

Uh oh!

GarrettWu commented Jun 4, 2025

Uh oh!

Uh oh!

BUG: Series.str.isdigit with pyarrow dtype doesn't honor unicode superscripts #61466

BUG: Series.str.isdigit with pyarrow dtype doesn't honor unicode superscripts #61466

Comments

GarrettWu commented May 20, 2025

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

rhshadrach commented May 20, 2025

Uh oh!

iabhi4 commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhshadrach commented May 30, 2025

Uh oh!

WillAyd commented May 30, 2025

Uh oh!

GarrettWu commented Jun 4, 2025

Uh oh!

iabhi4 commented May 24, 2025 •

edited

Loading