Skip to content

Fix error value_counts result with pyarrow categorical columns #60949

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

chilin0525
Copy link
Contributor

@chilin0525 chilin0525 commented Feb 17, 2025

@chilin0525 chilin0525 changed the title Fix error value counts result with pyarrow categorical columns with nulls Fix error value_counts result with pyarrow categorical columns Feb 17, 2025
Comment on lines +450 to +455
from pandas import Index

if isinstance(values, Index):
arr = values._data._pa_array.combine_chunks()
else:
arr = values._pa_array.combine_chunks()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from pandas import Index
if isinstance(values, Index):
arr = values._data._pa_array.combine_chunks()
else:
arr = values._pa_array.combine_chunks()
arr = values.array._pa_array.combine_chunks()

Copy link
Contributor Author

@chilin0525 chilin0525 Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke Thanks for review. After applied this change got failed of 2 testcase in pandas/tests/extension/test_arrow.py 🤔 (testing in my local):

Following is error message from 2 fail testcase, values.array raise error when value is ArrowExtensionArray Object:

  • test_sort_values_dictionary:

        [1/1] Generating write_version_file with a custom command
    + /usr/local/bin/ninja
    ============================= test session starts ==============================
    platform linux -- Python 3.10.8, pytest-8.3.4, pluggy-1.5.0
    PyQt5 5.15.11 -- Qt runtime 5.15.16 -- Qt compiled 5.15.14
    rootdir: /home/pandas
    configfile: pyproject.toml
    plugins: xdist-3.6.1, hypothesis-6.125.3, qt-4.4.0, anyio-4.8.0, localserver-0.9.0.post0, cython-0.3.1, cov-6.0.0
    collected 26414 items / 26413 deselected / 1 selected
    
    pandas/tests/extension/test_arrow.py F
    
    =================================== FAILURES ===================================
    _________________________ test_sort_values_dictionary __________________________
    
        def test_sort_values_dictionary():
            df = pd.DataFrame(
                {
                    "a": pd.Series(
                        ["x", "y"], dtype=ArrowDtype(pa.dictionary(pa.int32(), pa.string()))
                    ),
                    "b": [1, 2],
                },
            )
            expected = df.copy()
    >       result = df.sort_values(by=["a", "b"])
    
    pandas/tests/extension/test_arrow.py:1699: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    pandas/core/frame.py:7077: in sort_values
        indexer = lexsort_indexer(
    pandas/core/sorting.py:350: in lexsort_indexer
        cat = Categorical(k, ordered=True)
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    self = <[AttributeError("'NoneType' object has no attribute 'categories'") raised in repr()] Categorical object at 0x7f5aa42d8a90>
    values = <ArrowExtensionArray>
    ['x', 'y']
    Length: 2, dtype: dictionary<values=string, indices=int32, ordered=0>[pyarrow]
    categories = None, ordered = True
    dtype = CategoricalDtype(categories=None, ordered=True, categories_dtype=None)
    copy = True
    
        def __init__(
            self,
            values,
            categories=None,
            ordered=None,
            dtype: Dtype | None = None,
            copy: bool = True,
        ) -> None:
            dtype = CategoricalDtype._from_values_or_dtype(
                values, categories, ordered, dtype
            )
            # At this point, dtype is always a CategoricalDtype, but
            # we may have dtype.categories be None, and we need to
            # infer categories in a factorization step further below
        
            if not is_list_like(values):
                # GH#38433
                raise TypeError("Categorical input must be list-like")
        
            # null_mask indicates missing values we want to exclude from inference.
            # This means: only missing values in list-likes (not arrays/ndframes).
            null_mask = np.array(False)
        
            # sanitize input
            vdtype = getattr(values, "dtype", None)
            if isinstance(vdtype, CategoricalDtype):
                if dtype.categories is None:
                    dtype = CategoricalDtype(values.categories, dtype.ordered)
            elif isinstance(values, range):
                from pandas.core.indexes.range import RangeIndex
        
                values = RangeIndex(values)
            elif not isinstance(values, (ABCIndex, ABCSeries, ExtensionArray)):
                values = com.convert_to_list_like(values)
                if isinstance(values, list) and len(values) == 0:
                    # By convention, empty lists result in object dtype:
                    values = np.array([], dtype=object)
                elif isinstance(values, np.ndarray):
                    if values.ndim > 1:
                        # preempt sanitize_array from raising ValueError
                        raise NotImplementedError(
                            "> 1 ndim Categorical are not supported at this time"
                        )
                    values = sanitize_array(values, None)
                else:
                    # i.e. must be a list
                    arr = sanitize_array(values, None)
                    null_mask = isna(arr)
                    if null_mask.any():
                        # We remove null values here, then below will re-insert
                        #  them, grep "full_codes"
                        arr_list = [values[idx] for idx in np.where(~null_mask)[0]]
        
                        # GH#44900 Do not cast to float if we have only missing values
                        if arr_list or arr.dtype == "object":
                            sanitize_dtype = None
                        else:
                            sanitize_dtype = arr.dtype
        
                        arr = sanitize_array(arr_list, None, dtype=sanitize_dtype)
                    values = arr
        
            if dtype.categories is None:
                if isinstance(values.dtype, ArrowDtype) and issubclass(
                    values.dtype.type, CategoricalDtypeType
                ):
                    # from pandas import Index
        
                    # if isinstance(values, Index):
                    #     arr = values._data._pa_array.combine_chunks()
                    # else:
                    #     arr = values._pa_array.combine_chunks()
    >               arr = values.array._pa_array.combine_chunks()
    E               AttributeError: 'ArrowExtensionArray' object has no attribute 'array'
    
    pandas/core/arrays/categorical.py:456: AttributeError
    ---------------- generated xml file: /home/pandas/test-data.xml ----------------
    ============================= slowest 30 durations =============================
    
    (3 durations < 0.005s hidden.  Use -vv to show these durations.)
    =========================== short test summary info ============================
    FAILED pandas/tests/extension/test_arrow.py::test_sort_values_dictionary - At...
    ===================== 1 failed, 26413 deselected in 1.61s ======================
    
  • test_dictionary_astype_categorical:

        [1/1] Generating write_version_file with a custom command
    + /usr/local/bin/ninja
    ============================= test session starts ==============================
    platform linux -- Python 3.10.8, pytest-8.3.4, pluggy-1.5.0
    PyQt5 5.15.11 -- Qt runtime 5.15.16 -- Qt compiled 5.15.14
    rootdir: /home/pandas
    configfile: pyproject.toml
    plugins: xdist-3.6.1, hypothesis-6.125.3, qt-4.4.0, anyio-4.8.0, localserver-0.9.0.post0, cython-0.3.1, cov-6.0.0
    collected 26414 items / 26413 deselected / 1 selected
    
    pandas/tests/extension/test_arrow.py F
    
    =================================== FAILURES ===================================
    ______________________ test_dictionary_astype_categorical ______________________
    
        def test_dictionary_astype_categorical():
            # GH#56672
            arrs = [
                pa.array(np.array(["a", "x", "c", "a"])).dictionary_encode(),
                pa.array(np.array(["a", "d", "c"])).dictionary_encode(),
            ]
            ser = pd.Series(ArrowExtensionArray(pa.chunked_array(arrs)))
    >       result = ser.astype("category")
    
    pandas/tests/extension/test_arrow.py:3339: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    pandas/core/generic.py:6435: in astype
        new_data = self._mgr.astype(dtype=dtype, errors=errors)
    pandas/core/internals/managers.py:588: in astype
        return self.apply("astype", dtype=dtype, errors=errors)
    pandas/core/internals/managers.py:438: in apply
        applied = getattr(b, f)(**kwargs)
    pandas/core/internals/blocks.py:610: in astype
        new_values = astype_array_safe(values, dtype, errors=errors)
    pandas/core/dtypes/astype.py:234: in astype_array_safe
        new_values = astype_array(values, dtype, copy=copy)
    pandas/core/dtypes/astype.py:176: in astype_array
        values = values.astype(dtype, copy=copy)
    pandas/core/arrays/base.py:769: in astype
        return cls._from_sequence(self, dtype=dtype, copy=copy)
    pandas/core/arrays/categorical.py:531: in _from_sequence
        return cls(scalars, dtype=dtype, copy=copy)
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    self = <[AttributeError("'NoneType' object has no attribute 'categories'") raised in repr()] Categorical object at 0x7feb7f621940>
    values = <ArrowExtensionArray>
    ['a', 'x', 'c', 'a', 'a', 'd', 'c']
    Length: 7, dtype: dictionary<values=string, indices=int32, ordered=0>[pyarrow]
    categories = None, ordered = None
    dtype = CategoricalDtype(categories=None, ordered=None, categories_dtype=None)
    copy = False
    
        def __init__(
            self,
            values,
            categories=None,
            ordered=None,
            dtype: Dtype | None = None,
            copy: bool = True,
        ) -> None:
            dtype = CategoricalDtype._from_values_or_dtype(
                values, categories, ordered, dtype
            )
            # At this point, dtype is always a CategoricalDtype, but
            # we may have dtype.categories be None, and we need to
            # infer categories in a factorization step further below
        
            if not is_list_like(values):
                # GH#38433
                raise TypeError("Categorical input must be list-like")
        
            # null_mask indicates missing values we want to exclude from inference.
            # This means: only missing values in list-likes (not arrays/ndframes).
            null_mask = np.array(False)
        
            # sanitize input
            vdtype = getattr(values, "dtype", None)
            if isinstance(vdtype, CategoricalDtype):
                if dtype.categories is None:
                    dtype = CategoricalDtype(values.categories, dtype.ordered)
            elif isinstance(values, range):
                from pandas.core.indexes.range import RangeIndex
        
                values = RangeIndex(values)
            elif not isinstance(values, (ABCIndex, ABCSeries, ExtensionArray)):
                values = com.convert_to_list_like(values)
                if isinstance(values, list) and len(values) == 0:
                    # By convention, empty lists result in object dtype:
                    values = np.array([], dtype=object)
                elif isinstance(values, np.ndarray):
                    if values.ndim > 1:
                        # preempt sanitize_array from raising ValueError
                        raise NotImplementedError(
                            "> 1 ndim Categorical are not supported at this time"
                        )
                    values = sanitize_array(values, None)
                else:
                    # i.e. must be a list
                    arr = sanitize_array(values, None)
                    null_mask = isna(arr)
                    if null_mask.any():
                        # We remove null values here, then below will re-insert
                        #  them, grep "full_codes"
                        arr_list = [values[idx] for idx in np.where(~null_mask)[0]]
        
                        # GH#44900 Do not cast to float if we have only missing values
                        if arr_list or arr.dtype == "object":
                            sanitize_dtype = None
                        else:
                            sanitize_dtype = arr.dtype
        
                        arr = sanitize_array(arr_list, None, dtype=sanitize_dtype)
                    values = arr
        
            if dtype.categories is None:
                if isinstance(values.dtype, ArrowDtype) and issubclass(
                    values.dtype.type, CategoricalDtypeType
                ):
                    # from pandas import Index
        
                    # if isinstance(values, Index):
                    #     arr = values._data._pa_array.combine_chunks()
                    # else:
                    #     arr = values._pa_array.combine_chunks()
    >               arr = values.array._pa_array.combine_chunks()
    E               AttributeError: 'ArrowExtensionArray' object has no attribute 'array'
    
    pandas/core/arrays/categorical.py:456: AttributeError
    ---------------- generated xml file: /home/pandas/test-data.xml ----------------
    ============================= slowest 30 durations =============================
    
    (3 durations < 0.005s hidden.  Use -vv to show these durations.)
    =========================== short test summary info ============================
    FAILED pandas/tests/extension/test_arrow.py::test_dictionary_astype_categorical
    ===================== 1 failed, 26413 deselected in 1.67s ======================
    

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok. It appears something is passing in an ArrowExtensionArray here already. OK what you have here is fine then

@mroeschke mroeschke added Categorical Categorical Data Type Arrow pyarrow functionality labels Feb 18, 2025
@mroeschke mroeschke added this to the 3.0 milestone Feb 19, 2025
@mroeschke mroeschke merged commit e2e3791 into pandas-dev:main Feb 19, 2025
42 checks passed
@mroeschke
Copy link
Member

Thanks @chilin0525

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: value_counts() returns error/wrong result with PyArrow categorical columns with nulls
2 participants