Skip to content

BUG: read_parquet does not respect index for arrow dtype backend #51726

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 33 additions & 8 deletions pandas/io/parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,9 @@

from pandas import (
DataFrame,
Index,
MultiIndex,
RangeIndex,
arrays,
get_option,
)
Expand Down Expand Up @@ -250,14 +252,37 @@ def read(
if dtype_backend == "pandas":
result = pa_table.to_pandas(**to_pandas_kwargs)
elif dtype_backend == "pyarrow":
result = DataFrame(
{
col_name: arrays.ArrowExtensionArray(pa_col)
for col_name, pa_col in zip(
pa_table.column_names, pa_table.itercolumns()
)
}
)
index_columns = pa_table.schema.pandas_metadata.get("index_columns", [])
result_dc = {
col_name: arrays.ArrowExtensionArray(pa_col)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to not use types_mapper here (as is done for our own masked nullable arrays)?

Because all this handling of the index is done by pyarrow already, and if using to_pandas() as for the other code paths would ensure this is done consistently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea, but it looks like arrow is not applying the types mapper to index either. Not sure why we didn't do it like this initially though.

I tried:

types_mapper = lambda x: pd.ArrowDtype(x)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a bug in pyarrow (well, when this keyword was implemented the Index did not yet support EAs, so at that point it wasn't needed to consider the index as well). This was recently reported (apache/arrow#34283), and we can ensure to fix it for the next release in April.

Another reason to go the types_mapper way is that users can define a custom ExtensionArray that has its own conversion from pyarrow->pandas defined, and the current code here would ignore that.

Short term an option to overcome the Index bug could be to convert the Index manually back to an Arrow backed array. That's of course a bit wasteful in case it was not a zero-copy conversion .. But for people following the latest pyarrow releases it should only be a short-term issue.

Copy link
Member Author

@phofl phofl Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to types_mapper=pd.ArrowDtype in #51766

If you have converting the Index on your agenda, I'd avoid implementing this ourselves for a couple of weeks basically.

Thoughts?

If you agree, then can just close here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my limited understanding of https://github.com/apache/arrow/blob/main/python/pyarrow/src/arrow/python/arrow_to_pandas.cc I used the manual conversion to avoid a conversion from numpy(?)

// Functions for pandas conversion via NumPy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are going through pandas_dtype.__from_arrow__(arr) which receives an arrow array, so we should be good?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, whenever pyarrow detects an extension dtype for a column (either in the metadata, the pyarrow type itself or types_mapper keyword), we don't actually convert to numpy but directly pass the pyarrow array to dtype.__from_arrow__

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it thanks for the confirmation. I think #51766 should be sufficient then

for col_name, pa_col in zip(
pa_table.column_names, pa_table.itercolumns()
)
}
idx: Index | None
if len(index_columns) == 0:
idx = None
elif len(index_columns) == 1 and isinstance(index_columns[0], dict):
params = index_columns[0]
idx = RangeIndex(
params.get("start"),
params.get("stop"),
params.get("step"),
name=params.get("name"),
)

else:
index_data = [
result_dc.pop(index_col) for index_col in index_columns
]
if len(index_data) == 1:
name = index_columns[0]
if isinstance(name, str) and name.startswith("__index_level_"):
name = None
idx = Index(index_data[0], name=name)
else:
idx = MultiIndex.from_arrays(index_data, names=index_columns)
result = DataFrame(result_dc, index=idx)
if manager == "array":
result = result._as_manager("array", copy=False)
return result
Expand Down
36 changes: 36 additions & 0 deletions pandas/tests/io/test_parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
import pandas.util._test_decorators as td

import pandas as pd
from pandas import RangeIndex
import pandas._testing as tm
from pandas.util.version import Version

Expand Down Expand Up @@ -1225,3 +1226,38 @@ def test_bytes_file_name(self, engine):

result = read_parquet(path, engine=engine)
tm.assert_frame_equal(result, df)

@pytest.mark.parametrize("index", ["A", ["A", "B"]])
def test_pyarrow_backed_df_index(self, index, pa):
# GH#48944
obj = pd.DataFrame(data={"A": [0, 1], "B": [1, 0], "C": 1})
df = obj.set_index(index)
with tm.ensure_clean("test.parquet") as path:
with open(path.encode(), "wb") as f:
df.to_parquet(f)

with pd.option_context("mode.dtype_backend", "pyarrow"):
result = read_parquet(path, engine="pyarrow")
expected = obj.astype("int64[pyarrow]").set_index(index)
tm.assert_frame_equal(result, expected)

@pytest.mark.parametrize("name", [None, "test"])
@pytest.mark.parametrize("index", [True, False, None])
def test_pyarrow_backed_df_range_index(self, pa, index, name):
# GH#48944
df = pd.DataFrame(
data={"A": [0, 1], "B": [1, 0]},
index=RangeIndex(start=100, stop=102, name=name),
)
with tm.ensure_clean("test.parquet") as path:
with open(path.encode(), "wb") as f:
df.to_parquet(f, index=index)

with pd.option_context("mode.dtype_backend", "pyarrow"):
result = read_parquet(path, engine="pyarrow")
expected = df.astype("int64[pyarrow]")
if index is False:
expected = expected.reset_index(drop=True)
elif index:
expected.index = pd.Index([100, 101], dtype="int64[pyarrow]", name=name)
tm.assert_frame_equal(result, expected)