Skip to content

API: New global option to set the default dtypes to use #61620

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
datapythonista opened this issue Jun 10, 2025 · 3 comments
Open

API: New global option to set the default dtypes to use #61620

datapythonista opened this issue Jun 10, 2025 · 3 comments
Labels
API Design Needs Discussion Requires discussion from core team before further action

Comments

@datapythonista
Copy link
Member

This was already implemented before 2.0 in #50748, but then removed before the release in #51853, as in too many cases the option wasn't being respected.

The idea is to have a global option to let pandas know which dtype kind to use when data is created (the exact option name needs to be discussed, but I'll use use_arrow to illustrate):

pandas.options.mode.use_arrow = True

df = pandas.read_csv(...)  # The returned DataFrame will use pyarrow dtypes
df["foo"] = 1  # The added column will use pyarrow dtypes
df = pandas.DataFrame(...)  # The returned DataFrame will use pyarrow dtypes
...

I don't think adding the option is controversial, as it has no impact on users unless set, and it was already implemented without objections in the past.

I think the implementation requires a bit of discussion, as the exact behavior to implement is not immediately obvious, a least to me. Main points I can see

  1. Should we have an option to set pyarrow as the default (since those should be the types we expect people to use in the future), or a more generic option to set dtype_backend to numpy|nullable|pyarrow?
  2. I think at least initially it makes sense that if a user is specific about the dtype they want to use (e.g. Series([1, 2], dtype="Int32")) we let them do it. But could it make sense to have a second option force_arrow or force_dtype_backend so any operation that would use another dtype kind would fail? I think this could be helpful for users that only want to live in the pyarrow world, and it would also be helpful to identify undesired casts for us.
  3. The exact namespace (mode vs future vs others) and name of the option, which clearly will depend on the previous points
@datapythonista datapythonista added Dtype Conversions Unexpected or buggy dtype conversions Needs Discussion Requires discussion from core team before further action pyarrow dtype retention op with pyarrow dtype -> expect pyarrow result labels Jun 10, 2025
@datapythonista datapythonista changed the title ENH: New global option to set the default dtypes to use API: New global option to set the default dtypes to use Jun 10, 2025
@datapythonista datapythonista added API Design and removed Dtype Conversions Unexpected or buggy dtype conversions pyarrow dtype retention op with pyarrow dtype -> expect pyarrow result labels Jun 10, 2025
@simonjayhawkins
Copy link
Member

2. I think at least initially it makes sense that if a user is specific about the dtype they want to use (e.g. Series([1, 2], dtype="Int32")) we let them do it. But could it make sense to have a second option force_arrow or force_dtype_backend so any operation that would use another dtype kind would fail? I think this could be helpful for users that only want to live in the pyarrow world, and it would also be helpful to identify undesired casts for us.

It would seem logical that if we have a global option that there is a mapping of dtypes to Arrow types silently. The purpose of the global option is to work with only Arrow types.

a secondary option, for control of that would perhaps be desirable for some users.

But definitely we would not want to require any code changes. The idea of the option would be to allow users to use PyArrow on existing code without any code changes.

We could perhaps give consideration to logical types, as per PDEP-13 #58455, as a future direction so that these silent dtypes mappings do not occur but that is definitely not a blocker to what you are proposing.

@arthurlw
Copy link
Member

  1. Should we have an option to set pyarrow as the default (since those should be the types we expect people to use in the future), or a more generic option to set dtype_backend to numpy|nullable|pyarrow?

Not a maintainer, but personally I would prefer the latter: it feels more future-proof and flexible, especially if other backends are considered later on.

@datapythonista
Copy link
Member Author

Thanks @arthurlw, this is good feedback. I agree, and I prefer the first option, because I see the dtype backends not as a feature, but as something we had to do because we didn't get the backend we wanted initially.

Long term I think users should just think about float, int... and not how they are storaged internally. In that sense maybe pandas.options.mode.use_legacy_dtypes = True/False can even be clearer, if others share my point of view.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

3 participants