Skip to content

Commit fde84ea

Browse files
word wrap
1 parent 2968157 commit fde84ea

File tree

1 file changed

+127
-32
lines changed

1 file changed

+127
-32
lines changed

web/pandas/pdeps/0018-nullable-object-dtype.md

Lines changed: 127 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -8,70 +8,165 @@
88

99
## Abstract
1010

11-
This proposal outlines the introduction of a nullable object dtype to the pandas library. The goal is to provide a dedicated dtype for handling arbitrary Python objects with consistent missing value semantics using `pd.NA`. Unlike the traditional `object` dtype which lacks robust missing data handling, this new nullable dtype will add clarity and consistency in representing missing or undefined values within object arrays.
11+
This proposal outlines the introduction of a nullable object
12+
dtype to the pandas library. The goal is to provide a
13+
dedicated dtype for handling arbitrary Python objects with
14+
consistent missing value semantics using `pd.NA`. Unlike the
15+
traditional `object` dtype which lacks robust missing data
16+
handling, this new nullable dtype will add clarity and
17+
consistency in representing missing or undefined values
18+
within object arrays.
1219

1320
## Motivation
1421

15-
Currently, the `object` dtype in pandas is a catch-all for heterogeneous Python objects, but it does not enforce any particular missing-value semantics. As pandas has evolved to include extension types (like `string[python]`, `Int64`, or `boolean`), there is a clear benefit in extending these improvements to the object datatype. A nullable object dtype would help:
16-
- **Consistency**: Enforce a uniform approach to managing missing values with `pd.NA` across all dtypes.
17-
- **Interoperability**: Enable cleaner and more predictable behavior when performing operations on data previously stored as generic objects.
18-
- **Clarity**: Help users distinguish between truly “object” data and data that is better represented by a nullable container supporting missing values.
19-
20-
This proposal is driven by frequent community discussions and development efforts that aim to unify missing value handling across pandas data types.
22+
Currently, the `object` dtype in pandas is a catch-all for
23+
heterogeneous Python objects, but it does not enforce any
24+
particular missing-value semantics. As pandas has evolved to
25+
include extension types (like `string[python]`, `Int64`, or
26+
`boolean`), there is a clear benefit in extending these
27+
improvements to the object datatype. A nullable object dtype
28+
would help:
29+
- **Consistency**: Enforce a uniform approach to managing
30+
missing values with `pd.NA` across all dtypes.
31+
- **Interoperability**: Enable cleaner and more predictable
32+
behavior when performing operations on data previously
33+
stored as generic objects.
34+
- **Clarity**: Help users distinguish between truly “object”
35+
data and data that is better represented by a nullable
36+
container supporting missing values.
37+
38+
This proposal is driven by frequent community discussions
39+
and development efforts that aim to unify missing value
40+
handling across pandas data types.
2141

2242
## Detailed Proposal
2343

2444
### Definition
2545

26-
The proposal introduces a new extension type, tentatively named `"object_nullable"`, that stores an underlying array of Python objects alongside a boolean mask that indicates missing (i.e., `pd.NA`) values. The API should mimic that of existing extension arrays, ensuring that missing value propagation, casting, and arithmetic comparisons (where applicable) behave consistently with other nullable types.
46+
The proposal introduces a new extension type, tentatively
47+
named `"object_nullable"`, that stores an underlying array
48+
of Python objects alongside a boolean mask that indicates
49+
missing (i.e., `pd.NA`) values. The API should mimic that of
50+
existing extension arrays, ensuring that missing value
51+
propagation, casting, and arithmetic comparisons (where
52+
applicable) behave consistently with other nullable types.
2753

2854
### Key Features
2955
1. **Consistent Missing Value Semantics**:
30-
- Missing entries will be represented by `pd.NA`, ensuring compatibility with pandas nullable dtypes that use `pd.NA` as the missing value indicator as well as the experimental `ArrowDType`.
31-
- Operations that encounter missing values will handle `pd.NA` uniformly consistent with other pandas nullable dtypes that use `pd.NA` as the missing value indicator.
56+
- Missing entries will be represented by `pd.NA`,
57+
ensuring compatibility with pandas nullable dtypes that
58+
use `pd.NA` as the missing value indicator as well as
59+
the experimental `ArrowDType`.
60+
- Operations that encounter missing values will handle
61+
`pd.NA` uniformly consistent with other pandas nullable
62+
dtypes that use `pd.NA` as the missing value indicator.
3263
2. **Underlying Data Storage**:
33-
- The core data structure will consist of a NumPy array of Python objects and an associated boolean mask. (not so different from the current `object` backed nullable string array variant that uses `pd.NA` as the missing value.)
34-
- Consideration should be given to performance, ensuring that operations remain as vectorized as possible despite the inherent overhead of handling Python objects.
64+
- The core data structure will consist of a NumPy array
65+
of Python objects and an associated boolean mask. (not
66+
so different from the current `object` backed nullable
67+
string array variant that uses `pd.NA` as the missing
68+
value.)
69+
- Consideration should be given to performance, ensuring
70+
that operations remain as vectorized as possible despite
71+
the inherent overhead of handling Python objects.
3572
3. **API Integration**:
36-
- The new dtype will implement the ExtensionArray interface.
37-
- Methods such as `astype`, `isna`, `fillna`, and element-wise operations are already defined to respect missing values in the other pandas nullable dtypes.
38-
- All operations on a nullable object array will return a pandas nullable array except where requested, such as `astype`. Methods like `fillna` would still return a nullable object array even though there are no missing values to avoid introducing mixed-propagation behavior.
39-
- Ensure compatibility with pandas functions, like groupby, concatenation, and merging, where the semantics of missing values are critical.
73+
- The new dtype will implement the ExtensionArray
74+
interface.
75+
- Methods such as `astype`, `isna`, `fillna`, and
76+
element-wise operations are already defined to respect
77+
missing values in the other pandas nullable dtypes.
78+
- All operations on a nullable object array will return
79+
a pandas nullable array except where requested, such as
80+
`astype`. Methods like `fillna` would still return a
81+
nullable object array even though there are no missing
82+
values to avoid introducing mixed-propagation behavior.
83+
- Ensure compatibility with pandas functions, like
84+
groupby, concatenation, and merging, where the semantics
85+
of missing values are critical.
4086
4. **Transition and Interoperability**:
41-
- Users should be able to convert from the legacy object dtype to object_nullable using a constructor or an explicit method (e.g., `pd.array(old_array, dtype="object_nullable")`) using the existing api.
42-
- Operations on existing pandas nullable dtypes that would normally produce an object dtype should be updated (or made configurable as a transition path) to yield "object_nullable" in all cases even when missing values are not present to avoid introducing mixed-propagation behavior.
43-
- `ArrowDType` does not offer an `object` dtype for heterogeneous Python objects and therefore a user requesting arrow dtypes could be given "object_nullable" arrays where appropriate to avoid mixed `pd.NA`/`np.nan` semantics when using `dtype_backend="pyarrow"`.
87+
- Users should be able to convert from the legacy object
88+
dtype to object_nullable using a constructor or an
89+
explicit method (e.g., `pd.array(old_array,
90+
dtype="object_nullable")`) using the existing api.
91+
- Operations on existing pandas nullable dtypes that
92+
would normally produce an object dtype should be updated
93+
(or made configurable as a transition path) to yield
94+
"object_nullable" in all cases even when missing values
95+
are not present to avoid introducing mixed-propagation
96+
behavior.
97+
- `ArrowDType` does not offer an `object` dtype for
98+
heterogeneous Python objects and therefore a user
99+
requesting arrow dtypes could be given "object_nullable"
100+
arrays where appropriate to avoid mixed `pd.NA`/`np.nan`
101+
semantics when using `dtype_backend="pyarrow"`.
44102

45103

46104
### Implementation Considerations
47105
1. **Performance**:
48-
- Handling arbitrary Python objects is inherently slower than operations on native numerical types.
49-
- Expanding the EA interface to 2D is outside the scope of this PDEP.
106+
- Handling arbitrary Python objects is inherently slower
107+
than operations on native numerical types.
108+
- Expanding the EA interface to 2D is outside the scope
109+
of this PDEP.
50110

51111
2. **Backward Compatibility**:
52-
- Existing code that uses the traditional object dtype should not break. (Making the pandas nullable object dtype the default is not part of this proposal and would be discussed in conjunction with moving the other pandas nullable dtypes to be default.)
53-
- Existing code that uses the pandas nullable dtypes should not break without warnings, even though they are considered experimental, as these dtypes have been available to users for a long time. The new dtype can be offered as an opt-in feature initially.
112+
- Existing code that uses the traditional object dtype
113+
should not break. (Making the pandas nullable object
114+
dtype the default is not part of this proposal and would
115+
be discussed in conjunction with moving the other pandas
116+
nullable dtypes to be default.)
117+
- Existing code that uses the pandas nullable dtypes
118+
should not break without warnings, even though they are
119+
considered experimental, as these dtypes have been
120+
available to users for a long time. The new dtype can be
121+
offered as an opt-in feature initially.
54122
3. **Testing and Documentation**:
55-
- Extensive tests will be required to validate behavior against edge cases.
56-
- Updated documentation should explain differences between the legacy object dtype and object_nullable, including examples and migration tips.
123+
- Extensive tests will be required to validate behavior
124+
against edge cases.
125+
- Updated documentation should explain differences
126+
between the legacy object dtype and object_nullable,
127+
including examples and migration tips.
57128
4. **Community Feedback**:
58-
- Continuous discussions on GitHub, mailing lists, and related channels will inform refinements. The nullable object dtype should be available as opt-in for at least 2 minor versions to allow sufficient time for feedback before the return types of the existing pandas nullable dtypes are changed.
129+
- Continuous discussions on GitHub, mailing lists, and
130+
related channels will inform refinements. The nullable
131+
object dtype should be available as opt-in for at least
132+
2 minor versions to allow sufficient time for feedback
133+
before the return types of the existing pandas nullable
134+
dtypes are changed.
59135

60136
## Alternatives Considered
61137
- Continuing with the Legacy Object Dtype:
62-
- Retaining the ambiguous missing value semantics of the legacy object dtype does not provide a robust and consistent solution, aligning with the design of other extension arrays.
63-
- Not having a nullable object dtype could potentially be a blocker for a potential future nullable by default policy.
138+
- Retaining the ambiguous missing value semantics of the
139+
legacy object dtype does not provide a robust and
140+
consistent solution, aligning with the design of other
141+
extension arrays.
142+
- Not having a nullable object dtype could potentially
143+
be a blocker for a potential future nullable by default
144+
policy.
64145

65146
## Drawbacks and Future Directions
66147
1. **Overhead Cost**:
67-
The additional memory required for a boolean mask and possible performance penalties in highly heterogeneous arrays are acknowledged trade-offs.
148+
The additional memory required for a boolean mask and
149+
possible performance penalties in highly heterogeneous
150+
arrays are acknowledged trade-offs.
68151
2. **Integration Complexity**:
69-
Ensuring seamless integration with the full suite of pandas functionality may reveal edge cases that require careful handling.
152+
Ensuring seamless integration with the full suite of pandas
153+
functionality may reveal edge cases that require careful
154+
handling.
70155
3. **Incompatibility**:
71-
The existing object array can hold any python object, even `pd.NA` itself. The proposed nullable object array will be unable to hold `np.nan`, `None` or `pd.NaT` as these will be considered missing in the constructors and other conversions when following the existing API for the other nullable types. Users will not be able to round-trip between the legacy and nullable object dtypes.
156+
The existing object array can hold any python object, even
157+
`pd.NA` itself. The proposed nullable object array will be
158+
unable to hold `np.nan`, `None` or `pd.NaT` as these will be
159+
considered missing in the constructors and other conversions
160+
when following the existing API for the other nullable
161+
types. Users will not be able to round-trip between the
162+
legacy and nullable object dtypes.
72163

73164
## Conclusion
74-
Introducing a nullable object dtype in pandas will offer a clearer semantic for missing values and align the behavior of object arrays with other nullable types. This proposal is aimed at fostering discussion and soliciting community feedback to refine the design and implementation roadmap.
165+
Introducing a nullable object dtype in pandas will offer a
166+
clearer semantic for missing values and align the behavior
167+
of object arrays with other nullable types. This proposal is
168+
aimed at fostering discussion and soliciting community
169+
feedback to refine the design and implementation roadmap.
75170

76171

77172

0 commit comments

Comments
 (0)