|
8 | 8 |
|
9 | 9 | ## Abstract
|
10 | 10 |
|
11 |
| -This proposal outlines the introduction of a nullable object dtype to the pandas library. The goal is to provide a dedicated dtype for handling arbitrary Python objects with consistent missing value semantics using `pd.NA`. Unlike the traditional `object` dtype which lacks robust missing data handling, this new nullable dtype will add clarity and consistency in representing missing or undefined values within object arrays. |
| 11 | +This proposal outlines the introduction of a nullable object |
| 12 | +dtype to the pandas library. The goal is to provide a |
| 13 | +dedicated dtype for handling arbitrary Python objects with |
| 14 | +consistent missing value semantics using `pd.NA`. Unlike the |
| 15 | +traditional `object` dtype which lacks robust missing data |
| 16 | +handling, this new nullable dtype will add clarity and |
| 17 | +consistency in representing missing or undefined values |
| 18 | +within object arrays. |
12 | 19 |
|
13 | 20 | ## Motivation
|
14 | 21 |
|
15 |
| -Currently, the `object` dtype in pandas is a catch-all for heterogeneous Python objects, but it does not enforce any particular missing-value semantics. As pandas has evolved to include extension types (like `string[python]`, `Int64`, or `boolean`), there is a clear benefit in extending these improvements to the object datatype. A nullable object dtype would help: |
16 |
| -- **Consistency**: Enforce a uniform approach to managing missing values with `pd.NA` across all dtypes. |
17 |
| -- **Interoperability**: Enable cleaner and more predictable behavior when performing operations on data previously stored as generic objects. |
18 |
| -- **Clarity**: Help users distinguish between truly “object” data and data that is better represented by a nullable container supporting missing values. |
19 |
| - |
20 |
| -This proposal is driven by frequent community discussions and development efforts that aim to unify missing value handling across pandas data types. |
| 22 | +Currently, the `object` dtype in pandas is a catch-all for |
| 23 | +heterogeneous Python objects, but it does not enforce any |
| 24 | +particular missing-value semantics. As pandas has evolved to |
| 25 | +include extension types (like `string[python]`, `Int64`, or |
| 26 | +`boolean`), there is a clear benefit in extending these |
| 27 | +improvements to the object datatype. A nullable object dtype |
| 28 | +would help: |
| 29 | +- **Consistency**: Enforce a uniform approach to managing |
| 30 | +missing values with `pd.NA` across all dtypes. |
| 31 | +- **Interoperability**: Enable cleaner and more predictable |
| 32 | +behavior when performing operations on data previously |
| 33 | +stored as generic objects. |
| 34 | +- **Clarity**: Help users distinguish between truly “object” |
| 35 | +data and data that is better represented by a nullable |
| 36 | +container supporting missing values. |
| 37 | + |
| 38 | +This proposal is driven by frequent community discussions |
| 39 | +and development efforts that aim to unify missing value |
| 40 | +handling across pandas data types. |
21 | 41 |
|
22 | 42 | ## Detailed Proposal
|
23 | 43 |
|
24 | 44 | ### Definition
|
25 | 45 |
|
26 |
| -The proposal introduces a new extension type, tentatively named `"object_nullable"`, that stores an underlying array of Python objects alongside a boolean mask that indicates missing (i.e., `pd.NA`) values. The API should mimic that of existing extension arrays, ensuring that missing value propagation, casting, and arithmetic comparisons (where applicable) behave consistently with other nullable types. |
| 46 | +The proposal introduces a new extension type, tentatively |
| 47 | +named `"object_nullable"`, that stores an underlying array |
| 48 | +of Python objects alongside a boolean mask that indicates |
| 49 | +missing (i.e., `pd.NA`) values. The API should mimic that of |
| 50 | +existing extension arrays, ensuring that missing value |
| 51 | +propagation, casting, and arithmetic comparisons (where |
| 52 | +applicable) behave consistently with other nullable types. |
27 | 53 |
|
28 | 54 | ### Key Features
|
29 | 55 | 1. **Consistent Missing Value Semantics**:
|
30 |
| - - Missing entries will be represented by `pd.NA`, ensuring compatibility with pandas nullable dtypes that use `pd.NA` as the missing value indicator as well as the experimental `ArrowDType`. |
31 |
| - - Operations that encounter missing values will handle `pd.NA` uniformly consistent with other pandas nullable dtypes that use `pd.NA` as the missing value indicator. |
| 56 | + - Missing entries will be represented by `pd.NA`, |
| 57 | + ensuring compatibility with pandas nullable dtypes that |
| 58 | + use `pd.NA` as the missing value indicator as well as |
| 59 | + the experimental `ArrowDType`. |
| 60 | + - Operations that encounter missing values will handle |
| 61 | + `pd.NA` uniformly consistent with other pandas nullable |
| 62 | + dtypes that use `pd.NA` as the missing value indicator. |
32 | 63 | 2. **Underlying Data Storage**:
|
33 |
| - - The core data structure will consist of a NumPy array of Python objects and an associated boolean mask. (not so different from the current `object` backed nullable string array variant that uses `pd.NA` as the missing value.) |
34 |
| - - Consideration should be given to performance, ensuring that operations remain as vectorized as possible despite the inherent overhead of handling Python objects. |
| 64 | + - The core data structure will consist of a NumPy array |
| 65 | + of Python objects and an associated boolean mask. (not |
| 66 | + so different from the current `object` backed nullable |
| 67 | + string array variant that uses `pd.NA` as the missing |
| 68 | + value.) |
| 69 | + - Consideration should be given to performance, ensuring |
| 70 | + that operations remain as vectorized as possible despite |
| 71 | + the inherent overhead of handling Python objects. |
35 | 72 | 3. **API Integration**:
|
36 |
| - - The new dtype will implement the ExtensionArray interface. |
37 |
| - - Methods such as `astype`, `isna`, `fillna`, and element-wise operations are already defined to respect missing values in the other pandas nullable dtypes. |
38 |
| - - All operations on a nullable object array will return a pandas nullable array except where requested, such as `astype`. Methods like `fillna` would still return a nullable object array even though there are no missing values to avoid introducing mixed-propagation behavior. |
39 |
| - - Ensure compatibility with pandas functions, like groupby, concatenation, and merging, where the semantics of missing values are critical. |
| 73 | + - The new dtype will implement the ExtensionArray |
| 74 | + interface. |
| 75 | + - Methods such as `astype`, `isna`, `fillna`, and |
| 76 | + element-wise operations are already defined to respect |
| 77 | + missing values in the other pandas nullable dtypes. |
| 78 | + - All operations on a nullable object array will return |
| 79 | + a pandas nullable array except where requested, such as |
| 80 | + `astype`. Methods like `fillna` would still return a |
| 81 | + nullable object array even though there are no missing |
| 82 | + values to avoid introducing mixed-propagation behavior. |
| 83 | + - Ensure compatibility with pandas functions, like |
| 84 | + groupby, concatenation, and merging, where the semantics |
| 85 | + of missing values are critical. |
40 | 86 | 4. **Transition and Interoperability**:
|
41 |
| - - Users should be able to convert from the legacy object dtype to object_nullable using a constructor or an explicit method (e.g., `pd.array(old_array, dtype="object_nullable")`) using the existing api. |
42 |
| - - Operations on existing pandas nullable dtypes that would normally produce an object dtype should be updated (or made configurable as a transition path) to yield "object_nullable" in all cases even when missing values are not present to avoid introducing mixed-propagation behavior. |
43 |
| - - `ArrowDType` does not offer an `object` dtype for heterogeneous Python objects and therefore a user requesting arrow dtypes could be given "object_nullable" arrays where appropriate to avoid mixed `pd.NA`/`np.nan` semantics when using `dtype_backend="pyarrow"`. |
| 87 | + - Users should be able to convert from the legacy object |
| 88 | + dtype to object_nullable using a constructor or an |
| 89 | + explicit method (e.g., `pd.array(old_array, |
| 90 | + dtype="object_nullable")`) using the existing api. |
| 91 | + - Operations on existing pandas nullable dtypes that |
| 92 | + would normally produce an object dtype should be updated |
| 93 | + (or made configurable as a transition path) to yield |
| 94 | + "object_nullable" in all cases even when missing values |
| 95 | + are not present to avoid introducing mixed-propagation |
| 96 | + behavior. |
| 97 | + - `ArrowDType` does not offer an `object` dtype for |
| 98 | + heterogeneous Python objects and therefore a user |
| 99 | + requesting arrow dtypes could be given "object_nullable" |
| 100 | + arrays where appropriate to avoid mixed `pd.NA`/`np.nan` |
| 101 | + semantics when using `dtype_backend="pyarrow"`. |
44 | 102 |
|
45 | 103 |
|
46 | 104 | ### Implementation Considerations
|
47 | 105 | 1. **Performance**:
|
48 |
| - - Handling arbitrary Python objects is inherently slower than operations on native numerical types. |
49 |
| - - Expanding the EA interface to 2D is outside the scope of this PDEP. |
| 106 | + - Handling arbitrary Python objects is inherently slower |
| 107 | + than operations on native numerical types. |
| 108 | + - Expanding the EA interface to 2D is outside the scope |
| 109 | + of this PDEP. |
50 | 110 |
|
51 | 111 | 2. **Backward Compatibility**:
|
52 |
| - - Existing code that uses the traditional object dtype should not break. (Making the pandas nullable object dtype the default is not part of this proposal and would be discussed in conjunction with moving the other pandas nullable dtypes to be default.) |
53 |
| - - Existing code that uses the pandas nullable dtypes should not break without warnings, even though they are considered experimental, as these dtypes have been available to users for a long time. The new dtype can be offered as an opt-in feature initially. |
| 112 | + - Existing code that uses the traditional object dtype |
| 113 | + should not break. (Making the pandas nullable object |
| 114 | + dtype the default is not part of this proposal and would |
| 115 | + be discussed in conjunction with moving the other pandas |
| 116 | + nullable dtypes to be default.) |
| 117 | + - Existing code that uses the pandas nullable dtypes |
| 118 | + should not break without warnings, even though they are |
| 119 | + considered experimental, as these dtypes have been |
| 120 | + available to users for a long time. The new dtype can be |
| 121 | + offered as an opt-in feature initially. |
54 | 122 | 3. **Testing and Documentation**:
|
55 |
| - - Extensive tests will be required to validate behavior against edge cases. |
56 |
| - - Updated documentation should explain differences between the legacy object dtype and object_nullable, including examples and migration tips. |
| 123 | + - Extensive tests will be required to validate behavior |
| 124 | + against edge cases. |
| 125 | + - Updated documentation should explain differences |
| 126 | + between the legacy object dtype and object_nullable, |
| 127 | + including examples and migration tips. |
57 | 128 | 4. **Community Feedback**:
|
58 |
| - - Continuous discussions on GitHub, mailing lists, and related channels will inform refinements. The nullable object dtype should be available as opt-in for at least 2 minor versions to allow sufficient time for feedback before the return types of the existing pandas nullable dtypes are changed. |
| 129 | + - Continuous discussions on GitHub, mailing lists, and |
| 130 | + related channels will inform refinements. The nullable |
| 131 | + object dtype should be available as opt-in for at least |
| 132 | + 2 minor versions to allow sufficient time for feedback |
| 133 | + before the return types of the existing pandas nullable |
| 134 | + dtypes are changed. |
59 | 135 |
|
60 | 136 | ## Alternatives Considered
|
61 | 137 | - Continuing with the Legacy Object Dtype:
|
62 |
| - - Retaining the ambiguous missing value semantics of the legacy object dtype does not provide a robust and consistent solution, aligning with the design of other extension arrays. |
63 |
| - - Not having a nullable object dtype could potentially be a blocker for a potential future nullable by default policy. |
| 138 | + - Retaining the ambiguous missing value semantics of the |
| 139 | + legacy object dtype does not provide a robust and |
| 140 | + consistent solution, aligning with the design of other |
| 141 | + extension arrays. |
| 142 | + - Not having a nullable object dtype could potentially |
| 143 | + be a blocker for a potential future nullable by default |
| 144 | + policy. |
64 | 145 |
|
65 | 146 | ## Drawbacks and Future Directions
|
66 | 147 | 1. **Overhead Cost**:
|
67 |
| -The additional memory required for a boolean mask and possible performance penalties in highly heterogeneous arrays are acknowledged trade-offs. |
| 148 | +The additional memory required for a boolean mask and |
| 149 | +possible performance penalties in highly heterogeneous |
| 150 | +arrays are acknowledged trade-offs. |
68 | 151 | 2. **Integration Complexity**:
|
69 |
| -Ensuring seamless integration with the full suite of pandas functionality may reveal edge cases that require careful handling. |
| 152 | +Ensuring seamless integration with the full suite of pandas |
| 153 | +functionality may reveal edge cases that require careful |
| 154 | +handling. |
70 | 155 | 3. **Incompatibility**:
|
71 |
| -The existing object array can hold any python object, even `pd.NA` itself. The proposed nullable object array will be unable to hold `np.nan`, `None` or `pd.NaT` as these will be considered missing in the constructors and other conversions when following the existing API for the other nullable types. Users will not be able to round-trip between the legacy and nullable object dtypes. |
| 156 | +The existing object array can hold any python object, even |
| 157 | +`pd.NA` itself. The proposed nullable object array will be |
| 158 | +unable to hold `np.nan`, `None` or `pd.NaT` as these will be |
| 159 | +considered missing in the constructors and other conversions |
| 160 | +when following the existing API for the other nullable |
| 161 | +types. Users will not be able to round-trip between the |
| 162 | +legacy and nullable object dtypes. |
72 | 163 |
|
73 | 164 | ## Conclusion
|
74 |
| -Introducing a nullable object dtype in pandas will offer a clearer semantic for missing values and align the behavior of object arrays with other nullable types. This proposal is aimed at fostering discussion and soliciting community feedback to refine the design and implementation roadmap. |
| 165 | +Introducing a nullable object dtype in pandas will offer a |
| 166 | +clearer semantic for missing values and align the behavior |
| 167 | +of object arrays with other nullable types. This proposal is |
| 168 | +aimed at fostering discussion and soliciting community |
| 169 | +feedback to refine the design and implementation roadmap. |
75 | 170 |
|
76 | 171 |
|
77 | 172 |
|
|
0 commit comments