|
8 | 8 | np.set_printoptions(precision=4, suppress=True)
|
9 | 9 | pd.options.display.max_rows = 100
|
10 | 10 |
|
11 |
| -=============================== |
12 |
| - Internal Architecture Changes |
13 |
| -=============================== |
| 11 | +=================================== |
| 12 | + Internals: Data structure changes |
| 13 | +=================================== |
14 | 14 |
|
15 | 15 | Logical types and Physical Storage Decoupling
|
16 | 16 | =============================================
|
@@ -203,6 +203,85 @@ we've chosen for pandas, and elsewhere we can invoke pandas-specific code.
|
203 | 203 | A major concern here based on these ideas is **preserving NumPy
|
204 | 204 | interoperability**, so I'll examine this topic in some detail next.
|
205 | 205 |
|
| 206 | +Correspondence between logical and physical types |
| 207 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 208 | + |
| 209 | +* **Floating point numbers** |
| 210 | + |
| 211 | + - Logical: ``Float16/32/64`` |
| 212 | + - Physical: ``numpy.float16/32/64``, with ``NaN`` for null (for backwards |
| 213 | + compatibility) |
| 214 | + |
| 215 | +* **Signed Integers** |
| 216 | + |
| 217 | + - Logical: ``Int8/16/32/64`` |
| 218 | + - Physical: ``numpy.int8/16/32/64`` array plus nullness bitmap |
| 219 | + |
| 220 | +* **Unsigned Integers** |
| 221 | + |
| 222 | + - Logical: ``Int8/16/32/64`` |
| 223 | + - Physical: ``numpy.int8/16/32/64`` array plus nullness bitmap |
| 224 | + |
| 225 | +* **Boolean** |
| 226 | + |
| 227 | + - Logical: ``Boolean`` |
| 228 | + - Physical: ``np.bool_`` (a.k.a. ``np.uint8``) array plus nullness bitmap. We |
| 229 | + may also explore bit storage (versus bytes). |
| 230 | + |
| 231 | +* **Categorical** |
| 232 | + |
| 233 | + - Logical: ``Categorical[T]``, where ``T`` is any other logical type |
| 234 | + - Physical: this type is a composition of a ``Int8`` through ``Int64`` |
| 235 | + (depending on the cardinality of the categories) plus the categories |
| 236 | + array. These have the same physical representation as |
| 237 | + |
| 238 | +* **String and Binary** |
| 239 | + |
| 240 | + - Logical: ``String`` and ``Binary`` |
| 241 | + - Physical: Dictionary-encoded representation for UTF-8 and general binary |
| 242 | + data as described in the `string section <strings>`. |
| 243 | + |
| 244 | +* **Timestamp** |
| 245 | + |
| 246 | + - Logical: ``Timestamp[unit]``, where unit is the resolution. Nanoseconds can |
| 247 | + continue to be the default unit for now |
| 248 | + - Physical: ``numpy.int64``, with ``INT64_MIN`` as the null value. |
| 249 | + |
| 250 | +* **Timedelta** |
| 251 | + |
| 252 | + - Logical: ``Timedelta[unit]``, where unit is the resolution |
| 253 | + - Physical: ``numpy.int64``, with ``INT64_MIN`` as the null value. |
| 254 | + |
| 255 | +* **Period** |
| 256 | + |
| 257 | + - Logical: ``Period[unit]``, where unit is the resolution |
| 258 | + - Physical: ``numpy.int64``, with ``INT64_MIN`` as the null value. |
| 259 | + |
| 260 | +* **Interval** |
| 261 | + |
| 262 | + - Logical: ``Interval`` |
| 263 | + - Physical: two arrays of ``Timestamp[U]`` -- these may need to be forced to |
| 264 | + both be the same resolution |
| 265 | + |
| 266 | +* **Python objects** (catch-all for other data types) |
| 267 | + |
| 268 | + - Logical: ``Object`` |
| 269 | + - Physical: ``numpy.object_`` array, with None for null values (perhaps with |
| 270 | + ``np.NaN`` also for backwards compatibility) |
| 271 | + |
| 272 | +* **Complex numbers** |
| 273 | + |
| 274 | + - Logical: ``Complex64/128`` |
| 275 | + - Physical: ``numpy.complex64/128``, with ``NaN`` for null (for backwards |
| 276 | + compatibility) |
| 277 | + |
| 278 | +Some notes on these: |
| 279 | + |
| 280 | +- While a pandas (logical) type may map onto one or more physical |
| 281 | + representations, in general NumPy types will map directly onto a pandas |
| 282 | + type. Thus, existing code involving ``numpy.dtype``-like objects (such as |
| 283 | + ``'f8'`` or ``numpy.float64``) will continue to work. |
| 284 | + |
206 | 285 | Preserving NumPy interoperability
|
207 | 286 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
208 | 287 |
|
@@ -318,7 +397,7 @@ bitmap** (which the user never sees). This has numerous benefits:
|
318 | 397 | Notably, this is the way that PostgreSQL handles null values. For example, we
|
319 | 398 | might have:
|
320 | 399 |
|
321 |
| -.. code-block:: |
| 400 | +.. code-block:: text |
322 | 401 |
|
323 | 402 | [0, 1, 2, NA, NA, 5, 6, NA]
|
324 | 403 |
|
|
0 commit comments