Skip to content

Commit cef38e2

Browse files
committed
Merge branch 'master' of https://github.com/pandas-dev/pandas into ref-liboffsets
2 parents ea559e2 + 6f5614b commit cef38e2

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

67 files changed

+1549
-1013
lines changed

doc/source/user_guide/groupby.rst

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -199,6 +199,33 @@ For example, the groups created by ``groupby()`` below are in the order they app
199199
df3.groupby(['X']).get_group('B')
200200
201201
202+
.. _groupby.dropna:
203+
204+
.. versionadded:: 1.1.0
205+
206+
GroupBy dropna
207+
^^^^^^^^^^^^^^
208+
209+
By default ``NA`` values are excluded from group keys during the ``groupby`` operation. However,
210+
in case you want to include ``NA`` values in group keys, you could pass ``dropna=False`` to achieve it.
211+
212+
.. ipython:: python
213+
214+
df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
215+
df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"])
216+
217+
df_dropna
218+
219+
.. ipython:: python
220+
221+
# Default `dropna` is set to True, which will exclude NaNs in keys
222+
df_dropna.groupby(by=["b"], dropna=True).sum()
223+
224+
# In order to allow NaN in keys, set `dropna` to False
225+
df_dropna.groupby(by=["b"], dropna=False).sum()
226+
227+
The default setting of ``dropna`` argument is ``True`` which means ``NA`` are not included in group keys.
228+
202229

203230
.. _groupby.attributes:
204231

doc/source/whatsnew/v1.1.0.rst

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,37 @@ For example:
3636
ser["2014"]
3737
ser.loc["May 2015"]
3838
39+
40+
.. _whatsnew_110.groupby_key:
41+
42+
Allow NA in groupby key
43+
^^^^^^^^^^^^^^^^^^^^^^^^
44+
45+
With :ref:`groupby <groupby.dropna>` , we've added a ``dropna`` keyword to :meth:`DataFrame.groupby` and :meth:`Series.groupby` in order to
46+
allow ``NA`` values in group keys. Users can define ``dropna`` to ``False`` if they want to include
47+
``NA`` values in groupby keys. The default is set to ``True`` for ``dropna`` to keep backwards
48+
compatibility (:issue:`3729`)
49+
50+
.. ipython:: python
51+
52+
df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
53+
df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"])
54+
55+
df_dropna
56+
57+
.. ipython:: python
58+
59+
# Default `dropna` is set to True, which will exclude NaNs in keys
60+
df_dropna.groupby(by=["b"], dropna=True).sum()
61+
62+
# In order to allow NaN in keys, set `dropna` to False
63+
df_dropna.groupby(by=["b"], dropna=False).sum()
64+
65+
The default setting of ``dropna`` argument is ``True`` which means ``NA`` are not included in group keys.
66+
67+
.. versionadded:: 1.1.0
68+
69+
3970
.. _whatsnew_110.key_sorting:
4071

4172
Sorting with keys
@@ -563,6 +594,7 @@ Datetimelike
563594
- Bug in :meth:`DatetimeIndex.intersection` losing ``freq`` and timezone in some cases (:issue:`33604`)
564595
- Bug in :class:`DatetimeIndex` addition and subtraction with some types of :class:`DateOffset` objects incorrectly retaining an invalid ``freq`` attribute (:issue:`33779`)
565596
- Bug in :class:`DatetimeIndex` where setting the ``freq`` attribute on an index could silently change the ``freq`` attribute on another index viewing the same data (:issue:`33552`)
597+
- :meth:`DataFrame.min`/:meth:`DataFrame.max` not returning consistent result with :meth:`Series.min`/:meth:`Series.max` when called on objects initialized with empty :func:`pd.to_datetime`
566598
- Bug in :meth:`DatetimeIndex.intersection` and :meth:`TimedeltaIndex.intersection` with results not having the correct ``name`` attribute (:issue:`33904`)
567599
- Bug in :meth:`DatetimeArray.__setitem__`, :meth:`TimedeltaArray.__setitem__`, :meth:`PeriodArray.__setitem__` incorrectly allowing values with ``int64`` dtype to be silently cast (:issue:`33717`)
568600

@@ -574,6 +606,9 @@ Timedelta
574606
- Timedeltas now understand ``µs`` as identifier for microsecond (:issue:`32899`)
575607
- :class:`Timedelta` string representation now includes nanoseconds, when nanoseconds are non-zero (:issue:`9309`)
576608
- Bug in comparing a :class:`Timedelta`` object against a ``np.ndarray`` with ``timedelta64`` dtype incorrectly viewing all entries as unequal (:issue:`33441`)
609+
- Bug in :func:`timedelta_range` that produced an extra point on a edge case (:issue:`30353`, :issue:`33498`)
610+
- Bug in :meth:`DataFrame.resample` that produced an extra point on a edge case (:issue:`30353`, :issue:`13022`, :issue:`33498`)
611+
- Bug in :meth:`DataFrame.resample` that ignored the ``loffset`` argument when dealing with timedelta (:issue:`7687`, :issue:`33498`)
577612

578613
Timezones
579614
^^^^^^^^^
@@ -717,6 +752,7 @@ Groupby/resample/rolling
717752
- Bug in :meth:`DataFrameGroupby.transform` produces incorrect result with transformation functions (:issue:`30918`)
718753
- Bug in :meth:`GroupBy.count` causes segmentation fault when grouped-by column contains NaNs (:issue:`32841`)
719754
- Bug in :meth:`DataFrame.groupby` and :meth:`Series.groupby` produces inconsistent type when aggregating Boolean series (:issue:`32894`)
755+
- Bug in :meth:`DataFrameGroupBy.sum` and :meth:`SeriesGroupBy.sum` where a large negative number would be returned when the number of non-null values was below ``min_count`` for nullable integer dtypes (:issue:`32861`)
720756
- Bug in :meth:`SeriesGroupBy.quantile` raising on nullable integers (:issue:`33136`)
721757
- Bug in :meth:`SeriesGroupBy.first`, :meth:`SeriesGroupBy.last`, :meth:`SeriesGroupBy.min`, and :meth:`SeriesGroupBy.max` returning floats when applied to nullable Booleans (:issue:`33071`)
722758
- Bug in :meth:`DataFrameGroupBy.agg` with dictionary input losing ``ExtensionArray`` dtypes (:issue:`32194`)
@@ -747,6 +783,7 @@ Reshaping
747783
- Bug in :meth:`DataFrame.unstack` when MultiIndexed columns and MultiIndexed rows were used (:issue:`32624`, :issue:`24729` and :issue:`28306`)
748784
- Bug in :func:`concat` was not allowing for concatenation of ``DataFrame`` and ``Series`` with duplicate keys (:issue:`33654`)
749785
- Bug in :func:`cut` raised an error when non-unique labels (:issue:`33141`)
786+
- Bug in :meth:`DataFrame.replace` casts columns to ``object`` dtype if items in ``to_replace`` not in values (:issue:`32988`)
750787

751788

752789
Sparse
@@ -763,6 +800,7 @@ ExtensionArray
763800
- Fixed bug that caused :meth:`Series.__repr__()` to crash for extension types whose elements are multidimensional arrays (:issue:`33770`).
764801
- Fixed bug where :meth:`Series.update` would raise a ``ValueError`` for ``ExtensionArray`` dtypes with missing values (:issue:`33980`)
765802
- Fixed bug where :meth:`StringArray.memory_usage` was not implemented (:issue:`33963`)
803+
- Fixed bug that `DataFrame(columns=.., dtype='string')` would fail (:issue:`27953`, :issue:`33623`)
766804

767805

768806
Other

pandas/_libs/lib.pyx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ from pandas._libs.tslibs.nattype cimport (
7575
from pandas._libs.tslibs.conversion cimport convert_to_tsobject
7676
from pandas._libs.tslibs.timedeltas cimport convert_to_timedelta64
7777
from pandas._libs.tslibs.timezones cimport get_timezone, tz_compare
78-
from pandas._libs.tslibs.period cimport is_period_object
78+
from pandas._libs.tslibs.base cimport is_period_object
7979

8080
from pandas._libs.missing cimport (
8181
checknull,

pandas/_libs/missing.pxd

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ cpdef ndarray[uint8_t] isnaobj(ndarray arr)
66

77
cdef bint is_null_datetime64(v)
88
cdef bint is_null_timedelta64(v)
9+
cdef bint checknull_with_nat_and_na(object obj)
910

1011
cdef class C_NAType:
1112
pass

pandas/_libs/missing.pyx

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -279,6 +279,11 @@ cdef inline bint is_null_timedelta64(v):
279279
return False
280280

281281

282+
cdef bint checknull_with_nat_and_na(object obj):
283+
# See GH#32214
284+
return checknull_with_nat(obj) or obj is C_NA
285+
286+
282287
# -----------------------------------------------------------------------------
283288
# Implementation of NA singleton
284289

pandas/_libs/tslib.pyx

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -51,8 +51,7 @@ from pandas._libs.tslibs.conversion cimport (
5151
get_datetime64_nanos)
5252

5353
from pandas._libs.tslibs.nattype import nat_strings
54-
from pandas._libs.tslibs.nattype cimport (
55-
checknull_with_nat, NPY_NAT, c_NaT as NaT)
54+
from pandas._libs.tslibs.nattype cimport NPY_NAT, c_NaT as NaT
5655

5756
from pandas._libs.tslibs.offsets cimport to_offset
5857

@@ -64,6 +63,9 @@ from pandas._libs.tslibs.tzconversion cimport (
6463
tz_convert_utc_to_tzlocal,
6564
)
6665

66+
# Note: this is the only non-tslibs intra-pandas dependency here
67+
from pandas._libs.missing cimport checknull_with_nat_and_na
68+
6769

6870
cdef inline object create_datetime_from_ts(
6971
int64_t value,
@@ -438,7 +440,7 @@ def array_with_unit_to_datetime(
438440
for i in range(n):
439441
val = values[i]
440442

441-
if checknull_with_nat(val):
443+
if checknull_with_nat_and_na(val):
442444
iresult[i] = NPY_NAT
443445

444446
elif is_integer_object(val) or is_float_object(val):
@@ -505,7 +507,7 @@ def array_with_unit_to_datetime(
505507
for i in range(n):
506508
val = values[i]
507509

508-
if checknull_with_nat(val):
510+
if checknull_with_nat_and_na(val):
509511
oresult[i] = <object>NaT
510512
elif is_integer_object(val) or is_float_object(val):
511513

@@ -602,7 +604,7 @@ cpdef array_to_datetime(
602604
val = values[i]
603605

604606
try:
605-
if checknull_with_nat(val):
607+
if checknull_with_nat_and_na(val):
606608
iresult[i] = NPY_NAT
607609

608610
elif PyDateTime_Check(val):
@@ -812,7 +814,7 @@ cdef ignore_errors_out_of_bounds_fallback(ndarray[object] values):
812814
val = values[i]
813815

814816
# set as nan except if its a NaT
815-
if checknull_with_nat(val):
817+
if checknull_with_nat_and_na(val):
816818
if isinstance(val, float):
817819
oresult[i] = np.nan
818820
else:
@@ -874,7 +876,7 @@ cdef array_to_datetime_object(
874876
# 2) datetime strings, which we return as datetime.datetime
875877
for i in range(n):
876878
val = values[i]
877-
if checknull_with_nat(val) or PyDateTime_Check(val):
879+
if checknull_with_nat_and_na(val) or PyDateTime_Check(val):
878880
# GH 25978. No need to parse NaT-like or datetime-like vals
879881
oresult[i] = val
880882
elif isinstance(val, str):

pandas/_libs/tslibs/conversion.pyx

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ from cpython.datetime cimport (datetime, time, tzinfo,
1313
PyDateTime_IMPORT)
1414
PyDateTime_IMPORT
1515

16-
from pandas._libs.tslibs.base cimport ABCTimestamp
16+
from pandas._libs.tslibs.base cimport ABCTimestamp, is_period_object
1717

1818
from pandas._libs.tslibs.np_datetime cimport (
1919
check_dts_bounds, npy_datetimestruct, pandas_datetime_to_datetimestruct,
@@ -37,10 +37,11 @@ from pandas._libs.tslibs.nattype import nat_strings
3737
from pandas._libs.tslibs.nattype cimport (
3838
NPY_NAT, checknull_with_nat, c_NaT as NaT)
3939

40-
from pandas._libs.tslibs.tzconversion import (
41-
tz_localize_to_utc, tz_convert_single)
40+
from pandas._libs.tslibs.tzconversion import tz_localize_to_utc
4241
from pandas._libs.tslibs.tzconversion cimport (
43-
_tz_convert_tzlocal_utc, _tz_convert_tzlocal_fromutc)
42+
_tz_convert_tzlocal_utc, _tz_convert_tzlocal_fromutc,
43+
tz_convert_single
44+
)
4445

4546
# ----------------------------------------------------------------------
4647
# Constants
@@ -286,7 +287,7 @@ cdef convert_to_tsobject(object ts, object tz, object unit,
286287
# Keep the converter same as PyDateTime's
287288
ts = datetime.combine(ts, time())
288289
return convert_datetime_to_tsobject(ts, tz)
289-
elif getattr(ts, '_typ', None) == 'period':
290+
elif is_period_object(ts):
290291
raise ValueError("Cannot convert Period to Timestamp "
291292
"unambiguously. Use to_timestamp")
292293
else:

pandas/_libs/tslibs/frequencies.pyx

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ import re
33
cimport numpy as cnp
44
cnp.import_array()
55

6-
from pandas._libs.tslibs.util cimport is_integer_object
6+
from pandas._libs.tslibs.util cimport is_integer_object, is_offset_object
77

88
from pandas._libs.tslibs.ccalendar import MONTH_NUMBERS
99

@@ -153,7 +153,7 @@ cpdef get_freq_code(freqstr):
153153
>>> get_freq_code(('D', 3))
154154
(6000, 3)
155155
"""
156-
if getattr(freqstr, '_typ', None) == 'dateoffset':
156+
if is_offset_object(freqstr):
157157
freqstr = (freqstr.rule_code, freqstr.n)
158158

159159
if isinstance(freqstr, tuple):
@@ -451,8 +451,8 @@ cdef str _maybe_coerce_freq(code):
451451
code : string
452452
"""
453453
assert code is not None
454-
if getattr(code, '_typ', None) == 'dateoffset':
455-
# i.e. isinstance(code, ABCDateOffset):
454+
if is_offset_object(code):
455+
# i.e. isinstance(code, DateOffset):
456456
code = code.rule_code
457457
return code.upper()
458458

pandas/_libs/tslibs/nattype.pyx

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -15,11 +15,10 @@ from cpython.datetime cimport (
1515
datetime,
1616
timedelta,
1717
)
18+
PyDateTime_IMPORT
1819

1920
from cpython.version cimport PY_MINOR_VERSION
2021

21-
PyDateTime_IMPORT
22-
2322
import numpy as np
2423
cimport numpy as cnp
2524
from numpy cimport int64_t
@@ -30,8 +29,7 @@ from pandas._libs.tslibs.np_datetime cimport (
3029
get_timedelta64_value,
3130
)
3231
cimport pandas._libs.tslibs.util as util
33-
34-
from pandas._libs.missing cimport C_NA
32+
from pandas._libs.tslibs.base cimport is_period_object
3533

3634

3735
# ----------------------------------------------------------------------
@@ -150,7 +148,7 @@ cdef class _NaT(datetime):
150148
elif util.is_offset_object(other):
151149
return c_NaT
152150

153-
elif util.is_integer_object(other) or util.is_period_object(other):
151+
elif util.is_integer_object(other) or is_period_object(other):
154152
# For Period compat
155153
# TODO: the integer behavior is deprecated, remove it
156154
return c_NaT
@@ -186,7 +184,7 @@ cdef class _NaT(datetime):
186184
elif util.is_offset_object(other):
187185
return c_NaT
188186

189-
elif util.is_integer_object(other) or util.is_period_object(other):
187+
elif util.is_integer_object(other) or is_period_object(other):
190188
# For Period compat
191189
# TODO: the integer behavior is deprecated, remove it
192190
return c_NaT
@@ -809,7 +807,7 @@ cdef inline bint checknull_with_nat(object val):
809807
"""
810808
Utility to check if a value is a nat or not.
811809
"""
812-
return val is None or util.is_nan(val) or val is c_NaT or val is C_NA
810+
return val is None or util.is_nan(val) or val is c_NaT
813811

814812

815813
cpdef bint is_null_datetimelike(object val, bint inat_is_null=True):

pandas/_libs/tslibs/offsets.pyx

Lines changed: 8 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -128,7 +128,7 @@ def apply_index_wraps(func):
128128
# not play nicely with cython class methods
129129
def wrapper(self, other):
130130

131-
is_index = getattr(other, "_typ", "") == "datetimeindex"
131+
is_index = not util.is_array(other._data)
132132

133133
# operate on DatetimeArray
134134
arr = other._data if is_index else other
@@ -168,7 +168,8 @@ def apply_wraps(func):
168168
elif isinstance(other, (np.datetime64, datetime, date)):
169169
other = Timestamp(other)
170170
else:
171-
raise TypeError(other)
171+
# This will end up returning NotImplemented back in __add__
172+
raise ApplyTypeError
172173

173174
tz = other.tzinfo
174175
nano = other.nanosecond
@@ -474,11 +475,6 @@ class _BaseOffset:
474475
return type(self)(n=1, normalize=self.normalize, **self.kwds)
475476

476477
def __add__(self, other):
477-
if getattr(other, "_typ", None) in ["datetimeindex", "periodindex",
478-
"datetimearray", "periodarray",
479-
"series", "period", "dataframe"]:
480-
# defer to the other class's implementation
481-
return other + self
482478
try:
483479
return self.apply(other)
484480
except ApplyTypeError:
@@ -497,12 +493,12 @@ class _BaseOffset:
497493
return self.apply(other)
498494

499495
def __mul__(self, other):
500-
if hasattr(other, "_typ"):
501-
return NotImplemented
502496
if util.is_array(other):
503497
return np.array([self * x for x in other])
504-
return type(self)(n=other * self.n, normalize=self.normalize,
505-
**self.kwds)
498+
elif is_integer_object(other):
499+
return type(self)(n=other * self.n, normalize=self.normalize,
500+
**self.kwds)
501+
return NotImplemented
506502

507503
def __neg__(self):
508504
# Note: we are deferring directly to __mul__ instead of __rmul__, as
@@ -705,10 +701,7 @@ class BaseOffset(_BaseOffset):
705701
return self.__add__(other)
706702

707703
def __rsub__(self, other):
708-
if getattr(other, '_typ', None) in ['datetimeindex', 'series']:
709-
# i.e. isinstance(other, (ABCDatetimeIndex, ABCSeries))
710-
return other - self
711-
return -self + other
704+
return (-self).__add__(other)
712705

713706

714707
cdef class _Tick(ABCTick):

0 commit comments

Comments
 (0)