Skip to content

Commit 77fafd0

Browse files
authored
Merge pull request #161 from pandas-dev/master
Sync Fork from Upstream Repo
2 parents 93d69e9 + 1367cac commit 77fafd0

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

68 files changed

+971
-538
lines changed

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ repos:
3636
rev: 3.9.0
3737
hooks:
3838
- id: flake8
39-
additional_dependencies: [flake8-comprehensions>=3.1.0]
39+
additional_dependencies: [flake8-comprehensions>=3.1.0, flake8-bugbear>=21.3.2]
4040
- id: flake8
4141
name: flake8 (cython)
4242
types: [cython]

doc/source/ecosystem.rst

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -475,7 +475,7 @@ arrays can be stored inside pandas' Series and DataFrame.
475475
`Pandas-Genomics`_
476476
~~~~~~~~~~~~~~~~~~
477477

478-
Pandas-Genomics provides extension types and extension arrays for working with genomics data
478+
Pandas-Genomics provides extension types, extension arrays, and extension accessors for working with genomics data
479479

480480
`Pint-Pandas`_
481481
~~~~~~~~~~~~~~
@@ -502,16 +502,17 @@ A directory of projects providing
502502
:ref:`extension accessors <extending.register-accessors>`. This is for users to
503503
discover new accessors and for library authors to coordinate on the namespace.
504504

505-
=============== ============ ==================================== ===============================================================
506-
Library Accessor Classes Description
507-
=============== ============ ==================================== ===============================================================
508-
`cyberpandas`_ ``ip`` ``Series`` Provides common operations for working with IP addresses.
509-
`pdvega`_ ``vgplot`` ``Series``, ``DataFrame`` Provides plotting functions from the Altair_ library.
510-
`pandas_path`_ ``path`` ``Index``, ``Series`` Provides `pathlib.Path`_ functions for Series.
511-
`pint-pandas`_ ``pint`` ``Series``, ``DataFrame`` Provides units support for numeric Series and DataFrames.
512-
`composeml`_ ``slice`` ``DataFrame`` Provides a generator for enhanced data slicing.
513-
`datatest`_ ``validate`` ``Series``, ``DataFrame``, ``Index`` Provides validation, differences, and acceptance managers.
514-
=============== ============ ==================================== ===============================================================
505+
================== ============ ==================================== ===============================================================================
506+
Library Accessor Classes Description
507+
================== ============ ==================================== ===============================================================================
508+
`cyberpandas`_ ``ip`` ``Series`` Provides common operations for working with IP addresses.
509+
`pdvega`_ ``vgplot`` ``Series``, ``DataFrame`` Provides plotting functions from the Altair_ library.
510+
`pandas-genomics`_ ``genomics`` ``Series``, ``DataFrame`` Provides common operations for quality control and analysis of genomics data
511+
`pandas_path`_ ``path`` ``Index``, ``Series`` Provides `pathlib.Path`_ functions for Series.
512+
`pint-pandas`_ ``pint`` ``Series``, ``DataFrame`` Provides units support for numeric Series and DataFrames.
513+
`composeml`_ ``slice`` ``DataFrame`` Provides a generator for enhanced data slicing.
514+
`datatest`_ ``validate`` ``Series``, ``DataFrame``, ``Index`` Provides validation, differences, and acceptance managers.
515+
================== ============ ==================================== ===============================================================================
515516

516517
.. _cyberpandas: https://cyberpandas.readthedocs.io/en/latest
517518
.. _pdvega: https://altair-viz.github.io/pdvega/

doc/source/whatsnew/v1.2.4.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ Fixed regressions
1717

1818
- Fixed regression in :meth:`DataFrame.sum` when ``min_count`` greater than the :class:`DataFrame` shape was passed resulted in a ``ValueError`` (:issue:`39738`)
1919
- Fixed regression in :meth:`DataFrame.to_json` raising ``AttributeError`` when run on PyPy (:issue:`39837`)
20+
- Fixed regression in (in)equality comparison of ``pd.NaT`` with a non-datetimelike numpy array returning a scalar instead of an array (:issue:`40722`)
2021
- Fixed regression in :meth:`DataFrame.where` not returning a copy in the case of an all True condition (:issue:`39595`)
2122
- Fixed regression in :meth:`DataFrame.replace` raising ``IndexError`` when ``regex`` was a multi-key dictionary (:issue:`39338`)
2223
-

doc/source/whatsnew/v1.3.0.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -561,6 +561,7 @@ Numeric
561561
- Bug in :func:`select_dtypes` different behavior between Windows and Linux with ``include="int"`` (:issue:`36569`)
562562
- Bug in :meth:`DataFrame.apply` and :meth:`DataFrame.agg` when passed argument ``func="size"`` would operate on the entire ``DataFrame`` instead of rows or columns (:issue:`39934`)
563563
- Bug in :meth:`DataFrame.transform` would raise ``SpecificationError`` when passed a dictionary and columns were missing; will now raise a ``KeyError`` instead (:issue:`40004`)
564+
- Bug in :meth:`DataFrameGroupBy.rank` giving incorrect results with ``pct=True`` and equal values between consecutive groups (:issue:`40518`)
564565
-
565566

566567
Conversion

environment.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ dependencies:
2121
- black=20.8b1
2222
- cpplint
2323
- flake8
24+
- flake8-bugbear>=21.3.2 # used by flake8, find likely bugs
2425
- flake8-comprehensions>=3.1.0 # used by flake8, linting of unnecessary comprehensions
2526
- isort>=5.2.1 # check that imports are in the right order
2627
- mypy=0.812

pandas/_libs/algos.pyx

Lines changed: 47 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -947,12 +947,14 @@ def rank_1d(
947947
TiebreakEnumType tiebreak
948948
Py_ssize_t i, j, N, grp_start=0, dups=0, sum_ranks=0
949949
Py_ssize_t grp_vals_seen=1, grp_na_count=0
950-
ndarray[int64_t, ndim=1] lexsort_indexer
951-
ndarray[float64_t, ndim=1] grp_sizes, out
950+
ndarray[int64_t, ndim=1] grp_sizes
951+
ndarray[intp_t, ndim=1] lexsort_indexer
952+
ndarray[float64_t, ndim=1] out
952953
ndarray[rank_t, ndim=1] masked_vals
953954
ndarray[uint8_t, ndim=1] mask
954955
bint keep_na, at_end, next_val_diff, check_labels, group_changed
955956
rank_t nan_fill_val
957+
int64_t grp_size
956958

957959
tiebreak = tiebreakers[ties_method]
958960
if tiebreak == TIEBREAK_FIRST:
@@ -965,7 +967,7 @@ def rank_1d(
965967
# TODO Cython 3.0: cast won't be necessary (#2992)
966968
assert <Py_ssize_t>len(labels) == N
967969
out = np.empty(N)
968-
grp_sizes = np.ones(N)
970+
grp_sizes = np.ones(N, dtype=np.int64)
969971

970972
# If all 0 labels, can short-circuit later label
971973
# comparisons
@@ -1022,7 +1024,7 @@ def rank_1d(
10221024
# each label corresponds to a different group value,
10231025
# the mask helps you differentiate missing values before
10241026
# performing sort on the actual values
1025-
lexsort_indexer = np.lexsort(order).astype(np.int64, copy=False)
1027+
lexsort_indexer = np.lexsort(order).astype(np.intp, copy=False)
10261028

10271029
if not ascending:
10281030
lexsort_indexer = lexsort_indexer[::-1]
@@ -1093,13 +1095,15 @@ def rank_1d(
10931095
for j in range(i - dups + 1, i + 1):
10941096
out[lexsort_indexer[j]] = grp_vals_seen
10951097

1096-
# Look forward to the next value (using the sorting in lexsort_indexer)
1097-
# if the value does not equal the current value then we need to
1098-
# reset the dups and sum_ranks, knowing that a new value is
1099-
# coming up. The conditional also needs to handle nan equality
1100-
# and the end of iteration
1101-
if next_val_diff or (mask[lexsort_indexer[i]]
1102-
^ mask[lexsort_indexer[i+1]]):
1098+
# Look forward to the next value (using the sorting in
1099+
# lexsort_indexer). If the value does not equal the current
1100+
# value then we need to reset the dups and sum_ranks, knowing
1101+
# that a new value is coming up. The conditional also needs
1102+
# to handle nan equality and the end of iteration. If group
1103+
# changes we do not record seeing a new value in the group
1104+
if not group_changed and (next_val_diff or
1105+
(mask[lexsort_indexer[i]]
1106+
^ mask[lexsort_indexer[i+1]])):
11031107
dups = sum_ranks = 0
11041108
grp_vals_seen += 1
11051109

@@ -1110,14 +1114,21 @@ def rank_1d(
11101114
# group encountered (used by pct calculations later). Also be
11111115
# sure to reset any of the items helping to calculate dups
11121116
if group_changed:
1117+
1118+
# If not dense tiebreak, group size used to compute
1119+
# percentile will be # of non-null elements in group
11131120
if tiebreak != TIEBREAK_DENSE:
1114-
for j in range(grp_start, i + 1):
1115-
grp_sizes[lexsort_indexer[j]] = \
1116-
(i - grp_start + 1 - grp_na_count)
1121+
grp_size = i - grp_start + 1 - grp_na_count
1122+
1123+
# Otherwise, it will be the number of distinct values
1124+
# in the group, subtracting 1 if NaNs are present
1125+
# since that is a distinct value we shouldn't count
11171126
else:
1118-
for j in range(grp_start, i + 1):
1119-
grp_sizes[lexsort_indexer[j]] = \
1120-
(grp_vals_seen - 1 - (grp_na_count > 0))
1127+
grp_size = grp_vals_seen - (grp_na_count > 0)
1128+
1129+
for j in range(grp_start, i + 1):
1130+
grp_sizes[lexsort_indexer[j]] = grp_size
1131+
11211132
dups = sum_ranks = 0
11221133
grp_na_count = 0
11231134
grp_start = i + 1
@@ -1184,12 +1195,14 @@ def rank_1d(
11841195
out[lexsort_indexer[j]] = grp_vals_seen
11851196

11861197
# Look forward to the next value (using the sorting in
1187-
# lexsort_indexer) if the value does not equal the current
1198+
# lexsort_indexer). If the value does not equal the current
11881199
# value then we need to reset the dups and sum_ranks, knowing
11891200
# that a new value is coming up. The conditional also needs
1190-
# to handle nan equality and the end of iteration
1191-
if next_val_diff or (mask[lexsort_indexer[i]]
1192-
^ mask[lexsort_indexer[i+1]]):
1201+
# to handle nan equality and the end of iteration. If group
1202+
# changes we do not record seeing a new value in the group
1203+
if not group_changed and (next_val_diff or
1204+
(mask[lexsort_indexer[i]]
1205+
^ mask[lexsort_indexer[i+1]])):
11931206
dups = sum_ranks = 0
11941207
grp_vals_seen += 1
11951208

@@ -1200,14 +1213,21 @@ def rank_1d(
12001213
# group encountered (used by pct calculations later). Also be
12011214
# sure to reset any of the items helping to calculate dups
12021215
if group_changed:
1216+
1217+
# If not dense tiebreak, group size used to compute
1218+
# percentile will be # of non-null elements in group
12031219
if tiebreak != TIEBREAK_DENSE:
1204-
for j in range(grp_start, i + 1):
1205-
grp_sizes[lexsort_indexer[j]] = \
1206-
(i - grp_start + 1 - grp_na_count)
1220+
grp_size = i - grp_start + 1 - grp_na_count
1221+
1222+
# Otherwise, it will be the number of distinct values
1223+
# in the group, subtracting 1 if NaNs are present
1224+
# since that is a distinct value we shouldn't count
12071225
else:
1208-
for j in range(grp_start, i + 1):
1209-
grp_sizes[lexsort_indexer[j]] = \
1210-
(grp_vals_seen - 1 - (grp_na_count > 0))
1226+
grp_size = grp_vals_seen - (grp_na_count > 0)
1227+
1228+
for j in range(grp_start, i + 1):
1229+
grp_sizes[lexsort_indexer[j]] = grp_size
1230+
12111231
dups = sum_ranks = 0
12121232
grp_na_count = 0
12131233
grp_start = i + 1

pandas/_libs/groupby.pyi

Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
from typing import Literal
2+
3+
import numpy as np
4+
5+
def group_median_float64(
6+
out: np.ndarray, # ndarray[float64_t, ndim=2]
7+
counts: np.ndarray, # ndarray[int64_t]
8+
values: np.ndarray, # ndarray[float64_t, ndim=2]
9+
labels: np.ndarray, # ndarray[int64_t]
10+
min_count: int = ..., # Py_ssize_t
11+
) -> None: ...
12+
13+
def group_cumprod_float64(
14+
out: np.ndarray, # float64_t[:, ::1]
15+
values: np.ndarray, # const float64_t[:, :]
16+
labels: np.ndarray, # const int64_t[:]
17+
ngroups: int,
18+
is_datetimelike: bool,
19+
skipna: bool = ...,
20+
) -> None: ...
21+
22+
def group_cumsum(
23+
out: np.ndarray, # numeric[:, ::1]
24+
values: np.ndarray, # ndarray[numeric, ndim=2]
25+
labels: np.ndarray, # const int64_t[:]
26+
ngroups: int,
27+
is_datetimelike: bool,
28+
skipna: bool = ...,
29+
) -> None: ...
30+
31+
32+
def group_shift_indexer(
33+
out: np.ndarray, # int64_t[::1]
34+
labels: np.ndarray, # const int64_t[:]
35+
ngroups: int,
36+
periods: int,
37+
) -> None: ...
38+
39+
40+
def group_fillna_indexer(
41+
out: np.ndarray, # ndarray[int64_t]
42+
labels: np.ndarray, # ndarray[int64_t]
43+
mask: np.ndarray, # ndarray[uint8_t]
44+
direction: Literal["ffill", "bfill"],
45+
limit: int, # int64_t
46+
dropna: bool,
47+
) -> None: ...
48+
49+
50+
def group_any_all(
51+
out: np.ndarray, # uint8_t[::1]
52+
values: np.ndarray, # const uint8_t[::1]
53+
labels: np.ndarray, # const int64_t[:]
54+
mask: np.ndarray, # const uint8_t[::1]
55+
val_test: Literal["any", "all"],
56+
skipna: bool,
57+
) -> None: ...
58+
59+
def group_add(
60+
out: np.ndarray, # complexfloating_t[:, ::1]
61+
counts: np.ndarray, # int64_t[::1]
62+
values: np.ndarray, # ndarray[complexfloating_t, ndim=2]
63+
labels: np.ndarray, # const intp_t[:]
64+
min_count: int = ...
65+
) -> None: ...
66+
67+
def group_prod(
68+
out: np.ndarray, # floating[:, ::1]
69+
counts: np.ndarray, # int64_t[::1]
70+
values: np.ndarray, # ndarray[floating, ndim=2]
71+
labels: np.ndarray, # const intp_t[:]
72+
min_count: int = ...
73+
) -> None: ...
74+
75+
def group_var(
76+
out: np.ndarray, # floating[:, ::1]
77+
counts: np.ndarray, # int64_t[::1]
78+
values: np.ndarray, # ndarray[floating, ndim=2]
79+
labels: np.ndarray, # const intp_t[:]
80+
min_count: int = ..., # Py_ssize_t
81+
ddof: int = ..., # int64_t
82+
) -> None: ...
83+
84+
def group_mean(
85+
out: np.ndarray, # floating[:, ::1]
86+
counts: np.ndarray, # int64_t[::1]
87+
values: np.ndarray, # ndarray[floating, ndim=2]
88+
labels: np.ndarray, # const intp_t[:]
89+
min_count: int = ...
90+
) -> None: ...
91+
92+
def group_ohlc(
93+
out: np.ndarray, # floating[:, ::1]
94+
counts: np.ndarray, # int64_t[::1]
95+
values: np.ndarray, # ndarray[floating, ndim=2]
96+
labels: np.ndarray, # const intp_t[:]
97+
min_count: int = ...
98+
) -> None: ...
99+
100+
def group_quantile(
101+
out: np.ndarray, # ndarray[float64_t]
102+
values: np.ndarray, # ndarray[numeric, ndim=1]
103+
labels: np.ndarray, # ndarray[int64_t]
104+
mask: np.ndarray, # ndarray[uint8_t]
105+
q: float, # float64_t
106+
interpolation: Literal["linear", "lower", "higher", "nearest", "midpoint"],
107+
) -> None: ...
108+
109+
def group_last(
110+
out: np.ndarray, # rank_t[:, ::1]
111+
counts: np.ndarray, # int64_t[::1]
112+
values: np.ndarray, # ndarray[rank_t, ndim=2]
113+
labels: np.ndarray, # const int64_t[:]
114+
min_count: int = ..., # Py_ssize_t
115+
) -> None: ...
116+
117+
def group_nth(
118+
out: np.ndarray, # rank_t[:, ::1]
119+
counts: np.ndarray, # int64_t[::1]
120+
values: np.ndarray, # ndarray[rank_t, ndim=2]
121+
labels: np.ndarray, # const int64_t[:]
122+
min_count: int = ..., # int64_t
123+
rank: int = ..., # int64_t
124+
) -> None: ...
125+
126+
def group_rank(
127+
out: np.ndarray, # float64_t[:, ::1]
128+
values: np.ndarray, # ndarray[rank_t, ndim=2]
129+
labels: np.ndarray, # const int64_t[:]
130+
ngroups: int,
131+
is_datetimelike: bool,
132+
ties_method: Literal["aveage", "min", "max", "first", "dense"] = ...,
133+
ascending: bool = ...,
134+
pct: bool = ...,
135+
na_option: Literal["keep", "top", "bottom"] = ...,
136+
) -> None: ...
137+
138+
def group_max(
139+
out: np.ndarray, # groupby_t[:, ::1]
140+
counts: np.ndarray, # int64_t[::1]
141+
values: np.ndarray, # ndarray[groupby_t, ndim=2]
142+
labels: np.ndarray, # const int64_t[:]
143+
min_count: int = ...,
144+
) -> None: ...
145+
146+
def group_min(
147+
out: np.ndarray, # groupby_t[:, ::1]
148+
counts: np.ndarray, # int64_t[::1]
149+
values: np.ndarray, # ndarray[groupby_t, ndim=2]
150+
labels: np.ndarray, # const int64_t[:]
151+
min_count: int = ...,
152+
) -> None: ...
153+
154+
def group_cummin(
155+
out: np.ndarray, # groupby_t[:, ::1]
156+
values: np.ndarray, # ndarray[groupby_t, ndim=2]
157+
labels: np.ndarray, # const int64_t[:]
158+
ngroups: int,
159+
is_datetimelike: bool,
160+
) -> None: ...
161+
162+
def group_cummax(
163+
out: np.ndarray, # groupby_t[:, ::1]
164+
values: np.ndarray, # ndarray[groupby_t, ndim=2]
165+
labels: np.ndarray, # const int64_t[:]
166+
ngroups: int,
167+
is_datetimelike: bool,
168+
) -> None: ...

0 commit comments

Comments
 (0)