-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
DOC: Improve groupby().ngroup() explanation for missing groups #50049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
11ba535
9466d4d
66ae02e
e5d0075
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3212,6 +3212,9 @@ def ngroup(self, ascending: bool = True): | |
would be seen when iterating over the groupby object, not the | ||
order they are first observed. | ||
|
||
If a group would be excluded (due to null keys) then that | ||
group is labeled as np.nan. See examples below. | ||
|
||
Parameters | ||
---------- | ||
ascending : bool, default True | ||
|
@@ -3228,15 +3231,17 @@ def ngroup(self, ascending: bool = True): | |
|
||
Examples | ||
-------- | ||
>>> df = pd.DataFrame({"A": list("aaabba")}) | ||
>>> df = pd.DataFrame() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: Could you combine the DataFrame construction in 1 line (construct the DataFrame from dictionary?) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree with Matt, also I think the examples can be simplified. I think having just a single column with the Also, unrelated to your changes, but I'd remove And also, I think it's better for users if the examples are a bit more meaningful than random letters. I think using something like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with all of these suggestions. fixup PR incoming. |
||
>>> df["A"] = ["a", "a", "a", "b", "b", "a"] | ||
>>> df["B"] = ["a", None, "a", "b", "b", "a"] | ||
>>> df | ||
A | ||
0 a | ||
1 a | ||
2 a | ||
3 b | ||
4 b | ||
5 a | ||
A B | ||
0 a a | ||
1 a None | ||
2 a a | ||
3 b b | ||
4 b b | ||
5 a a | ||
>>> df.groupby('A').ngroup() | ||
0 0 | ||
1 0 | ||
|
@@ -3261,6 +3266,22 @@ def ngroup(self, ascending: bool = True): | |
4 2 | ||
5 0 | ||
dtype: int64 | ||
>>> df.groupby("B").ngroup() | ||
0 0.0 | ||
1 NaN | ||
2 0.0 | ||
3 1.0 | ||
4 1.0 | ||
5 0.0 | ||
dtype: float64 | ||
>>> df.groupby("B", dropna=False).ngroup() | ||
0 0 | ||
1 2 | ||
2 0 | ||
3 1 | ||
4 1 | ||
5 0 | ||
dtype: int64 | ||
""" | ||
with self._group_selection_context(): | ||
index = self._selected_obj.index | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could be simpler to understand if we just simply say something like
Groups where the key is missing are skipped from the count, and their value will be 'NaN'
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer NA as opposed to "missing". There are various reasons a value can be NA, missing is but one of them. Also recommend using NA over NaN or null as it matches
dropna
andpd.NA
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine by me. But in this case if we say
NA
it may sound likepandas.NA
, and if the value isNone
orNaN
it will also be skipped. So, missing seemed more appropriate to me.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe
not available
instead ofNA
is more generic than missing, and doesn't sound like we are talking aboutpandas.NA
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry - that was poorly written. My main reason is to connect it with the
dropna
argument. I threw inpd.NA
as an indicator that we use NA with somewhat good consistency. E.g. dropna, pd.NA, isna, use_inf_as_na, fillna, na_filter. I think we should maintain this.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I adjusted to "Groups with missing keys (where
pd.isna()
is True) will be labeled withNaN
and will be skipped from the count." Definitely this is the right direction, I'm still just not sure about the "missing" languageI agree with
so I avoided saying simply "groups with NA keys...". I think my phrasing is maybe a little longer, but extremely precise and fairly easy to understand. Another option I would also be satisfied with is "Groups with NA keys (where
pd.isna()
is True)"There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - this looks great to me.