Skip to content

DOC: Improve groupby().ngroup() explanation for missing groups #50049

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jan 3, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 29 additions & 8 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -3212,6 +3212,9 @@ def ngroup(self, ascending: bool = True):
would be seen when iterating over the groupby object, not the
order they are first observed.

If a group would be excluded (due to null keys) then that
group is labeled as np.nan. See examples below.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be simpler to understand if we just simply say something like Groups where the key is missing are skipped from the count, and their value will be 'NaN'.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer NA as opposed to "missing". There are various reasons a value can be NA, missing is but one of them. Also recommend using NA over NaN or null as it matches dropna and pd.NA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me. But in this case if we say NA it may sound like pandas.NA, and if the value is None orNaN it will also be skipped. So, missing seemed more appropriate to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not available instead of NA is more generic than missing, and doesn't sound like we are talking about pandas.NA?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - that was poorly written. My main reason is to connect it with the dropna argument. I threw in pd.NA as an indicator that we use NA with somewhat good consistency. E.g. dropna, pd.NA, isna, use_inf_as_na, fillna, na_filter. I think we should maintain this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adjusted to "Groups with missing keys (where pd.isna() is True) will be labeled with NaN and will be skipped from the count." Definitely this is the right direction, I'm still just not sure about the "missing" language

I agree with

But in this case if we say NA it may sound like pandas.NA, and if the value is None orNaN it will also be skipped.

so I avoided saying simply "groups with NA keys...". I think my phrasing is maybe a little longer, but extremely precise and fairly easy to understand. Another option I would also be satisfied with is "Groups with NA keys (where pd.isna() is True)"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - this looks great to me.


Parameters
----------
ascending : bool, default True
Expand All @@ -3228,15 +3231,17 @@ def ngroup(self, ascending: bool = True):

Examples
--------
>>> df = pd.DataFrame({"A": list("aaabba")})
>>> df = pd.DataFrame()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Could you combine the DataFrame construction in 1 line (construct the DataFrame from dictionary?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Matt, also I think the examples can be simplified. I think having just a single column with the None value is enough, no need to show without the None first, and with it later.

Also, unrelated to your changes, but I'd remove df.groupby(["A", [1,1,2,3,2,1]]).ngroup(). This is a very specific case, not directly related to ngroup, and grouping with provided values is not even shown in the examples of groupby, so it makes less sense to show it here. I think it mostly adds confusion to this docstring.

And also, I think it's better for users if the examples are a bit more meaningful than random letters. I think using something like df = pd.DataFrame({'color': ['red', 'red', 'green', None, 'red', 'green']}) (or with animals, jobs, whatever... would make the example a bit easier to understand. Just as an idea, if you want to keep this PR only for the example you're trying to illustrate, that's surely fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with all of these suggestions. fixup PR incoming.

>>> df["A"] = ["a", "a", "a", "b", "b", "a"]
>>> df["B"] = ["a", None, "a", "b", "b", "a"]
>>> df
A
0 a
1 a
2 a
3 b
4 b
5 a
A B
0 a a
1 a None
2 a a
3 b b
4 b b
5 a a
>>> df.groupby('A').ngroup()
0 0
1 0
Expand All @@ -3261,6 +3266,22 @@ def ngroup(self, ascending: bool = True):
4 2
5 0
dtype: int64
>>> df.groupby("B").ngroup()
0 0.0
1 NaN
2 0.0
3 1.0
4 1.0
5 0.0
dtype: float64
>>> df.groupby("B", dropna=False).ngroup()
0 0
1 2
2 0
3 1
4 1
5 0
dtype: int64
"""
with self._group_selection_context():
index = self._selected_obj.index
Expand Down