Skip to content

Implement DataFrame.value_counts #27350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -8387,6 +8387,48 @@ def isin(self, values):
self.columns,
)

def value_counts(self):
"""
The number of times each unique row appears in the DataFrame.

Returns
-------
counts : Series

See Also
--------
Series.value_counts: Equivalent method on Series.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have more options on the Series.value_counts, dropna for example these need to be implemented

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no option in group_by to not drop rows containing a NaN. How do I go about implementing that case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be OK with raising a NotImplementedError for that case

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. This changed the method pretty significantly. PTAL.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The single-column case now works, but the code raises NotImplementedError for the multi-column case.


Examples
--------

>>> df = pd.DataFrame({'num_legs': [2, 4, 4], 'num_wings': [2, 0, 0]},
... index=['falcon', 'dog', 'cat'])
>>> df
num_legs num_wings
falcon 2 2
dog 4 0
cat 4 0

>>> df.value_counts()
(4, 0) 2
(2, 2) 1
dtype: int64

>>> df1col = df[['num_legs']]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the 2nd example is showing how this works for a Series?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> type(df[['num_legs']]) 
pandas.core.frame.DataFrame

>>> df1col
num_legs
falcon 2
dog 4
cat 4

>>> df1col.value_counts()
(4,) 2
(2,) 1
dtype: int64
"""
return self.apply(tuple, 1).value_counts()

# ----------------------------------------------------------------------
# Add plotting methods to DataFrame
plot = CachedAccessor("plot", pandas.plotting.PlotAccessor)
Expand Down
10 changes: 10 additions & 0 deletions pandas/tests/indexes/test_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -2289,6 +2289,16 @@ def test_dt_conversion_preserves_name(self, dt_conv):
index = pd.Index(["01:02:03", "01:02:04"], name="label")
assert index.name == dt_conv(index).name

def test_data_frame_value_counts(self):
df = pd.DataFrame({'num_legs': [2, 4, 4], 'num_wings': [2, 0, 0]},
index=['falcon', 'dog', 'cat'])
assert df.value_counts().equals(pd.Series(data=[2, 1],
index=[(4, 0), (2, 2)]))

df_single_col = df[['num_legs']]
assert df_single_col.value_counts().equals(pd.Series(
data=[2, 1], index=[(4,), (2,)]))

@pytest.mark.parametrize(
"index,expected",
[
Expand Down