-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
DOC Trying to improve Group by split-apply-combine guide #51916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
586a520
dea035c
4763d8f
ea29699
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,15 +6,15 @@ | |
Group by: split-apply-combine | ||
***************************** | ||
|
||
By "group by" we are referring to a process involving one or more of the following | ||
By "group by" we are referring to a process involving one or several of the following | ||
steps: | ||
|
||
* **Splitting** the data into groups based on some criteria. | ||
* **Applying** a function to each group independently. | ||
* **Combining** the results into a data structure. | ||
|
||
Out of these, the split step is the most straightforward. In fact, in many | ||
situations we may wish to split the data set into groups and do something with | ||
cases we may wish to split the data set into groups and do something with | ||
DeaMariaLeon marked this conversation as resolved.
Show resolved
Hide resolved
|
||
those groups. In the apply step, we might wish to do one of the | ||
following: | ||
|
||
|
@@ -31,29 +31,29 @@ following: | |
* Filling NAs within groups with a value derived from each group. | ||
|
||
* **Filtration**: discard some groups, according to a group-wise computation | ||
that evaluates True or False. Some examples: | ||
that evaluates as True or False. Some examples: | ||
DeaMariaLeon marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
* Discard data that belongs to groups with only a few members. | ||
* Discard data that belong to groups with only a few members. | ||
* Filter out data based on the group sum or mean. | ||
|
||
Many of these operations are defined on GroupBy objects. These operations are similar | ||
to the :ref:`aggregating API <basics.aggregate>`, :ref:`window API <window.overview>`, | ||
and :ref:`resample API <timeseries.aggregate>`. | ||
to those of the :ref:`aggregating API <basics.aggregate>`, | ||
:ref:`window API <window.overview>`, and :ref:`resample API <timeseries.aggregate>`. | ||
|
||
It is possible that a given operation does not fall into one of these categories or | ||
is some combination of them. In such a case, it may be possible to compute the | ||
operation using GroupBy's ``apply`` method. This method will examine the results of the | ||
apply step and try to return a sensibly combined result if it doesn't fit into either | ||
of the above two categories. | ||
splitting step and try to return a sensibly combined result if it doesn't fit into either | ||
DeaMariaLeon marked this conversation as resolved.
Show resolved
Hide resolved
|
||
of the above three categories. | ||
|
||
.. note:: | ||
|
||
An operation that is split into multiple steps using built-in GroupBy operations | ||
will be more efficient than using the ``apply`` method with a user-defined Python | ||
An operation that is split into multiple steps using built-in GroupBy operations, | ||
rhshadrach marked this conversation as resolved.
Show resolved
Hide resolved
|
||
will be more efficient than one using the ``apply`` method with a user-defined Python | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is something incorrect with leaving "one" out? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the options are either: Or: "Splitting into multiple steps using built-in GroupBy operations, will be more efficient than using the apply method with a user-defined Python function." There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I didn't notice the comma added here - I believe that is incorrect. These are not independent clauses. In your second option above, I believe you're missing a noun: "Splitting an operation into multiple groups...". I see no reason to prefer one version over the other and because of that I think this should be left as is - but let me know if you think there is. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK thanks |
||
function. | ||
|
||
|
||
Since the set of object instance methods on pandas data structures are generally | ||
Since the set of object instance methods on pandas data structures is generally | ||
rich and expressive, we often simply want to invoke, say, a DataFrame function | ||
on each group. The name GroupBy should be quite familiar to those who have used | ||
a SQL-based tool (or ``itertools``), in which you can write code like: | ||
|
@@ -65,7 +65,7 @@ a SQL-based tool (or ``itertools``), in which you can write code like: | |
GROUP BY Column1, Column2 | ||
|
||
We aim to make operations like this natural and easy to express using | ||
pandas. We'll address each area of GroupBy functionality then provide some | ||
pandas. We'll go over each area of GroupBy functionalities, then provide some | ||
rhshadrach marked this conversation as resolved.
Show resolved
Hide resolved
|
||
non-trivial examples / use cases. | ||
|
||
See the :ref:`cookbook<cookbook.grouping>` for some advanced strategies. | ||
|
@@ -75,9 +75,9 @@ See the :ref:`cookbook<cookbook.grouping>` for some advanced strategies. | |
Splitting an object into groups | ||
------------------------------- | ||
|
||
pandas objects can be split on any of their axes. The abstract definition of | ||
grouping is to provide a mapping of labels to group names. To create a GroupBy | ||
object (more on what the GroupBy object is later), you may do the following: | ||
The abstract definition of grouping is to provide a mapping of labels to | ||
group names. To create a GroupBy object (more on what the GroupBy object is | ||
later), you may do the following: | ||
topper-123 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
.. ipython:: python | ||
|
||
|
@@ -99,12 +99,11 @@ object (more on what the GroupBy object is later), you may do the following: | |
|
||
The mapping can be specified many different ways: | ||
|
||
* A Python function, to be called on each of the axis labels. | ||
* A Python function, to be called on each of the index labels. | ||
topper-123 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* A list or NumPy array of the same length as the index. | ||
* A dict or ``Series``, providing a ``label -> group name`` mapping. | ||
* For ``DataFrame`` objects, a string indicating either a column name or | ||
an index level name to be used to group. | ||
* ``df.groupby('A')`` is just syntactic sugar for ``df.groupby(df['A'])``. | ||
DeaMariaLeon marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* A list of any of the above things. | ||
|
||
Collectively we refer to the grouping objects as the **keys**. For example, | ||
|
@@ -136,16 +135,20 @@ We could naturally group by either the ``A`` or ``B`` columns, or both: | |
grouped = df.groupby("A") | ||
grouped = df.groupby(["A", "B"]) | ||
|
||
.. note:: | ||
|
||
``df.groupby('A')`` is just syntactic sugar for ``df.groupby(df['A'])``. | ||
|
||
If we also have a MultiIndex on columns ``A`` and ``B``, we can group by all | ||
but the specified columns | ||
the columns except the one we specify: | ||
|
||
.. ipython:: python | ||
|
||
df2 = df.set_index(["A", "B"]) | ||
grouped = df2.groupby(level=df2.index.names.difference(["B"])) | ||
grouped.sum() | ||
|
||
These will split the DataFrame on its index (rows). To split by columns, first do | ||
GroupBy will split the DataFrame on its index (rows). To split by columns, first do | ||
DeaMariaLeon marked this conversation as resolved.
Show resolved
Hide resolved
|
||
a tranpose: | ||
|
||
.. ipython:: | ||
|
@@ -184,8 +187,8 @@ only verifies that you've passed a valid mapping. | |
.. note:: | ||
|
||
Many kinds of complicated data manipulations can be expressed in terms of | ||
GroupBy operations (though can't be guaranteed to be the most | ||
efficient). You can get quite creative with the label mapping functions. | ||
GroupBy operations (it can't be guaranteed to be the most efficient implementation). | ||
DeaMariaLeon marked this conversation as resolved.
Show resolved
Hide resolved
|
||
You can get quite creative with the label mapping functions. | ||
|
||
.. _groupby.sorting: | ||
|
||
|
@@ -245,8 +248,8 @@ The default setting of ``dropna`` argument is ``True`` which means ``NA`` are no | |
GroupBy object attributes | ||
~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
The ``groups`` attribute is a dict whose keys are the computed unique groups | ||
and corresponding values being the axis labels belonging to each group. In the | ||
The ``groups`` attribute is a dictionary whose keys are the computed unique groups | ||
and corresponding values are the axis labels belonging to each group. In the | ||
above example we have: | ||
|
||
.. ipython:: python | ||
|
@@ -358,10 +361,12 @@ More on the ``sum`` function and aggregation later. | |
|
||
Grouping DataFrame with Index levels and columns | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
A DataFrame may be grouped by a combination of columns and index levels by | ||
specifying the column names as strings and the index levels as ``pd.Grouper`` | ||
A DataFrame may be grouped by a combination of columns and index levels. You | ||
need to specify the column names as strings, and the index levels as ``pd.Grouper`` | ||
objects. | ||
DeaMariaLeon marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Let's first create a DataFrame with a MultiIndex: | ||
|
||
.. ipython:: python | ||
|
||
arrays = [ | ||
|
@@ -375,8 +380,7 @@ objects. | |
|
||
df | ||
|
||
The following example groups ``df`` by the ``second`` index level and | ||
the ``A`` column. | ||
Then we group ``df`` by the ``second`` index level and the ``A`` column. | ||
|
||
.. ipython:: python | ||
|
||
|
@@ -398,8 +402,8 @@ DataFrame column selection in GroupBy | |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Once you have created the GroupBy object from a DataFrame, you might want to do | ||
something different for each of the columns. Thus, using ``[]`` similar to | ||
getting a column from a DataFrame, you can do: | ||
something different for each of the columns. Thus, by using ``[]`` on the GroupBy | ||
object in a similar way as the one used to get a column from a DataFrame, you can do: | ||
|
||
.. ipython:: python | ||
|
||
|
@@ -418,13 +422,13 @@ getting a column from a DataFrame, you can do: | |
grouped_C = grouped["C"] | ||
grouped_D = grouped["D"] | ||
|
||
This is mainly syntactic sugar for the alternative and much more verbose: | ||
This is mainly syntactic sugar for the alternative, which is much more verbose: | ||
|
||
.. ipython:: python | ||
|
||
df["C"].groupby(df["A"]) | ||
|
||
Additionally this method avoids recomputing the internal grouping information | ||
Additionally, this method avoids recomputing the internal grouping information | ||
derived from the passed key. | ||
|
||
.. _groupby.iterating-label: | ||
|
@@ -433,7 +437,7 @@ Iterating through groups | |
------------------------ | ||
|
||
With the GroupBy object in hand, iterating through the grouped data is very | ||
natural and functions similarly to :py:func:`itertools.groupby`: | ||
natural and works similarly to :py:func:`itertools.groupby`: | ||
DeaMariaLeon marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
.. ipython:: | ||
|
||
|
@@ -1195,8 +1199,8 @@ function. | |
|
||
.. note:: | ||
|
||
All of the examples in this section can be more reliably, and more efficiently, | ||
computed using other pandas functionality. | ||
All of the examples in this section can be more reliably, and more efficiently | ||
computed using other pandas functionalities. | ||
DeaMariaLeon marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
.. ipython:: python | ||
|
||
|
@@ -1218,7 +1222,7 @@ The dimension of the returned result can also change: | |
|
||
grouped.apply(f) | ||
|
||
``apply`` on a Series can operate on a returned value from the applied function, | ||
``apply`` on a Series can operate on a returned value from the applied function | ||
that is itself a series, and possibly upcast the result to a DataFrame: | ||
|
||
.. ipython:: python | ||
|
@@ -1245,7 +1249,7 @@ Control grouped column(s) placement with ``group_keys`` | |
group keys added to the result index. Previous versions of pandas would add | ||
the group keys only when the result from the applied function had a different | ||
index than the input. If ``group_keys`` is not specified, the group keys will | ||
not be added for like-indexed outputs. In the future this behavior | ||
not be added for like-indexed outputs. In the future, this behavior | ||
DeaMariaLeon marked this conversation as resolved.
Show resolved
Hide resolved
|
||
will change to always respect ``group_keys``, which defaults to ``True``. | ||
|
||
To control whether the grouped column(s) are included in the indices, you can use | ||
|
@@ -1293,7 +1297,7 @@ Again consider the example DataFrame we've been looking at: | |
|
||
df | ||
|
||
Suppose we wish to compute the standard deviation grouped by the ``A`` | ||
Suppose we need to compute the standard deviation grouped by the ``A`` | ||
DeaMariaLeon marked this conversation as resolved.
Show resolved
Hide resolved
|
||
column. There is a slight problem, namely that we don't care about the data in | ||
column ``B`` because it is not numeric. We refer to these non-numeric columns as | ||
"nuisance" columns. You can avoid nuisance columns by specifying ``numeric_only=True``: | ||
|
@@ -1303,16 +1307,16 @@ column ``B`` because it is not numeric. We refer to these non-numeric columns as | |
df.groupby("A").std(numeric_only=True) | ||
|
||
Note that ``df.groupby('A').colname.std().`` is more efficient than | ||
``df.groupby('A').std().colname``, so if the result of an aggregation function | ||
is only interesting over one column (here ``colname``), it may be filtered | ||
``df.groupby('A').std().colname``. So if the result of an aggregation function | ||
is only needed over one column (here ``colname``), it may be filtered | ||
*before* applying the aggregation function. | ||
|
||
.. note:: | ||
Any object column, also if it contains numerical values such as ``Decimal`` | ||
objects, is considered as a "nuisance" column. They are excluded from | ||
aggregate functions automatically in groupby. | ||
If an object column includes numerical values such as ``Decimal`` | ||
objects, it is considered a "nuisance" column. They are automatically | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed this can be phrased better, but I believe the change here is incorrect - it states that a nuisance column must contain numerical values. Any object column is consider a nuisance column. I'd suggest "Any object column, even if it contains..." There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This note in general is out of date - There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will change to your suggestion and remove the out of date note. |
||
excluded from aggregate functions in groupby. | ||
|
||
If you do wish to include decimal or object columns in an aggregation with | ||
If you do want to include decimal or object columns in an aggregation with | ||
DeaMariaLeon marked this conversation as resolved.
Show resolved
Hide resolved
|
||
other non-nuisance data types, you must do so explicitly. | ||
|
||
.. ipython:: python | ||
|
@@ -1435,7 +1439,7 @@ use the ``pd.Grouper`` to provide this local control. | |
|
||
df | ||
|
||
Groupby a specific column with the desired frequency. This is like resampling. | ||
Groupby a specific column with the wanted frequency. This is like resampling. | ||
rhshadrach marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
.. ipython:: python | ||
|
||
|
@@ -1574,8 +1578,8 @@ Plotting | |
~~~~~~~~ | ||
|
||
Groupby also works with some plotting methods. For example, suppose we | ||
suspect that some features in a DataFrame may differ by group, in this case, | ||
the values in column 1 where the group is "B" are 3 higher on average. | ||
suspect that some features in a DataFrame may differ by group. In this case, | ||
in group "B", the values in column 1 are 3 times higher on average. | ||
DeaMariaLeon marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
.. ipython:: python | ||
|
||
|
@@ -1657,7 +1661,7 @@ arbitrary function, for example: | |
|
||
df.groupby(["Store", "Product"]).pipe(mean) | ||
|
||
where ``mean`` takes a GroupBy object and finds the mean of the Revenue and Quantity | ||
Where ``mean`` takes a GroupBy object and finds the mean of the Revenue and Quantity | ||
DeaMariaLeon marked this conversation as resolved.
Show resolved
Hide resolved
|
||
columns respectively for each Store-Product combination. The ``mean`` function can | ||
be any function that takes in a GroupBy object; the ``.pipe`` will pass the GroupBy | ||
object as a parameter into the function you specify. | ||
|
@@ -1709,11 +1713,16 @@ Groupby by indexer to 'resample' data | |
|
||
Resampling produces new hypothetical samples (resamples) from already existing observed data or from a model that generates data. These new samples are similar to the pre-existing samples. | ||
|
||
In order to resample to work on indices that are non-datetimelike, the following procedure can be utilized. | ||
In order for resample to work on indices that are non-datetimelike, the following procedure can be utilized. | ||
|
||
In the following examples, **df.index // 5** returns a binary array which is used to determine what gets selected for the groupby operation. | ||
|
||
.. note:: The below example shows how we can downsample by consolidation of samples into fewer samples. Here by using **df.index // 5**, we are aggregating the samples in bins. By applying **std()** function, we aggregate the information contained in many samples into a small subset of values which is their standard deviation thereby reducing the number of samples. | ||
.. note:: | ||
|
||
The example below shows how we can downsample by consolidation of samples into fewer ones. | ||
Here by using **df.index // 5**, we are aggregating the samples in bins. By applying **std()** | ||
function, we aggregate the information contained in many samples into a small subset of values | ||
which is their standard deviation. Thereby reducing the number of samples. | ||
rhshadrach marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
.. ipython:: python | ||
|
||
|
@@ -1727,7 +1736,7 @@ Returning a Series to propagate names | |
|
||
Group DataFrame columns, compute a set of metrics and return a named Series. | ||
The Series name is used as the name for the column index. This is especially | ||
useful in conjunction with reshaping operations such as stacking in which the | ||
useful in conjunction with reshaping operations such as stacking, in which the | ||
column index name will be used as the name of the inserted column: | ||
|
||
.. ipython:: python | ||
|
Uh oh!
There was an error while loading. Please reload this page.