Skip to content

Commit 462f66c

Browse files
rhettingerlisroach
authored andcommitted
bpo-36324: Apply review comments from Allen Downey (pythonGH-15693)
1 parent f15d8ff commit 462f66c

File tree

3 files changed

+83
-85
lines changed

3 files changed

+83
-85
lines changed

Doc/library/statistics.rst

Lines changed: 65 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -26,10 +26,10 @@ numeric (:class:`Real`-valued) data.
2626
Unless explicitly noted otherwise, these functions support :class:`int`,
2727
:class:`float`, :class:`decimal.Decimal` and :class:`fractions.Fraction`.
2828
Behaviour with other types (whether in the numeric tower or not) is
29-
currently unsupported. Mixed types are also undefined and
30-
implementation-dependent. If your input data consists of mixed types,
31-
you may be able to use :func:`map` to ensure a consistent result, e.g.
32-
``map(float, input_data)``.
29+
currently unsupported. Collections with a mix of types are also undefined
30+
and implementation-dependent. If your input data consists of mixed types,
31+
you may be able to use :func:`map` to ensure a consistent result, for
32+
example: ``map(float, input_data)``.
3333

3434
Averages and measures of central location
3535
-----------------------------------------
@@ -102,11 +102,9 @@ However, for reading convenience, most of the examples show sorted sequences.
102102
.. note::
103103

104104
The mean is strongly affected by outliers and is not a robust estimator
105-
for central location: the mean is not necessarily a typical example of the
106-
data points. For more robust, although less efficient, measures of
107-
central location, see :func:`median` and :func:`mode`. (In this case,
108-
"efficient" refers to statistical efficiency rather than computational
109-
efficiency.)
105+
for central location: the mean is not necessarily a typical example of
106+
the data points. For more robust measures of central location, see
107+
:func:`median` and :func:`mode`.
110108

111109
The sample mean gives an unbiased estimate of the true population mean,
112110
which means that, taken on average over all the possible samples,
@@ -120,9 +118,8 @@ However, for reading convenience, most of the examples show sorted sequences.
120118
Convert *data* to floats and compute the arithmetic mean.
121119

122120
This runs faster than the :func:`mean` function and it always returns a
123-
:class:`float`. The result is highly accurate but not as perfect as
124-
:func:`mean`. If the input dataset is empty, raises a
125-
:exc:`StatisticsError`.
121+
:class:`float`. The *data* may be a sequence or iterator. If the input
122+
dataset is empty, raises a :exc:`StatisticsError`.
126123

127124
.. doctest::
128125

@@ -136,15 +133,20 @@ However, for reading convenience, most of the examples show sorted sequences.
136133

137134
Convert *data* to floats and compute the geometric mean.
138135

136+
The geometric mean indicates the central tendency or typical value of the
137+
*data* using the product of the values (as opposed to the arithmetic mean
138+
which uses their sum).
139+
139140
Raises a :exc:`StatisticsError` if the input dataset is empty,
140141
if it contains a zero, or if it contains a negative value.
142+
The *data* may be a sequence or iterator.
141143

142144
No special efforts are made to achieve exact results.
143145
(However, this may change in the future.)
144146

145147
.. doctest::
146148

147-
>>> round(geometric_mean([54, 24, 36]), 9)
149+
>>> round(geometric_mean([54, 24, 36]), 1)
148150
36.0
149151

150152
.. versionadded:: 3.8
@@ -174,7 +176,7 @@ However, for reading convenience, most of the examples show sorted sequences.
174176
3.6
175177

176178
Using the arithmetic mean would give an average of about 5.167, which
177-
is too high.
179+
is well over the aggregate P/E ratio.
178180

179181
:exc:`StatisticsError` is raised if *data* is empty, or any element
180182
is less than zero.
@@ -312,10 +314,10 @@ However, for reading convenience, most of the examples show sorted sequences.
312314
The mode (when it exists) is the most typical value and serves as a
313315
measure of central location.
314316

315-
If there are multiple modes, returns the first one encountered in the *data*.
316-
If the smallest or largest of multiple modes is desired instead, use
317-
``min(multimode(data))`` or ``max(multimode(data))``. If the input *data* is
318-
empty, :exc:`StatisticsError` is raised.
317+
If there are multiple modes with the same frequency, returns the first one
318+
encountered in the *data*. If the smallest or largest of those is
319+
desired instead, use ``min(multimode(data))`` or ``max(multimode(data))``.
320+
If the input *data* is empty, :exc:`StatisticsError` is raised.
319321

320322
``mode`` assumes discrete data, and returns a single value. This is the
321323
standard treatment of the mode as commonly taught in schools:
@@ -325,8 +327,8 @@ However, for reading convenience, most of the examples show sorted sequences.
325327
>>> mode([1, 1, 2, 3, 3, 3, 3, 4])
326328
3
327329

328-
The mode is unique in that it is the only statistic which also applies
329-
to nominal (non-numeric) data:
330+
The mode is unique in that it is the only statistic in this package that
331+
also applies to nominal (non-numeric) data:
330332

331333
.. doctest::
332334

@@ -368,15 +370,16 @@ However, for reading convenience, most of the examples show sorted sequences.
368370

369371
.. function:: pvariance(data, mu=None)
370372

371-
Return the population variance of *data*, a non-empty iterable of real-valued
372-
numbers. Variance, or second moment about the mean, is a measure of the
373-
variability (spread or dispersion) of data. A large variance indicates that
374-
the data is spread out; a small variance indicates it is clustered closely
375-
around the mean.
373+
Return the population variance of *data*, a non-empty sequence or iterator
374+
of real-valued numbers. Variance, or second moment about the mean, is a
375+
measure of the variability (spread or dispersion) of data. A large
376+
variance indicates that the data is spread out; a small variance indicates
377+
it is clustered closely around the mean.
376378

377-
If the optional second argument *mu* is given, it should be the mean of
378-
*data*. If it is missing or ``None`` (the default), the mean is
379-
automatically calculated.
379+
If the optional second argument *mu* is given, it is typically the mean of
380+
the *data*. It can also be used to compute the second moment around a
381+
point that is not the mean. If it is missing or ``None`` (the default),
382+
the arithmetic mean is automatically calculated.
380383

381384
Use this function to calculate the variance from the entire population. To
382385
estimate the variance from a sample, the :func:`variance` function is usually
@@ -401,10 +404,6 @@ However, for reading convenience, most of the examples show sorted sequences.
401404
>>> pvariance(data, mu)
402405
1.25
403406

404-
This function does not attempt to verify that you have passed the actual mean
405-
as *mu*. Using arbitrary values for *mu* may lead to invalid or impossible
406-
results.
407-
408407
Decimals and Fractions are supported:
409408

410409
.. doctest::
@@ -423,11 +422,11 @@ However, for reading convenience, most of the examples show sorted sequences.
423422
σ². When called on a sample instead, this is the biased sample variance
424423
s², also known as variance with N degrees of freedom.
425424

426-
If you somehow know the true population mean μ, you may use this function
427-
to calculate the variance of a sample, giving the known population mean as
428-
the second argument. Provided the data points are representative
429-
(e.g. independent and identically distributed), the result will be an
430-
unbiased estimate of the population variance.
425+
If you somehow know the true population mean μ, you may use this
426+
function to calculate the variance of a sample, giving the known
427+
population mean as the second argument. Provided the data points are a
428+
random sample of the population, the result will be an unbiased estimate
429+
of the population variance.
431430

432431

433432
.. function:: stdev(data, xbar=None)
@@ -502,19 +501,19 @@ However, for reading convenience, most of the examples show sorted sequences.
502501
:func:`pvariance` function as the *mu* parameter to get the variance of a
503502
sample.
504503

505-
.. function:: quantiles(dist, *, n=4, method='exclusive')
504+
.. function:: quantiles(data, *, n=4, method='exclusive')
506505

507-
Divide *dist* into *n* continuous intervals with equal probability.
506+
Divide *data* into *n* continuous intervals with equal probability.
508507
Returns a list of ``n - 1`` cut points separating the intervals.
509508

510509
Set *n* to 4 for quartiles (the default). Set *n* to 10 for deciles. Set
511510
*n* to 100 for percentiles which gives the 99 cuts points that separate
512-
*dist* in to 100 equal sized groups. Raises :exc:`StatisticsError` if *n*
511+
*data* in to 100 equal sized groups. Raises :exc:`StatisticsError` if *n*
513512
is not least 1.
514513

515-
The *dist* can be any iterable containing sample data or it can be an
514+
The *data* can be any iterable containing sample data or it can be an
516515
instance of a class that defines an :meth:`~inv_cdf` method. For meaningful
517-
results, the number of data points in *dist* should be larger than *n*.
516+
results, the number of data points in *data* should be larger than *n*.
518517
Raises :exc:`StatisticsError` if there are not at least two data points.
519518

520519
For sample data, the cut points are linearly interpolated from the
@@ -523,7 +522,7 @@ However, for reading convenience, most of the examples show sorted sequences.
523522
cut-point will evaluate to ``104``.
524523

525524
The *method* for computing quantiles can be varied depending on
526-
whether the data in *dist* includes or excludes the lowest and
525+
whether the data in *data* includes or excludes the lowest and
527526
highest possible values from the population.
528527

529528
The default *method* is "exclusive" and is used for data sampled from
@@ -535,14 +534,14 @@ However, for reading convenience, most of the examples show sorted sequences.
535534

536535
Setting the *method* to "inclusive" is used for describing population
537536
data or for samples that are known to include the most extreme values
538-
from the population. The minimum value in *dist* is treated as the 0th
537+
from the population. The minimum value in *data* is treated as the 0th
539538
percentile and the maximum value is treated as the 100th percentile.
540539
The portion of the population falling below the *i-th* of *m* sorted
541540
data points is computed as ``(i - 1) / (m - 1)``. Given 11 sample
542541
values, the method sorts them and assigns the following percentiles:
543542
0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%.
544543

545-
If *dist* is an instance of a class that defines an
544+
If *data* is an instance of a class that defines an
546545
:meth:`~inv_cdf` method, setting *method* has no effect.
547546

548547
.. doctest::
@@ -580,7 +579,7 @@ A single exception is defined:
580579
:class:`NormalDist` is a tool for creating and manipulating normal
581580
distributions of a `random variable
582581
<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_. It is a
583-
composite class that treats the mean and standard deviation of data
582+
class that treats the mean and standard deviation of data
584583
measurements as a single entity.
585584

586585
Normal distributions arise from the `Central Limit Theorem
@@ -616,13 +615,14 @@ of applications in statistics.
616615

617616
.. classmethod:: NormalDist.from_samples(data)
618617

619-
Makes a normal distribution instance computed from sample data. The
620-
*data* can be any :term:`iterable` and should consist of values that
621-
can be converted to type :class:`float`.
618+
Makes a normal distribution instance with *mu* and *sigma* parameters
619+
estimated from the *data* using :func:`fmean` and :func:`stdev`.
622620

623-
If *data* does not contain at least two elements, raises
624-
:exc:`StatisticsError` because it takes at least one point to estimate
625-
a central value and at least two points to estimate dispersion.
621+
The *data* can be any :term:`iterable` and should consist of values
622+
that can be converted to type :class:`float`. If *data* does not
623+
contain at least two elements, raises :exc:`StatisticsError` because it
624+
takes at least one point to estimate a central value and at least two
625+
points to estimate dispersion.
626626

627627
.. method:: NormalDist.samples(n, *, seed=None)
628628

@@ -636,10 +636,10 @@ of applications in statistics.
636636
.. method:: NormalDist.pdf(x)
637637

638638
Using a `probability density function (pdf)
639-
<https://en.wikipedia.org/wiki/Probability_density_function>`_,
640-
compute the relative likelihood that a random variable *X* will be near
641-
the given value *x*. Mathematically, it is the ratio ``P(x <= X <
642-
x+dx) / dx``.
639+
<https://en.wikipedia.org/wiki/Probability_density_function>`_, compute
640+
the relative likelihood that a random variable *X* will be near the
641+
given value *x*. Mathematically, it is the limit of the ratio ``P(x <=
642+
X < x+dx) / dx`` as *dx* approaches zero.
643643

644644
The relative likelihood is computed as the probability of a sample
645645
occurring in a narrow range divided by the width of the range (hence
@@ -667,8 +667,10 @@ of applications in statistics.
667667

668668
.. method:: NormalDist.overlap(other)
669669

670-
Returns a value between 0.0 and 1.0 giving the overlapping area for
671-
the two probability density functions.
670+
Measures the agreement between two normal probability distributions.
671+
Returns a value between 0.0 and 1.0 giving `the overlapping area for
672+
the two probability density functions
673+
<https://www.rasch.org/rmt/rmt101r.htm>`_.
672674

673675
Instances of :class:`NormalDist` support addition, subtraction,
674676
multiplication and division by a constant. These operations
@@ -740,12 +742,11 @@ Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_:
740742
... return (3*x + 7*x*y - 5*y) / (11 * z)
741743
...
742744
>>> n = 100_000
743-
>>> seed = 86753099035768
744-
>>> X = NormalDist(10, 2.5).samples(n, seed=seed)
745-
>>> Y = NormalDist(15, 1.75).samples(n, seed=seed)
746-
>>> Z = NormalDist(50, 1.25).samples(n, seed=seed)
747-
>>> NormalDist.from_samples(map(model, X, Y, Z)) # doctest: +SKIP
748-
NormalDist(mu=1.8661894803304777, sigma=0.65238717376862)
745+
>>> X = NormalDist(10, 2.5).samples(n, seed=3652260728)
746+
>>> Y = NormalDist(15, 1.75).samples(n, seed=4582495471)
747+
>>> Z = NormalDist(50, 1.25).samples(n, seed=6582483453)
748+
>>> quantiles(map(model, X, Y, Z)) # doctest: +SKIP
749+
[1.4591308524824727, 1.8035946855390597, 2.175091447274739]
749750

750751
Normal distributions commonly arise in machine learning problems.
751752

Lib/statistics.py

Lines changed: 17 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -322,7 +322,6 @@ def fmean(data):
322322
"""Convert data to floats and compute the arithmetic mean.
323323
324324
This runs faster than the mean() function and it always returns a float.
325-
The result is highly accurate but not as perfect as mean().
326325
If the input dataset is empty, it raises a StatisticsError.
327326
328327
>>> fmean([3.5, 4.0, 5.25])
@@ -538,15 +537,16 @@ def mode(data):
538537
``mode`` assumes discrete data, and returns a single value. This is the
539538
standard treatment of the mode as commonly taught in schools:
540539
541-
>>> mode([1, 1, 2, 3, 3, 3, 3, 4])
542-
3
540+
>>> mode([1, 1, 2, 3, 3, 3, 3, 4])
541+
3
543542
544543
This also works with nominal (non-numeric) data:
545544
546-
>>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
547-
'red'
545+
>>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
546+
'red'
548547
549-
If there are multiple modes, return the first one encountered.
548+
If there are multiple modes with same frequency, return the first one
549+
encountered:
550550
551551
>>> mode(['red', 'red', 'green', 'blue', 'blue'])
552552
'red'
@@ -615,28 +615,28 @@ def multimode(data):
615615
# position is that fewer options make for easier choices and that
616616
# external packages can be used for anything more advanced.
617617

618-
def quantiles(dist, /, *, n=4, method='exclusive'):
619-
"""Divide *dist* into *n* continuous intervals with equal probability.
618+
def quantiles(data, /, *, n=4, method='exclusive'):
619+
"""Divide *data* into *n* continuous intervals with equal probability.
620620
621621
Returns a list of (n - 1) cut points separating the intervals.
622622
623623
Set *n* to 4 for quartiles (the default). Set *n* to 10 for deciles.
624624
Set *n* to 100 for percentiles which gives the 99 cuts points that
625-
separate *dist* in to 100 equal sized groups.
625+
separate *data* in to 100 equal sized groups.
626626
627-
The *dist* can be any iterable containing sample data or it can be
627+
The *data* can be any iterable containing sample data or it can be
628628
an instance of a class that defines an inv_cdf() method. For sample
629629
data, the cut points are linearly interpolated between data points.
630630
631-
If *method* is set to *inclusive*, *dist* is treated as population
631+
If *method* is set to *inclusive*, *data* is treated as population
632632
data. The minimum value is treated as the 0th percentile and the
633633
maximum value is treated as the 100th percentile.
634634
"""
635635
if n < 1:
636636
raise StatisticsError('n must be at least 1')
637-
if hasattr(dist, 'inv_cdf'):
638-
return [dist.inv_cdf(i / n) for i in range(1, n)]
639-
data = sorted(dist)
637+
if hasattr(data, 'inv_cdf'):
638+
return [data.inv_cdf(i / n) for i in range(1, n)]
639+
data = sorted(data)
640640
ld = len(data)
641641
if ld < 2:
642642
raise StatisticsError('must have at least two data points')
@@ -745,7 +745,7 @@ def variance(data, xbar=None):
745745
def pvariance(data, mu=None):
746746
"""Return the population variance of ``data``.
747747
748-
data should be an iterable of Real-valued numbers, with at least one
748+
data should be a sequence or iterator of Real-valued numbers, with at least one
749749
value. The optional argument mu, if given, should be the mean of
750750
the data. If it is missing or None, the mean is automatically calculated.
751751
@@ -766,10 +766,6 @@ def pvariance(data, mu=None):
766766
>>> pvariance(data, mu)
767767
1.25
768768
769-
This function does not check that ``mu`` is actually the mean of ``data``.
770-
Giving arbitrary values for ``mu`` may lead to invalid or impossible
771-
results.
772-
773769
Decimals and Fractions are supported:
774770
775771
>>> from decimal import Decimal as D
@@ -913,8 +909,8 @@ def __init__(self, mu=0.0, sigma=1.0):
913909
"NormalDist where mu is the mean and sigma is the standard deviation."
914910
if sigma < 0.0:
915911
raise StatisticsError('sigma must be non-negative')
916-
self._mu = mu
917-
self._sigma = sigma
912+
self._mu = float(mu)
913+
self._sigma = float(sigma)
918914

919915
@classmethod
920916
def from_samples(cls, data):

Misc/ACKS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -416,6 +416,7 @@ Dima Dorfman
416416
Yves Dorfsman
417417
Michael Dorman
418418
Steve Dower
419+
Allen Downey
419420
Cesar Douady
420421
Dean Draayer
421422
Fred L. Drake, Jr.

0 commit comments

Comments
 (0)