Skip to content

Commit 11c7953

Browse files
authored
bpo-36018: Add the NormalDist class to the statistics module (GH-11973)
1 parent 64d6cc8 commit 11c7953

File tree

5 files changed

+556
-1
lines changed

5 files changed

+556
-1
lines changed

Doc/library/statistics.rst

Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -467,6 +467,201 @@ A single exception is defined:
467467

468468
Subclass of :exc:`ValueError` for statistics-related exceptions.
469469

470+
471+
:class:`NormalDist` objects
472+
===========================
473+
474+
A :class:`NormalDist` is a a composite class that treats the mean and standard
475+
deviation of data measurements as a single entity. It is a tool for creating
476+
and manipulating normal distributions of a random variable.
477+
478+
Normal distributions arise from the `Central Limit Theorem
479+
<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
480+
of applications in statistics, including simulations and hypothesis testing.
481+
482+
.. class:: NormalDist(mu=0.0, sigma=1.0)
483+
484+
Returns a new *NormalDist* object where *mu* represents the `arithmetic
485+
mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of data and *sigma*
486+
represents the `standard deviation
487+
<https://en.wikipedia.org/wiki/Standard_deviation>`_ of the data.
488+
489+
If *sigma* is negative, raises :exc:`StatisticsError`.
490+
491+
.. attribute:: mu
492+
493+
The mean of a normal distribution.
494+
495+
.. attribute:: sigma
496+
497+
The standard deviation of a normal distribution.
498+
499+
.. attribute:: variance
500+
501+
A read-only property representing the `variance
502+
<https://en.wikipedia.org/wiki/Variance>`_ of a normal
503+
distribution. Equal to the square of the standard deviation.
504+
505+
.. classmethod:: NormalDist.from_samples(data)
506+
507+
Class method that makes a normal distribution instance
508+
from sample data. The *data* can be any :term:`iterable`
509+
and should consist of values that can be converted to type
510+
:class:`float`.
511+
512+
If *data* does not contain at least two elements, raises
513+
:exc:`StatisticsError` because it takes at least one point to estimate
514+
a central value and at least two points to estimate dispersion.
515+
516+
.. method:: NormalDist.samples(n, seed=None)
517+
518+
Generates *n* random samples for a given mean and standard deviation.
519+
Returns a :class:`list` of :class:`float` values.
520+
521+
If *seed* is given, creates a new instance of the underlying random
522+
number generator. This is useful for creating reproducible results,
523+
even in a multi-threading context.
524+
525+
.. method:: NormalDist.pdf(x)
526+
527+
Using a `probability density function (pdf)
528+
<https://en.wikipedia.org/wiki/Probability_density_function>`_,
529+
compute the relative likelihood that a random sample *X* will be near
530+
the given value *x*. Mathematically, it is the ratio ``P(x <= X <
531+
x+dx) / dx``.
532+
533+
Note the relative likelihood of *x* can be greater than `1.0`. The
534+
probability for a specific point on a continuous distribution is `0.0`,
535+
so the :func:`pdf` is used instead. It gives the probability of a
536+
sample occurring in a narrow range around *x* and then dividing that
537+
probability by the width of the range (hence the word "density").
538+
539+
.. method:: NormalDist.cdf(x)
540+
541+
Using a `cumulative distribution function (cdf)
542+
<https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
543+
compute the probability that a random sample *X* will be less than or
544+
equal to *x*. Mathematically, it is written ``P(X <= x)``.
545+
546+
Instances of :class:`NormalDist` support addition, subtraction,
547+
multiplication and division by a constant. These operations
548+
are used for translation and scaling. For example:
549+
550+
.. doctest::
551+
552+
>>> temperature_february = NormalDist(5, 2.5) # Celsius
553+
>>> temperature_february * (9/5) + 32 # Fahrenheit
554+
NormalDist(mu=41.0, sigma=4.5)
555+
556+
Dividing a constant by an instance of :class:`NormalDist` is not supported.
557+
558+
Since normal distributions arise from additive effects of independent
559+
variables, it is possible to `add and subtract two normally distributed
560+
random variables
561+
<https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
562+
represented as instances of :class:`NormalDist`. For example:
563+
564+
.. doctest::
565+
566+
>>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
567+
>>> drug_effects = NormalDist(0.4, 0.15)
568+
>>> combined = birth_weights + drug_effects
569+
>>> f'mu={combined.mu :.1f} sigma={combined.sigma :.1f}'
570+
'mu=3.1 sigma=0.5'
571+
572+
.. versionadded:: 3.8
573+
574+
575+
:class:`NormalDist` Examples and Recipes
576+
----------------------------------------
577+
578+
A :class:`NormalDist` readily solves classic probability problems.
579+
580+
For example, given `historical data for SAT exams
581+
<https://blog.prepscholar.com/sat-standard-deviation>`_ showing that scores
582+
are normally distributed with a mean of 1060 and standard deviation of 192,
583+
determine the percentage of students with scores between 1100 and 1200:
584+
585+
.. doctest::
586+
587+
>>> sat = NormalDist(1060, 195)
588+
>>> fraction = sat.cdf(1200) - sat.cdf(1100)
589+
>>> f'{fraction * 100 :.1f}% score between 1100 and 1200'
590+
'18.2% score between 1100 and 1200'
591+
592+
To estimate the distribution for a model than isn't easy to solve
593+
analytically, :class:`NormalDist` can generate input samples for a `Monte
594+
Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_ of the
595+
model:
596+
597+
.. doctest::
598+
599+
>>> n = 100_000
600+
>>> X = NormalDist(350, 15).samples(n)
601+
>>> Y = NormalDist(47, 17).samples(n)
602+
>>> Z = NormalDist(62, 6).samples(n)
603+
>>> model_simulation = [x * y / z for x, y, z in zip(X, Y, Z)]
604+
>>> NormalDist.from_samples(model_simulation) # doctest: +SKIP
605+
NormalDist(mu=267.6516398754636, sigma=101.357284306067)
606+
607+
Normal distributions commonly arise in machine learning problems.
608+
609+
Uncyclopedia has a `nice example with a Naive Bayesian Classifier
610+
<https://en.wikipedia.org/wiki/Naive_Bayes_classifier>`_. The challenge
611+
is to guess a person's gender from measurements of normally distributed
612+
features including height, weight, and foot size.
613+
614+
The `prior probability <https://en.wikipedia.org/wiki/Prior_probability>`_ of
615+
being male or female is 50%:
616+
617+
.. doctest::
618+
619+
>>> prior_male = 0.5
620+
>>> prior_female = 0.5
621+
622+
We also have a training dataset with measurements for eight people. These
623+
measurements are assumed to be normally distributed, so we summarize the data
624+
with :class:`NormalDist`:
625+
626+
.. doctest::
627+
628+
>>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
629+
>>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
630+
>>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
631+
>>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
632+
>>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
633+
>>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
634+
635+
We observe a new person whose feature measurements are known but whose gender
636+
is unknown:
637+
638+
.. doctest::
639+
640+
>>> ht = 6.0 # height
641+
>>> wt = 130 # weight
642+
>>> fs = 8 # foot size
643+
644+
The posterior is the product of the prior times each likelihood of a
645+
feature measurement given the gender:
646+
647+
.. doctest::
648+
649+
>>> posterior_male = (prior_male * height_male.pdf(ht) *
650+
... weight_male.pdf(wt) * foot_size_male.pdf(fs))
651+
652+
>>> posterior_female = (prior_female * height_female.pdf(ht) *
653+
... weight_female.pdf(wt) * foot_size_female.pdf(fs))
654+
655+
The final prediction is awarded to the largest posterior -- this is known as
656+
the `maximum a posteriori
657+
<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:
658+
659+
.. doctest::
660+
661+
>>> 'male' if posterior_male > posterior_female else 'female'
662+
'female'
663+
664+
470665
..
471666
# This modelines must appear within the last ten lines of the file.
472667
kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;

Doc/whatsnew/3.8.rst

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -278,6 +278,32 @@ Added :func:`statistics.fmean` as a faster, floating point variant of
278278
:func:`statistics.mean()`. (Contributed by Raymond Hettinger and
279279
Steven D'Aprano in :issue:`35904`.)
280280

281+
Added :class:`statistics.NormalDist`, a tool for creating
282+
and manipulating normal distributions of a random variable.
283+
(Contributed by Raymond Hettinger in :issue:`36018`.)
284+
285+
::
286+
287+
>>> temperature_feb = NormalDist.from_samples([4, 12, -3, 2, 7, 14])
288+
>>> temperature_feb
289+
NormalDist(mu=6.0, sigma=6.356099432828281)
290+
291+
>>> temperature_feb.cdf(3) # Chance of being under 3 degrees
292+
0.3184678262814532
293+
>>> # Relative chance of being 7 degrees versus 10 degrees
294+
>>> temperature_feb.pdf(7) / temperature_feb.pdf(10)
295+
1.2039930378537762
296+
297+
>>> el_nino = NormalDist(4, 2.5)
298+
>>> temperature_feb += el_nino # Add in a climate effect
299+
>>> temperature_feb
300+
NormalDist(mu=10.0, sigma=6.830080526611674)
301+
302+
>>> temperature_feb * (9/5) + 32 # Convert to Fahrenheit
303+
NormalDist(mu=50.0, sigma=12.294144947901014)
304+
>>> temperature_feb.samples(3) # Generate random samples
305+
[7.672102882379219, 12.000027119750287, 4.647488369766392]
306+
281307

282308
tokenize
283309
--------

0 commit comments

Comments
 (0)