Skip to content

Commit b15457c

Browse files
glemaitreogrisel
authored andcommitted
DOC add warning regarding the load_boston function (scikit-learn#20729)
Co-authored-by: Olivier Grisel <[email protected]>
1 parent 6bb9fd4 commit b15457c

File tree

4 files changed

+148
-7
lines changed

4 files changed

+148
-7
lines changed

doc/whats_new/v1.0.rst

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -235,11 +235,11 @@ Changelog
235235

236236
- |API| Deprecates the following keys in `cv_results_`: `'mean_score'`,
237237
`'std_score'`, and `'split(k)_score` in favor of `'mean_test_score'`
238-
`'std_test_score'`, and `'split(k)_test_score``. :pr:`20583` by `Thomas Fan`_.
238+
`'std_test_score'`, and `'split(k)_test_score'`. :pr:`20583` by `Thomas Fan`_.
239239

240240
- |Fix| Adds arrays check to :func:`covariance.ledoit_wolf` and
241-
:func:`covariance.ledoit_wolf_shrinkage`
242-
:pr:`20416` by `Hugo Defois <defoishugo>`.
241+
:func:`covariance.ledoit_wolf_shrinkage`.
242+
:pr:`20416` by :user:`Hugo Defois <defoishugo>`.
243243

244244
:mod:`sklearn.datasets`
245245
.......................
@@ -260,7 +260,12 @@ Changelog
260260
with ``importlib.resources`` to avoid the assumption that these resource
261261
files (e.g. ``iris.csv``) already exist on a filesystem, and by extension
262262
to enable compatibility with tools such as ``PyOxidizer``.
263-
:pr:`20297` by :user:`Jack Liu <jackzyliu>`
263+
:pr:`20297` by :user:`Jack Liu <jackzyliu>`.
264+
265+
- |API| Deprecates :func:`datasets.load_boston` in 1.0 and it will be removed
266+
in 1.2. Alternative code snippets to load similar datasets are provided.
267+
Please report to the docstring of the function for details.
268+
:pr:`20729` by `Guillaume Lemaitre`_.
264269

265270

266271
:mod:`sklearn.decomposition`

sklearn/datasets/_base.py

Lines changed: 109 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
from ..utils import Bunch
1919
from ..utils import check_random_state
2020
from ..utils import check_pandas_support
21+
from ..utils.deprecation import deprecated
2122

2223
import numpy as np
2324

@@ -1109,8 +1110,45 @@ def load_linnerud(*, return_X_y=False, as_frame=False):
11091110
)
11101111

11111112

1113+
@deprecated(
1114+
r"""`load_boston` is deprecated in 1.0 and will be removed in 1.2.
1115+
1116+
The Boston housing prices dataset has an ethical problem. You can refer to
1117+
the documentation of this function for further details.
1118+
1119+
The scikit-learn maintainers therefore strongly discourage the use of this
1120+
dataset unless the purpose of the code is to study and educate about
1121+
ethical issues in data science and machine learning.
1122+
1123+
In this case special case, you can fetch the dataset from the original
1124+
source::
1125+
1126+
import pandas as pd
1127+
import numpy as np
1128+
1129+
1130+
data_url = "http://lib.stat.cmu.edu/datasets/boston"
1131+
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
1132+
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
1133+
target = raw_df.values[1::2, 2]
1134+
1135+
Alternative datasets include the California housing dataset (i.e.
1136+
func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
1137+
dataset. You can load the datasets as follows:
1138+
1139+
from sklearn.datasets import fetch_california_housing
1140+
housing = fetch_california_housing()
1141+
1142+
for the California housing dataset and:
1143+
1144+
from sklearn.datasets import fetch_openml
1145+
housing = fetch_openml(name="house_prices", as_frame=True)
1146+
1147+
for the Ames housing dataset.
1148+
"""
1149+
)
11121150
def load_boston(*, return_X_y=False):
1113-
"""Load and return the boston house-prices dataset (regression).
1151+
r"""Load and return the boston house-prices dataset (regression).
11141152
11151153
============== ==============
11161154
Samples total 506
@@ -1121,6 +1159,50 @@ def load_boston(*, return_X_y=False):
11211159
11221160
Read more in the :ref:`User Guide <boston_dataset>`.
11231161
1162+
.. deprecated:: 1.0
1163+
This function is deprecated in 1.0 and will be removed in 1.2. See the
1164+
warning message below for futher details regarding the alternative
1165+
datasets.
1166+
1167+
.. warning::
1168+
The Boston housing prices dataset has an ethical problem: as
1169+
investigated in [1]_, the authors of this dataset engineered a
1170+
non-invertible variable "B" assuming that racial self-segregation had a
1171+
positive impact on house prices [2]_. Furthermore the goal of the
1172+
research that led to the creation of this dataset was to study the
1173+
impact of air quality but it did not give adequate demonstration of the
1174+
validity of this assumption.
1175+
1176+
The scikit-learn maintainers therefore strongly discourage the use of
1177+
this dataset unless the purpose of the code is to study and educate
1178+
about ethical issues in data science and machine learning.
1179+
1180+
In this case special case, you can fetch the dataset from the original
1181+
source::
1182+
1183+
import pandas as pd # doctest: +SKIP
1184+
import numpy as np
1185+
1186+
1187+
data_url = "http://lib.stat.cmu.edu/datasets/boston"
1188+
raw_df = pd.read_csv(data_url, sep="s+", skiprows=22, header=None)
1189+
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
1190+
target = raw_df.values[1::2, 2]
1191+
1192+
Alternative datasets include the California housing dataset [3]_
1193+
(i.e. func:`~sklearn.datasets.fetch_california_housing`) and Ames
1194+
housing dataset [4]_. You can load the datasets as follows::
1195+
1196+
from sklearn.datasets import fetch_california_housing
1197+
housing = fetch_california_housing()
1198+
1199+
for the California housing dataset and::
1200+
1201+
from sklearn.datasets import fetch_openml
1202+
housing = fetch_openml(name="house_prices", as_frame=True) # noqa
1203+
1204+
for the Ames housing dataset.
1205+
11241206
Parameters
11251207
----------
11261208
return_X_y : bool, default=False
@@ -1136,7 +1218,7 @@ def load_boston(*, return_X_y=False):
11361218
11371219
data : ndarray of shape (506, 13)
11381220
The data matrix.
1139-
target : ndarray of shape (506, )
1221+
target : ndarray of shape (506,)
11401222
The regression target.
11411223
filename : str
11421224
The physical location of boston csv dataset.
@@ -1157,13 +1239,37 @@ def load_boston(*, return_X_y=False):
11571239
.. versionchanged:: 0.20
11581240
Fixed a wrong data point at [445, 0].
11591241
1242+
References
1243+
----------
1244+
.. [1] `Racist data destruction? M Carlisle,
1245+
<https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8>`_
1246+
.. [2] `Harrison Jr, David, and Daniel L. Rubinfeld.
1247+
"Hedonic housing prices and the demand for clean air."
1248+
Journal of environmental economics and management 5.1 (1978): 81-102.
1249+
<https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>`_
1250+
.. [3] `California housing dataset
1251+
<https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset>`_
1252+
.. [4] `Ames housing dataset
1253+
<https://www.openml.org/d/42165>`_
1254+
11601255
Examples
11611256
--------
1257+
>>> import warnings
11621258
>>> from sklearn.datasets import load_boston
1163-
>>> X, y = load_boston(return_X_y=True)
1259+
>>> with warnings.catch_warnings():
1260+
... # You should probably not use this dataset.
1261+
... warnings.filterwarnings("ignore")
1262+
... X, y = load_boston(return_X_y=True)
11641263
>>> print(X.shape)
11651264
(506, 13)
11661265
"""
1266+
# TODO: once the deprecation period is over, implement a module level
1267+
# `__getattr__` function in`sklearn.datasets` to raise an exception with
1268+
# an informative error message at import time instead of just removing
1269+
# load_boston. The goal is to avoid having beginners that copy-paste code
1270+
# from numerous books and tutorials that use this dataset loader get
1271+
# a confusing ImportError when trying to learn scikit-learn.
1272+
# See: https://www.python.org/dev/peps/pep-0562/
11671273

11681274
descr_text = load_descr("boston_house_prices.rst")
11691275

sklearn/datasets/tests/test_base.py

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@
2727
load_gzip_compressed_csv_data,
2828
)
2929
from sklearn.utils import Bunch
30+
from sklearn.utils._testing import SkipTest
3031
from sklearn.datasets.tests.test_common import check_as_frame
3132

3233
from sklearn.externals._pilutil import pillow_installed
@@ -223,6 +224,7 @@ def test_load_missing_sample_image_error():
223224
warnings.warn("Could not load sample images, PIL is not available.")
224225

225226

227+
@pytest.mark.filterwarnings("ignore:Function load_boston is deprecated")
226228
@pytest.mark.parametrize(
227229
"loader_func, data_shape, target_shape, n_target, has_descr, filenames",
228230
[
@@ -318,3 +320,30 @@ def test_bunch_dir():
318320
# check that dir (important for autocomplete) shows attributes
319321
data = load_iris()
320322
assert "data" in dir(data)
323+
324+
325+
# FIXME: to be removed in 1.2
326+
def test_load_boston_warning():
327+
"""Check that we raise the ethical warning when loading `load_boston`."""
328+
warn_msg = "The Boston housing prices dataset has an ethical problem"
329+
with pytest.warns(FutureWarning, match=warn_msg):
330+
load_boston()
331+
332+
333+
@pytest.mark.filterwarnings("ignore:Function load_boston is deprecated")
334+
def test_load_boston_alternative():
335+
pd = pytest.importorskip("pandas")
336+
if not os.environ.get("SKLEARN_SKIP_NETWORK_TESTS", "1") == "1":
337+
raise SkipTest(
338+
"This test requires an internet connection to fetch the dataset."
339+
)
340+
341+
boston_sklearn = load_boston()
342+
343+
data_url = "http://lib.stat.cmu.edu/datasets/boston"
344+
raw_df = pd.read_csv(data_url, sep=r"\s+", skiprows=22, header=None)
345+
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
346+
target = raw_df.values[1::2, 2]
347+
348+
np.testing.assert_allclose(data, boston_sklearn.data)
349+
np.testing.assert_allclose(target, boston_sklearn.target)

sklearn/datasets/tests/test_common.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,7 @@ def _generate_func_supporting_param(param, dataset_type=("load", "fetch")):
115115
@pytest.mark.parametrize(
116116
"name, dataset_func", _generate_func_supporting_param("return_X_y")
117117
)
118+
@pytest.mark.filterwarnings("ignore:Function load_boston is deprecated")
118119
def test_common_check_return_X_y(name, dataset_func):
119120
bunch = dataset_func()
120121
check_return_X_y(bunch, dataset_func)

0 commit comments

Comments
 (0)