-
-
Notifications
You must be signed in to change notification settings - Fork 34
Slep007 - feature names, their generation and the API #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
51fc476
28d7b84
3fcbefa
e13bc47
e9ca87b
c56dbe9
6cfe3c8
3e77b4f
ff5a991
9d380da
c5659bf
d0bb0e6
9b24545
081ed93
1baea78
51de1f7
50c6538
c37d9d6
17fe3d7
6b5533c
4ed249c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,288 @@ | ||
.. _slep_007: | ||
|
||
=========================================== | ||
Feature names, their generation and the API | ||
=========================================== | ||
|
||
:Author: Adrin Jalali | ||
:Status: Under Review | ||
:Type: Standards Track | ||
:Created: 2019-04 | ||
|
||
Abstract | ||
######## | ||
|
||
This SLEP proposes the introduction of the ``feature_names_in_`` attribute for | ||
all estimators, and the ``feature_names_out_`` attribute for all transformers. | ||
We here discuss the generation of such attributes and their propagation through | ||
pipelines. Since for most estimators there are multiple ways to generate | ||
feature names, this SLEP does not intend to define how exactly feature names | ||
are generated for all of them. | ||
|
||
Motivation | ||
########## | ||
|
||
``scikit-learn`` has been making it easier to build complex workflows with the | ||
``ColumnTransformer`` and it has been seeing widespread adoption. However, | ||
using it results in pipelines where it's not clear what the input features to | ||
the final predictor are, even more so than before. For example, after fitting | ||
the following pipeline, users should ideally be able to inspect the features | ||
going into the final predictor:: | ||
|
||
|
||
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True) | ||
|
||
# We will train our classifier with the following features: | ||
# Numeric Features: | ||
# - age: float. | ||
# - fare: float. | ||
# Categorical Features: | ||
# - embarked: categories encoded as strings {'C', 'S', 'Q'}. | ||
# - sex: categories encoded as strings {'female', 'male'}. | ||
# - pclass: ordinal integers {1, 2, 3}. | ||
|
||
# We create the preprocessing pipelines for both numeric and categorical data. | ||
numeric_features = ['age', 'fare'] | ||
numeric_transformer = Pipeline(steps=[ | ||
('imputer', SimpleImputer(strategy='median')), | ||
('scaler', StandardScaler())]) | ||
|
||
categorical_features = ['embarked', 'sex', 'pclass'] | ||
categorical_transformer = Pipeline(steps=[ | ||
('imputer', SimpleImputer(strategy='constant', fill_value='missing')), | ||
('onehot', OneHotEncoder(handle_unknown='ignore'))]) | ||
|
||
preprocessor = ColumnTransformer( | ||
transformers=[ | ||
('num', numeric_transformer, numeric_features), | ||
('cat', categorical_transformer, categorical_features)]) | ||
|
||
# Append classifier to preprocessing pipeline. | ||
# Now we have a full prediction pipeline. | ||
clf = Pipeline(steps=[('preprocessor', preprocessor), | ||
('classifier', LogisticRegression())]) | ||
|
||
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) | ||
|
||
clf.fit(X_train, y_train) | ||
|
||
|
||
However, it's impossible to interpret or even sanity-check the | ||
``LogisticRegression`` instance that's produced in the example, because the | ||
adrinjalali marked this conversation as resolved.
Show resolved
Hide resolved
|
||
correspondence of the coefficients to the input features is basically | ||
impossible to figure out. | ||
|
||
This proposal suggests adding two attributes to fitted estimators: | ||
``feature_names_in_`` and ``feature_names_out_``, such that in the | ||
abovementioned example ``clf[-1].feature_names_in_`` and | ||
``clf[-2].feature_names_out_`` will be:: | ||
|
||
['num__age', | ||
'num__fare', | ||
'cat__embarked_C', | ||
'cat__embarked_Q', | ||
'cat__embarked_S', | ||
'cat__embarked_missing', | ||
'cat__sex_female', | ||
'cat__sex_male', | ||
'cat__pclass_1', | ||
'cat__pclass_2', | ||
'cat__pclass_3'] | ||
|
||
Ideally the generated feature names describe how a feature is generated at each | ||
stage of a pipeline. For instance, ``cat__sex_female`` shows that the feature | ||
has been through a categorical preprocessing pipeline, was originally the | ||
column ``sex``, and has been one hot encoded and is one if it was originally | ||
``female``. However, this is not always possible or desirable especially when a | ||
generated column is based on many columns, since the generated feature names | ||
will be too long, for example in ``PCA``. As a rule of thumb, the following | ||
types of transformers may generate feature names which corresponds to the | ||
original features: | ||
|
||
- Leave columns unchanged, *e.g.* ``StandardScaler`` | ||
- Select a subset of columns, *e.g.* ``SelectKBest`` | ||
- create new columns where each column depends on at most one input column, | ||
*e.g* ``OneHotEncoder`` | ||
- Algorithms that create combinations of a fixed number of features, *e.g.* | ||
``PolynomialFeatures``, as opposed to all of | ||
them where there are many. Note that verbosity considerations and | ||
``verbose_feature_names`` as explained later can apply here. | ||
|
||
This proposal talks about how feature names are generated and not how they are | ||
propagated. | ||
|
||
verbose_feature_names | ||
********************* | ||
|
||
``verbose_feature_names`` controls the verbosity of the generated feature names | ||
and it can be ``True`` or ``False``. Alternative solutions could include: | ||
|
||
- an integer: fine tuning the verbosity of the generated feature names. | ||
- a ``callable`` which would give further flexibility to the user to generate | ||
user defined feature names. | ||
Comment on lines
+121
to
+122
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. simpler than a callable would be to accept a format string with a specific language, i.e. users may pass and on our side we would do so that the final name would be A given string could actually be the default, to examplify the use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Which would change the parameter name to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, we can talk about alternatives, but this SLEP says we're not implementing any of them for now anyway. |
||
|
||
These alternatives may be discussed and implemented in the future if deemed | ||
necessary. | ||
|
||
Scope | ||
##### | ||
|
||
The API for input and output feature names includes a ``feature_names_in_`` | ||
attribute for all estimators, and a ``feature_names_out_`` attribute for any | ||
estimator with a ``transform`` method, *i.e.* they expose the generated feature | ||
names via the ``feature_names_out_`` attribute. | ||
|
||
Note that this SLEP also applies to `resamplers | ||
<https://github.com/scikit-learn/enhancement_proposals/pull/15>`_ the same way | ||
as transformers. | ||
|
||
Input Feature Names | ||
################### | ||
|
||
The input feature names are stored in a fitted estimator in a | ||
``feature_names_in_`` attribute, and are taken from the given input data, for | ||
instance a ``pandas`` data frame. This attribute will be ``None`` if the input | ||
provides no feature names. | ||
|
||
Output Feature Names | ||
#################### | ||
|
||
A fitted estimator exposes the output feature names through the | ||
``feature_names_out_`` attribute. Here we discuss more in detail how these | ||
feature names are generated. Since for most estimators there are multiple ways | ||
to generate feature names, this SLEP does not intend to define how exactly | ||
feature names are generated for all of them. It is instead a guideline on how | ||
they could generally be generated. Furthermore, that specific behavior of a | ||
given estimator may be tuned via the ``verbose_feature_names`` parameter, as | ||
detailed below. | ||
|
||
As detailed bellow, some generated output features names are the same or a | ||
derived from the input feature names. In such cases, if no input feature names | ||
are provided, ``x0`` to ``xn`` are assumed to be their names. | ||
|
||
Feature Selector Transformers | ||
adrinjalali marked this conversation as resolved.
Show resolved
Hide resolved
|
||
***************************** | ||
|
||
adrinjalali marked this conversation as resolved.
Show resolved
Hide resolved
|
||
This includes transformers which output a subset of the input features, w/o | ||
changing them. For example, if a ``SelectKBest`` transformer selects the first | ||
and the third features, and no names are provided, the ``feature_names_out_`` | ||
will be ``[x0, x2]``. | ||
|
||
Feature Generating Transformers | ||
******************************* | ||
|
||
The simplest category of transformers in this section are the ones which | ||
generate a column based on a single given column. The generated output column | ||
NicolasHug marked this conversation as resolved.
Show resolved
Hide resolved
|
||
in this case is a sensible transformation of the input feature name. For | ||
instance, a ``LogTransformer`` can do ``'age' -> 'log(age)'``, and a | ||
``OneHotEncoder`` could do ``'gender' -> 'gender_female', 'gender_fluid', | ||
...``. An alternative is to leave the feature names unchanged when each output | ||
feature corresponds to exactly one input feature. Whether or not to modify the | ||
feature name, *e.g.* ``log(x0)`` vs. ``x0`` may be controlled via the | ||
``verbose_feature_names`` to the constructor. The default value of | ||
``verbose_feature_names`` can be different depending on the transformer. For | ||
instance, ``StandardScaler`` can have it as ``False``, whereas | ||
``LogTransformer`` could have it as ``True`` by default. | ||
|
||
Transformers where each output feature depends on a fixed number of input | ||
features may generate descriptive names as well. For instance, a | ||
``PolynomialTransformer`` on a small subset of features can generate an output | ||
feature name such as ``x[0] * x[2] ** 3``. | ||
|
||
And finally, the transformers where each output feature depends on many or all | ||
NicolasHug marked this conversation as resolved.
Show resolved
Hide resolved
|
||
input features, generate feature names which has the form of ``name0`` to | ||
``namen``, where ``name`` represents the transformer. For instance, a ``PCA`` | ||
transformer will output ``[pca0, ..., pcan]``, ``n`` being the number of PCA | ||
components. | ||
|
||
Meta-Estimators | ||
*************** | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't we list all meta-estimators that are transformers? What about FeatureUnion and RFECV? I guess maybe we're talking about meta-estimators that are not feature selectors, because those are easy. |
||
|
||
Meta estimators can choose to prefix the output feature names given by the | ||
estimators they are wrapping or not. | ||
|
||
By default, ``Pipeline`` adds no prefix, *i.e* its ``feature_names_out_`` is | ||
the same as the ``feature_names_out_`` of the last step, and ``None`` if the | ||
last step is not a transformer. | ||
|
||
``ColumnTransformer`` by default adds a prefix to the output feature names, | ||
indicating the name of the transformer applied to them. If a column is in the output | ||
as a part of ``passthrough``, it won't be prefixed since no operation has been | ||
applied on it. | ||
|
||
This is the default behavior, and it can be tuned by constructor parameters if | ||
NicolasHug marked this conversation as resolved.
Show resolved
Hide resolved
|
||
the meta estimator allows it. For instance, a ``verbose_feature_names=False`` | ||
adrinjalali marked this conversation as resolved.
Show resolved
Hide resolved
|
||
may indicate that a ``ColumnTransformer`` should not prefix the generated | ||
feature names with the name of the step. | ||
|
||
Examples | ||
######## | ||
|
||
Here we include some examples to demonstrate the behavior of output feature | ||
names:: | ||
|
||
100 features (no names) -> PCA(n_components=3) | ||
feature_names_out_: [pca0, pca1, pca2] | ||
|
||
|
||
100 features (no names) -> SelectKBest(k=3) | ||
feature_names_out_: [x2, x17, x42] | ||
|
||
|
||
adrinjalali marked this conversation as resolved.
Show resolved
Hide resolved
|
||
[f1, ..., f100] -> SelectKBest(k=3) | ||
feature_names_out_: [f2, f17, f42] | ||
|
||
|
||
[cat0] -> OneHotEncoder() | ||
feature_names_out_: [cat0_cat, cat0_dog, ...] | ||
|
||
|
||
[f1, ..., f100] -> Pipeline( | ||
[SelectKBest(k=30), | ||
PCA(n_components=3)] | ||
) | ||
feature_names_out_: [pca0, pca1, pca2] | ||
|
||
|
||
[model, make, numeric0, ..., numeric100] -> | ||
ColumnTransformer( | ||
[('cat', Pipeline(SimpleImputer(), OneHotEncoder()), | ||
['model', 'make']), | ||
('num', Pipeline(SimpleImputer(), PCA(n_components=3)), | ||
['numeric0', ..., 'numeric100'])] | ||
) | ||
feature_names_out_: ['cat_model_100', 'cat_model_200', ..., | ||
'cat_make_ABC', 'cat_make_XYZ', ..., | ||
'num_pca0', 'num_pca1', 'num_pca2'] | ||
|
||
However, the following examples produce a somewhat redundant feature names, | ||
and hence the relevance of ``verbose_feature_names=False``:: | ||
|
||
[model, make, numeric0, ..., numeric100] -> | ||
ColumnTransformer([ | ||
('ohe', OneHotEncoder(), ['model', 'make']), | ||
('pca', PCA(n_components=3), ['numeric0', ..., 'numeric100']) | ||
]) | ||
feature_names_out_: ['ohe_model_100', 'ohe_model_200', ..., | ||
adrinjalali marked this conversation as resolved.
Show resolved
Hide resolved
|
||
'ohe_make_ABC', 'ohe_make_XYZ', ..., | ||
'pca_pca0', 'pca_pca1', 'pca_pca2'] | ||
|
||
If desired, the user can remove the prefixes:: | ||
|
||
[model, make, numeric0, ..., numeric100] -> | ||
make_column_transformer( | ||
(OneHotEncoder(), ['model', 'make']), | ||
(PCA(n_components=3), ['numeric0', ..., 'numeric100']), | ||
verbose_feature_names=False | ||
) | ||
feature_names_out_: ['model_100', 'model_200', ..., | ||
'make_ABC', 'make_XYZ', ..., | ||
'pca0', 'pca1', 'pca2'] | ||
|
||
adrinjalali marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Backward Compatibility | ||
###################### | ||
|
||
All estimators should implement the ``feature_names_in_`` and | ||
``feature_names_out_`` API. This is checked in ``check_estimator``, and the | ||
transition is done with a ``FutureWarning`` for at least two versions to give | ||
time to third party developers to implement the API. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,11 +1,11 @@ | ||
SLEPs under review | ||
================== | ||
|
||
No SLEP is currently under review. | ||
.. No SLEP is currently under review. | ||
|
||
.. Uncomment below when a SLEP is under review | ||
|
||
.. .. toctree:: | ||
.. :maxdepth: 1 | ||
.. toctree:: | ||
:maxdepth: 1 | ||
|
||
.. slepXXX/proposal | ||
slep007/proposal | ||
adrinjalali marked this conversation as resolved.
Show resolved
Hide resolved
|
Uh oh!
There was an error while loading. Please reload this page.