6.1. Pipelines and composite estimators¶
Transformers are usually combined with classifiers, regressors or other estimators to build a composite estimator. The most common tool is a Pipeline. Pipeline is often used in combination with FeatureUnion which concatenates the output of transformers into a composite feature space. TransformedTargetRegressor deals with transforming the target (i.e. log-transform y). In contrast, Pipelines only transform the observed data (X).
6.1.1. Pipeline: chaining estimators¶
Pipeline
can be used to chain multiple estimators
into one. This is useful as there is often a fixed sequence
of steps in processing the data, for example feature selection, normalization
and classification. Pipeline
serves multiple purposes here:
- Convenience and encapsulation
You only have to call fit and predict once on your data to fit a whole sequence of estimators.
- Joint parameter selection
You can grid search over parameters of all estimators in the pipeline at once.
- Safety
Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.
All estimators in a pipeline, except the last one, must be transformers (i.e. must have a transform method). The last estimator may be any type (transformer, classifier, etc.).
Note
Calling fit
on the pipeline is the same as calling fit
on
each estimator in turn, transform
the input and pass it on to the next step.
The pipeline has all the methods that the last estimator in the pipeline has,
i.e. if the last estimator is a classifier, the Pipeline
can be used
as a classifier. If the last estimator is a transformer, again, so is the
pipeline.
6.1.1.1. Usage¶
6.1.1.1.1. Build a pipeline¶
The Pipeline
is built using a list of (key, value)
pairs, where
the key
is a string containing the name you want to give this step and value
is an estimator object:
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.svm import SVC
>>> from sklearn.decomposition import PCA
>>> estimators = [('reduce_dim', PCA()), ('clf', SVC())]
>>> pipe = Pipeline(estimators)
>>> pipe
Pipeline(steps=[('reduce_dim', PCA()), ('clf', SVC())])
Shorthand version using :func:`make_pipeline`
Click for more details
The utility function make_pipeline
is a shorthand
for constructing pipelines;
it takes a variable number of estimators and returns a pipeline,
filling in the names automatically:
>>> from sklearn.pipeline import make_pipeline
>>> make_pipeline(PCA(), SVC())
Pipeline(steps=[('pca', PCA()), ('svc', SVC())])