3.3. Metrics and scoring: quantifying the quality of predictions

There are 3 different APIs for evaluating the quality of a model’s predictions:

Finally, Dummy estimators are useful to get a baseline value of those metrics for random predictions.

See also

For “pairwise” metrics, between samples and not estimators or predictions, see the Pairwise metrics, Affinities and Kernels section.

3.3.1. The scoring parameter: defining model evaluation rules

Model selection and evaluation using tools, such as model_selection.GridSearchCV and model_selection.cross_val_score, take a scoring parameter that controls what metric they apply to the estimators evaluated.

3.3.1.1. Common cases: predefined values

For the most common use cases, you can designate a scorer object with the scoring parameter; the table below shows all possible values. All scorer objects follow the convention that higher return values are better than lower return values. Thus metrics which measure the distance between the model and the data, like metrics.mean_squared_error, are available as neg_mean_squared_error which return the negated value of the metric.

Scoring

Function

Comment

Classification

‘accuracy’

metrics.accuracy_score

‘balanced_accuracy’

metrics.balanced_accuracy_score

‘top_k_accuracy’

metrics.top_k_accuracy_score

‘average_precision’

metrics.average_precision_score

‘neg_brier_score’

metrics.brier_score_loss

‘f1’

metrics.f1_score

for binary targets

‘f1_micro’

metrics.f1_score

micro-averaged

‘f1_macro’

metrics.f1_score

macro-averaged

‘f1_weighted’

metrics.f1_score

weighted average

‘f1_samples’

metrics.f1_score

by multilabel sample

‘neg_log_loss’

metrics.log_loss

requires predict_proba support

‘precision’ etc.

metrics.precision_score

suffixes apply as with ‘f1’

‘recall’ etc.

metrics.recall_score

suffixes apply as with ‘f1’

‘jaccard’ etc.

metrics.jaccard_score

suffixes apply as with ‘f1’

‘roc_auc’

metrics.roc_auc_score

‘roc_auc_ovr’

metrics.roc_auc_score

‘roc_auc_ovo’

metrics.roc_auc_score

‘roc_auc_ovr_weighted’

metrics.roc_auc_score

‘roc_auc_ovo_weighted’

metrics.roc_auc_score

Clustering

‘adjusted_mutual_info_score’

metrics.adjusted_mutual_info_score

‘adjusted_rand_score’

metrics.adjusted_rand_score

‘completeness_score’

metrics.completeness_score

‘fowlkes_mallows_score’

metrics.fowlkes_mallows_score

‘homogeneity_score’

metrics.homogeneity_score

‘mutual_info_score’

metrics.mutual_info_score

‘normalized_mutual_info_score’

metrics.normalized_mutual_info_score

‘rand_score’

metrics.rand_score

‘v_measure_score’

metrics.v_measure_score

Regression

‘explained_variance’

metrics.explained_variance_score

‘max_error’

metrics.max_error

‘neg_mean_absolute_error’

metrics.mean_absolute_error

‘neg_mean_squared_error’

metrics.mean_squared_error

‘neg_root_mean_squared_error’

metrics.mean_squared_error

‘neg_mean_squared_log_error’

metrics.mean_squared_log_error

‘neg_median_absolute_error’

metrics.median_absolute_error

‘r2’

metrics.r2_score

‘neg_mean_poisson_deviance’

metrics.mean_poisson_deviance

‘neg_mean_gamma_deviance’

metrics.mean_gamma_deviance

‘neg_mean_absolute_percentage_error’

metrics.mean_absolute_percentage_error

‘d2_absolute_error_score’

metrics.d2_absolute_error_score

‘d2_pinball_score’

metrics.d2_pinball_score

‘d2_tweedie_score’

metrics.d2_tweedie_score

Usage examples:

>>> from sklearn import svm, datasets
>>> from sklearn.model_selection import cross_val_score
>>> X, y = datasets.load_iris(return_X_y=True)
>>> clf = svm.SVC(random_state=0)
>>> cross_val_score(clf, X, y, cv=5, scoring='recall_macro')
array([0.96..., 0.96..., 0.96..., 0.93..., 1.        ])

Note

If a wrong scoring name is passed, an InvalidParameterError is raised. You can retrieve the names of all available scorers by calling get_scorer_names.

3.3.1.2. Defining your scoring strategy from metric functions

The module sklearn.metrics also exposes a set of simple functions measuring a prediction error given ground truth and prediction:

  • functions ending with _score return a value to maximize, the higher the better.

  • functions ending with _error or _loss return a value to minimize, the lower the better. When converting into a scorer object using make_scorer, set the greater_is_better parameter to False (True by default; see the parameter description below).

Metrics available for various machine learning tasks are detailed in sections below.

Many metrics are not given names to be used as scoring values, sometimes because they require additional parameters, such as fbeta_score. In such cases, you need to generate an appropriate scoring object. The simplest way to generate a callable object for scoring is by using make_scorer. That function converts metrics into callables that can be used for model evaluation.

One typical use case is to wrap an existing metric function from the library with non-default values for its parameters, such as the beta parameter for the fbeta_score function:

>>> from sklearn.metrics import fbeta_score, make_scorer
>>> ftwo_scorer = make_scorer(fbeta_score, beta=2)
>>> from sklearn.model_selection import GridSearchCV
>>> from sklearn.svm import LinearSVC
>>> grid = GridSearchCV(LinearSVC(dual="auto"), param_grid={'C': [1, 10]},
...                     scoring=ftwo_scorer, cv=5)

Custom scorer objects Click for more details

The second use case is to build a completely custom scorer object from a simple python function using make_scorer, which can take several parameters:

  • the python function you want to use (my_custom_loss_func in the example below)

  • whether the python function returns a score (greater_is_better=True, the default) or a loss (greater_is_better=False). If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.

  • for classification metrics only: whether the python function you provided requires continuous decision certainties (needs_threshold=True). The default value is False.

  • any additional parameters, such as beta or labels in f1_score.

Here is an example of building custom scorers, and of using the greater_is_better parameter:

>>> import numpy as np
>>> def my_custom_loss_func(y_true, y_pred):
...     diff = np.abs(y_true - y_pred).max()
...     return np.log1p(diff)
...
>>> # score will negate the return value of my_custom_loss_func,
>>> # which will be np.log(2), 0.693, given the values for X
>>> # and y defined below.
>>> score = make_scorer(my_custom_loss_func, greater_is_better=False)
>>> X = [[1], [1]]
>>> y = [0, 1]
>>> from sklearn.dummy import DummyClassifier
>>> clf = DummyClassifier(strategy='most_frequent', random_state=0)
>>> clf = clf.fit(X, y)
>>> my_custom_loss_func(y, clf.predict(X))
0.69...
>>> score(clf, X, y)
-0.69...