1.4. Support Vector Machines¶
Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.
The advantages of support vector machines are:
Effective in high dimensional spaces.
Still effective in cases where number of dimensions is greater than the number of samples.
Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.
The disadvantages of support vector machines include:
If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.
SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).
The support vector machines in scikit-learn support both dense
(numpy.ndarray
and convertible to that by numpy.asarray
) and
sparse (any scipy.sparse
) sample vectors as input. However, to use
an SVM to make predictions for sparse data, it must have been fit on such
data. For optimal performance, use C-ordered numpy.ndarray
(dense) or
scipy.sparse.csr_matrix
(sparse) with dtype=float64
.
1.4.1. Classification¶
SVC
, NuSVC
and LinearSVC
are classes
capable of performing binary and multi-class classification on a dataset.

SVC
and NuSVC
are similar methods, but accept slightly
different sets of parameters and have different mathematical formulations (see
section Mathematical formulation). On the other hand,
LinearSVC
is another (faster) implementation of Support Vector
Classification for the case of a linear kernel. It also
lacks some of the attributes of SVC
and NuSVC
, like
support_
. LinearSVC
uses squared_hinge
loss and due to its
implementation in liblinear
it also regularizes the intercept, if considered.
This effect can however be reduced by carefully fine tuning its
intercept_scaling
parameter, which allows the intercept term to have a
different regularization behavior compared to the other features. The
classification results and score can therefore differ from the other two
classifiers.
As other classifiers, SVC
, NuSVC
and
LinearSVC
take as input two arrays: an array X
of shape
(n_samples, n_features)
holding the training samples, and an array y
of
class labels (strings or integers), of shape (n_samples)
:
>>> from sklearn import svm
>>> X = [[0, 0], [1, 1]]
>>> y = [0, 1]
>>> clf = svm.SVC()
>>> clf.fit(X, y)
SVC()
After being fitted, the model can then be used to predict new values:
>>> clf.predict([[2., 2.]])
array([1])
SVMs decision function (detailed in the Mathematical formulation)
depends on some subset of the training data, called the support vectors. Some
properties of these support vectors can be found in attributes
support_vectors_
, support_
and n_support_
:
>>> # get support vectors
>>> clf.support_vectors_
array([[0., 0.],
[1., 1.]])
>>> # get indices of support vectors
>>> clf.support_
array([0, 1]...)
>>> # get number of support vectors for each class
>>> clf.n_support_
array([1, 1]...)
1.4.1.1. Multi-class classification¶
SVC
and NuSVC
implement the “one-versus-one”
approach for multi-class classification. In total,
n_classes * (n_classes - 1) / 2
classifiers are constructed and each one trains data from two classes.
To provide a consistent interface with other classifiers, the
decision_function_shape
option allows to monotonically transform the
results of the “one-versus-one” classifiers to a “one-vs-rest” decision
function of shape (n_samples, n_classes)
.
>>> X = [[0], [1], [2], [3]]
>>> Y = [0, 1, 2, 3]
>>> clf = svm.SVC(decision_function_shape='ovo')
>>> clf.fit(X, Y)
SVC(decision_function_shape='ovo')
>>> dec = clf.decision_function([[1]])
>>> dec.shape[1] # 4 classes: 4*3/2 = 6
6
>>> clf.decision_function_shape = "ovr"
>>> dec = clf.decision_function([[1]])
>>> dec.shape[1] # 4 classes
4
On the other hand, LinearSVC
implements “one-vs-the-rest”
multi-class strategy, thus training n_classes
models.
>>> lin_clf = svm.LinearSVC(dual="auto")
>>> lin_clf.fit(X, Y)
LinearSVC(dual='auto')
>>> dec = lin_clf.decision_function([[1]])
>>> dec.shape[1]
4
See Mathematical formulation for a complete description of the decision function.
Details on multi-class strategies
Click for more details
Note that the LinearSVC
also implements an alternative multi-class
strategy, the so-called multi-class SVM formulated by Crammer and Singer
[16], by using the option multi_class='crammer_singer'
. In practice,
one-vs-rest classification is usually preferred, since the results are mostly
similar, but the runtime is significantly less.
For “one-vs-rest” LinearSVC
the attributes coef_
and intercept_
have the shape (n_classes, n_features)
and (n_classes,)
respectively.
Each row of the coefficients corresponds to one of the n_classes
“one-vs-rest” classifiers and similar for the intercepts, in the
order of the “one” class.
In the case of “one-vs-one” SVC
and NuSVC
, the layout of
the attributes is a little more involved. In the case of a linear
kernel, the attributes coef_
and intercept_
have the shape
(n_classes * (n_classes - 1) / 2, n_features)
and (n_classes *
(n_classes - 1) / 2)
respectively. This is similar to the layout for
LinearSVC
described above, with each row now corresponding
to a binary classifier. The order for classes
0 to n is “0 vs 1”, “0 vs 2” , … “0 vs n”, “1 vs 2”, “1 vs 3”, “1 vs n”, . .
. “n-1 vs n”.
The shape of dual_coef_
is (n_classes-1, n_SV)
with
a somewhat hard to grasp layout.
The columns correspond to the support vectors involved in any
of the n_classes * (n_classes - 1) / 2
“one-vs-one” classifiers.
Each support vector v
has a dual coefficient in each of the
n_classes - 1
classifiers comparing the class of v
against another class.
Note that some, but not all, of these dual coefficients, may be zero.
The n_classes - 1
entries in each column are these dual coefficients,
ordered by the opposing class.
This might be clearer with an example: consider a three class problem with
class 0 having three support vectors
\(v^{0}_0, v^{1}_0, v^{2}_0\) and class 1 and 2 having two support vectors
\(v^{0}_1, v^{1}_1\) and \(v^{0}_2, v^{1}_2\) respectively. For each
support vector \(v^{j}_i\), there are two dual coefficients. Let’s call
the coefficient of support vector \(v^{j}_i\) in the classifier between
classes \(i\) and \(k\) \(\alpha^{j}_{i,k}\).
Then dual_coef_
looks like this:
\(\alpha^{0}_{0,1}\) |
\(\alpha^{1}_{0,1}\) |
\(\alpha^{2}_{0,1}\) |
\(\alpha^{0}_{1,0}\) |
\(\alpha^{1}_{1,0}\) |
\(\alpha^{0}_{2,0}\) |
\(\alpha^{1}_{2,0}\) |
\(\alpha^{0}_{0,2}\) |
\(\alpha^{1}_{0,2}\) |
\(\alpha^{2}_{0,2}\) |
\(\alpha^{0}_{1,2}\) |
\(\alpha^{1}_{1,2}\) |
\(\alpha^{0}_{2,1}\) |
\(\alpha^{1}_{2,1}\) |
Coefficients for SVs of class 0 |
Coefficients for SVs of class 1 |
Coefficients for SVs of class 2 |