forked from root-project/root
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathSVM.tex
More file actions
272 lines (246 loc) · 13.7 KB
/
SVM.tex
File metadata and controls
272 lines (246 loc) · 13.7 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
\subsection{Support Vector Machine (SVM)}
\label{sec:SVM}
In the early 1960s a linear support vector method\index{Support vector machine, SVM}
has been developed for the construction of
separating hyperplanes for pattern recognition problems~\cite{Vapnik1963,Vapnik1964}.
It took 30 years before the method was generalised to nonlinear separating
functions~\cite{Vapnik1992,Vapnik1995a} and for estimating real-valued functions
(regression)~\cite{Vapnik1995b}. At that moment it became a general purpose algorithm,
performing classification and regression tasks which can compete with neural networks
and probability density estimators. Typical applications of SVMs include text
categorisation, character recognition, bio-informatics and face detection.
The main idea of the SVM approach to classification problems is to build a hyperplane
that separates signal and background {\em vectors} (events) using only a minimal subset
of all training vectors ({\em support vectors}). The position of the hyperplane is
obtained by maximizing the margin (distance) between it and the support vectors.
The extension to nonlinear SVMs is performed by mapping the input vectors onto
a higher dimensional feature space in which signal and background events can be
separated by a linear procedure using an optimally separating hyperplane. The use
of kernel functions eliminates thereby the explicit transformation to the feature
space and simplifies the computation.
The implementation of the newly introduced regression is similar to the approach in classification.
It also maps input data into higher dimensional space using previously chosen support
vectors. Instead of separating events of two types, it determines the hyperplane with
events of the same value (which is equal to the {\em mean} from all training events).
The final value is estimated based on the distance to the hyperplane which is computed by the selected kernel function.% Comment: currently we can only select Gaus, soon we will have most of the old kernel options back
\subsubsection{Booking options}
The SVM classifier is booked via the command:
\begin{codeexample}
\begin{tmvacode}
factory->BookMethod( TMVA::Types::kSVM, "SVM", "<options>" );
\end{tmvacode}
\caption[.]{\codeexampleCaptionSize Booking of the SVM classifier: the first
argument is a unique type enumerator, the second is a user-defined
name which must be unique among all booked classifiers, and the third argument
is the configuration option string. Individual options are separated by a ':'.
For options that are not set in the string default values are used.
See Sec.~\ref{sec:usingtmva:booking} for more information on the booking.}
\end{codeexample}
The configuration options for the SVM classifier are given in Option Table~\ref{opt:mva::svm}.
% ======= input option table ==========================================
\begin{option}[t]
\input optiontables/MVA__SVM.tex
\caption[.]{\optionCaptionSize
Configuration options reference for MVA method: {\em SVM}.
Values given are defaults. If predefined categories exist, the default category
is marked by a '$\star$'. The options in Option Table~\ref{opt:mva::methodbase} on
page~\pageref{opt:mva::methodbase} can also be configured.
% Definition of the kernel function: \code{Linear} is
% $K(\vec x, \vec y) = \vec x \cdot \vec y $ (no extra parameters),
% \code{Polynomial} is $K(\vec x, \vec y) = (\vec x \cdot \vec y +\theta)^d$,
% \code{Gauss} is
% $K(\vec x, \vec y) = \exp\big(-\left|\vec x - \vec y \right|^2 /2\sigma^2\big)$,
% and \code{Sigmoid} corresponds to
% $K(\vec x, \vec y) = \tanh\left(\kappa (\vec x \cdot \vec y) +\theta \right)$
}
\label{opt:mva::svm}
\end{option}
% =====================================================================
\subsubsection{Description and implementation}
A detailed description of the SVM formalism can be found, for example,
in Ref.~\cite{Burges}. Here only a brief introduction of the TMVA
implementation is given.
\subsubsection*{Linear SVM}
\index{Support vector machine, SVM!linear}
%
\begin{figure}[t]
\centering
\includegraphics[width=0.47\textwidth]{plots/SVM}
\caption{ Hyperplane classifier in two dimensions. The vectors (events) ${\bf x}_{1-4}$
define the hyperplane and margin, \ie, they are the support vectors.
}
\label{fig:classifier}
\end{figure}
Consider a simple two-class classifier with oriented hyperplanes. If the training
data is linearly separable, a vector-scalar pair $(\vec w, b)$ can be found that
satisfies the constraints
%
\beq
\label{eq:SVM1}
y_i (\vec {x_i}\cdot \vec w+ b)-1 \ge 0\,, \space \mbox{\hspace{0.5cm}} \forall_i\,,
\eeq
%
where $\vec x_i$ are the input vectors, $y_i$ the desired outputs ($y_i=\pm 1$),
and where the pair $(\vec w,b)$ defines a hyperplane. The decision function of
the classifier is
$f(\vec x_i)= {\rm sign} (\vec x_i\cdot \vec w+ b)$, which is $+1$ for all points on
one side of the hyperplane and $-1$ for the points on the other side.
Intuitively, the classifier with the largest margin will give better separation.
The margin for this linear classifier is just $2/|\vec w|$. Hence to maximise the
margin, one needs to minimise the {\em cost function} $W = |\vec w|^2/w$ with the
constraints from Eq.~(\ref{eq:SVM1}).
At this point it is beneficial to consider the significance of different input
vectors $\vec x_i$. The training events lying on the margins, which are called
the support vectors (SV), are the events that contribute to defining the decision boundary
(see Fig.~\ref{fig:classifier}). Hence if the other events are removed from the training
sample and the classifier is retrained on the remaining events, the training will result
in the same decision boundary. To solve the constrained quadratic optimisation problem,
we first reformulate it in terms of a Lagrangian
%
\beq
{\cal L}( \vec w, b, \vec \alpha) = \frac{1}{2}\left| \vec w \right|^2-
\sum_i \alpha_i \left( y_i \left( ( \vec x_i \cdot \vec w ) + b\right)-1\right)
\eeq
%
where $\alpha_i \ge 0$ and the condition from Eq.~(\ref{eq:SVM1}) must be fulfilled.
The Lagrangian ${\cal L}$ is minimised with respect to $\vec w$ and $b$ and
maximised with respect to $\vec \alpha$. The solution has an expansion in terms
of a subset of input vectors for which $\alpha_i \ne 0$ (the support vectors):
%
\beq
\vec w = \sum_i \alpha_i y_i \vec x_i\,,
\eeq
%
because $\partial {\cal L} / \partial b =0$ and $\partial {\cal L}/\partial\vec w=0$
hold at the extremum. The optimisation problem translates to finding the vector
$\vec \alpha$ which maximises
%
\beq
\label{eq:SVMdot}
{\cal L}( \vec \alpha )= \sum_i \alpha_i -
\frac{1}{2} \sum_{ij} \alpha_i \alpha_j y_i y_j \vec x_i \cdot \vec x_j \,.
\eeq
%
Both the optimisation problem and the final decision function depend only on scalar
products between input vectors, which is a crucial property for the generalisation
to the nonlinear case.
\subsubsection*{Nonseparable data}
\label{SVM:nonseparable}
\index{Nonseparable data}
The above algorithm can be extended to non-separable data. The classification
constraints in Eq.~(\ref{eq:SVM1}) are modified by adding a ``slack'' variable $\xi_i$ to
it ($\xi_i=0$ if the vector is properly classified, otherwise $\xi_i$ is the distance
to the decision hyperplane)
%
\beq
\label{eq:SVM2}
y_i (\vec {x_i}\cdot \vec w+ b)-1 +\xi_i \ge 0,\space
\mbox{\hspace{1cm}} \xi_i\ge 0\,, \mbox{\hspace{0.5cm}} \forall_i\,.
\eeq
%
This admits a certain amount of misclassification. The training algorithm thus
minimises the modified cost function
%
\beq
\label{eq:SVM3}
W = \frac{1}{2}\left| \vec w \right|^2 + C \sum_i \xi_i\,,
\eeq
describing a trade-off between margin and misclassification. The cost parameter \code{C}
sets the scale by how much misclassification increases the cost function (see Tab.~\ref{opt:mva::svm}).
\subsubsection*{Nonlinear SVM}
\index{Support vector machine, SVM!nonlinear}
The SVM formulation given above can be further extended to build a nonlinear
SVM which can classify nonlinearly separable data.
%
Consider a function $\Phi:\;{\rm R}^\Nvar\to\cal H$, which maps the training data
from ${\rm R}^\Nvar$, where $\Nvar$ is the number of discriminating input variables,
to some higher dimensional space $\cal H$. In the $\cal H$ space the signal and background
events can be linearly separated so that the linear SVM formulation can be applied. We have
seen in Eq.~(\ref{eq:SVMdot}) that event variables only appear in the form of
scalar products $\vec x_i\cdot \vec x_j$, which become $\Phi(\vec x_i)\cdot \Phi(\vec x_j)$
in the higher dimensional feature space $\cal H$. The latter scalar product can be approximated
by a kernel function
%
\beq
K(\vec x_i, \vec x_j)\approx\Phi(\vec x_i)\cdot \Phi(\vec x_j)\,,
\eeq
%
which avoids the explicit computation of the mapping function $\Phi(\vec x)$.
This is desirable because the exact form of $\Phi(\vec x)$ is hard to derive from
the training data. Most frequently used kernel functions are
%
\beq
\label{SMV:kernels}
\def\smallEq{\hspace{-0.15cm}=}
\begin{array}{rcll}
K(\vec x, \vec y) &\smallEq
&\hspace{-0.15cm} (\vec x \cdot \vec y +\theta)^d
&\hspace{0.2cm}\text{\em Polynomial}, \\
K(\vec x, \vec y) &\smallEq
&\hspace{-0.15cm} \exp\left(-\left|\vec x - \vec y \right|^2 /2\sigma^2\right)
&\hspace{0.2cm} \text{\em Gaussian}, \\
K(\vec x, \vec y) &\smallEq
&\hspace{-0.15cm} \tanh\left(\kappa (\vec x \cdot \vec y) +\theta\right)
&\hspace{0.2cm} \text{\em Sigmoidal},
\end{array}
\eeq
of which currently only the Gaussian Kernel is implemented in TMVA.
%
It was shown in Ref.~\cite{Vapnik1995b} that a suitable function kernel must fulfill
Mercer's condition
%
\beq
\int K(\vec x, \vec y)g(\vec x)g(\vec y)d\vec x d\vec y \ge 0\,,
\eeq
%
for any function $g$ such that $\int g^2(\vec x) d\vec x$ is finite.
While Gaussian and polynomial kernels are known to comply with Mercer's condition,
this is not strictly the case for sigmoidal kernels. To extend the linear
methodology to nonlinear problems one substitutes $\vec x_i \cdot \vec x_j$
by $K(\vec x_i, \vec x_j)$ in Eq.~(\ref{eq:SVMdot}).
Due to Mercer's conditions on the kernel, the corresponding optimisation problem
is a well defined convex quadratic programming problem with a global minimum.
This is an advantage of SVMs compared to neural networks where local minima occur.
For regression problems, the same algorithm is used as for classification with the exception that instead of dividing
events based on their type (signal/background), it separates them based on the value
(larger/smaller than average). In the end, it does not return the sigmoid of the distance
between the event and the hyperplane, but the distance itself -- increased by the average target value.
\subsubsection*{Implementation}
The TMVA implementation of the Support Vector Machine follows closely the description
given in the literature. It employs a sequential minimal optimisation (SMO)~\cite{Platt}
to solve the quadratic problem. Acceleration of the minimisation is achieved by dividing
a set of vectors into smaller subsets~\cite{Keerthi}. The number of training subsets is
controlled by option \code{NSubSets}. The SMO method drives the subset
selection to the extreme by selecting subsets of two vectors (for details see
Ref.~\cite{Burges}). The pairs of vectors are chosen, using heuristic rules, to achieve
the largest possible improvement (minimisation) per step. Because the working set is of
size two, it is straightforward to write down the analytical solution. The minimisation
procedure is repeated recursively until the minimum is found. The SMO algorithm has proven
to be significantly faster than other methods and has become the most common
minimisation method used in SVM implementations. The precision of the minimisation
is controlled by the tolerance parameter \code{Tol} (see Tab.~\ref{opt:mva::svm}). The
SVM training time can be reduced by increasing the tolerance. Most classification problems
should be solved with less then 1000 training iterations. Interrupting the SVM algorithm
using the option \code{MaxIter} may thus be helpful when optimising the SVM training
parameters. \code{MaxIter} can be released for the final classifier training.
\subsubsection{Variable ranking}
The present implementation of the SVM classifier does not provide a ranking
of the input variables.
\subsubsection{Performance}
\label{SVM:performance}
\index{Support vector machine, SVM!performance of}
%The TMVA SVM algorithm comes with linear, polynomial, Gaussian and sigmoidal kernel functions.
The TMVA SVM algorithm comes currently only with the Gaussian kernel function.
With sufficient training statistics, the Gaussian kernel allows to approximate
any separating function in the input space. It is crucial for the performance of the
SVM to appropriately tune the kernel parameters and the cost parameter \code{C}.
In case of a Gaussian, the kernel is tuned via option \code{Gamma} which is related to
the width $\sigma$ by $\Gamma=1/(2 \sigma^2)$.
The optimal tuning of these parameters is specific to the problem and must be done by the user.
The SVM training time scales with $n^2$, where $n$ is the number of vectors (events) in
the training data set. The user is therefore advised to restrict the sample size
during the first rough scan of the kernel parameters. Also increasing the minimisation
tolerance helps to speed up the training.
SVM is a nonlinear general purpose classification and regression algorithm with
a performance similar to neural networks (Sec.~\ref{sec:ann}) or to a multidimensional
likelihood estimator (Sec.~\ref{sec:pders}).