\subsection{Support Vector Machine (SVM)} \label{sec:SVM} In the early 1960s a linear support vector method\index{Support vector machine, SVM} has been developed for the construction of separating hyperplanes for pattern recognition problems~\cite{Vapnik1963,Vapnik1964}. It took 30 years before the method was generalised to nonlinear separating functions~\cite{Vapnik1992,Vapnik1995a} and for estimating real-valued functions (regression)~\cite{Vapnik1995b}. At that moment it became a general purpose algorithm, performing classification and regression tasks which can compete with neural networks and probability density estimators. Typical applications of SVMs include text categorisation, character recognition, bio-informatics and face detection. The main idea of the SVM approach to classification problems is to build a hyperplane that separates signal and background {\em vectors} (events) using only a minimal subset of all training vectors ({\em support vectors}). The position of the hyperplane is obtained by maximizing the margin (distance) between it and the support vectors. The extension to nonlinear SVMs is performed by mapping the input vectors onto a higher dimensional feature space in which signal and background events can be separated by a linear procedure using an optimally separating hyperplane. The use of kernel functions eliminates thereby the explicit transformation to the feature space and simplifies the computation. The implementation of the newly introduced regression is similar to the approach in classification. It also maps input data into higher dimensional space using previously chosen support vectors. Instead of separating events of two types, it determines the hyperplane with events of the same value (which is equal to the {\em mean} from all training events). The final value is estimated based on the distance to the hyperplane which is computed by the selected kernel function.% Comment: currently we can only select Gaus, soon we will have most of the old kernel options back \subsubsection{Booking options} The SVM classifier is booked via the command: \begin{codeexample} \begin{tmvacode} factory->BookMethod( TMVA::Types::kSVM, "SVM", "" ); \end{tmvacode} \caption[.]{\codeexampleCaptionSize Booking of the SVM classifier: the first argument is a unique type enumerator, the second is a user-defined name which must be unique among all booked classifiers, and the third argument is the configuration option string. Individual options are separated by a ':'. For options that are not set in the string default values are used. See Sec.~\ref{sec:usingtmva:booking} for more information on the booking.} \end{codeexample} The configuration options for the SVM classifier are given in Option Table~\ref{opt:mva::svm}. % ======= input option table ========================================== \begin{option}[t] \input optiontables/MVA__SVM.tex \caption[.]{\optionCaptionSize Configuration options reference for MVA method: {\em SVM}. Values given are defaults. If predefined categories exist, the default category is marked by a '$\star$'. The options in Option Table~\ref{opt:mva::methodbase} on page~\pageref{opt:mva::methodbase} can also be configured. % Definition of the kernel function: \code{Linear} is % $K(\vec x, \vec y) = \vec x \cdot \vec y $ (no extra parameters), % \code{Polynomial} is $K(\vec x, \vec y) = (\vec x \cdot \vec y +\theta)^d$, % \code{Gauss} is % $K(\vec x, \vec y) = \exp\big(-\left|\vec x - \vec y \right|^2 /2\sigma^2\big)$, % and \code{Sigmoid} corresponds to % $K(\vec x, \vec y) = \tanh\left(\kappa (\vec x \cdot \vec y) +\theta \right)$ } \label{opt:mva::svm} \end{option} % ===================================================================== \subsubsection{Description and implementation} A detailed description of the SVM formalism can be found, for example, in Ref.~\cite{Burges}. Here only a brief introduction of the TMVA implementation is given. \subsubsection*{Linear SVM} \index{Support vector machine, SVM!linear} % \begin{figure}[t] \centering \includegraphics[width=0.47\textwidth]{plots/SVM} \caption{ Hyperplane classifier in two dimensions. The vectors (events) ${\bf x}_{1-4}$ define the hyperplane and margin, \ie, they are the support vectors. } \label{fig:classifier} \end{figure} Consider a simple two-class classifier with oriented hyperplanes. If the training data is linearly separable, a vector-scalar pair $(\vec w, b)$ can be found that satisfies the constraints % \beq \label{eq:SVM1} y_i (\vec {x_i}\cdot \vec w+ b)-1 \ge 0\,, \space \mbox{\hspace{0.5cm}} \forall_i\,, \eeq % where $\vec x_i$ are the input vectors, $y_i$ the desired outputs ($y_i=\pm 1$), and where the pair $(\vec w,b)$ defines a hyperplane. The decision function of the classifier is $f(\vec x_i)= {\rm sign} (\vec x_i\cdot \vec w+ b)$, which is $+1$ for all points on one side of the hyperplane and $-1$ for the points on the other side. Intuitively, the classifier with the largest margin will give better separation. The margin for this linear classifier is just $2/|\vec w|$. Hence to maximise the margin, one needs to minimise the {\em cost function} $W = |\vec w|^2/w$ with the constraints from Eq.~(\ref{eq:SVM1}). At this point it is beneficial to consider the significance of different input vectors $\vec x_i$. The training events lying on the margins, which are called the support vectors (SV), are the events that contribute to defining the decision boundary (see Fig.~\ref{fig:classifier}). Hence if the other events are removed from the training sample and the classifier is retrained on the remaining events, the training will result in the same decision boundary. To solve the constrained quadratic optimisation problem, we first reformulate it in terms of a Lagrangian % \beq {\cal L}( \vec w, b, \vec \alpha) = \frac{1}{2}\left| \vec w \right|^2- \sum_i \alpha_i \left( y_i \left( ( \vec x_i \cdot \vec w ) + b\right)-1\right) \eeq % where $\alpha_i \ge 0$ and the condition from Eq.~(\ref{eq:SVM1}) must be fulfilled. The Lagrangian ${\cal L}$ is minimised with respect to $\vec w$ and $b$ and maximised with respect to $\vec \alpha$. The solution has an expansion in terms of a subset of input vectors for which $\alpha_i \ne 0$ (the support vectors): % \beq \vec w = \sum_i \alpha_i y_i \vec x_i\,, \eeq % because $\partial {\cal L} / \partial b =0$ and $\partial {\cal L}/\partial\vec w=0$ hold at the extremum. The optimisation problem translates to finding the vector $\vec \alpha$ which maximises % \beq \label{eq:SVMdot} {\cal L}( \vec \alpha )= \sum_i \alpha_i - \frac{1}{2} \sum_{ij} \alpha_i \alpha_j y_i y_j \vec x_i \cdot \vec x_j \,. \eeq % Both the optimisation problem and the final decision function depend only on scalar products between input vectors, which is a crucial property for the generalisation to the nonlinear case. \subsubsection*{Nonseparable data} \label{SVM:nonseparable} \index{Nonseparable data} The above algorithm can be extended to non-separable data. The classification constraints in Eq.~(\ref{eq:SVM1}) are modified by adding a ``slack'' variable $\xi_i$ to it ($\xi_i=0$ if the vector is properly classified, otherwise $\xi_i$ is the distance to the decision hyperplane) % \beq \label{eq:SVM2} y_i (\vec {x_i}\cdot \vec w+ b)-1 +\xi_i \ge 0,\space \mbox{\hspace{1cm}} \xi_i\ge 0\,, \mbox{\hspace{0.5cm}} \forall_i\,. \eeq % This admits a certain amount of misclassification. The training algorithm thus minimises the modified cost function % \beq \label{eq:SVM3} W = \frac{1}{2}\left| \vec w \right|^2 + C \sum_i \xi_i\,, \eeq describing a trade-off between margin and misclassification. The cost parameter \code{C} sets the scale by how much misclassification increases the cost function (see Tab.~\ref{opt:mva::svm}). \subsubsection*{Nonlinear SVM} \index{Support vector machine, SVM!nonlinear} The SVM formulation given above can be further extended to build a nonlinear SVM which can classify nonlinearly separable data. % Consider a function $\Phi:\;{\rm R}^\Nvar\to\cal H$, which maps the training data from ${\rm R}^\Nvar$, where $\Nvar$ is the number of discriminating input variables, to some higher dimensional space $\cal H$. In the $\cal H$ space the signal and background events can be linearly separated so that the linear SVM formulation can be applied. We have seen in Eq.~(\ref{eq:SVMdot}) that event variables only appear in the form of scalar products $\vec x_i\cdot \vec x_j$, which become $\Phi(\vec x_i)\cdot \Phi(\vec x_j)$ in the higher dimensional feature space $\cal H$. The latter scalar product can be approximated by a kernel function % \beq K(\vec x_i, \vec x_j)\approx\Phi(\vec x_i)\cdot \Phi(\vec x_j)\,, \eeq % which avoids the explicit computation of the mapping function $\Phi(\vec x)$. This is desirable because the exact form of $\Phi(\vec x)$ is hard to derive from the training data. Most frequently used kernel functions are % \beq \label{SMV:kernels} \def\smallEq{\hspace{-0.15cm}=} \begin{array}{rcll} K(\vec x, \vec y) &\smallEq &\hspace{-0.15cm} (\vec x \cdot \vec y +\theta)^d &\hspace{0.2cm}\text{\em Polynomial}, \\ K(\vec x, \vec y) &\smallEq &\hspace{-0.15cm} \exp\left(-\left|\vec x - \vec y \right|^2 /2\sigma^2\right) &\hspace{0.2cm} \text{\em Gaussian}, \\ K(\vec x, \vec y) &\smallEq &\hspace{-0.15cm} \tanh\left(\kappa (\vec x \cdot \vec y) +\theta\right) &\hspace{0.2cm} \text{\em Sigmoidal}, \end{array} \eeq of which currently only the Gaussian Kernel is implemented in TMVA. % It was shown in Ref.~\cite{Vapnik1995b} that a suitable function kernel must fulfill Mercer's condition % \beq \int K(\vec x, \vec y)g(\vec x)g(\vec y)d\vec x d\vec y \ge 0\,, \eeq % for any function $g$ such that $\int g^2(\vec x) d\vec x$ is finite. While Gaussian and polynomial kernels are known to comply with Mercer's condition, this is not strictly the case for sigmoidal kernels. To extend the linear methodology to nonlinear problems one substitutes $\vec x_i \cdot \vec x_j$ by $K(\vec x_i, \vec x_j)$ in Eq.~(\ref{eq:SVMdot}). Due to Mercer's conditions on the kernel, the corresponding optimisation problem is a well defined convex quadratic programming problem with a global minimum. This is an advantage of SVMs compared to neural networks where local minima occur. For regression problems, the same algorithm is used as for classification with the exception that instead of dividing events based on their type (signal/background), it separates them based on the value (larger/smaller than average). In the end, it does not return the sigmoid of the distance between the event and the hyperplane, but the distance itself -- increased by the average target value. \subsubsection*{Implementation} The TMVA implementation of the Support Vector Machine follows closely the description given in the literature. It employs a sequential minimal optimisation (SMO)~\cite{Platt} to solve the quadratic problem. Acceleration of the minimisation is achieved by dividing a set of vectors into smaller subsets~\cite{Keerthi}. The number of training subsets is controlled by option \code{NSubSets}. The SMO method drives the subset selection to the extreme by selecting subsets of two vectors (for details see Ref.~\cite{Burges}). The pairs of vectors are chosen, using heuristic rules, to achieve the largest possible improvement (minimisation) per step. Because the working set is of size two, it is straightforward to write down the analytical solution. The minimisation procedure is repeated recursively until the minimum is found. The SMO algorithm has proven to be significantly faster than other methods and has become the most common minimisation method used in SVM implementations. The precision of the minimisation is controlled by the tolerance parameter \code{Tol} (see Tab.~\ref{opt:mva::svm}). The SVM training time can be reduced by increasing the tolerance. Most classification problems should be solved with less then 1000 training iterations. Interrupting the SVM algorithm using the option \code{MaxIter} may thus be helpful when optimising the SVM training parameters. \code{MaxIter} can be released for the final classifier training. \subsubsection{Variable ranking} The present implementation of the SVM classifier does not provide a ranking of the input variables. \subsubsection{Performance} \label{SVM:performance} \index{Support vector machine, SVM!performance of} %The TMVA SVM algorithm comes with linear, polynomial, Gaussian and sigmoidal kernel functions. The TMVA SVM algorithm comes currently only with the Gaussian kernel function. With sufficient training statistics, the Gaussian kernel allows to approximate any separating function in the input space. It is crucial for the performance of the SVM to appropriately tune the kernel parameters and the cost parameter \code{C}. In case of a Gaussian, the kernel is tuned via option \code{Gamma} which is related to the width $\sigma$ by $\Gamma=1/(2 \sigma^2)$. The optimal tuning of these parameters is specific to the problem and must be done by the user. The SVM training time scales with $n^2$, where $n$ is the number of vectors (events) in the training data set. The user is therefore advised to restrict the sample size during the first rough scan of the kernel parameters. Also increasing the minimisation tolerance helps to speed up the training. SVM is a nonlinear general purpose classification and regression algorithm with a performance similar to neural networks (Sec.~\ref{sec:ann}) or to a multidimensional likelihood estimator (Sec.~\ref{sec:pders}).