@@ -18,29 +18,65 @@ many of the models in the Deep Learning Tutorials.
1818Learning a Classifier
1919+++++++++++++++++++++
2020
21- The models presented in these deep learning tutorials are as classifiers.
22- The objective in training a classifier is to minimize the number of errors (zero-one loss) on unseen examples.
21+ Zero-One Loss
22+ -------------
23+
24+ The models presented in these deep learning tutorials are mostly used as
25+ for classification. The objective in training a classifier is to minimize the number
26+ of errors (zero-one loss) on unseen examples. If :math:`f: R^D \rightarrow
27+ \{0,...,L\}` is the prediction function, then this loss can be written as:
28+
29+ .. math::
30+
31+ \ell_{0,1} = \sum_{i=0}^{|\mathcal{D}|} I_{f(x^{(i)}) \neq y^{(i)}}
32+
33+ where :math:`\mathcal{D} \cap \mathcal{D}_{train} = \emptyset` and :math:`I` is the
34+ indicator function defined as:
35+
36+ .. math::
37+
38+ I_x = \left\{\begin{array}{ccc}
39+ 1&\mbox{ if x is True} \\
40+ 0&\mbox{ otherwise}\end{array}\right.
41+
42+ In this tutorial, :math:`f` is defined as:
2343
2444.. math::
25- ***zero one loss ***
45+
46+ f(x) = argmax_k P(Y=k | x, \theta)
47+
48+
49+ Negative Log-Likelihood Loss
50+ ----------------------------
51+
52+ Since the zero-one loss is not differentiable, optimizing it for large models
53+ (thousands or millions of parameters) is prohibitively expensive
54+ (computationally). We thus maximize the log-likelihood of our classifier given
55+ all the labels in a training set.
56+
57+ .. math::
58+
59+ \mathcal{L}(\theta, \mathcal{D}) =
60+ \sum_{i=0}^{|\mathcal{D}|} \log P(Y=y^{(i)} | x^{(i)}, \theta)
61+
62+ The likelihood of the inputs for their respective class is not the same as the
63+ number of right predictions, but from the point of view of a randomly
64+ initialized classifier they are pretty similar. Later in training you can see
65+ that the number of right predictions in a validation set often decreases a
66+ little even after the probability of the right answers starts to drop
67+ (indicating overfitting), but not much.
2668
27- But the zero-one loss is not differentiable, so optimizing it for large
28- models (thousands or millions of parameters) is prohibitively computationally
29- expensive. Instead, we optimize the log-likelihood of our classifier given all the
30- labels in a training set.
69+ Since we usually speak in terms of minimizing a loss function, learning will
70+ thus attempt to **minimize** the **negative** log-likelihood (NLL), defined
71+ as:
3172
3273.. math::
33- *** log likelihood ***
3474
35- The likelihood of the right answers is not the same as the number of right
36- predictions, but from the point of view of a randomly initialized classifier they
37- are pretty similar. Later in training you can see that the number of right
38- predictions in a validation set often decreases a little even after the probability
39- of the right answers starts to drop (indicating overfitting), but not much.
75+ NLL(\theta, \mathcal{D}) = - \sum_{i=0}^{|\mathcal{D}|} \log P(Y=y^{(i)} | x^{(i)}, \theta)
4076
41- The log-likelihood of our classifier is a differentiable surrogate for the
42- zero-one loss, and we use the gradient of this function over our training data
43- as a supervised learning signal for deep learning.
77+ The NLL of our classifier is a differentiable surrogate for the zero-one loss,
78+ and we use the gradient of this function over our training data as a
79+ supervised learning signal for deep learning.
4480
4581.. _opt_SGD:
4682
0 commit comments