Skip to content

Commit aef8e8c

Browse files
committed
Dumi's time to shine :)
1 parent 71335dc commit aef8e8c

3 files changed

Lines changed: 76 additions & 25 deletions

File tree

doc/logreg.txt

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,7 @@ us first start by defining the likelihood :math:`\cal{L}` and loss
114114
.. math::
115115

116116
\mathcal{L} (\theta=\{W,b\}, \mathcal{D}) =
117-
\sum_{i=0}^{|\mathcal{D}|} log(P(Y=y^{(i)}|x^{(i)}, W,b)) \\
117+
\sum_{i=0}^{|\mathcal{D}|} \log(P(Y=y^{(i)}|x^{(i)}, W,b)) \\
118118
\ell (\theta=\{W,b\}, \mathcal{D}) = - \mathcal{L} (\theta=\{W,b\}, \mathcal{D})
119119

120120
While entire books are dedicated to the topic of minimization, gradient
@@ -306,6 +306,14 @@ The finished product is as follows.
306306

307307
.. literalinclude:: ../code/logistic_sgd.py
308308

309+
The user can classify MNIST digits to his heart's content, by typing, from
310+
within the DeepLearningTutorials folder:
311+
312+
.. code-block:: bash
313+
314+
python code/logistic_sgd.py
315+
316+
309317

310318
.. rubric:: Footnotes
311319

doc/notation.txt

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -15,21 +15,28 @@ use superscripts to distinguish training set examples. :math:`x^{(i)} \in
1515
:math:`y^{(i)} \in \{0, ..., L\}` is the i-th label assigned to input
1616
:math:`x^{(i)}`.
1717

18-
List of Symbols
19-
+++++++++++++++
18+
Math Conventions
19+
++++++++++++++++
2020

21-
* L is for the number of labels.
22-
* D is the number of input dimensions.
2321
* :math:`W` upper-case symbols refer to a matrix unless specified otherwise
2422
* :math:`W_{ij}` element at i-th row and j-th column of matrix :math:`W`
2523
* :math:`W_{i \cdot}, W_i` vector, i-th row of matrix :math:`W`
2624
* :math:`W_{\cdot j}` vector, j-th column of matrix :math:`W`
2725
* :math:`b` lower-case symbols refer to a vector unless specified otherwise
2826
* :math:`b_i` i-th element of vector :math:`b`
29-
* :math:`\theta` is the set of all parameters for a given model
30-
* :math:`\mathcal{L}(\theta, \cal{D})` likelihood of dataset :math:`\cal{D}`
27+
28+
List of Symbols
29+
+++++++++++++++
30+
31+
* D: number of input dimensions.
32+
* :math:`f_{\theta}(x)`, :math:`f(x)`: prediction function of a model :math:`P(Y|x,\theta)`, defined as :math:`argmax_k P(Y=k|x,\theta)`.
33+
Note that we will often drop the :math:`\theta` subscript.
34+
* L: number of labels.
35+
* :math:`\mathcal{L}(\theta, \cal{D})`: (log)likelihood of dataset :math:`\cal{D}`
3136
under the model defined by parameters :math:`\theta`.
32-
* :math:`\ell(\theta, \cal{D})` empirical loss of the prediction function
33-
parameterized by :math:`\theta` on data set :math:`\cal{D}`. TODO: put it in terms of P(Y|X) ?
37+
* :math:`\ell(\theta, \cal{D})` empirical loss of the prediction function f
38+
parameterized by :math:`\theta` on data set :math:`\cal{D}`.
39+
* NLL: negative log-likelihood
40+
* :math:`\theta`: set of all parameters for a given model
3441

3542

doc/optimization.txt

Lines changed: 52 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -18,29 +18,65 @@ many of the models in the Deep Learning Tutorials.
1818
Learning a Classifier
1919
+++++++++++++++++++++
2020

21-
The models presented in these deep learning tutorials are as classifiers.
22-
The objective in training a classifier is to minimize the number of errors (zero-one loss) on unseen examples.
21+
Zero-One Loss
22+
-------------
23+
24+
The models presented in these deep learning tutorials are mostly used as
25+
for classification. The objective in training a classifier is to minimize the number
26+
of errors (zero-one loss) on unseen examples. If :math:`f: R^D \rightarrow
27+
\{0,...,L\}` is the prediction function, then this loss can be written as:
28+
29+
.. math::
30+
31+
\ell_{0,1} = \sum_{i=0}^{|\mathcal{D}|} I_{f(x^{(i)}) \neq y^{(i)}}
32+
33+
where :math:`\mathcal{D} \cap \mathcal{D}_{train} = \emptyset` and :math:`I` is the
34+
indicator function defined as:
35+
36+
.. math::
37+
38+
I_x = \left\{\begin{array}{ccc}
39+
1&\mbox{ if x is True} \\
40+
0&\mbox{ otherwise}\end{array}\right.
41+
42+
In this tutorial, :math:`f` is defined as:
2343

2444
.. math::
25-
***zero one loss ***
45+
46+
f(x) = argmax_k P(Y=k | x, \theta)
47+
48+
49+
Negative Log-Likelihood Loss
50+
----------------------------
51+
52+
Since the zero-one loss is not differentiable, optimizing it for large models
53+
(thousands or millions of parameters) is prohibitively expensive
54+
(computationally). We thus maximize the log-likelihood of our classifier given
55+
all the labels in a training set.
56+
57+
.. math::
58+
59+
\mathcal{L}(\theta, \mathcal{D}) =
60+
\sum_{i=0}^{|\mathcal{D}|} \log P(Y=y^{(i)} | x^{(i)}, \theta)
61+
62+
The likelihood of the inputs for their respective class is not the same as the
63+
number of right predictions, but from the point of view of a randomly
64+
initialized classifier they are pretty similar. Later in training you can see
65+
that the number of right predictions in a validation set often decreases a
66+
little even after the probability of the right answers starts to drop
67+
(indicating overfitting), but not much.
2668

27-
But the zero-one loss is not differentiable, so optimizing it for large
28-
models (thousands or millions of parameters) is prohibitively computationally
29-
expensive. Instead, we optimize the log-likelihood of our classifier given all the
30-
labels in a training set.
69+
Since we usually speak in terms of minimizing a loss function, learning will
70+
thus attempt to **minimize** the **negative** log-likelihood (NLL), defined
71+
as:
3172

3273
.. math::
33-
*** log likelihood ***
3474

35-
The likelihood of the right answers is not the same as the number of right
36-
predictions, but from the point of view of a randomly initialized classifier they
37-
are pretty similar. Later in training you can see that the number of right
38-
predictions in a validation set often decreases a little even after the probability
39-
of the right answers starts to drop (indicating overfitting), but not much.
75+
NLL(\theta, \mathcal{D}) = - \sum_{i=0}^{|\mathcal{D}|} \log P(Y=y^{(i)} | x^{(i)}, \theta)
4076

41-
The log-likelihood of our classifier is a differentiable surrogate for the
42-
zero-one loss, and we use the gradient of this function over our training data
43-
as a supervised learning signal for deep learning.
77+
The NLL of our classifier is a differentiable surrogate for the zero-one loss,
78+
and we use the gradient of this function over our training data as a
79+
supervised learning signal for deep learning.
4480

4581
.. _opt_SGD:
4682

0 commit comments

Comments
 (0)