Dumi's time to shine :)

gdesjardins · gdesjardins · commit aef8e8cf714e · 2010-01-07T18:26:34.000-05:00
diff --git a/doc/logreg.txt b/doc/logreg.txt
@@ -114,7 +114,7 @@ us first start by defining the likelihood :math:`\cal{L}` and loss
 .. math::
 
    \mathcal{L} (\theta=\{W,b\}, \mathcal{D}) = 
-     \sum_{i=0}^{|\mathcal{D}|} log(P(Y=y^{(i)}|x^{(i)}, W,b)) \\
+     \sum_{i=0}^{|\mathcal{D}|} \log(P(Y=y^{(i)}|x^{(i)}, W,b)) \\
    \ell (\theta=\{W,b\}, \mathcal{D}) = - \mathcal{L} (\theta=\{W,b\}, \mathcal{D})
 
 While entire books are dedicated to the topic of minimization, gradient
@@ -306,6 +306,14 @@ The finished product is as follows.
 
 .. literalinclude:: ../code/logistic_sgd.py
 
+The user can classify MNIST digits to his heart's content, by typing, from
+within the DeepLearningTutorials folder:
+
+.. code-block:: bash
+
+    python code/logistic_sgd.py
+
+
 
 .. rubric:: Footnotes
 
diff --git a/doc/notation.txt b/doc/notation.txt
@@ -15,21 +15,28 @@ use superscripts to distinguish training set examples. :math:`x^{(i)} \in
 :math:`y^{(i)} \in \{0, ..., L\}` is the i-th label assigned to input
 :math:`x^{(i)}`.
 
-List of Symbols
-+++++++++++++++
+Math Conventions
+++++++++++++++++
 
-* L is for the number of labels.
-* D is the number of input dimensions.
 * :math:`W` upper-case symbols refer to a matrix unless specified otherwise
 * :math:`W_{ij}` element at i-th row and j-th column of matrix :math:`W`
 * :math:`W_{i \cdot}, W_i` vector, i-th row of matrix :math:`W`
 * :math:`W_{\cdot j}` vector, j-th column of matrix :math:`W`
 * :math:`b` lower-case symbols refer to a vector unless specified otherwise
 * :math:`b_i` i-th element of vector :math:`b`
-* :math:`\theta` is the set of all parameters for a given model
-* :math:`\mathcal{L}(\theta, \cal{D})` likelihood of dataset :math:`\cal{D}`
+
+List of Symbols
++++++++++++++++
+
+* D: number of input dimensions.
+* :math:`f_{\theta}(x)`, :math:`f(x)`: prediction function of a model :math:`P(Y|x,\theta)`, defined as :math:`argmax_k P(Y=k|x,\theta)`.
+  Note that we will often drop the :math:`\theta` subscript.
+* L: number of labels.
+* :math:`\mathcal{L}(\theta, \cal{D})`: (log)likelihood of dataset :math:`\cal{D}`
   under the model defined by parameters :math:`\theta`.
-* :math:`\ell(\theta, \cal{D})` empirical loss of the prediction function
-  parameterized by :math:`\theta` on data set :math:`\cal{D}`. TODO: put it in terms of P(Y|X) ?
+* :math:`\ell(\theta, \cal{D})` empirical loss of the prediction function f
+  parameterized by :math:`\theta` on data set :math:`\cal{D}`.
+* NLL: negative log-likelihood
+* :math:`\theta`: set of all parameters for a given model
 
 
diff --git a/doc/optimization.txt b/doc/optimization.txt
@@ -18,29 +18,65 @@ many of the models in the Deep Learning Tutorials.
 Learning a Classifier
 +++++++++++++++++++++
 
-The models presented in these deep learning tutorials are as classifiers.
-The objective in training a classifier is to minimize the number of errors (zero-one loss) on unseen examples.
+Zero-One Loss
+-------------
+
+The models presented in these deep learning tutorials are mostly used as
+for classification. The objective in training a classifier is to minimize the number
+of errors (zero-one loss) on unseen examples. If :math:`f: R^D \rightarrow
+\{0,...,L\}` is the prediction function, then this loss can be written as:
+
+.. math::
+    
+    \ell_{0,1} = \sum_{i=0}^{|\mathcal{D}|} I_{f(x^{(i)}) \neq y^{(i)}}
+    
+where :math:`\mathcal{D} \cap \mathcal{D}_{train} = \emptyset` and :math:`I` is the
+indicator function defined as:
+
+.. math::
+
+    I_x = \left\{\begin{array}{ccc} 
+          1&\mbox{ if x is True} \\
+          0&\mbox{ otherwise}\end{array}\right.
+
+In this tutorial, :math:`f` is defined as:
 
 .. math::
-    ***zero one loss ***
+    
+    f(x) = argmax_k P(Y=k | x, \theta)
+
+
+Negative Log-Likelihood Loss
+----------------------------
+
+Since the zero-one loss is not differentiable, optimizing it for large models
+(thousands or millions of parameters) is prohibitively expensive
+(computationally). We thus maximize the log-likelihood of our classifier given
+all the labels in a training set.
+
+.. math::
+
+    \mathcal{L}(\theta, \mathcal{D}) = 
+        \sum_{i=0}^{|\mathcal{D}|} \log P(Y=y^{(i)} | x^{(i)}, \theta)
+
+The likelihood of the inputs for their respective class is not the same as the
+number of right predictions, but from the point of view of a randomly
+initialized classifier they are pretty similar.  Later in training you can see
+that the number of right predictions in a validation set often decreases a
+little even after the probability of the right answers starts to drop
+(indicating overfitting), but not much.
 
-But the zero-one loss is not differentiable, so optimizing it for large
-models (thousands or millions of parameters) is prohibitively computationally
-expensive.  Instead, we optimize the log-likelihood of our classifier given all the
-labels in a training set.
+Since we usually speak in terms of minimizing a loss function, learning will
+thus attempt to **minimize** the **negative** log-likelihood (NLL), defined
+as:
 
 .. math::
-    *** log likelihood ***
 
-The likelihood of the right answers is not the same as the number of right
-predictions, but from the point of view of a randomly initialized classifier they
-are pretty similar.  Later in training you can see that the number of right
-predictions in a validation set often decreases a little even after the probability
-of the right answers starts to drop (indicating overfitting), but not much.
+    NLL(\theta, \mathcal{D}) = - \sum_{i=0}^{|\mathcal{D}|} \log P(Y=y^{(i)} | x^{(i)}, \theta)
 
-The log-likelihood of our classifier is a differentiable surrogate for the
-zero-one loss, and we use the gradient of this function over our training data
-as a supervised learning signal for deep learning.
+The NLL of our classifier is a differentiable surrogate for the zero-one loss,
+and we use the gradient of this function over our training data as a
+supervised learning signal for deep learning.
 
 .. _opt_SGD: