Skip to content

Commit 92f10e3

Browse files
author
Yoshua Bengio
committed
more material on auto-encoders
1 parent eb6e759 commit 92f10e3

1 file changed

Lines changed: 95 additions & 16 deletions

File tree

doc/SdA.txt

Lines changed: 95 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ Stacked Denoising Autoencoders (SdA)
99

1010
The Stacked Denoising Autoencoder (SdA) is an extension of the stacked
1111
autoencoder [Bengio07]_ and it was introduced in [Vincent08]_. We will start the
12-
tutorial with a short digression on :ref:`autoencoders`
12+
tutorial with a short discussion on :ref:`autoencoders`
1313
and then move on to how classical
1414
autoencoders are extended to denoising autoencoders (:ref:`dA`).
1515
Throughout the following subchapters we will stick as close as possible to
@@ -21,6 +21,7 @@ the original paper ( [Vincent08] ).
2121
Autoencoders
2222
+++++++++++++
2323

24+
See section 4.6 of [Bengio09] for an overview of auto-encoders.
2425
An autoencoder takes an input :math:`\mathbf{x} \in [0,1]^d` and first
2526
maps it (with an *encoder*) to a hidden representation :math:`\mathbf{y} \in [0,1]^{d'}`
2627
through a deterministic mapping:
@@ -29,8 +30,8 @@ through a deterministic mapping:
2930

3031
\mathbf{y} = s(\mathbf{W}\mathbf{x} + \mathbf{b})
3132

32-
The latent representation :math:`\mathbf{y}` is then mapped back (with a *decoder*) into a
33-
"reconstructed" vector :math:`\mathbf{z}` of same shape as
33+
The latent representation :math:`\mathbf{y}`, or **code** is then mapped back (with a *decoder*) into a
34+
**reconstruction** :math:`\mathbf{z}` of same shape as
3435
:math:`\mathbf{x}` through a similar transformation, namely:
3536

3637
.. math::
@@ -55,8 +56,23 @@ bit probabilities by the reconstruction *cross-entropy* defined as :
5556
L_{H} (\mathbf{x}, \mathbf{z}) = - \sum^d_{k=1}[\mathbf{x}_k \log
5657
\mathbf{z}_k + (1 - \mathbf{x}_k)\log(1 - \mathbf{z}_k)]
5758

58-
59-
We want to implement this behavior using Theano, in the form of a class,
59+
The hope is that the code :math:`\mathbf{y}` is a distributed representation
60+
that captures the coordinates along the main factors of variation in the data:
61+
because :math:`\mathbf{y}` is viewed as a lossy compression of :math:`\mathbf{x}`, it cannot
62+
be a good compression (with small loss) for all :math:`\mathbf{x}`, so learning
63+
drives it to be one that is a good compression in particular for training
64+
examples, and hopefully for others as well (and that is the sense
65+
in which an auto-encoder generalizes), but not for arbitrary inputs.
66+
67+
If there is one linear hidden layer (the code) and
68+
the mean squared error criterion is used to train the network, then the :math:`k`
69+
hidden units learn to project the input in the span of the first :math:`k`
70+
principal components of the data. If the hidden
71+
layer is non-linear, the auto-encoder behaves differently from PCA,
72+
with the ability to capture multi-modal aspects of the input
73+
distribution.
74+
75+
We want to implement an auto-encoder using Theano, in the form of a class,
6076
that could be afterwards used in constructing a stacked autoencoder. The
6177
first step is to create shared variables for the parameters of the
6278
autoencoder ( :math:`\mathbf{W}`, :math:`\mathbf{b}` and
@@ -130,20 +146,70 @@ Note that for the stacked denoising autoencoder we will not use the
130146
the autoencoder would work. In [Bengio07] autoencoders are used to
131147
build deep networks.
132148

149+
One serious potential issue with auto-encoders is that if there is no other
150+
constraint besides minimizing the reconstruction error,
151+
then an auto-encoder with :math:`n` inputs and an
152+
encoding of dimension at least :math:`n` could potentially just learn
153+
the identity function, for which many encodings would be useless (e.g.,
154+
just copying the input). Surprisingly, experiments reported
155+
in [Bengio07] suggest that in practice, when trained with
156+
stochastic gradient descent, non-linear auto-encoders with more hidden units
157+
than inputs (called overcomplete) yield useful representations
158+
(in the sense of classification error measured on a network taking this
159+
representation in input). A simple explanation is based on the
160+
observation that stochastic gradient
161+
descent with early stopping is similar to an L2 regularization of the
162+
parameters. To achieve perfect reconstruction of continuous
163+
inputs, a one-hidden layer auto-encoder with non-linear hidden units
164+
needs very small weights in the first layer (to bring the non-linearity of
165+
the hidden units in their linear regime) and very large weights in the
166+
second layer.
167+
With binary inputs, very large weights are
168+
also needed to completely minimize the reconstruction error. Since the
169+
implicit or explicit regularization makes it difficult to reach
170+
large-weight solutions, the optimization algorithm finds encodings which
171+
only work well for examples similar to those in the training set, which is
172+
what we want. It means that the representation is exploiting statistical
173+
regularities present in the training set, rather than learning to
174+
replicate the identity function.
175+
176+
There are different ways that an auto-encoder with more hidden units
177+
than inputs could be prevented from learning the identity, and still
178+
capture something useful about the input in its hidden representation.
179+
One is the addition of sparsity (forcing many of the hidden units to
180+
be zero or near-zero), and it has been exploited very successfully
181+
by many. Another is to add randomness in the transformation from
182+
input to reconstruction. This is exploited in Restricted Boltzmann
183+
Machines (discussed later in this tutorial), as well as in
184+
Denoising Auto-Encoders, discussed below.
133185

134186
Denoising Autoencoders (dA)
135187
+++++++++++++++++++++++++++
136188

137189
The idea behind denoising autoencoders is simple. In order to enforce
138190
the hidden layer to discover more roboust features we train the
139191
autoencoder to reconstruct the input from a corrupted version of it.
140-
This can be understood from different perspectives
192+
The denoising auto-encoder is a stochastic version of the auto-encoder.
193+
Intuitively, a denoising auto-encoder does two things: try to encode the
194+
input (preserve the information about the input), and try to undo the
195+
effect of a corruption process stochastically applied to the input of the
196+
auto-encoder. The latter can only be done by capturing the statistical
197+
dependencies between the inputs. The denoising
198+
auto-encoder can be understood from different perspectives
141199
( the manifold learning perspective,
142200
stochastic operator perspective,
143201
bottom-up -- information theoretic perspective,
144202
top-down -- generative model perspective ), all of which are explained in
145203
[Vincent08].
204+
See also section 7.2 of [Bengio09] for an overview of auto-encoders.
146205

206+
In [Vincent08], the stochastic corruption process
207+
consists in randomly setting some of the inputs (as many as half of them)
208+
to zero. Hence the denoising auto-encoder is trying to predict the missing
209+
values from the non-missing values, for randomly selected subsets of
210+
missing patterns. Note how being able to predict any subset of variables
211+
from the rest is a sufficient condition for completely capturing the
212+
joint distribution between a set of variables.
147213

148214
To convert the autoencoder class into a denoising autoencoder one, all we
149215
need to do is to add a stochastic corruption step operating on the input. The input can be
@@ -225,15 +291,25 @@ Stacked Autoencoders
225291
++++++++++++++++++++
226292

227293
The denoising autoencoders can now be stacked to form a deep network by
228-
feeding the latent representation of the dA found on the layer
229-
below as input to the current layer. The "pre-training" of such an
230-
architecture is done one layer at a time. Once the first :math:`k` layers
294+
feeding the latent representation (output code)
295+
of the denoising auto-encoder found on the layer
296+
below as input to the current layer. The **unsupervised pre-training** of such an
297+
architecture is done one layer at a time. Each layer is trained as
298+
a denoising auto-encoder by minimizing the reconstruction of its input
299+
(which is the output code of the previous layer).
300+
Once the first :math:`k` layers
231301
are trained, we can train the :math:`k+1`-th layer because we can now
232-
compute the "correct" latent representation generated by the layer below.
302+
compute the code or latent representation from the layer below.
233303
Once all layers are pre-trained, the network goes through a second stage
234-
of training called fine-tuning. For this we first add a logistic regression
235-
layer on top and train the entire network as we would train a multilayer
236-
perceptron. This stage is supervised, since now we use the target during
304+
of training called **fine-tuning**. Here we consider **supervised fine-tuning**
305+
where we want to minimize prediction error on a supervised task.
306+
For this we first add a logistic regression
307+
layer on top of the network (more precisely on the output code of the
308+
output layer). We then
309+
train the entire network as we would train a multilayer
310+
perceptron. At this point, we only consider the encoding parts of
311+
each auto-encoder.
312+
This stage is supervised, since now we use the target during
237313
training (see the :ref:`mlp` for details on the multilayer perceptron).
238314

239315
This can be easily implemented in Theano, using the class defined
@@ -439,11 +515,14 @@ TODO
439515
References
440516
++++++++++
441517

442-
.. [Vincent08] Vincent, P., Larochelle H., Bengio Y. and Manzagol P.A. `Extracting and Composing Robust Features with Denoising Autoencoders`_. Proceedings of the Twenty-fifth International Confrence on Machine Learning (ICML'08), pages 1096 - 1103, ACM, 2008
518+
.. [Bengio07] Bengio Y., Lamblin P., Popovici D. and Larochelle H. `Greedy Layer-Wise Training of Deep Networks`_. Advances in Neural Information Processing Systems 19 (NIPS'06), pages 153-160, MIT Press 2007.
519+
520+
.. [Vincent08] Vincent, P., Larochelle H., Bengio Y. and Manzagol P.A. `Extracting and Composing Robust Features with Denoising Autoencoders`_. Proceedings of the Twenty-fifth International Confrence on Machine Learning (ICML'08), pages 1096 - 1103, ACM, 2008.
443521

444-
.. [Bengio07] Bengio Y., Lamblin P., Popovici D. and Larochelle H. `Greedy Layer-Wise Training of Deep Networks`_. Advances in Neural Information Processing Systems 19 (NIPS'06), pages 153-160, MIT Press 2007
522+
.. [Bengio09] Bengio Y. `Learning deep architectures for AI`_, Foundations and Trends in Machine Learning 1(2) pages 1-127.
445523

524+
.. _Greedy Layer-Wise Training of Deep Networks: http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/190
446525

447526
.. _Extracting and Composing Robust Features with Denoising Autoencoders: http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/217
448527

449-
.. _Greedy Layer-Wise Training of Deep Networks: http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/190
528+
.. _Learning deep architectures for AI: http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/239

0 commit comments

Comments
 (0)