You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We want to implement this behavior using Theano, in the form of a class,
59
+
The hope is that the code :math:`\mathbf{y}` is a distributed representation
60
+
that captures the coordinates along the main factors of variation in the data:
61
+
because :math:`\mathbf{y}` is viewed as a lossy compression of :math:`\mathbf{x}`, it cannot
62
+
be a good compression (with small loss) for all :math:`\mathbf{x}`, so learning
63
+
drives it to be one that is a good compression in particular for training
64
+
examples, and hopefully for others as well (and that is the sense
65
+
in which an auto-encoder generalizes), but not for arbitrary inputs.
66
+
67
+
If there is one linear hidden layer (the code) and
68
+
the mean squared error criterion is used to train the network, then the :math:`k`
69
+
hidden units learn to project the input in the span of the first :math:`k`
70
+
principal components of the data. If the hidden
71
+
layer is non-linear, the auto-encoder behaves differently from PCA,
72
+
with the ability to capture multi-modal aspects of the input
73
+
distribution.
74
+
75
+
We want to implement an auto-encoder using Theano, in the form of a class,
60
76
that could be afterwards used in constructing a stacked autoencoder. The
61
77
first step is to create shared variables for the parameters of the
62
78
autoencoder ( :math:`\mathbf{W}`, :math:`\mathbf{b}` and
@@ -130,20 +146,70 @@ Note that for the stacked denoising autoencoder we will not use the
130
146
the autoencoder would work. In [Bengio07] autoencoders are used to
131
147
build deep networks.
132
148
149
+
One serious potential issue with auto-encoders is that if there is no other
150
+
constraint besides minimizing the reconstruction error,
151
+
then an auto-encoder with :math:`n` inputs and an
152
+
encoding of dimension at least :math:`n` could potentially just learn
153
+
the identity function, for which many encodings would be useless (e.g.,
154
+
just copying the input). Surprisingly, experiments reported
155
+
in [Bengio07] suggest that in practice, when trained with
156
+
stochastic gradient descent, non-linear auto-encoders with more hidden units
157
+
than inputs (called overcomplete) yield useful representations
158
+
(in the sense of classification error measured on a network taking this
159
+
representation in input). A simple explanation is based on the
160
+
observation that stochastic gradient
161
+
descent with early stopping is similar to an L2 regularization of the
162
+
parameters. To achieve perfect reconstruction of continuous
163
+
inputs, a one-hidden layer auto-encoder with non-linear hidden units
164
+
needs very small weights in the first layer (to bring the non-linearity of
165
+
the hidden units in their linear regime) and very large weights in the
166
+
second layer.
167
+
With binary inputs, very large weights are
168
+
also needed to completely minimize the reconstruction error. Since the
169
+
implicit or explicit regularization makes it difficult to reach
170
+
large-weight solutions, the optimization algorithm finds encodings which
171
+
only work well for examples similar to those in the training set, which is
172
+
what we want. It means that the representation is exploiting statistical
173
+
regularities present in the training set, rather than learning to
174
+
replicate the identity function.
175
+
176
+
There are different ways that an auto-encoder with more hidden units
177
+
than inputs could be prevented from learning the identity, and still
178
+
capture something useful about the input in its hidden representation.
179
+
One is the addition of sparsity (forcing many of the hidden units to
180
+
be zero or near-zero), and it has been exploited very successfully
181
+
by many. Another is to add randomness in the transformation from
182
+
input to reconstruction. This is exploited in Restricted Boltzmann
183
+
Machines (discussed later in this tutorial), as well as in
184
+
Denoising Auto-Encoders, discussed below.
133
185
134
186
Denoising Autoencoders (dA)
135
187
+++++++++++++++++++++++++++
136
188
137
189
The idea behind denoising autoencoders is simple. In order to enforce
138
190
the hidden layer to discover more roboust features we train the
139
191
autoencoder to reconstruct the input from a corrupted version of it.
140
-
This can be understood from different perspectives
192
+
The denoising auto-encoder is a stochastic version of the auto-encoder.
193
+
Intuitively, a denoising auto-encoder does two things: try to encode the
194
+
input (preserve the information about the input), and try to undo the
195
+
effect of a corruption process stochastically applied to the input of the
196
+
auto-encoder. The latter can only be done by capturing the statistical
197
+
dependencies between the inputs. The denoising
198
+
auto-encoder can be understood from different perspectives
141
199
( the manifold learning perspective,
142
200
stochastic operator perspective,
143
201
bottom-up -- information theoretic perspective,
144
202
top-down -- generative model perspective ), all of which are explained in
145
203
[Vincent08].
204
+
See also section 7.2 of [Bengio09] for an overview of auto-encoders.
146
205
206
+
In [Vincent08], the stochastic corruption process
207
+
consists in randomly setting some of the inputs (as many as half of them)
208
+
to zero. Hence the denoising auto-encoder is trying to predict the missing
209
+
values from the non-missing values, for randomly selected subsets of
210
+
missing patterns. Note how being able to predict any subset of variables
211
+
from the rest is a sufficient condition for completely capturing the
212
+
joint distribution between a set of variables.
147
213
148
214
To convert the autoencoder class into a denoising autoencoder one, all we
149
215
need to do is to add a stochastic corruption step operating on the input. The input can be
@@ -225,15 +291,25 @@ Stacked Autoencoders
225
291
++++++++++++++++++++
226
292
227
293
The denoising autoencoders can now be stacked to form a deep network by
228
-
feeding the latent representation of the dA found on the layer
229
-
below as input to the current layer. The "pre-training" of such an
230
-
architecture is done one layer at a time. Once the first :math:`k` layers
294
+
feeding the latent representation (output code)
295
+
of the denoising auto-encoder found on the layer
296
+
below as input to the current layer. The **unsupervised pre-training** of such an
297
+
architecture is done one layer at a time. Each layer is trained as
298
+
a denoising auto-encoder by minimizing the reconstruction of its input
299
+
(which is the output code of the previous layer).
300
+
Once the first :math:`k` layers
231
301
are trained, we can train the :math:`k+1`-th layer because we can now
232
-
compute the "correct" latent representation generated by the layer below.
302
+
compute the code or latent representation from the layer below.
233
303
Once all layers are pre-trained, the network goes through a second stage
234
-
of training called fine-tuning. For this we first add a logistic regression
235
-
layer on top and train the entire network as we would train a multilayer
236
-
perceptron. This stage is supervised, since now we use the target during
304
+
of training called **fine-tuning**. Here we consider **supervised fine-tuning**
305
+
where we want to minimize prediction error on a supervised task.
306
+
For this we first add a logistic regression
307
+
layer on top of the network (more precisely on the output code of the
308
+
output layer). We then
309
+
train the entire network as we would train a multilayer
310
+
perceptron. At this point, we only consider the encoding parts of
311
+
each auto-encoder.
312
+
This stage is supervised, since now we use the target during
237
313
training (see the :ref:`mlp` for details on the multilayer perceptron).
238
314
239
315
This can be easily implemented in Theano, using the class defined
@@ -439,11 +515,14 @@ TODO
439
515
References
440
516
++++++++++
441
517
442
-
.. [Vincent08] Vincent, P., Larochelle H., Bengio Y. and Manzagol P.A. `Extracting and Composing Robust Features with Denoising Autoencoders`_. Proceedings of the Twenty-fifth International Confrence on Machine Learning (ICML'08), pages 1096 - 1103, ACM, 2008
518
+
.. [Bengio07] Bengio Y., Lamblin P., Popovici D. and Larochelle H. `Greedy Layer-Wise Training of Deep Networks`_. Advances in Neural Information Processing Systems 19 (NIPS'06), pages 153-160, MIT Press 2007.
519
+
520
+
.. [Vincent08] Vincent, P., Larochelle H., Bengio Y. and Manzagol P.A. `Extracting and Composing Robust Features with Denoising Autoencoders`_. Proceedings of the Twenty-fifth International Confrence on Machine Learning (ICML'08), pages 1096 - 1103, ACM, 2008.
443
521
444
-
.. [Bengio07] Bengio Y., Lamblin P., Popovici D. and Larochelle H. `Greedy Layer-Wise Training of Deep Networks`_. Advances in Neural Information Processing Systems 19 (NIPS'06), pages 153-160, MIT Press 2007
522
+
.. [Bengio09] Bengio Y. `Learning deep architectures for AI`_, Foundations and Trends in Machine Learning 1(2) pages 1-127.
445
523
524
+
.. _Greedy Layer-Wise Training of Deep Networks: http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/190
446
525
447
526
.. _Extracting and Composing Robust Features with Denoising Autoencoders: http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/217
448
527
449
-
.. _Greedy Layer-Wise Training of Deep Networks: http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/190
528
+
.. _Learning deep architectures for AI: http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/239
0 commit comments