Skip to content

Commit 7a61a86

Browse files
authored
Update README.rst
1 parent b91e031 commit 7a61a86

1 file changed

Lines changed: 80 additions & 0 deletions

File tree

README.rst

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -315,17 +315,79 @@ Non-negative Matrix Factorization (NMF)
315315
~~~~~~~~~~~~~~~~~
316316
Random Projection
317317
~~~~~~~~~~~~~~~~~
318+
Random projection or random feature is technique for dimensionality reduction which is mostly used for very large volume dataset or very high dimensional feature space. Text and document, especially with weighted feature extraction, generate huge number of features.
319+
Many researchers addressed Random Projection for text data for text mining, text classification and/or dimensionality reduction.
320+
we start to review some random projection techniques.
321+
318322

319323
.. image:: docs/pic/Random%20Projection.png
320324

325+
.. code:: python
326+
import numpy as np
327+
from sklearn import random_projection
328+
X = np.random.rand(100, 10000)
329+
transformer = random_projection.GaussianRandomProjection()
330+
X_new = transformer.fit_transform(X)
331+
X_new.shape
332+
(100, 3947)
333+
334+
321335
~~~~~~~~~~~
322336
Autoencoder
323337
~~~~~~~~~~~
324338

339+
340+
Autoencoder is a neural network technique that is trained to attempt to copy its input to its output. The autoencoder as dimensional reduction methods have achieved great success via the powerful reprehensibility of neural networks. The main idea is one hidden layer between input and output layers has fewer units which could be used as reduced dimension of feature space. Specially for texts, documents, and sequences that contains many features, autoencoder could help to process of data faster and more efficient.
341+
342+
325343
.. image:: docs/pic/Autoencoder.png
326344

327345

328346

347+
.. code:: python
348+
349+
from keras.layers import Input, Dense
350+
from keras.models import Model
351+
352+
# this is the size of our encoded representations
353+
encoding_dim = 1500
354+
355+
# this is our input placeholder
356+
input = Input(shape=(n,))
357+
# "encoded" is the encoded representation of the input
358+
encoded = Dense(encoding_dim, activation='relu')(input)
359+
# "decoded" is the lossy reconstruction of the input
360+
decoded = Dense(n, activation='sigmoid')(encoded)
361+
362+
# this model maps an input to its reconstruction
363+
autoencoder = Model(input, decoded)
364+
365+
# this model maps an input to its encoded representation
366+
encoder = Model(input, encoded)
367+
368+
369+
encoded_input = Input(shape=(encoding_dim,))
370+
# retrieve the last layer of the autoencoder model
371+
decoder_layer = autoencoder.layers[-1]
372+
# create the decoder model
373+
decoder = Model(encoded_input, decoder_layer(encoded_input))
374+
375+
autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')
376+
377+
378+
379+
Load data:
380+
381+
382+
.. code:: python
383+
384+
autoencoder.fit(x_train, x_train,
385+
epochs=50,
386+
batch_size=256,
387+
shuffle=True,
388+
validation_data=(x_test, x_test))
389+
390+
329391
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
330392
T-distributed Stochastic Neighbor Embedding (T-SNE)
331393
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -342,6 +404,24 @@ Text Classification Techniques
342404
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
343405
Rocchio classification
344406
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
407+
The first version of Rocchio algorithm is introduced by rocchio in 1971 to use relevance feedback in querying full-text databases. Since then many researchers addressed and developed this technique for text and document classification. This method uses TF-IDF weights for each informative word instead of a set of Boolean features. Using a training set of documents, Rocchio's algorithm builds a prototype vector for each class which is an average vector over all training document vectors that belongs to a certain class. Then, it will assign each test document to a class with maximum similarity that between test document and each of prototype vectors.
408+
409+
410+
When in nearest centroid classifier, we used for text as input data for classification with tf-idf vectors, this classifier is known as the Rocchio classifier.
411+
412+
.. code:: python
413+
from sklearn.neighbors.nearest_centroid import NearestCentroid
414+
import numpy as np
415+
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
416+
y = np.array([1, 1, 1, 2, 2, 2])
417+
clf = NearestCentroid()
418+
clf.fit(X, y)
419+
420+
421+
422+
423+
424+
345425
346426
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
347427
Boosting and Bagging

0 commit comments

Comments
 (0)