You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Random projection or random feature is technique for dimensionality reduction which is mostly used for very large volume dataset or very high dimensional feature space. Text and document, especially with weighted feature extraction, generate huge number of features.
319
+
Many researchers addressed Random Projection for text data for text mining, text classification and/or dimensionality reduction.
320
+
we start to review some random projection techniques.
Autoencoder is a neural network technique that is trained to attempt to copy its input to its output. The autoencoder as dimensional reduction methods have achieved great success via the powerful reprehensibility of neural networks. The main idea is one hidden layer between input and output layers has fewer units which could be used as reduced dimension of feature space. Specially for texts, documents, and sequences that contains many features, autoencoder could help to process of data faster and more efficient.
341
+
342
+
325
343
.. image:: docs/pic/Autoencoder.png
326
344
327
345
328
346
347
+
.. code:: python
348
+
349
+
from keras.layers import Input, Dense
350
+
from keras.models import Model
351
+
352
+
# this is the size of our encoded representations
353
+
encoding_dim =1500
354
+
355
+
# this is our input placeholder
356
+
input= Input(shape=(n,))
357
+
# "encoded" is the encoded representation of the input
@@ -342,6 +404,24 @@ Text Classification Techniques
342
404
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
343
405
Rocchio classification
344
406
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
407
+
The first version of Rocchio algorithm is introduced by rocchio in 1971 to use relevance feedback in querying full-text databases. Since then many researchers addressed and developed this technique for text and document classification. This method uses TF-IDF weights for each informative word instead of a set of Boolean features. Using a training set of documents, Rocchio's algorithm builds a prototype vector for each class which is an average vector over all training document vectors that belongs to a certain class. Then, it will assign each test document to a class with maximum similarity that between test document and each of prototype vectors.
408
+
409
+
410
+
When in nearest centroid classifier, we used for text as input data for classification with tf-idf vectors, this classifier is known as the Rocchio classifier.
411
+
412
+
.. code:: python
413
+
from sklearn.neighbors.nearest_centroid import NearestCentroid
0 commit comments