Update README.rst

kk7nc · web-flow · commit 1b3d7432b5f7 · 2019-05-16T20:20:36.000-04:00
diff --git a/README.rst b/README.rst
@@ -11,9 +11,7 @@ Text Classification Algorithms: A Survey
  
  Referenced paper : `Text Classification Algorithms: A Survey <https://arxiv.org/abs/1904.08067>`__
 
-      
-      
-      
+
 ##################
 Table of Contents
 ##################
@@ -435,6 +433,72 @@ Where N is number of documents and df(t) is the number of documents containing t
         X_test = vectorizer_x.transform(X_test).toarray()
         print("tf-idf with",str(np.array(X_train).shape[1]),"features")
         return (X_train,X_test)
+   
+   
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Compare Feature Extraction Techniques
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
++-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+
+|                Model                |                                                                        Advantages                                                                        |                                                   Limitation                                                   |
++-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+
+|            Weighted Words           |  * Easy to compute                                                                                                                                       |  * It does not capture the position in the text (syntactic)                                                    |
+|                                     |                                                                                                                                                          |                                                                                                                |
+|                                     |  * Easy to compute the similarity between 2 documents using it                                                                                           |  * It does not capture meaning in the text (semantics)                                                         |
+|                                     |                                                                                                                                                          |                                                                                                                |
+|                                     |  * Basic metric to extract the most descriptive terms in a document                                                                                      |                                                                                                                |
+|                                     |                                                                                                                                                          |  * Common words effect on the results (e.g., “am”, “is”, etc.)                                                 |
+|                                     |  * Works with an unknown word (e.g., New words in languages)                                                                                             |                                                                                                                |
++-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+
+|                TF-IDF               |  * Easy to compute                                                                                                                                       |  * It does not capture the position in the text (syntactic)                                                    |
+|                                     |                                                                                                                                                          |                                                                                                                |
+|                                     |                                                                                                                                                          |                                                                                                                |
+|                                     |  * Easy to compute the similarity between 2 documents using it                                                                                           |  * It does not capture meaning in the text (semantics)                                                         |
+|                                     |                                                                                                                                                          |                                                                                                                |
+|                                     |                                                                                                                                                          |                                                                                                                |
+|                                     |  * Basic metric to extract the most descriptive terms in a document                                                                                      |                                                                                                                |
+|                                     |                                                                                                                                                          |                                                                                                                |
+|                                     |                                                                                                                                                          |                                                                                                                |
+|                                     |  * Common words do not affect the results due to IDF (e.g., “am”, “is”, etc.)                                                                            |                                                                                                                |
++-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+
+|               Word2Vec              |  * It captures the position of the words in the text (syntactic)                                                                                         |  * It cannot capture the meaning of the word from the text (fails to capture polysemy)                         |
+|                                     |                                                                                                                                                          |                                                                                                                |
+|                                     |  * It captures meaning in the words (semantics)                                                                                                          |  * It cannot capture out-of-vocabulary words from corpus                                                       |
++-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+
+|         GloVe (Pre-Trained)         |  * It captures the position of the words in the text (syntactic)                                                                                         |  * It cannot capture the meaning of the word from  the text (fails to capture polysemy)                        |
+|                                     |                                                                                                                                                          |                                                                                                                |
+|                                     |  * It captures meaning in the words (semantics)                                                                                                          |                                                                                                                |
+|                                     |                                                                                                                                                          |  * Memory consumption for storage                                                                              |
+|                                     |  * Trained on huge corpus                                                                                                                                |                                                                                                                |
+|                                     |                                                                                                                                                          |                                                                                                                |
+|                                     |                                                                                                                                                          |  * It cannot capture out-of-vocabulary words from corpus                                                       |
++-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+
+|           GloVe (Trained)           |  * It is very straightforward, e.g., to enforce the word vectors to capture sub-linear relationships in the vector space (performs better than Word2vec) |  * Memory consumption for storage                                                                              |
+|                                     |                                                                                                                                                          |                                                                                                                |
+|                                     |  * Lower weight for highly frequent word pairs such as stop words like “am”, “is”, etc. Will not dominate training progress                              |  * Needs huge corpus to learn                                                                                  |
+|                                     |                                                                                                                                                          |                                                                                                                |
+|                                     |                                                                                                                                                          |  * It cannot capture out-of-vocabulary words from the corpus                                                   |
+|                                     |                                                                                                                                                          |                                                                                                                |
+|                                     |                                                                                                                                                          |  * It cannot capture the meaning of the word from  the text (fails to capture polysemy)                        |
++-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+
+|               FastText              |  * Works for rare words (rare in their character n-grams which are still shared with other words                                                         |  * It cannot capture the meaning of the word from the text (fails to capture polysemy)                         |
+|                                     |                                                                                                                                                          |                                                                                                                |
+|                                     |                                                                                                                                                          |  * Memory consumption for storage                                                                              |
+|                                     |  * Solves out of vocabulary words with n-gram in character level                                                                                         |                                                                                                                |
+|                                     |                                                                                                                                                          |  * Computationally is more expensive in comparing with GloVe and Word2Vec                                      |
++-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+
+| Contextualized Word Representations |  * It captures the meaning of the word from the text (incorporates context, handling polysemy)                                                           |  * Memory consumption for storage                                                                              |
+|                                     |                                                                                                                                                          |                                                                                                                |
+|                                     |                                                                                                                                                          |  * Improves performance notably on downstream tasks. Computationally is more expensive in comparison to others |
+|                                     |                                                                                                                                                          |                                                                                                                |
+|                                     |                                                                                                                                                          |  * Needs another word embedding for all LSTM and feedforward layers                                            |
+|                                     |                                                                                                                                                          |                                                                                                                |
+|                                     |                                                                                                                                                          |  * It cannot capture out-of-vocabulary words from a corpus                                                     |
+|                                     |                                                                                                                                                          |                                                                                                                |
+|                                     |                                                                                                                                                          |                                                                                                                |
+|                                     |                                                                                                                                                          |  * Works only sentence and document level (it cannot work for individual word level)                           |
++-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+
+
 
 ========================
 Dimensionality Reduction