Skip to content

Commit 1b3d743

Browse files
authored
Update README.rst
1 parent 289c9ab commit 1b3d743

1 file changed

Lines changed: 67 additions & 3 deletions

File tree

README.rst

Lines changed: 67 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,7 @@ Text Classification Algorithms: A Survey
1111

1212
Referenced paper : `Text Classification Algorithms: A Survey <https://arxiv.org/abs/1904.08067>`__
1313

14-
15-
16-
14+
1715
##################
1816
Table of Contents
1917
##################
@@ -435,6 +433,72 @@ Where N is number of documents and df(t) is the number of documents containing t
435433
X_test = vectorizer_x.transform(X_test).toarray()
436434
print("tf-idf with",str(np.array(X_train).shape[1]),"features")
437435
return (X_train,X_test)
436+
437+
438+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
439+
Compare Feature Extraction Techniques
440+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
441+
442+
+-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+
443+
| Model | Advantages | Limitation |
444+
+-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+
445+
| Weighted Words | * Easy to compute | * It does not capture the position in the text (syntactic) |
446+
| | | |
447+
| | * Easy to compute the similarity between 2 documents using it | * It does not capture meaning in the text (semantics) |
448+
| | | |
449+
| | * Basic metric to extract the most descriptive terms in a document | |
450+
| | | * Common words effect on the results (e.g., “am”, “is”, etc.) |
451+
| | * Works with an unknown word (e.g., New words in languages) | |
452+
+-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+
453+
| TF-IDF | * Easy to compute | * It does not capture the position in the text (syntactic) |
454+
| | | |
455+
| | | |
456+
| | * Easy to compute the similarity between 2 documents using it | * It does not capture meaning in the text (semantics) |
457+
| | | |
458+
| | | |
459+
| | * Basic metric to extract the most descriptive terms in a document | |
460+
| | | |
461+
| | | |
462+
| | * Common words do not affect the results due to IDF (e.g., “am”, “is”, etc.) | |
463+
+-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+
464+
| Word2Vec | * It captures the position of the words in the text (syntactic) | * It cannot capture the meaning of the word from the text (fails to capture polysemy) |
465+
| | | |
466+
| | * It captures meaning in the words (semantics) | * It cannot capture out-of-vocabulary words from corpus |
467+
+-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+
468+
| GloVe (Pre-Trained) | * It captures the position of the words in the text (syntactic) | * It cannot capture the meaning of the word from the text (fails to capture polysemy) |
469+
| | | |
470+
| | * It captures meaning in the words (semantics) | |
471+
| | | * Memory consumption for storage |
472+
| | * Trained on huge corpus | |
473+
| | | |
474+
| | | * It cannot capture out-of-vocabulary words from corpus |
475+
+-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+
476+
| GloVe (Trained) | * It is very straightforward, e.g., to enforce the word vectors to capture sub-linear relationships in the vector space (performs better than Word2vec) | * Memory consumption for storage |
477+
| | | |
478+
| | * Lower weight for highly frequent word pairs such as stop words like “am”, “is”, etc. Will not dominate training progress | * Needs huge corpus to learn |
479+
| | | |
480+
| | | * It cannot capture out-of-vocabulary words from the corpus |
481+
| | | |
482+
| | | * It cannot capture the meaning of the word from the text (fails to capture polysemy) |
483+
+-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+
484+
| FastText | * Works for rare words (rare in their character n-grams which are still shared with other words | * It cannot capture the meaning of the word from the text (fails to capture polysemy) |
485+
| | | |
486+
| | | * Memory consumption for storage |
487+
| | * Solves out of vocabulary words with n-gram in character level | |
488+
| | | * Computationally is more expensive in comparing with GloVe and Word2Vec |
489+
+-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+
490+
| Contextualized Word Representations | * It captures the meaning of the word from the text (incorporates context, handling polysemy) | * Memory consumption for storage |
491+
| | | |
492+
| | | * Improves performance notably on downstream tasks. Computationally is more expensive in comparison to others |
493+
| | | |
494+
| | | * Needs another word embedding for all LSTM and feedforward layers |
495+
| | | |
496+
| | | * It cannot capture out-of-vocabulary words from a corpus |
497+
| | | |
498+
| | | |
499+
| | | * Works only sentence and document level (it cannot work for individual word level) |
500+
+-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+
501+
438502

439503
========================
440504
Dimensionality Reduction

0 commit comments

Comments
 (0)