Skip to content

Commit cc00a9a

Browse files
authored
Update README.rst
1 parent 259d931 commit cc00a9a

1 file changed

Lines changed: 4 additions & 0 deletions

File tree

README.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -256,6 +256,8 @@ Text lemmatization is process in NLP to replaces the suffix of a word with a dif
256256
Word Embedding
257257
~~~~~~~~~~~~~~
258258

259+
Different word embedding has been proposed to translate these unigrams into understandable input for machine learning algorithms. Most basic methods to perform such embedding is term-frequency~(TF) where each word will be mapped to a number corresponding to the number of occurrence of that word in the whole corpora. The other term frequency functions have been also used that present words frequency as Boolean or logarithmically scaled number. As regarding to results, each document will be translated to a vector with the length of document, containing the frequency of the words in that document. Although such approach is very intuitive but it suffers from the fact that particular words that are used commonly in language literature would dominate such word representation.
260+
259261

260262
.. image:: docs/pic/CBOW.png
261263

@@ -339,6 +341,8 @@ Weighted Words
339341
Term frequency
340342
--------------
341343

344+
Term frequency is Bag of words that is simplest technique of text feature extraction. This method is based on counting number of the words in each document and assign it to feature space.
345+
342346

343347
-----------------------------------------
344348
Term Frequency-Inverse Document Frequency

0 commit comments

Comments
 (0)