Skip to content

Commit 1a9eefb

Browse files
authored
Update README.rst
1 parent dfb4cab commit 1a9eefb

1 file changed

Lines changed: 50 additions & 0 deletions

File tree

README.rst

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -247,6 +247,56 @@ Word Embedding
247247
Word2Vec
248248
--------
249249

250+
Original from https://code.google.com/p/word2vec/
251+
252+
I’ve copied it to a github project so I can apply and track community
253+
patches for my needs (starting with capability for Mac OS X
254+
compilation).
255+
256+
- **makefile and some source has been modified for Mac OS X
257+
compilation** See
258+
https://code.google.com/p/word2vec/issues/detail?id=1#c5
259+
- **memory patch for word2vec has been applied** See
260+
https://code.google.com/p/word2vec/issues/detail?id=2
261+
- Project file layout altered
262+
263+
There seems to be a segfault in the compute-accuracy utility.
264+
265+
To get started:
266+
267+
::
268+
269+
cd scripts && ./demo-word.sh
270+
271+
Original README text follows:
272+
273+
This tool provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. These representations can be subsequently used in many natural language processing applications and for further research.
274+
275+
276+
this code provides an implementation of the Continuous Bag-of-Words (CBOW) and
277+
the Skip-gram model (SG), as well as several demo scripts.
278+
279+
Given a text corpus, the word2vec tool learns a vector for every word in
280+
the vocabulary using the Continuous Bag-of-Words or the Skip-Gram neural
281+
network architectures. The user should to specify the following: -
282+
desired vector dimensionality - the size of the context window for
283+
either the Skip-Gram or the Continuous Bag-of-Words model - training
284+
algorithm: hierarchical softmax and / or negative sampling - threshold
285+
for downsampling the frequent words - number of threads to use - the
286+
format of the output word vector file (text or binary)
287+
288+
Usually, the other hyper-parameters such as the learning rate do not
289+
need to be tuned for different training sets.
290+
291+
The script demo-word.sh downloads a small (100MB) text corpus from the
292+
web, and trains a small word vector model. After the training is
293+
finished, the user can interactively explore the similarity of the
294+
words.
295+
296+
More information about the scripts is provided at
297+
https://code.google.com/p/word2vec/
298+
299+
250300
----------------------------------------------
251301
Global Vectors for Word Representation (GloVe)
252302
----------------------------------------------

0 commit comments

Comments
 (0)