Package for interfacing Stanford's C GloVe implementation from Python.
Install glovpy from PyPI:
pip install glovpyAdditionally the first time you import glopy it will build GloVe from scratch on your system.
We highly recommend that you use a Unix-based system, preferably a variant of Debian.
The package needs git, make and a C compiler (clang or gcc) installed.
Otherwise the implementation is as barebones as it gets, only the standard library and gensim are being used (gensim only for producing KeyedVectors).
Here's a quick example of how to train GloVe on 20newsgroups using Gensim's tokenizer.
from gensim.utils import tokenize
from sklearn.datasets import fetch_20newsgroups
from glovpy import GloVe
texts = fetch_20newsgroups().data
corpus = [list(tokenize(text, lowercase=True, deacc=True)) for text in texts]
model = GloVe(vector_size=25)
model.train(corpus)
for word, similarity in model.wv.most_similar("god"):
print(f"{word}, sim: {similarity}")| word | similarity |
|---|---|
| existence | 0.9156746864 |
| jesus | 0.8746870756 |
| lord | 0.8555182219 |
| christ | 0.8517201543 |
| bless | 0.8298447728 |
| faith | 0.8237065077 |
| saying | 0.8204566240 |
| therefore | 0.8177698255 |
| desires | 0.8094088435 |
| telling | 0.8083973527 |
class glovpy.GloVe(vector_size, window_size, symmetric, distance_weighting, alpha, min_count, iter, initial_learning_rate, threads, memory)
Wrapper around the original C implementation of GloVe.
| Parameter | Type | Description | Default |
|---|---|---|---|
| vector_size | int | Number of dimensions the trained word vectors should have. | 50 |
| window_size | int | Number of context words to the left (and to the right, if symmetric is True). | 15 |
| alpha | float | Parameter in exponent of weighting function; default 0.75 | 0.75 |
| symmetric | bool | If true, both future and past words will be used as context, otherwise only past words will be used. | True |
| distance_weighting | bool | If False, do not weight cooccurrence count by distance between words. If True (default), weight the cooccurrence count by inverse of distance between the target word and the context word. | True |
| min_count | int | Minimum number of times a token has to appear to be kept in the vocabulary. | 5 |
| iter | int | Number of training iterations. | 25 |
| initial_learning_rate | float | Initial learning rate for training. | 0.05 |
| threads | int | Number of threads to use for training. | 8 |
| memory | float | Soft limit for memory consumption, in GB. (based on simple heuristic, so not extremely accurate) | 4.0 |
| Name | Type | Description |
|---|---|---|
| wv | KeyedVectors | Token embeddings in the form of Gensim keyed vectors. |
Train the model on a stream of texts.
| Parameter | Type | Description |
|---|---|---|
| tokens | Iterable[list[str]] | Stream of documents in the form of lists of tokens. The stream has to be reusable, as the model needs at least two passes over the corpus. |
Function decorator that turns your generator function into an iterator, thereby making it reusable. You can use this if you want to reuse a generator function so that multiple passes can be made.
| Parameter | Type | Description |
|---|---|---|
| gen_func | Callable | Generator function that you want to be reusable. |
| Returns | Type | Description |
|---|---|---|
| _multigen | Callable | Iterator class wrapping the generator function. |
Here's how to stream a very long file line by line in a reusable manner.
from gensim.utils import tokenize
from glovpy.utils import reusable
from glovpy import GloVe
@reusable
def stream_lines():
with open("very_long_text_file.txt") as f:
for line in f:
yield list(tokenize(line))
model = GloVe()
model.train(stream_lines())