gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• tech trainings & IT consulting

models.doc2vec – Deep learning with paragraph2vec

models.doc2vec – Deep learning with paragraph2vec

Deep learning via the distributed memory and distributed bag of words models from [1], using either hierarchical softmax or negative sampling [2] [3].

Install Cython with `pip install cython` before installing gensim, to use optimized doc2vec training (70x speedup [blog]).

Initialize a model with e.g.:

>>> model = Doc2Vec(sentences, size=100, window=8, min_count=5, workers=4)

Persist a model to disk with:

>>> model.save(fname)
>>> model = Doc2Vec.load(fname)  # you can continue training with the loaded model!

The model can also be instantiated from an existing file on disk in the word2vec C format:

>>> model = Doc2Vec.load_word2vec_format('/tmp/vectors.txt', binary=False)  # C text format
>>> model = Doc2Vec.load_word2vec_format('/tmp/vectors.bin', binary=True)  # C binary format
[1]Quoc Le and Tomas Mikolov. Distributed Representations of Sentences and Documents. http://arxiv.org/pdf/1405.4053v2.pdf
[2]Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.
[3]Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.
[blog]Optimizing word2vec in gensim, http://radimrehurek.com/2013/09/word2vec-in-python-part-two-optimizing/
class gensim.models.doc2vec.Doc2Vec(sentences=None, size=300, alpha=0.025, window=8, min_count=5, sample=0, seed=1, workers=1, min_alpha=0.0001, dm=1, hs=1, negative=0, dm_mean=0, train_words=True, train_lbls=True, **kwargs)

Bases: gensim.models.word2vec.Word2Vec

Class for training, using and evaluating neural networks described in http://arxiv.org/pdf/1405.4053v2.pdf

Initialize the model from an iterable of sentences. Each sentence is a LabeledSentence object that will be used for training.

The sentences iterable can be simply a list of LabeledSentence elements, but for larger corpora, consider an iterable that streams the sentences directly from disk/network.

If you don’t supply sentences, the model is left uninitialized – use if you plan to initialize it in some other way.

dm defines the training algorithm. By default (dm=1), distributed memory is used. Otherwise, dbow is employed.

size is the dimensionality of the feature vectors.

window is the maximum distance between the current and predicted word within a sentence.

alpha is the initial learning rate (will linearly drop to zero as training progresses).

seed = for the random number generator.

min_count = ignore all words with total frequency lower than this.

sample = threshold for configuring which higher-frequency words are randomly downsampled;
default is 0 (off), useful value is 1e-5.

workers = use this many worker threads to train the model (=faster training with multicore machines).

hs = if 1 (default), hierarchical sampling will be used for model training (else set to 0).

negative = if > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20).

dm_mean = if 0 (default), use the sum of the context word vectors. If 1, use the mean. Only applies when dm is used.

accuracy(questions, restrict_vocab=30000, most_similar=<function most_similar at 0x1152fded8>)

Compute accuracy of the model. questions is a filename where lines are 4-tuples of words, split into sections by ”: SECTION NAME” lines. See https://code.google.com/p/word2vec/source/browse/trunk/questions-words.txt for an example.

The accuracy is reported (=printed to log and returned as a list) for each section separately, plus there’s one aggregate summary at the end.

Use restrict_vocab to ignore all questions containing a word whose frequency is not in the top-N most frequent words (default top 30,000).

This method corresponds to the compute-accuracy script of the original C word2vec.

build_vocab(sentences)

Build vocabulary from a sequence of sentences (can be a once-only generator stream). Each sentence must be a list of unicode strings.

create_binary_tree()

Create a binary Huffman tree using stored vocabulary word counts. Frequent words will have shorter binary codes. Called internally from build_vocab().

doesnt_match(words)

Which word from the given list doesn’t go with the others?

Example:

>>> trained_model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'
init_sims(replace=False)

Precompute L2-normalized vectors.

If replace is set, forget the original vectors and only keep the normalized ones = saves lots of memory!

Note that you cannot continue training after doing a replace. The model becomes effectively read-only = you can call most_similar, similarity etc., but not train.

classmethod load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

classmethod load_word2vec_format(fname, fvocab=None, binary=False, norm_only=True)

Load the input-hidden weight matrix from the original C word2vec-tool format.

Note that the information stored in the file is incomplete (the binary tree is missing), so while you can query for word similarity etc., you cannot continue training with a model loaded this way.

binary is a boolean indicating whether the data is in binary word2vec format. norm_only is a boolean indicating whether to only store normalised word2vec vectors in memory. Word counts are read from fvocab filename, if set (this is the file generated by -save-vocab flag of the original C tool).

static log_accuracy(section)
make_table(table_size=100000000, power=0.75)

Create a table using stored vocabulary word counts for drawing random words in the negative sampling training routines.

Called internally from build_vocab().

most_similar(positive=[], negative=[], topn=10)

Find the top-N most similar words. Positive words contribute positively towards the similarity, negative words negatively.

This method computes cosine similarity between a simple mean of the projection weight vectors of the given words, and corresponds to the word-analogy and distance scripts in the original word2vec implementation.

Example:

>>> trained_model.most_similar(positive=['woman', 'king'], negative=['man'])
[('queen', 0.50882536), ...]
most_similar_cosmul(positive=[], negative=[], topn=10)

Find the top-N most similar words, using the multiplicative combination objective proposed by Omer Levy and Yoav Goldberg in [4]. Positive words still contribute positively towards the similarity, negative words negatively, but with less susceptibility to one large distance dominating the calculation.

In the common analogy-solving case, of two positive and one negative examples, this method is equivalent to the “3CosMul” objective (equation (4)) of Levy and Goldberg.

Additional positive or negative examples contribute to the numerator or denominator, respectively – a potentially sensible but untested extension of the method. (With a single positive example, rankings will be the same as in the default most_similar.)

Example:

>>> trained_model.most_similar_cosmul(positive=['baghdad','england'],negative=['london'])
[(u'iraq', 0.8488819003105164), ...]
[4]Omer Levy and Yoav Goldberg. Linguistic Regularities in Sparse and Explicit Word Representations, 2014.
n_similarity(ws1, ws2)

Compute cosine similarity between two sets of words.

Example:

>>> trained_model.n_similarity(['sushi', 'shop'], ['japanese', 'restaurant'])
0.61540466561049689

>>> trained_model.n_similarity(['restaurant', 'japanese'], ['japanese', 'restaurant'])
1.0000000000000004

>>> trained_model.n_similarity(['sushi'], ['restaurant']) == trained_model.similarity('sushi', 'restaurant')
True
precalc_sampling()

Precalculate each vocabulary item’s threshold for sampling

reset_weights()

Reset all projection weights to an initial (untrained) state, but keep the existing vocabulary.

save(*args, **kwargs)
save_word2vec_format(fname, fvocab=None, binary=False)

Store the input-hidden weight matrix in the same format used by the original C word2vec-tool, for compatibility.

similarity(w1, w2)

Compute cosine similarity between two words.

Example:

>>> trained_model.similarity('woman', 'man')
0.73723527

>>> trained_model.similarity('woman', 'woman')
1.0
train(sentences, total_words=None, word_count=0, chunksize=100)

Update the model’s neural weights from a sequence of sentences (can be a once-only generator stream). Each sentence must be a list of unicode strings.

class gensim.models.doc2vec.LabeledBrownCorpus(dirname)

Bases: object

Iterate over sentences from the Brown corpus (part of NLTK data), yielding each sentence out as a LabeledSentence object.

class gensim.models.doc2vec.LabeledLineSentence(source)

Bases: object

Simple format: one sentence = one line = one LabeledSentence object.

Words are expected to be already preprocessed and separated by whitespace, labels are constructed automatically from the sentence line number.

source can be either a string or a file object.

Example:

sentences = LineSentence('myfile.txt')

Or for compressed files:

sentences = LineSentence('compressed_text.txt.bz2')
sentences = LineSentence('compressed_text.txt.gz')
class gensim.models.doc2vec.LabeledSentence(words, labels)

Bases: object

A single labeled sentence = text item. Replaces “sentence as a list of words” from Word2Vec.

words is a list of tokens (unicode strings), labels a list of text labels associated with this text.