Automatically detect common phrases (multiword expressions) from a stream of sentences.
The phrases are collocations (frequently co-occurring tokens). See [1] for the exact formula.
For example, if your input stream (=an iterable, with each value a list of token strings) looks like:
>>> print(list(sentence_stream))
[[u'the', u'mayor', u'of', u'new', u'york', u'was', u'there'],
[u'machine', u'learning', u'can', u'be', u'useful', u'sometimes'],
...,
]
you’d train the detector with:
>>> bigram = Phrases(sentence_stream)
and then transform any sentence (list of token strings) using the standard gensim syntax:
>>> sent = [u'the', u'mayor', u'of', u'new', u'york', u'was', u'there']
>>> print(bigram[sent])
[u'the', u'mayor', u'of', u'new_york', u'was', u'there']
(note new_york became a single token). As usual, you can also transform an entire sentence stream using:
>>> print(list(bigram[any_sentence_stream]))
[[u'the', u'mayor', u'of', u'new_york', u'was', u'there'],
[u'machine_learning', u'can', u'be', u'useful', u'sometimes'],
...,
]
You can also continue updating the collocation counts with new sentences, by:
>>> bigram.add_vocab(new_sentence_stream)
These phrase streams are meant to be used during text preprocessing, before converting the resulting tokens into vectors using `Dictionary`. See the gensim.models.word2vec module for an example application of using phrase detection.
The detection can also be run repeatedly, to get phrases longer than two tokens (e.g. new_york_times):
>>> trigram = Phrases(bigram[sentence_stream])
>>> sent = [u'the', u'new', u'york', u'times', u'is', u'a', u'newspaper']
>>> print(trigram[bigram[sent]])
[u'the', u'new_york_times', u'is', u'a', u'newspaper']
[1] | Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013. |
Bases: gensim.interfaces.TransformationABC
Detect phrases, based on collected collocation counts. Adjacent words that appear together more frequently than expected are joined together with the _ character.
It can be used to generate phrases on the fly, using the phrases[sentence] and phrases[corpus] syntax.
Initialize the model from an iterable of sentences. Each sentence must be a list of words (unicode strings) that will be used for training.
The sentences iterable can be simply a list, but for larger corpora, consider a generator that streams the sentences directly from disk/network, without storing everything in RAM. See BrownCorpus, Text8Corpus or LineSentence in the gensim.models.word2vec module for such examples.
min_count ignore all words and bigrams with total collected count lower than this.
threshold represents a threshold for forming the phrases (higher means fewer phrases). A phrase of words a and b is accepted if (cnt(a, b) - min_count) * N / (cnt(a) * cnt(b)) > threshold, where N is the total vocabulary size.
max_vocab_size is the maximum size of the vocabulary. Used to control pruning of less common words, to keep memory under control. The default of 40M needs about 3.6GB of RAM; increase/decrease max_vocab_size depending on how much available memory you have.
delimiter is the glue character used to join collocation tokens.
Merge the collected counts vocab into this phrase detector.
Collect unigram/bigram counts from the sentences iterable.
Load a previously saved object from file (also see save).
If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.
Save the object to file (also see load).
If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.
You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.
ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.
Remove all entries from the vocab dictionary with count smaller than min_reduce. Modifies vocab in place.