gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• tech trainings & IT consulting

models.ldamulticore – parallelized Latent Dirichlet Allocation

models.ldamulticore – parallelized Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) in Python, using all CPU cores to parallelize and speed up model training.

The parallelization uses multiprocessing; in case this doesn’t work for you for some reason, try the gensim.models.ldamodel.LdaModel class which is an equivalent, but more straightforward and single-core implementation.

The training algorithm:

  • is streamed: training documents may come in sequentially, no random access required,
  • runs in constant memory w.r.t. the number of documents: size of the training corpus does not affect memory footprint, can process corpora larger than RAM

Wall-clock performance on the English Wikipedia (2G corpus positions, 3.5M documents, 100K features, 0.54G non-zero entries in the final bag-of-words matrix), requesting 100 topics:

algorithm training time
LdaMulticore(workers=1) 2h30m
LdaMulticore(workers=2) 1h24m
LdaMulticore(workers=3) 1h6m
old LdaModel() 3h44m
simply iterating over input corpus = I/O overhead 20m

(Measured on this i7 server with 4 physical cores, so that optimal workers=3, one less than the number of cores.)

This module allows both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents. The model can also be updated with new documents for online training.

The core estimation code is based on the onlineldavb.py script by M. Hoffman [1], see Hoffman, Blei, Bach: Online Learning for Latent Dirichlet Allocation, NIPS 2010.

[1]http://www.cs.princeton.edu/~mdhoffma
class gensim.models.ldamulticore.LdaMulticore(corpus=None, num_topics=100, id2word=None, workers=None, chunksize=2000, passes=1, batch=False, alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10, iterations=50, gamma_threshold=0.001)

Bases: gensim.models.ldamodel.LdaModel

The constructor estimates Latent Dirichlet Allocation model parameters based on a training corpus:

>>> lda = LdaMulticore(corpus, num_topics=10)

You can then infer topic distributions on new, unseen documents, with

>>> doc_lda = lda[doc_bow]

The model can be updated (trained) with new documents via

>>> lda.update(other_corpus)

Model persistency is achieved through its load/save methods.

If given, start training from the iterable corpus straight away. If not given, the model is left untrained (presumably because you want to call update() manually).

num_topics is the number of requested latent topics to be extracted from the training corpus.

id2word is a mapping from word ids (integers) to words (strings). It is used to determine the vocabulary size, as well as for debugging and topic printing.

workers is the number of extra processes to use for parallelization. Uses all available cores by default: workers=cpu_count()-1. Note: for hyper-threaded CPUs, cpu_count() returns a useless number – set workers directly to the number of your real cores (not hyperthreads) minus one, for optimal performance.

If batch is not set, perform online training by updating the model once every workers * chunksize documents (online training). Otherwise, run batch LDA, updating model only once at the end of each full corpus pass.

alpha and eta are hyperparameters that affect sparsity of the document-topic (theta) and topic-word (lambda) distributions. Both default to a symmetric 1.0/num_topics prior.

alpha can be set to an explicit array = prior of your choice. It also support special values of ‘asymmetric’ and ‘auto’: the former uses a fixed normalized asymmetric 1.0/topicno prior, the latter learns an asymmetric prior directly from your data.

eta can be a scalar for a symmetric prior over topic/word distributions, or a matrix of shape num_topics x num_words, which can be used to impose asymmetric priors over the word distribution on a per-topic basis. This may be useful if you want to seed certain topics with particular words by boosting the priors for those words.

Calculate and log perplexity estimate from the latest mini-batch once every eval_every documents. Set to None to disable perplexity estimation (faster), or to 0 to only evaluate perplexity once, at the end of each corpus pass.

decay and offset parameters are the same as Kappa and Tau_0 in Hoffman et al, respectively.

Example:

>>> lda = LdaMulticore(corpus, id2word=id2word, num_topics=100)  # train model
>>> print(lda[doc_bow]) # get topic probability distribution for a document
>>> lda.update(corpus2) # update the LDA model with additional documents
>>> print(lda[doc_bow])
bound(corpus, gamma=None, subsample_ratio=1.0)

Estimate the variational bound of documents from corpus: E_q[log p(corpus)] - E_q[log q(corpus)]

gamma are the variational parameters on topic weights for each corpus document (=2d matrix=what comes out of inference()). If not supplied, will be inferred from the model.

clear()

Clear model state (free up some memory). Used in the distributed algo.

do_estep(chunk, state=None)

Perform inference on a chunk of documents, and accumulate the collected sufficient statistics in state (or self.state if None).

do_mstep(rho, other)

M step: use linear interpolation between the existing topics and collected sufficient statistics in other to update the topics.

inference(chunk, collect_sstats=False)

Given a chunk of sparse document vectors, estimate gamma (parameters controlling the topic weights) for each document in the chunk.

This function does not modify the model (=is read-only aka const). The whole input chunk of document is assumed to fit in RAM; chunking of a large corpus must be done earlier in the pipeline.

If collect_sstats is True, also collect sufficient statistics needed to update the model’s topic-word distributions, and return a 2-tuple (gamma, sstats). Otherwise, return (gamma, None). gamma is of shape len(chunk) x self.num_topics.

Avoids computing the phi variational parameter directly using the optimization presented in Lee, Seung: Algorithms for non-negative matrix factorization, NIPS 2001.

classmethod load(fname, *args, **kwargs)

Load a previously saved object from file (also see save).

Large arrays are mmap’ed back as read-only (shared memory).

log_perplexity(chunk, total_docs=None)
print_topic(topicid, topn=10)
print_topics(num_topics=10, num_words=10)
save(fname, *args, **kwargs)

Save the model to file.

Large internal arrays may be stored into separate files, with fname as prefix.

show_topic(topicid, topn=10)
show_topics(num_topics=10, num_words=10, log=False, formatted=True)

For num_topics number of topics, return num_words most significant words (10 words per topic, by default).

The topics are returned as a list – a list of strings if formatted is True, or a list of (probability, word) 2-tuples if False.

If log is True, also output this result to log.

Unlike LSA, there is no natural ordering between the topics in LDA. The returned num_topics <= self.num_topics subset of all topics is therefore arbitrary and may change between two LDA training runs.

sync_state()
update(corpus)

Train the model with new documents, by EM-iterating over corpus until the topics converge (or until the maximum number of allowed iterations is reached). corpus must be an iterable (repeatable stream of documents),

The E-step is distributed into the several processes.

This update also supports updating an already trained model (self) with new documents from corpus; the two models are then merged in proportion to the number of old vs. new documents. This feature is still experimental for non-stationary input streams.

For stationary input (no topic drift in new documents), on the other hand, this equals the online update of Hoffman et al. and is guaranteed to converge for any decay in (0.5, 1.0>.

update_alpha(gammat, rho)

Update parameters for the Dirichlet prior on the per-document topic weights alpha given the last gammat.

Uses Newton’s method, described in Huang: Maximum Likelihood Estimation of Dirichlet Distribution Parameters. (http://www.stanford.edu/~jhuang11/research/dirichlet/dirichlet.pdf)