Enhance LDA tutorial #2790

fehiepsi · 2021-03-29T03:18:06Z

This PR fixes some issues in LDA tutorial and adds some enhancements to make the result better.

Fixes

In cell 2, change

pyro.sample('counts', dist.Multinomial(len(counts), theta), obs=counts)

to

total_count = int(counts.sum())
pyro.sample('counts', dist.Multinomial(total_count, theta), obs=counts)

Above cell 9, change log-normal to logistic-normal

To address this issue, they used an encoder network that approximates the Dirichlet prior p(θ|α) with a log-normal distribution.

to

To address this issue, they used an encoder network that approximates the Dirichlet prior 𝑝(𝜃|𝛼) with a logistic-normal distribution (more precisely, this is softmax-normal distribution).

because log-normal cannot be used to approximate Dirichlet. The reference also talked about logistic-normal, not log-normal.

Add bias=False to Decoder.beta to match the discussion: wn|β,θ∼Categorical(σ(βθ)). Otherwise, we should change the discussion text to wn|β,θ∼Categorical(σ(βθ + bias)). bias=False matches the behavior of the original implementation of ProdLDA.
Fix total_count argument and remove to_event(1) at

pyro.sample('obs', dist.Multinomial(docs.shape[1], count_param).to_event(1), obs=docs)

Using to_event(1) here will give us a wrong model (Multinomal already has event_shape=1). Empirically, in the tutorial, epoch_loss=1.12e+07 while after the fix epoch_loss=3.72e+05.

Make it clear that the output of Encoder is logtheta, rather than theta.

Enhancements

Using affine=False in BatchNorm1d: I got no luck with affine=True. The inference seems overfitting with those extra parameters of affine=True and the result topics do not make much sense.
Using stop_words='english' at cell 7 seems to help. The number of unique words is reduced from 12999 to 12722 and the words his/he/was/... are removed, which is a nice preprocessing improvement IMO.

Result

According to the notebook, the word cloud topics are more coherent than the current one. IMO, the result is pretty good now. :) cc @ucals

fritzo

LGTM, thanks for improving this!

Fix total_count argument and remove to_event(1) at pyro.sample('obs', ...)

It's disconcerting that validation missed this. Do you think there's any way we could have strengthened model validation so as to catch this otherwise silent error?

fritzo · 2021-03-29T18:50:37Z

@fehiepsi does this run fine in the latest Pyro release? If so I'll push it and update the website.

fehiepsi · 2021-03-29T19:13:43Z

strengthened model validation so as to catch this otherwise silent error?

We can remove this relaxation when Multinomial supports inhomogeneous total_count. Then I think this issue can be detected if users do not provide a correct total_count.

does this run fine in the latest Pyro release?

Yes, it gave the same result.

fehiepsi added 3 commits March 28, 2021 21:44

improve prodlda tutorial

3c886e7

revert unnecessary change at lda tutorial

eaa71a8

change log-normal to logistic-normal

7f16042

fehiepsi added the awaiting review label Mar 29, 2021

fritzo approved these changes Mar 29, 2021

View reviewed changes

fritzo merged commit d8fdfb0 into pyro-ppl:dev Mar 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance LDA tutorial #2790

Enhance LDA tutorial #2790

fehiepsi commented Mar 29, 2021 •

edited

Loading

fritzo left a comment

fritzo commented Mar 29, 2021

fehiepsi commented Mar 29, 2021

Enhance LDA tutorial #2790

Enhance LDA tutorial #2790

Conversation

fehiepsi commented Mar 29, 2021 • edited Loading

Fixes

Enhancements

Result

fritzo left a comment

Choose a reason for hiding this comment

fritzo commented Mar 29, 2021

fehiepsi commented Mar 29, 2021

fehiepsi commented Mar 29, 2021 •

edited

Loading