Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Online NMF #2007

Merged
merged 161 commits into from
Jan 17, 2019
Merged

Online NMF #2007

merged 161 commits into from
Jan 17, 2019

Conversation

anotherbugmaster
Copy link
Contributor

@anotherbugmaster anotherbugmaster commented Mar 29, 2018

Online Robust NMF. Resolves #132. Based on this paper.

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start @anotherbugmaster 👍

Main things that you need to do now

  1. Benchmark (add notebook where you compare current implementation with others using different tasks)
  2. Support for BoW format (feel free to drop numpy dense matrices)
  3. API (should be very similar with Lda/Lsi)
  4. Tests

gensim/models/nmf.py Outdated Show resolved Hide resolved
gensim/models/nmf.py Outdated Show resolved Hide resolved
gensim/models/nmf.py Outdated Show resolved Hide resolved
gensim/models/nmf.py Outdated Show resolved Hide resolved
@menshikh-iv menshikh-iv added the incubator project PR is RaRe incubator project label Mar 29, 2018
gensim/models/nmf.py Outdated Show resolved Hide resolved
gensim/models/nmf.py Outdated Show resolved Hide resolved
gensim/models/nmf.py Show resolved Hide resolved
gensim/models/nmf.py Outdated Show resolved Hide resolved
gensim/models/nmf.py Outdated Show resolved Hide resolved
gensim/models/nmf.py Show resolved Hide resolved
gensim/models/nmf.py Outdated Show resolved Hide resolved
gensim/models/nmf.py Outdated Show resolved Hide resolved
@menshikh-iv menshikh-iv changed the title [WIP] Online NMF Online NMF Jan 17, 2019
@menshikh-iv
Copy link
Contributor

Time to merge, awesome work @anotherbugmaster 🚀💣🔥💣🚀

@menshikh-iv menshikh-iv merged commit 239856c into piskvorky:develop Jan 17, 2019
@piskvorky
Copy link
Owner

piskvorky commented Jan 18, 2019

@anotherbugmaster can you share those TL;DR comparisons against other implementations (sklearn etc), as per my comment above (time, memory, quality)?

I'd like to include that in the release notes. Thanks!

@piskvorky
Copy link
Owner

I found some numbers in the images at the bottom of the tutorial. Is the Gensim implementation really 6x slower than sklearn's?

@anotherbugmaster
Copy link
Contributor Author

anotherbugmaster commented Jan 18, 2019

@anotherbugmaster can you share those TL;DR comparisons, as per my comment above (time, memory, quality)?

I'd like to include that in the release notes. Thanks!

Sure, here they are: https://github.com/anotherbugmaster/gensim/blob/e34b939e9a5f1f79f9582ef3d0618fd43bbd7be2/docs/notebooks/nmf_wikipedia.ipynb

image

I found some numbers in the images at the bottom of the tutorial. Is the Gensim implementation really 6x slower than sklearn's?

Only with certain hyperparameters. It's 2-3x faster than sklearn in most cases, which also have better F1:

image

@piskvorky
Copy link
Owner

piskvorky commented Jan 19, 2019

@anotherbugmaster thanks, but I don't know how to read any of these tables, what these NaNs mean (?), or what F1 is doing in an unsupervised method. It looks more like some internal benchmark notes -- what I'd like is some human-readable digestion and insights.

There's almost no text in the tutorial. The part that was easy to interpret were the images in the end, which say Gensim is 6x slower than anything else :(

Can you please post a TL;DR comparison against sklearn on the same dataset (wiki? images?): memory, time, quality? Why should someone use our NMF implementation, instead of other implementations?

@anotherbugmaster
Copy link
Contributor Author

@anotherbugmaster thanks, but I don't know how to read any of these tables, what these NaNs mean (?), or what F1 is doing in an unsupervised method. It looks more like some internal benchmark notes -- what I'd like is some human-readable digestion and insights.

There's almost no text in the tutorial. The part that was easy to interpret were the images in the end, which say Gensim is 6x slower than anything else :(

Can you please post a TL;DR comparison against sklearn on the same dataset (wiki? images?): memory, time, quality? Why should someone use our NMF implementation, instead of other implementations?

Ok, Radim, how about the first table in the release notes?

https://github.com/RaRe-Technologies/gensim/releases

image

Also, here are the insights from the tutorial notebook:

  • Gensim NMF clearly beats sklearn implementation both in terms of speed and quality
  • LDA is still significantly better in terms of quality, though interpretabiliy of topics and speed are clearly worse then NMF's

Here are the RAM comparison on wikipedia:

image

NaN means that this metric weren't computed for particular model (coherence for sklearn NMF, for example).

F1 is the quality of a model on the downstream task, 20-newsgroups classification.

Our NMF is online (you can't just run sklearn on wikipedia, it won't fit in memory) and faster than sklearn NMF on sparse and large datasets (which is the case for Topic Modeling).

@piskvorky
Copy link
Owner

piskvorky commented Jan 19, 2019

@anotherbugmaster I already saw all these tables and notebook multiple times. They are not what I am asking. Nobody but you knows how the numbers relate, what's important, or even which way is up.

I am asking for a short human summary of a head-to-head comparison of memory, time and quality with other NMF implementations (e.g. sklearn), on a concrete dataset.

Gensim NMF clearly beats sklearn implementation both in terms of speed and quality is substantiated where? The images show it's actually 6x slower. I don't know where "clearly beats in quality" is coming from -- the numbers seems either NaN or better?

Similarly for LDA is still significantly better in terms of quality, though interpretabiliy of topics and speed are clearly worse then NMF's: ignoring the typos, what does this claim actually mean? The whole point of LDA is to produce interpretable topics.

I'm sure the code is fine if @menshikh-iv OKed it and merged. That's not the issue. The issue is the documentation, especially with regard to motivation and user guidance. As a user, I don't understand where this NMF implementation stands, how it compares to other implementations, when I should use it (or not use it), what the parameters mean and which ones I should change (or not touch).

I can help with the language once I understand it myself, but I need some insights, not a huge table full of some unexplained numbers and code cells without commentary.

@menshikh-iv do you understand what I'm asking? Can you help out here?

@piskvorky
Copy link
Owner

piskvorky commented Jan 19, 2019

For clarity, here's an example what I meant by "insights", something users may understand, ground them conceptually and guide their intuition about this implementation:

Gensim NMF should be used whenever you want to retrieve interpretable (non-negative factors) topics from a very large and sparse dataset. Its online incremental training allows you to update the NMF model in pieces, in constant memory. This is in stark contrast to other NMF implementation (such as in scikit-learn), where the entire dataset must be loaded into memory at once. It also allows resuming training with more data at a later time. Another application of this "online" architecture is joining NMF models built from partial data slices into a single model (e.g. individual NMF models from weekly time-slices combined into a single NMF model for the whole year) .

In terms of memory, the Gensim NMF implementation scales linearly with the number of terms and topics. You also need to be able to load a partial chunk of documents into RAM at a time (the chunksize parameter). For example, on the English Wikipedia dataset, you'll need 2 GB RAM for 100 NMF topics and 100k vocabulary, updating the model with chunks of 10,000 documents at a time. See this notebook table for more details and benchmark numbers.

In terms of CPU, the runtime is dominated by the coordinate descent in each update iteration. You can control the CPU-accuracy tradeoff by tweaking the ABC parameter. The default are set to work well on standard English texts (sparsity <1%), but if your dataset is dense, you may want to change it to EFG.

In terms of model quality, the algorithm implemented in Gensim NFM follows this paper. It achieves the online training capability by calculating only approximatate XYZ. On the English Wikipedia, this results in L2 reconstruction error of ABC (compared to sklearn's DEF). For more information, see the paper above or our benchmarks here.

If you want to use NMF, check out our official tutorial here for a step-by-step code guide. The API parameters are documented here.

(just an example, maybe the facts are wrong, or the implementation cannot do this -- I don't know. but this was our goal.)

@anotherbugmaster
Copy link
Contributor Author

anotherbugmaster commented Jan 19, 2019

@anotherbugmaster I already saw all these tables multiple times. They are not what I am asking. Nobody but you knows how they relate, what's important, or even which way is up.

I am asking for a head-to-head comparison of memory, time and quality with other NMF implementations (e.g. sklearn), on a concrete dataset.

Gensim NMF clearly beats sklearn implementation both in terms of speed and quality is substantiated where? The images show it's actually 6x slower. I don't know where "clearly beats in quality is coming from.

Similarly for LDA is still significantly better in terms of quality, though interpretabiliy of topics and speed are clearly worse then NMF's: ignoring the typos, what does this claim actually mean? The whole point of LDA is to produce interpretable topics, that's its "quality".

I'm sure the code is fine if @menshikh-iv OKed it and merged. But I don't understand where this functionality stands when to comes to how it's different from other approaches, when I should use it (or not use it). Consequently, I don't know to communicate it to users. I need some insights, not a huge table full of some unexplained numbers.

Radim, to be clear, Olivietti faces decomposition is added just to show that it's possible to extract latent components. Model is optimized for the case of sparse corpora, not dense image matrices.

The main benchmark dataset is 20-newsgroups, and the huge table is concerning this dataset.

As for the quality, I can't entirely agree, because:

  • Perplexity and coherence doesn't always correlate with human estimation
  • TMs are used as features on downstream tasks, and that we can measure precisely

I see what you mean by insights. I'll try to make something similar to your example.

@anotherbugmaster
Copy link
Contributor Author

anotherbugmaster commented Jan 19, 2019

Gensim NMF should be used whenever you want to retrieve interpretable (non-negative factors) topics from a very large and sparse dataset. Its online incremental training allows you to update the NMF model in pieces, in constant memory. This is in stark contrast to other NMF implementation (such as in scikit-learn), where the entire dataset must be loaded into memory at once. It also allows resuming training at a later time.

In terms of memory, the Gensim NMF implementation scales linearly with the number of terms and topics. You also need to be able to load a partial chunk of documents into RAM at a time (the chunksize parameter). For example, on the English Wikipedia dataset, you'll need 150Mb RAM for 50 NMF topics and 100k vocabulary, updating the model with chunks of 2,000 documents at a time. See this notebook table for more details and benchmark numbers.

In terms of CPU, the runtime is dominated by the coordinate descent in each update iteration. You can control the CPU-accuracy tradeoff by tweaking the w_max_iter, w_stop_condition, h_r_max_iter, h_r_stop_condition and sparse_coef parameters. The default are set to work well on standard English texts (sparsity <1%), but if your dataset is dense, you may want to increase sparse_coef.

In terms of model quality, the algorithm implemented in Gensim NMF follows this paper. It achieves the online training capability by accumulating document-topic matrices of each subsequent batch in a special way and then iteratively computing topic-word matrix. For more information, see the paper above or our benchmarks here.

If you want to use NMF, check out our official tutorial here for a step-by-step code guide. The API parameters are documented here.

@piskvorky
Copy link
Owner

piskvorky commented Jan 20, 2019

Thanks for your patience, but we need to improve the docs significantly before we really promote this exciting new model addition.

Still missing: clear numbers from a single benchmark (ideally Wikipedia, 3 numbers: RAM + time + reconstruction error/loss), and a TL;DR comparison to sklearn (same 3 calculated/estimated numbers, for a direct head-to-head).

I don't know how else to say it, but we need a human-friendly TL;DR comparison of NMF implementation in Gensim and other NMF implementations. The current nondescript table full of numbers and NaNs, in a notebook without comments, is insufficient.

@anotherbugmaster Can you improve the parameter intuition too please? Enumerating the parameter names like h_r_max_iter tells me nothing. What are they for? What are their acceptable value ranges? When would I want to change them? How do they relate to each other? The API documentation under https://radimrehurek.com/gensim/models/nmf.html is similarly terse and frustrating (compare to sklearn NMF, Gensim SVD).

Try to see this from the user perspective please. Users are not going to decode academic papers or pour over the code, just to understand what this model is supposed to do and how it differs from their other options. We have to provide a basic overview and intuition.

Radim, to be clear, Olivietti faces decomposition is added just to show that it's possible to extract latent components. Model is optimized for the case of sparse corpora, not dense image matrices.

That wasn't clear at all from the notebook. In fact, "Olivietti faces" is not even introduced / described anywhere. As a reader, I don't know what I'm looking at, why, or what I'm supposed to be seeing there.

I assume by 150 Mb you mean megabytes, right?

Does this model support merging partial models built from independent chunks or not? I see you removed this sentence from my example text which you used as a template (I completely made it up, are you sure the algo descriptions fit?), but then the rest of the text makes it sound like it does support such partial training.

@anotherbugmaster
Copy link
Contributor Author

anotherbugmaster commented Jan 21, 2019

Thanks for your patience, but we need to improve the docs significantly before we really promote this exciting new model addition.

Still missing: clear numbers from a single benchmark (ideally Wikipedia, 3 numbers: RAM + time + reconstruction error/loss), and a TL;DR comparison to sklearn (same 3 calculated/estimated numbers, for a direct head-to-head).

Radim, as I wrote before, I can't run sklearn's NMF on Wikipedia (at least on my machine), it takes too much RAM. I can either run it on a smaller corpus (like 20-newsgroups) or compare NMF with some other model, LDA for example (though it wouldn't be completely fair to compare L2 here). Do you have any ideas how can I implement the right benchmark?

I don't know how else to say it, but we need a human-friendly TL;DR comparison of NMF implementation in Gensim and other NMF implementations. The current nondescript table full of numbers and NaNs, in a notebook without comments, is insufficient.

Okay, I obviously need to revamp the notebooks and NMF's documentation. I'll try to do it this week.

@anotherbugmaster Can you improve the parameter intuition too please? Enumerating the parameter names like h_r_max_iter tells me nothing. What are they for? What are their acceptable value ranges? When would I want to change them? How do they relate to each other? The API documentation under radimrehurek.com/gensim/models/nmf.html is similarly terse and frustrating (compare to sklearn NMF, Gensim SVD).

Sure. I think I'll add more info to the module docstrings and describe what W, h and r matrices mean and how exactly does algorithm works.

Those parameters are for estimation and maximization steps of the algo. For example, h_r_max_iter is the maximum number of iterations for the estimation step, h_r_stop_condition is the error value that is considered small enough to finish the step.

w_max_iter and w_stop_condition works the same way.

Try to see this from the user perspective please. Users are not going to decode academic papers or pour over the code, just to understand what this model is supposed to do and how it differs from their other options. We have to provide a basic overview and intuition.

I see that a lot of things seem vague, I'll try to clear things up.

Radim, to be clear, Olivietti faces decomposition is added just to show that it's possible to extract latent components. Model is optimized for the case of sparse corpora, not dense image matrices.

That wasn't clear at all from the notebook. In fact, "Olivietti faces" is not even introduced / described anywhere. As a reader, I don't know what I'm looking at, why, or what I'm supposed to be seeing there.

Fair enough. I can either elaborate more on this section or we can completely remove it to not confuse readers.

I assume by 150 Mb you mean megabytes, right?

Yep, that's right.

Does this model support merging partial models built from independent chunks or not? I see you removed this sentence from my example text which you used as a template (I completely made it up, are you sure the algo descriptions fit?), but then the rest of the text makes it sound like it does support such partial training.

No, the model doesn't support merging of partial chunks, and I have no idea how to implement that even in theory. Maybe updating in pieces is not a good description of the model's behavior, more like it updates iteratively, which means that we need to go through a corpus top-down, not to build partial models and then merge them.

Yes, I get it that you've made an example up, but it's actully quite close to the truth, I fixed the parts where it wasn't.

@piskvorky
Copy link
Owner

piskvorky commented Jan 21, 2019

I can't run sklearn's NMF on Wikipedia (at least on my machine), it takes too much RAM.

I understand, hence the word "estimated". Btw how much RAM would be needed? Perhaps we can run it on one of our machines (64 GB RAM).

I can either elaborate more on or we can completely remove it to not confuse readers.

I like that idea (showing a different non-text usecase/workflow), I'd prefer to keep it. Being visual always helps!

Expanding the high-level descriptions, "what am I looking at and why, how is it different from others" is really what is needed here, across the board.

We went over the API docs with Ivan today, and we'll need to:

  • Add a module docstring with overview as per above (currently the docstring is missing).
  • Clarify the parameters, their relationship, ranges, perf/quality implications. Things like
    sparse_coef (float, optional) – The more it is, the more sparse are matrices. Significantly increases performance. or
    lambda (float, optional) – Residuals regularizer coefficient. Increasing it helps prevent ovefitting. Has no effect if use_r is set to False. (what is use_r? not documented)
    normalize (bool, optional) – Whether to normalize results. Offers "kind-of-probabilistic" result.
    are too vague, not actionable. We need to build the user intuition more, not require users to study papers or code.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
incubator project PR is RaRe incubator project interesting PR ⭐ Interesting PR topic, but not ready (need much work to finish)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants