Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Bounter for approx frequency counting #1654

Open
piskvorky opened this issue Oct 25, 2017 · 7 comments
Open

Use Bounter for approx frequency counting #1654

piskvorky opened this issue Oct 25, 2017 · 7 comments
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature performance Issue related to performance (in HW meaning)

Comments

@piskvorky
Copy link
Owner

piskvorky commented Oct 25, 2017

Multiple models in gensim do a full corpus scan as their first step, to get the frequencies / counts of tokens, bigrams etc: word2vec, doc2vec, tfidf, phrases, make_wiki...

This step can require a lot of memory and be slow, because it's typically not parallelized.

Replace all such scanning by Bounter. Let users specify how much memory they want to dedicate in an optional parameter, with some sane default like "1 GB" or "0.25 * total RAM" or something.

This is also a good place to revisit which algorithms need only the counts['abc'] functionality, versus full keys()/items() iteration. The current implementations of counting in gensim are probably unnecessarily demanding (use key iteration), because there's no difference for dict or Counter as they support both operations.

But there is a significant difference for Bounter: counts-only is more efficient than counts-and-iteration. So unless a counting algorithm in gensim really needs the keys, we should rewrite it using only bounter(need_iteration=False).

@piskvorky piskvorky added the performance Issue related to performance (in HW meaning) label Oct 25, 2017
@menshikh-iv menshikh-iv added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills labels Oct 26, 2017
@aneesh-joshi
Copy link
Contributor

Hey @menshikh-iv @piskvorky
I would like to take this up.
(I hope it doesn't require any GPU support, etc.)

Could you direct me to a good point to start?
I am familiar with word2vec.py

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Mar 6, 2018

Hello @aneesh-joshi,

  1. This doesn't require GPU support, don't worry
  2. Start from more simple variant - gensim.corpora.Dictionary, this looks like simplest variant. Look into internal structures, check where Bounter can be useful.

@aneesh-joshi
Copy link
Contributor

hey @menshikh-iv

I am trying to apply bounter to gensim.corpora.dictionary and had some doubts

https://github.com/RaRe-Technologies/gensim/blob/f9669bb8a0b5b4b45fa8ff58d951a11d3178116d/gensim/corpora/dictionary.py#L203

implies that the it will return index, frequency

But the example shows
https://github.com/RaRe-Technologies/gensim/blob/f9669bb8a0b5b4b45fa8ff58d951a11d3178116d/gensim/corpora/dictionary.py#L227

frequency as 1 although mama is there twice

Is this a documentation error or am I missing something?

Also, I cannot understand where I should implement bounter. Could you give me a specific idea?

Thanks!

@menshikh-iv
Copy link
Contributor

@aneesh-joshi this is frequency in "current" document, i.e.

>>> dct.doc2bow(["this","is","máma"])  # word 'máma' mentioned only once
[(2, 1)]

@menshikh-iv
Copy link
Contributor

important information: for proper integration of Bounter, we need to create wheels for all platforms (in the same way as for gensim), because this will be "core-dependency".

@aneesh-joshi
Copy link
Contributor

I am still trying to find a proper place to apply Bounter.
Once that's done, will start work on core dependencies

@menshikh-iv
Copy link
Contributor

@aneesh-joshi internal dictionaries in Dictionary, vocab in w2v maybe, etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature performance Issue related to performance (in HW meaning)
Projects
None yet
Development

No branches or pull requests

3 participants