-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Bounter for approx frequency counting #1654
Comments
Hey @menshikh-iv @piskvorky Could you direct me to a good point to start? |
Hello @aneesh-joshi,
|
hey @menshikh-iv I am trying to apply bounter to gensim.corpora.dictionary and had some doubts implies that the it will return index, frequency But the example shows frequency as 1 although mama is there twice Is this a documentation error or am I missing something? Also, I cannot understand where I should implement bounter. Could you give me a specific idea? Thanks! |
@aneesh-joshi this is frequency in "current" document, i.e. >>> dct.doc2bow(["this","is","máma"]) # word 'máma' mentioned only once
[(2, 1)] |
important information: for proper integration of Bounter, we need to create wheels for all platforms (in the same way as for gensim), because this will be "core-dependency". |
I am still trying to find a proper place to apply Bounter. |
@aneesh-joshi internal dictionaries in |
Multiple models in gensim do a full corpus scan as their first step, to get the frequencies / counts of tokens, bigrams etc: word2vec, doc2vec, tfidf, phrases, make_wiki...
This step can require a lot of memory and be slow, because it's typically not parallelized.
Replace all such scanning by Bounter. Let users specify how much memory they want to dedicate in an optional parameter, with some sane default like "1 GB" or "0.25 * total RAM" or something.
This is also a good place to revisit which algorithms need only the
counts['abc']
functionality, versus fullkeys()
/items()
iteration. The current implementations of counting in gensim are probably unnecessarily demanding (use key iteration), because there's no difference for dict or Counter as they support both operations.But there is a significant difference for Bounter: counts-only is more efficient than counts-and-iteration. So unless a counting algorithm in gensim really needs the keys, we should rewrite it using only
bounter(need_iteration=False)
.The text was updated successfully, but these errors were encountered: