Use Bounter for approx frequency counting #1654

piskvorky · 2017-10-25T20:32:50Z

Multiple models in gensim do a full corpus scan as their first step, to get the frequencies / counts of tokens, bigrams etc: word2vec, doc2vec, tfidf, phrases, make_wiki...

This step can require a lot of memory and be slow, because it's typically not parallelized.

Replace all such scanning by Bounter. Let users specify how much memory they want to dedicate in an optional parameter, with some sane default like "1 GB" or "0.25 * total RAM" or something.

This is also a good place to revisit which algorithms need only the counts['abc'] functionality, versus full keys()/items() iteration. The current implementations of counting in gensim are probably unnecessarily demanding (use key iteration), because there's no difference for dict or Counter as they support both operations.

But there is a significant difference for Bounter: counts-only is more efficient than counts-and-iteration. So unless a counting algorithm in gensim really needs the keys, we should rewrite it using only bounter(need_iteration=False).

The text was updated successfully, but these errors were encountered:

aneesh-joshi · 2018-03-05T18:04:43Z

Hey @menshikh-iv @piskvorky
I would like to take this up.
(I hope it doesn't require any GPU support, etc.)

Could you direct me to a good point to start?
I am familiar with word2vec.py

menshikh-iv · 2018-03-06T03:26:39Z

Hello @aneesh-joshi,

This doesn't require GPU support, don't worry
Start from more simple variant - gensim.corpora.Dictionary, this looks like simplest variant. Look into internal structures, check where Bounter can be useful.

aneesh-joshi · 2018-03-07T07:47:29Z

hey @menshikh-iv

I am trying to apply bounter to gensim.corpora.dictionary and had some doubts

https://github.com/RaRe-Technologies/gensim/blob/f9669bb8a0b5b4b45fa8ff58d951a11d3178116d/gensim/corpora/dictionary.py#L203

implies that the it will return index, frequency

But the example shows
https://github.com/RaRe-Technologies/gensim/blob/f9669bb8a0b5b4b45fa8ff58d951a11d3178116d/gensim/corpora/dictionary.py#L227

frequency as 1 although mama is there twice

Is this a documentation error or am I missing something?

Also, I cannot understand where I should implement bounter. Could you give me a specific idea?

Thanks!

menshikh-iv · 2018-03-07T07:50:32Z

@aneesh-joshi this is frequency in "current" document, i.e.

>>> dct.doc2bow(["this","is","máma"])  # word 'máma' mentioned only once
[(2, 1)]

menshikh-iv · 2018-03-07T08:51:24Z

important information: for proper integration of Bounter, we need to create wheels for all platforms (in the same way as for gensim), because this will be "core-dependency".

aneesh-joshi · 2018-03-07T09:41:35Z

I am still trying to find a proper place to apply Bounter.
Once that's done, will start work on core dependencies

menshikh-iv · 2018-03-07T10:20:17Z

@aneesh-joshi internal dictionaries in Dictionary, vocab in w2v maybe, etc

piskvorky added the performance Issue related to performance (in HW meaning) label Oct 25, 2017

menshikh-iv added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills labels Oct 26, 2017

menshikh-iv mentioned this issue Nov 17, 2017

Phrases multiprocessing #1141

Closed

This was referenced Oct 10, 2020

Phrases keeps learned vocabs as bytestring #2140

Closed

[MRG] Refactor phrases #2976

Merged

piskvorky mentioned this issue Apr 17, 2022

Freezing Trigram Phrase models yields inconsistent results #3326

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Bounter for approx frequency counting #1654

Use Bounter for approx frequency counting #1654

piskvorky commented Oct 25, 2017 •

edited

Loading

aneesh-joshi commented Mar 5, 2018

menshikh-iv commented Mar 6, 2018 •

edited

Loading

aneesh-joshi commented Mar 7, 2018

menshikh-iv commented Mar 7, 2018

menshikh-iv commented Mar 7, 2018

aneesh-joshi commented Mar 7, 2018

menshikh-iv commented Mar 7, 2018

Use Bounter for approx frequency counting #1654

Use Bounter for approx frequency counting #1654

Comments

piskvorky commented Oct 25, 2017 • edited Loading

aneesh-joshi commented Mar 5, 2018

menshikh-iv commented Mar 6, 2018 • edited Loading

aneesh-joshi commented Mar 7, 2018

menshikh-iv commented Mar 7, 2018

menshikh-iv commented Mar 7, 2018

aneesh-joshi commented Mar 7, 2018

menshikh-iv commented Mar 7, 2018

piskvorky commented Oct 25, 2017 •

edited

Loading

menshikh-iv commented Mar 6, 2018 •

edited

Loading