-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phrases multiprocessing #1141
Comments
Phrases are quite slow now, because they're written in pure Python. As such, word2vec training consumes its input faster than phrases can supply it (likely reason for the empty queues), so the word2vec training is "starved for data" and slower than it could be. Rewriting Phrases in Cython/C for performance is pretty easy and high on our TODO list, a much needed feature. If you want to help out with optimizing Phrases, @tmylk can direct you to the ticket / assist with PR. |
Hey any update on this? I took a quick glance over the code and I see a couple easy wins. First being possibly replace your defaultdict with a bounter, using cytoolz instead of functools and itertools and cythonize your scoring function. I would imagine that one of you have already thought about this and are working on something more C or "Cythony". @tmylk @piskvorky @AaronWang30 |
@patcollis34 unfortunately, for the current moment no any progress (our plans didn't take place). We also have the more global issue (not only about phrases) #1654 (bounter integration to all "counting" places). |
Cythonizing + bounter ought to be enough, and seem easy to do too (just a few lines of code). I think this would be a great simple intro PR. Only one question: do we keep support for "pure Python" = when a person doesn't have a compiler, can they still install and use gensim? Or do we require a compiler (and provide binary wheels where possible)? |
@piskvorky as for me, we should support "pure python" only if it works fine (stable, fast enough), phrases work OK by performance (not very fast, but it's possible to run it with big corpus). Ideally, we need to have universal wheels for all platforms, then we can safely remove all "replaced pure python" parts. |
Hey. I'd also be interested in speeding up the phraser.
I've checked the contribution guidelines, but any additional "handholding", in terms of moving this issue forward, would be appreciated. |
@mleue I think Cython is a great step. I wouldn't worry about parallelization (which is hard to do in Python), but see also #2048 . Regarding Bounter: it helps with speed too (a little faster than dict). But yes, lower memory footprint is its main advantage. But simply rewriting Phrases (the slow parts) in Cython sounds like an awesome first step. Easy wins first :) |
The reason I gave up on this is waiting on #1477, refactor the base corpus structure. To do some better/more interesting sampling methods across cores and solve the other bottleneck of the .any2utf8. Do you agree with this @piskvorky? But the bounter/cython stuff should be helpful also. @mleue |
Curious on the status of this. I have an application that would greatly benefit from multithreaded Phrases. |
Still open, and a welcome addition! I cleaned up the Phrases module in Gensim 4.0.0, but that wasn't related to performance optimizations. The Seeing @patcollis34 's comment above – I wouldn't worry about or wait on #1477. Implementing Cython / Bounter into Phrases would be beneficial on its own. Assuming the resulting speed-up warrants the added complexity. |
I can't find docs on how to be allowed to push to a branch to make this suggestion/diffs more readable. but I'm thinking you could delete all .gets in the _learn_vocab for loop and have func return only the joined words and then rely on the stdlib Counter to do the counting and joining for you. sorry for the noob screenshots. thoughts?@piskvorky |
Why, what would that give us? |
Thanks, a 4x speed up sounds promising. Can you open a PR, to explore your ideas? I looked up the PR where I cleaned up the code in Phrases for Gensim 4.0. That PR included some timings too: #2976 (comment). So let's try your PR on the same corpus (text8), for easy comparison. |
Is that supported? Since when I'm running word2vec with phrases, the speed is quite slow and I got some info like
2017-02-10 22:26:42,404: INFO: PROGRESS: at 0.23% examples, 156666 words/s, in_qsize 0, out_qsize 0
where in_qsize and out_qsize is always 0 or 1. Is that normal?
The text was updated successfully, but these errors were encountered: