Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get rid of Python implementation of fastText, word2vec, doc2vec #2511

Closed
mpenkov opened this issue May 30, 2019 · 3 comments · Fixed by #2630
Closed

Get rid of Python implementation of fastText, word2vec, doc2vec #2511

mpenkov opened this issue May 30, 2019 · 3 comments · Fixed by #2630
Assignees
Labels
difficulty easy Easy issue: required small fix feature Issue described a new feature

Comments

@mpenkov
Copy link
Collaborator

mpenkov commented May 30, 2019

We have Cython implementations for the above models, so that we fall back to Python when Cython is not available.

This is more hassle than it's worth, because:

  1. The Python versions are too slow to be practical
  2. We have to expend additional effort to test the Python versions (leading to bugs like this: FastText SkipGram Implementation Broken since 3.7.2 #2508 (comment))
  3. The Cython implementations are available most of the time. For people who don't have a compiler, we offer wheels.
@piskvorky piskvorky added difficulty easy Easy issue: required small fix feature Issue described a new feature labels Jun 24, 2019
@mpenkov mpenkov self-assigned this Sep 28, 2019
mpenkov added a commit that referenced this issue Oct 25, 2019
* Remove native Python implementations of Cython extensions

Fix #2511

* remove print statement in tox.ini

* remove print statement in tox.ini

* fix flake8 issues

* fix missing imports

* adjust exception message

* bring back FAST_VERSION variable

* fixup: missing parens

* disable progress bar for tox

* respond to review comments

* remove C/C++ sources generated from Cython files

* update setup.py

* remove duplicate line in setup.py

* fix numpy bootstrapping

* update tox.ini

* handle cython dependency in setup.py

* fixup in setup.py: lowercase c

* more cython sourcery

* fix tox.ini

* Fix merge artifact in setup.py

* fix merge artifact

* disable pip progress bar under CircleCI
@gojomo
Copy link
Collaborator

gojomo commented Oct 30, 2019

Looking at just word2vec.py, there's still at least two functions that were only used by the pure-Python path remaining: train_cbow_pair() & train_sg_pair().

@gojomo gojomo reopened this Oct 30, 2019
@amirouche
Copy link

What about PyPy?

@gojomo
Copy link
Collaborator

gojomo commented Oct 31, 2019

@amirouche I doubt the old pure-Python code worked very well under PyPy, but would be interested to learn if you (or anyone) was using it there & getting acceptable performance. My understanding is the Cython optimized code can work with PyPy, if the C code is recompiled there.

So my conjecture would be any PyPy users getting acceptable performance from gensim's "*2Vec" code were already doing that, and can still do that after this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty easy Easy issue: required small fix feature Issue described a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants