Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure TextCorpus code to share multiprocessing and preprocessing logic. #1477

Open
macks22 opened this issue Jul 9, 2017 · 1 comment
Labels
difficulty hard Hard issue: required deep gensim understanding & high python/cython skills feature Issue described a new feature

Comments

@macks22
Copy link
Contributor

macks22 commented Jul 9, 2017

Description

This is a follow-up from a conversation that came up around the addition of the TextDirectoryCorpus (see #1387). As part of discussion around that ticket, @piskvorky mentioned that there is not a clear plan for the textcorpus code. It was originally meant to just be example code for others to work from. However, it seems many people (and gensim tutorial writers) have need of common text processing corpora. I did some analysis (discussed in #1387) and found that there are several text corpora classes scattered in various places throughout the code. These include the TextCorpus, TextDirectoryCorpus, BrownCorpus, WikiCorpus, LineSentence, and Text8Corpus.

I propose combining the shared logic of these various corpora into the textcorpus module. The WikiCorpus can continue to live in the wikicorpus module, since its preprocessing is so specific to the wiki markup. However, the others should be moved to the textcorpus module. Once ... is completed, the BrownCorpus would be a good candidate to move to some sort of datasets subpackage. But at least for now, textcorpus is a better home for it than the word2vec module where it currently resides.

@piskvorky
Copy link
Owner

Makes sense, thanks!

To me, it's more about the sharing logic and making things modular and easier to discover, rather than where the implementation lives (module). We should import all the classes from corpora.__init__ anyway (so from gensim.corpora import WikiCorpus, LineSentence, BrownCorpus... works). The particular module name with the actual implementation is not that important, except to point to as a blueprint example for people's own custom extensions.

@menshikh-iv menshikh-iv added feature Issue described a new feature difficulty hard Hard issue: required deep gensim understanding & high python/cython skills labels Oct 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty hard Hard issue: required deep gensim understanding & high python/cython skills feature Issue described a new feature
Projects
None yet
Development

No branches or pull requests

3 participants