Restructure TextCorpus code to share multiprocessing and preprocessing logic. #1477
Labels
difficulty hard
Hard issue: required deep gensim understanding & high python/cython skills
feature
Issue described a new feature
Description
This is a follow-up from a conversation that came up around the addition of the
TextDirectoryCorpus
(see #1387). As part of discussion around that ticket, @piskvorky mentioned that there is not a clear plan for thetextcorpus
code. It was originally meant to just be example code for others to work from. However, it seems many people (and gensim tutorial writers) have need of common text processing corpora. I did some analysis (discussed in #1387) and found that there are several text corpora classes scattered in various places throughout the code. These include theTextCorpus
,TextDirectoryCorpus
,BrownCorpus
,WikiCorpus
,LineSentence
, andText8Corpus
.I propose combining the shared logic of these various corpora into the
textcorpus
module. TheWikiCorpus
can continue to live in thewikicorpus
module, since its preprocessing is so specific to the wiki markup. However, the others should be moved to thetextcorpus
module. Once ... is completed, theBrownCorpus
would be a good candidate to move to some sort ofdatasets
subpackage. But at least for now,textcorpus
is a better home for it than theword2vec
module where it currently resides.The text was updated successfully, but these errors were encountered: