Restructure TextCorpus code to share multiprocessing and preprocessing logic. #1477

macks22 · 2017-07-09T20:33:09Z

Description

This is a follow-up from a conversation that came up around the addition of the TextDirectoryCorpus (see #1387). As part of discussion around that ticket, @piskvorky mentioned that there is not a clear plan for the textcorpus code. It was originally meant to just be example code for others to work from. However, it seems many people (and gensim tutorial writers) have need of common text processing corpora. I did some analysis (discussed in #1387) and found that there are several text corpora classes scattered in various places throughout the code. These include the TextCorpus, TextDirectoryCorpus, BrownCorpus, WikiCorpus, LineSentence, and Text8Corpus.

I propose combining the shared logic of these various corpora into the textcorpus module. The WikiCorpus can continue to live in the wikicorpus module, since its preprocessing is so specific to the wiki markup. However, the others should be moved to the textcorpus module. Once ... is completed, the BrownCorpus would be a good candidate to move to some sort of datasets subpackage. But at least for now, textcorpus is a better home for it than the word2vec module where it currently resides.

The text was updated successfully, but these errors were encountered:

piskvorky · 2017-07-10T05:49:10Z

Makes sense, thanks!

To me, it's more about the sharing logic and making things modular and easier to discover, rather than where the implementation lives (module). We should import all the classes from corpora.__init__ anyway (so from gensim.corpora import WikiCorpus, LineSentence, BrownCorpus... works). The particular module name with the actual implementation is not that important, except to point to as a blueprint example for people's own custom extensions.

macks22 mentioned this issue Jul 9, 2017

Restructure TextCorpus code to share multiprocessing and preprocessing logic. #1478

Closed

menshikh-iv added feature Issue described a new feature difficulty hard Hard issue: required deep gensim understanding & high python/cython skills labels Oct 2, 2017

patcollis34 mentioned this issue May 21, 2018

Phrases multiprocessing #1141

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructure TextCorpus code to share multiprocessing and preprocessing logic. #1477

Restructure TextCorpus code to share multiprocessing and preprocessing logic. #1477

macks22 commented Jul 9, 2017 •

edited by menshikh-iv

Loading

piskvorky commented Jul 10, 2017

Restructure TextCorpus code to share multiprocessing and preprocessing logic. #1477

Restructure TextCorpus code to share multiprocessing and preprocessing logic. #1477

Comments

macks22 commented Jul 9, 2017 • edited by menshikh-iv Loading

Description

piskvorky commented Jul 10, 2017

macks22 commented Jul 9, 2017 •

edited by menshikh-iv

Loading