Restructure TextCorpus code to share multiprocessing and preprocessing logic. #1478

macks22 · 2017-07-09T20:38:35Z

Implements #1477.

macks22 · 2017-07-16T17:22:07Z

@piskvorky @menshikh-iv I believe the build failures on this only have to do with importing in the main guard in the test_word2vec module. This appears to be for a good reason based on the comment there. Is there anything else that should be changed for this PR? Thanks!

macks22 · 2017-07-16T17:36:05Z

I tested out the WikiCorpus before and after on the full wikipedia corpus. I did this by building the Dictionary. My goal was to ensure the speed is comparable now to what was implemented before.

Before: built Dictionary(2012943 unique tokens: [u'tripolitan', u'ftdna', u'fi\u0250', u'soestdijk', u'farmobil']...) from 4265002 documents (total 2338950602 corpus positions)
- 206m26.452s
After: built Dictionary(2014666 unique tokens: [u'tripolitan', u'ftdna', u'fi\u0250', u'soestdijk', u'farmobil']...) from 3765193 documents (total 1379989194 corpus positions)
- 211m35.043s

The speed is comparable. This implementation also performs deaccenting and stopword removal, which I suspect is why it takes a few minutes longer. The difference in number of documents comes from the removal of stopwords, which results in many more empty documents which are pruned. I think the difference in number of terms is due to the Dictionary pruning encountering different documents.

piskvorky

A quick shallow scan of coding style; I didn't have time to verify the actual logic.

piskvorky · 2017-07-17T02:08:11Z

gensim/corpora/text_processing_pool.py

+        self.init_state(state_kwargs)
+
+    def init_state(self, state_kwargs):
+        for name, value in state_kwargs.items():


Why not simply self.__dict__.update?

good suggestion; done

piskvorky · 2017-07-17T02:08:26Z

gensim/corpora/text_processing_pool.py

@@ -0,0 +1,160 @@
+import multiprocessing as mp


Missing file header.

added file header

piskvorky · 2017-07-17T02:10:11Z

gensim/corpora/textcorpus.py

+        # So just split the token sequence arbitrarily into sentences of length
+        # `max_sentence_length`.
+        sentence, rest = [], b''
+        with utils.smart_open(self.source) as fin:


Best to open files in binary mode (rb), and convert to text explicitly where needed.

The default is 'rb', but I updated to set mode explicitly to future-proof.

piskvorky · 2017-07-17T02:10:29Z

gensim/corpora/textcorpus.py

+                    break
+
+                last_token = text.rfind(b' ')  # last token may have been split in two... keep for next iteration
+                words, rest = (text[:last_token].split(),


No vertical indent -- please use hanging indent.

piskvorky · 2017-07-17T02:10:52Z

gensim/corpora/wikicorpus.py

+            # no need to lowercase and unicode, because the tokenizer already does that.
+            character_filters = [textcorpus.deaccent, textcorpus.strip_multiple_whitespaces]
+        super(WikiCorpus, self).__init__(source, dictionary, metadata, character_filters, tokenizer,
+                                         token_filters, processes)


No vertical indent.

piskvorky · 2017-07-17T02:11:17Z

gensim/test/test_corpora.py

@@ -20,7 +19,7 @@
 import numpy as np

 from gensim.corpora import (bleicorpus, mmcorpus, lowcorpus, svmlightcorpus,
-                            ucicorpus, malletcorpus, textcorpus, indexedcorpus)
+                            ucicorpus, malletcorpus, indexedcorpus)


No vertical indent please.

piskvorky · 2017-07-17T02:11:46Z

gensim/test/test_textcorpus.py

+
+def test_texts_file():
+    fpath = os.path.join(tempfile.gettempdir(), 'gensim_corpus.tst')
+    with open(fpath, 'w') as f:


smart_open + binary mode please.

this function was actually not being used, so I just removed it

piskvorky · 2017-07-17T02:11:59Z

gensim/test/test_textcorpus.py

+
+    def corpus_from_lines(self, lines):
+        fpath = tempfile.mktemp()
+        with codecs.open(fpath, 'w', encoding='utf8') as f:


changed mode to 'wb'

piskvorky · 2017-07-17T02:13:23Z

gensim/utils.py

@@ -1263,3 +1269,45 @@ def _iter_windows(document, window_size, copy=False, ignore_below_size=True):
    else:
        for doc_window in doc_windows:
            yield doc_window.copy() if copy else doc_window
+
+
+def walk_with_depth(top, topdown=True, onerror=None, followlinks=False, depth=0):


Is this really needed? The depth can be deduced easily from normal walk(), by comparing the root directories.

This is a very good point; I replaced this with a wrapper on os.walk that just deduces the depth in the manner you suggested.

piskvorky · 2017-07-17T02:15:48Z

This looks like a massive PR; are the changes 100% backward compatible?

If not, what is the upgrade plan for users = how do they modify their existing code so it continues to work?

piskvorky · 2017-07-17T02:16:58Z

gensim/corpora/text_processing_pool.py

+        util.debug('worker exiting after %d tasks' % completed)
+
+
+class _PatchedPool(mp.pool.Pool):


What is this for, what is being patched (and why)?

Needs a clear comment.

I've added documentation throughout this module to clarify. Some added context: when I initially implemented this refactor, I was serializing the token_filters, tokenizer, and character_filters used in TextCorpus for text preprocessing. This pickling overhead was causing a significant slowdown. So I wanted to include them in each worker process at startup to speed it up. Doing so ruled out the use of the builtin multiprocessing.Pool class.

Rather than write a complicated custom pool, I decided that reuse via patching of the existing pool would be more robust and probably useful elsewhere in the code (for instance, in the text_analysis module used by the probability_estimation module). That is why this module came about.

Aha, thanks, that's interesting. @gojomo @menshikh-iv can you please review this extended multiprocessing logic?

I'm curious whether others do it this way too, since this seems a very common use-case.

piskvorky · 2017-07-17T02:17:07Z

gensim/corpora/text_processing_pool.py

+        """
+        for i in range(self._processes - len(self._pool)):
+            w = self.Process(args=(self._inqueue, self._outqueue,
+                                   self._initializer,


No vertical indent in gensim.

piskvorky · 2017-07-17T02:20:23Z

gensim/corpora/text_processing_pool.py

+            util.debug('added worker')
+
+
+class TextProcessingPool(object):


What is this wrapper for? Why not use the default Pool?

I'd prefer to stick to built-ins, unless absolutely necessary. And if absolutely necessary, will need a better documentation describing the rationale.

Added documentation with rationale. See comment on _PatchedPool above for additional context.

macks22 · 2017-07-17T03:54:33Z

@piskvorky I've addressed your review comments; thank you for the quick feedback! If I can add anything else to make your review of the logic easier or otherwise clarify things, I will gladly do so.

piskvorky · 2017-07-17T05:05:47Z

gensim/utils.py

+    path = os.path.abspath(top)
+    for dirpath, dirnames, filenames in os.walk(path, topdown, onerror, followlinks):
+        sub_path = dirpath.replace(path, '')
+        depth = sub_path.count(os.sep)


Is this safe even with unnormalized paths (/tmp vs /tmp/ etc)? How does walk handle symlinks?

os.path.relpath/commonprefix may be safer, I'm not sure.

docs say this on symlinks:

By default, os.walk does not follow symbolic links to subdirectories on
systems that support them. In order to get this functionality, set the
optional argument 'followlinks' to true.

I did look at both of the functions you referenced (which I was not familiar with), but I believe the current code handles unnormalized paths correctly. I've added tests to verify this.

…logic from `WikiCorpus`.

…ove tests for text corpora classes to `test_textcorpus` module.

…that serializes all preprocessing functions once on initialization and then only passes the documents to the workers and the tokens back to the master.

…essing methods.

…`textcorpus.walk` to `walk_with_depth` and move to `utils` module. Update tests and other referencing modules to adjust to the moves, resolving some circular references that arose in the process.

…r to provide multiprocessing and additional preprocessing options.

…agical ways. Also, adjust `LineSentence` default kwargs to use single process and allow other preprocessing options.

…om `TextDirectoryCorpus`.

…ove tests for text corpora classes to `test_textcorpus` module.

…that serializes all preprocessing functions once on initialization and then only passes the documents to the workers and the tokens back to the master.

…r to provide multiprocessing and additional preprocessing options.

…loses `Queue`.

…py test`.

…, etc.

macks22 · 2017-07-26T12:37:01Z

@piskvorky I believe this is fully backwards-compatible in terms of interfaces. The only thing I expect will be different is the default preprocessing used for the WikiCorpus. In particular, it is now removing stopwords and deaccenting non-ascii text. It is also removing tokens shorter than 3 characters, instead of just those shorter than 2. I expect this will be a happy change for most users, but it is also possible to achieve the old behavior by initializing with WikiCorpus(fname, character_filters=[], token_filters=[]).

Also, I have updated the PR to address your most recent comments; thank you for your review. I believe you'd asked for thoughts from @gojomo and @menshikh-iv regarding the modified multiprocessing pool; I'm also curious to know if the approach I took here has been used elsewhere and if any alternative approaches might be more suitable for this problem. Thanks!

piskvorky · 2017-07-27T03:16:09Z

gensim/utils.py

+    path = os.path.abspath(top)
+    for dirpath, dirnames, filenames in os.walk(path, topdown, onerror, followlinks):
+        sub_path = dirpath.replace(path, '')
+        depth = sub_path.count(os.sep)


This construct still makes me a little uneasy. Can we at least os.path.normpath, to get rid of any double/trailing/leading slashes? Or does os.walk normalize the dirpath somehow? Although in that case, we'd have to normalize path and dirpath in exactly the same way, so that the .replace() above works.

Ah, I see your concern. I think os.path.abspath (called as the first line of that function) handles the situation you're worried about:

In [4]: os.path.abspath('/test/path/') Out[4]: '/test/path' In [5]: os.path.abspath('/test/path') Out[5]: '/test/path' In [6]: os.path.abspath('/test/path//') Out[6]: '/test/path'

What if walk hits a symlinked dir -- does it return dirpath as a canonical path (de-sym-linked), or is path still its prefix?

tree tmp

tmp ├── subdir │ └── test └── symlink -> subdir 2 directories, 1 file

In [56]: list(os.walk('tmp')) Out[56]: [('tmp', ['subdir', 'symlink'], []), ('tmp/subdir', [], ['test'])]

In [58]: list(os.walk('tmp', followlinks=True)) Out[58]: [('tmp', ['subdir', 'symlink'], []), ('tmp/subdir', [], ['test']), ('tmp/symlink', [], ['test'])]

piskvorky · 2017-07-27T03:21:44Z

@macks22 thanks!

In particular, it is now removing stopwords and deaccenting non-ascii text. It is also removing tokens shorter than 3 characters. Why this change?

Does the new code support custom tokenization / text normalization? That sounds really useful. Same defaults (backward compatibility), but allow injecting your own function to normalize and tokenize a text.

We had a recent ticket where a Thai user complained our wiki processing returns rubbish. Which is 100% true -- not only do/did we not support custom text processing, we didn't even notice where our hardwired processing didn't make sense, and happily produced garbage output without any error/warning.

…dd the `tokenizer` argument to allow users to override the default lemmatizer/tokenizer functions.

macks22 · 2017-07-27T11:30:51Z

@piskvorky I had mainly made those changes so the preprocessing defaults would be as close to the default for the TextCorpus as possible. I've added a commit changing the defaults back to the way they were before. And yes, the new code does support custom tokenization, text normalization, and any other preprocessing desired by the user. The preprocessing pipeline is the same as that used for TextCorpus, which consists of 0+ character_filters, 1 tokenizer, and 0+ token_filters. Here is the relevant excerpt from the TextCorpus.__init__ docstring:

character_filters (iterable of callable): each will be applied to the text of each
    document in order, and should return a single string with the modified text.
    For Python 2, the original text will not be unicode, so it may be useful to
    convert to unicode as the first character filter. The default character filters
    lowercase, convert to unicode (strict utf8), perform ASCII-folding, then collapse
    multiple whitespaces.
tokenizer (callable): takes as input the document text, preprocessed by all filters
    in `character_filters`; should return an iterable of tokens (strings).
token_filters (iterable of callable): each will be applied to the iterable of tokens
    in order, and should return another iterable of tokens. These filters can add,
    remove, or replace tokens, or do nothing at all. The default token filters
    remove tokens less than 3 characters long and remove stopwords using the list
    in `gensim.parsing.preprocessing.STOPWORDS`.

piskvorky · 2017-07-27T12:24:53Z

Nice! Should tokenizer return byte strings, or unicode strings?

piskvorky · 2017-07-27T12:37:42Z

@macks22 is there a way to reach you privately (email)? Please ping me at radim@rare-technologies.com.

…g within `TextCorpus`. Update docstring for `TextCorpus` for new parameters. Convert `PathLineSentences` to a `TextDirectoryCorpus` subclass and adjust the tests to account for this.

macks22 · 2017-07-28T14:20:28Z

@piskvorky Updated to improve the docstrings around the tokenizer. I also included the PathLineSentences corpus to textcorpus and made it inherit from TextDirectoryCorpus so it can share the same preprocessing/multiprocessing and filename filtering functionality.

piskvorky · 2017-07-28T14:46:26Z

gensim/corpora/textcorpus.py

+            logger.debug("sorting filepaths")
+            paths = list(paths)
+            paths.sort(key=lambda path: os.path.basename(path))
+            logger.debug("found {} files: {}".format(len(paths), paths))


The rest of the code uses C-style formatting -- best keep it consistent.

The intention here was to get the auto-formatting for the list of paths, as opposed to having to do my own '[' + ', '.join(paths) + ']', which seemed much messier. Should I still change it do this instead?

All these formatting alternatives should work identically (call str/repr on their arguments), so I'm not sure what you mean. Are you seeing a difference?

One advantage of a C-style format is that the argument types will be immediately apparent to the reader (%d and %s or %r in this case).

Unrelated: the arguments should be passed to logger.debug as arguments, to avoid formatting the string in case the message is not emitted by logging (doesn't pass the log level threshold etc). We want to leave the string formatting (which can sometimes be expensive) for the last moment possible.

Ah, I see; I simply wasn't aware of '%r' as an option to get the repr. Updated to use the C-style formatting.

piskvorky · 2017-07-28T14:46:57Z

gensim/corpora/textcorpus.py


-        logging.info('files read into PathLineSentences:' + '\n'.join(self.input_files))
+        logger.debug("finished reading %d files", num_files)


Why not info?

piskvorky · 2017-07-28T14:48:43Z

gensim/corpora/textcorpus.py

            token_filters (iterable of callable): each will be applied to the iterable of tokens
                in order, and should return another iterable of tokens. These filters can add,
                remove, or replace tokens, or do nothing at all. The default token filters
                remove tokens less than 3 characters long and remove stopwords using the list
                in `gensim.parsing.preprocessing.STOPWORDS`.
+            processes (int): number of processes to use for text preprocessing. The default is
+                -1, which will use (number of virtual CPUs - 1) worker processes, in addition


What happens when number of virtual CPUs == 1?

No worker pool is used; all preprocessing occurs in the master process. I've updated the docstring to inform on this.

piskvorky · 2017-07-28T14:50:28Z

gensim/corpora/textcorpus.py

-                For Python 2, the original text will not be unicode, so it may be useful to
-                convert to unicode as the first character filter. The default character filters
-                lowercase, convert to unicode (strict utf8), perform ASCII-folding, then collapse
+                For Python 2, the original text will not be unicode (unless you modify your


I propose dropping the (unless ... bracket. This is already very complicated as it is.

Also, why single out Python 2? Is the behaviour different between Python 2 vs Python 3?
If so, I'd consider that a bug.

Let's keep the API as simple as possible: getstream returns unicode (no matter the Python version); all filters expect unicode.

Done; I've put the unicode conversion in the master process. For the sake of speed, it may make sense to have getstream return bytes in all versions, move the encoding parameters to the workers, and have them do the unicode conversion. Based on the ongoing Phrases refactor, that seems to be more of a bottleneck than I would've expected. Despite these considerations, I think it is sensible to do it in the master for the sake of simplicity for now.

piskvorky · 2017-07-28T14:52:56Z

gensim/corpora/textcorpus.py

+                For Python 2, the original text will not be unicode (unless you modify your
+                `getstream` method to convert it to unicode), so it may be useful to convert to 
+                unicode as the first character filter. The default character filters lowercase, 
+                convert to unicode (strict utf8), perform ASCII-folding, then collapse


Why lowercase before converting to unicode? Could lead to bugs for non-ASCII capitals.

piskvorky · 2017-07-28T14:53:54Z

gensim/corpora/textcorpus.py

+            processes (int): number of processes to use for text preprocessing. The default is
+                -1, which will use (number of virtual CPUs - 1) worker processes, in addition
+                to the master process. If set to a number greater than the number of virtual
+                CPUs available, the value will be reduced to (number of virtual CPUs - 1).


-1 on this: why override a user's explicit request?

you're right; this mistrust of users is not suitable for a Python code base! Modified to remove upper bounding

…ry filtering arguments to discard no tokens by default.

piskvorky · 2017-07-29T02:35:24Z

gensim/corpora/textcorpus.py

@@ -552,7 +554,7 @@ def getstream(self):
        """
        for path in self.iter_filepaths():
            logging.debug("reading file: %s", path)
-            with utils.smart_open(path) as f:
+            with utils.smart_open(path, 'rt') as f:


This looks fragile. Best to always open files in binary mode rb, and convert to text (unicode) explicitly, with an explicit encoding, where needed.

changed to 'rb' followed by explicit unicode conversion

…nd add unicode decoding arguments to `TextCorpus`. Also open source files in 'rb' mode. Lowercase after deaccenting to prevent deaccent confusion. Do not upper bound the number of processes the user passes to `TextCorpus` constructor.

macks22 · 2017-08-05T19:42:58Z

@piskvorky thank you for your many reviews! I believe I have addressed all your comments and requests for changes. From what I can tell, the build check failures are only due to the imports in the __main__ block of the word2vec module, which seem to be necessary. Is there anything else you'd like me to change for this PR? Thanks!

piskvorky · 2017-08-06T07:15:20Z

Thanks for all the fixes and good work :)

I'll defer to @menshikh-iv for a final thorough review and decision (and fixing the unrelated build errors).

menshikh-iv

Beautiful work!
Please resolve merge conflict & fix small issues, this code LGTM.

menshikh-iv · 2017-09-13T09:55:59Z

gensim/corpora/stateful_pool.py

+        self._pool.terminate()
+
+
+if __name__ == "__main__":


This code isn't needed here (remove OR refactor it and add as test, it's better solution)

moved to test class

menshikh-iv · 2017-09-13T09:59:28Z

gensim/corpora/textcorpus.py

                else:
                    yield f.read().strip()
-                    num_texts += 1
+# endclass TextDirectoryCorpus


No needed # endclass ..., please remove it.

…lar imports.

…ul_pool` test module.

menshikh-iv · 2017-09-18T07:41:25Z

@macks22 please pay attention to Appveyour problems, a lot of tests breaks, but all of this looks like one problem.

menshikh-iv · 2017-09-26T05:39:41Z

Ping @macks22, what's a status here?

menshikh-iv · 2017-10-02T06:26:36Z

Ping @macks22

macks22 · 2017-10-02T11:42:56Z

@menshikh-iv I'm hoping to update this in the coming weeks. Having trouble finding time to put towards it on the weekends. I'm thinking to refactor it according to some discussion I had with @michaelwsherman in regards to #1506. He had proposed a decomposition of responsibilities into something like a TextCorpusLoader and a TextPreprocessor. I think splitting the preprocessing logic out into its own class instead of dynamically generating classes and transplanting methods (as I'm doing now) will resolve the errors on Windows.

menshikh-iv · 2017-10-02T13:10:51Z

@macks22 thanks for clarification, good luck :)

macks22 · 2017-10-14T22:22:15Z

@menshikh-iv hope all is well; I'm still working to find time to update this to fix the tests on Windows in the manner I described above. Hopefully next weekend.

menshikh-iv · 2017-10-30T04:47:23Z

ping @macks22, have you time to finish this now?

menshikh-iv · 2017-11-07T08:02:35Z

Ping @macks22, we are waiting you :)

macks22 · 2017-11-07T12:06:07Z

@menshikh-iv sorry for the latency in reply. I haven't had sufficient time to finish this. It's still on my Todo list, but TBH, I may not have time again until end of December holidays.

menshikh-iv · 2017-12-13T04:10:57Z

Ping @macks22, December has come, I remind you of us :)

menshikh-iv · 2017-12-26T14:56:37Z

ping @macks22, I remind you about PR :)

menshikh-iv · 2018-01-08T10:52:03Z

I'm sorry, but I'm closing this PR.
@macks22 feel free to re-open when you'll have time to finish this.

macks22 · 2018-01-12T12:03:44Z

Sorry for the delay in responding; I have been busier than expected. I will try to re-open and finish when I can.

piskvorky requested changes Jul 17, 2017

View reviewed changes

piskvorky reviewed Jul 17, 2017

View reviewed changes

Sweeney, Mack added 15 commits July 26, 2017 07:53

Make TextCorpus.get_texts use multiprocessing -- copied underlying …

e6069d3

…logic from `WikiCorpus`.

Refactor WikiCorpus to use distribution logic in TextCorpus and m…

3de59c2

…ove tests for text corpora classes to `test_textcorpus` module.

Speed up text corpus preprocessing using custom multiprocessing pool …

c15b42b

…that serializes all preprocessing functions once on initialization and then only passes the documents to the workers and the tokens back to the master.

Python 3 compatibility for imap and use mixin for shared text preproc…

977098d

…essing methods.

Move LineSentence and Text8Corpus to textcorpus module. Rename …

5b7d016

…`textcorpus.walk` to `walk_with_depth` and move to `utils` module. Update tests and other referencing modules to adjust to the moves, resolving some circular references that arose in the process.

Refactor LineSentence and Text8Corpus to use TextCorpus in orde…

8f4794b

…r to provide multiprocessing and additional preprocessing options.

Beef up docstrings for preprocess_text, which is used in slightly m…

c3a6a25

…agical ways. Also, adjust `LineSentence` default kwargs to use single process and allow other preprocessing options.

Move BrownCorpus to textcorpus module and refactor to subclass fr…

dd871c3

…om `TextDirectoryCorpus`.

Refactor WikiCorpus to use distribution logic in TextCorpus and m…

ab27719

…ove tests for text corpora classes to `test_textcorpus` module.

Speed up text corpus preprocessing using custom multiprocessing pool …

a88f4d5

…that serializes all preprocessing functions once on initialization and then only passes the documents to the workers and the tokens back to the master.

Refactor LineSentence and Text8Corpus to use TextCorpus in orde…

b75486e

…r to provide multiprocessing and additional preprocessing options.

Fix flake8 issues and ensure utils.chunkize terminates worker and c…

143faa6

…loses `Queue`.

Revert wikicorpus tests to avoid odd issues when running with `setup.…

9e711eb

…py test`.

Address PR review from @piskvorky regarding formatting, documentation…

271d6dd

…, etc.

Add tests for utils.walk_with_depth.

2db0aaa

macks22 force-pushed the text_corpus_restructure branch from 6c12b54 to 2db0aaa Compare July 26, 2017 12:26

piskvorky reviewed Jul 27, 2017

View reviewed changes

Change WikiCorpus default preprocessing to be the same as before. A…

76c2e74

…dd the `tokenizer` argument to allow users to override the default lemmatizer/tokenizer functions.

Add arguments to control filtering of extremes for dictionary buildin…

4b08e6e

…g within `TextCorpus`. Update docstring for `TextCorpus` for new parameters. Convert `PathLineSentences` to a `TextDirectoryCorpus` subclass and adjust the tests to account for this.

Remove trailing whitespace in textcorpus module.

169f502

piskvorky requested changes Jul 28, 2017

View reviewed changes

Remove trailing whitespace in textcorpus module and change dictiona…

93f19c5

…ry filtering arguments to discard no tokens by default.

piskvorky requested changes Jul 29, 2017

View reviewed changes

menshikh-iv added breaks backward-compatibility Change breaks backward compatibility and removed breaks backward-compatibility Change breaks backward compatibility labels Sep 13, 2017

menshikh-iv suggested changes Sep 13, 2017

View reviewed changes

Sweeney, Mack added 3 commits September 16, 2017 16:54

Merge branch 'develop' into text_corpus_restructure

fa6d940

Remove dependence of textcorpus on word2vec module to avoid circu…

2cdb20c

…lar imports.

Move test case from end of stateful_pool module to new `test_statef…

4d01d01

…ul_pool` test module.

menshikh-iv closed this Jan 8, 2018

		util.debug('worker exiting after %d tasks' % completed)


		class _PatchedPool(mp.pool.Pool):


		logging.info('files read into PathLineSentences:' + '\n'.join(self.input_files))
		logger.debug("finished reading %d files", num_files)

Restructure TextCorpus code to share multiprocessing and preprocessing logic. #1478

Restructure TextCorpus code to share multiprocessing and preprocessing logic. #1478

Conversation

macks22 commented Jul 9, 2017

macks22 commented Jul 16, 2017

macks22 commented Jul 16, 2017

piskvorky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky commented Jul 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

macks22 commented Jul 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

macks22 commented Jul 26, 2017

Choose a reason for hiding this comment

macks22 Jul 27, 2017 • edited Loading

Choose a reason for hiding this comment

piskvorky Jul 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky commented Jul 27, 2017 • edited Loading

macks22 commented Jul 27, 2017

piskvorky commented Jul 27, 2017

piskvorky commented Jul 27, 2017

macks22 commented Jul 28, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jul 29, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jul 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

macks22 commented Aug 5, 2017

piskvorky commented Aug 6, 2017

menshikh-iv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Sep 18, 2017

menshikh-iv commented Sep 26, 2017

menshikh-iv commented Oct 2, 2017

macks22 commented Oct 2, 2017

menshikh-iv commented Oct 2, 2017

macks22 commented Oct 14, 2017

menshikh-iv commented Oct 30, 2017

macks22 Jul 27, 2017 •

edited

Loading

piskvorky Jul 27, 2017 •

edited

Loading

piskvorky commented Jul 27, 2017 •

edited

Loading

piskvorky Jul 29, 2017 •

edited

Loading

piskvorky Jul 28, 2017 •

edited

Loading