-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All preprocessing functions to receive as input TokenSeries #145
Comments
To keep in mind: one of the advantages of the "clean" section of pre-processing is the possibility to clean in a uniform way small strings (e.g. Names, Addresses, etc.) in a dataset. This, although little when compared to the overall benefits of using the whole pipeline on big chunks of text, could be an interesting pre-step to string matching operations. They are very common in some research contexts, where you have to merge different dataset based for instance on Company Name or Scientific Publications authors. Would advancing "tokenize" in the pipeline prevent this use of TextHero? |
Very interesting observation. For the case you mentioned, we can
(out of the discussion) -> soon or later we will have to think about how to add a universal |
@henrifroese, would you mind help us with that? As you are already familiar with the Series subject. |
@jbesomi with what part do you need help? Or in general? I think that overall, as described in #131 the spaCy version without parallelization is too slow to be useful for texthero. With the spaCy parallelization it's still a lot slower than the regex version, but useable, and with the parallelization from #162 , it's pretty fast and useable. However, I'm not 100% convinced we should always tokenize first. I think the point mentioned by @Iota87 is correct that there are users who mainly use cleaning functions etc. and it's a little annoying and counterintuitive having to tokenize, clean, then join again. Additionally, this would of course be a pretty big development effort with needing to change a lot of functionality in the preprocessing module and tests, so I want to make sure this really is necessary 🥵 |
The aim of this issue is to discuss and understand when
tokenize
should happen in the pipeline.The current solution is to apply
tokenize
once the text has already been cleaned, either withclean
or with a custom pipeline. In general, in the cleaning phase, we also remove the punctuation symbols.The problem with this approach is that, especially for non-Western languages (#18 and #128), the tokenization operation might actually need the punctuation to execute correctly.
The natural question is: wouldn't be better to have as very first operation
tokenize
?In this scenario, all preprocessing functions would receive as input a
TokenSeries
. As we care about performance, one question is whereas we can develop aremove_punctuation
enough efficient withTokenSeries
. The current version of thetokenize
function is quite efficient as it makes use of regex. The first task would be to develop the new variant and benchmark it against the current one. An advantage of the non-regex approach is that as the input is a list of lists, we might empower parallelization.Could we move
tokenize
at the very first step yet keeping performance high? Which solution offer the fastest performance?The other question is: is there a scenario where preprocessing functions should deal with
TextSeries
rather thanTokenSeries
?Extra crunch:
The current
tokenize
version uses a very naive approach based on regex that works only for Western languages. The main advantage is that it's quite fast compared to NLTK or other solutions. An alternative we should seriously consider is to replace the regex version with the SpaCy tokenizer (#131). The question is: how can wetokenize
with SpaCy in a very efficient fashion?The text was updated successfully, but these errors were encountered: