-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid downloading spaCy models when only using the preprocessing module #120
Comments
Hi @leotok, thank you for reaching out! Which I'm not sure it will be possible to find a solution. Probably, very soon we will improve the "Every time a new API instance is started", that the same as saying every time the instance install texthero? I'm open to new suggestions, how would you solve this? Regards, |
Hi @jbesomi, thanks for the quick response! Currently, I'm only using these functions that I believe shouldn't require any outside resources: custom_pipeline = [
preprocessing.remove_whitespace,
preprocessing.remove_punctuation,
preprocessing.remove_diacritics,
preprocessing.remove_html_tags,
preprocessing.lowercase,
]
text = hero.clean(text, custom_pipeline) I found a way to overcome this issue by adding all packages downloads on my
The texthero package was already installed in the docker image, but the spacy and nltk resources were not. Now it's not a problem anymore with this new approach. Maybe there could be a hint about this somewhere in the README.md, what do you think? Also, now that I solved this installation issue, I'm facing a new problem. Although I use just a small (but great!) fraction of the lib, my docker container got a lot bigger because of these "unused" models and I had to increase the memory reserved for my instances inside kubernetes. I understand that there is a dependency because of the Something like this:
This Another suggesting is that
What do you think about this suggestions? Thanks for helping out! |
Hey Leo, You are making great observations, thank you for sharing! I agree with you that sometimes is inefficient (and also annoying) that the lib download resources not strictly necessary. Both of your solutions seem interesting to me; if you want to investigate this further I will be happy to review them. Some extra thoughts: nltk: We might want to get rid completely of spaCy models: We are realizing more and more that the simplest and elegant way to actually preprocess text data is by first tokenizing the input. Tokenization is therefore crucial (and one of the least considered part of the preprocessing phase imo). The current tokenization algo is based on a simple regular expression but is accuracy is poor. We were seriously considering switching to a more robust solution with spaCy (see #131). We are also considering having all In this case, download the spaCy model will be required in 90% of the cases. |
Hi,
I want to use Texthero to preprocess text on my API, but every time a new instance of the API is started, it has to download
en-core-web-sm
from spaCy even though I'm only using the preprocessing module.Is there a way to avoid this download?
Thanks!
Ex:
Running the API:
The text was updated successfully, but these errors were encountered: