Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data missing after download? #578

Closed
bdewilde opened this issue Oct 25, 2016 · 4 comments
Closed

data missing after download? #578

bdewilde opened this issue Oct 25, 2016 · 4 comments
Labels
bug Bugs and behaviour differing from documentation

Comments

@bdewilde
Copy link

A week or so ago, I reported an issue where language models were giving nonsensical results, which you found to be the result of missing models improperly loaded. I've now run into the root cause once again: unexpectedly missing language models. I believe I've identified the reason. When installing models for either 'en' or 'de' with the --force option, all models are removed. I would expect it to only overwrite (remove then download) the models for the particular language.

Here's an example:

~$ python -m spacy.en.download all --force
Downloading parsing model
Downloading...
Downloaded 532.28MB 100.00% 3.75MB/s eta 0s
archive.gz checksum/md5 OK
Model successfully installed to /Users/burtondewilde/.pyenv/versions/3.5.2/lib/python3.5/site-packages/spacy/data
Downloading GloVe vectors
Downloading...
Downloaded 708.08MB 100.00% 2.65MB/s eta 0s
archive.gz checksum/md5 OK
Model successfully installed to /Users/burtondewilde/.pyenv/versions/3.5.2/lib/python3.5/site-packages/spacy/data
~$ ls /Users/burtondewilde/.pyenv/versions/3.5.2/lib/python3.5/site-packages/spacy/data
__cache__  cookies.txt  en-1.1.0  en_glove_cc_300_1m_vectors-1.0.0
~$ python -m spacy.de.download all --force
Downloading...
Downloaded 644.23MB 100.00% 9.51MB/s eta 0s
archive.gz checksum/md5 OK
Model successfully installed to /Users/burtondewilde/.pyenv/versions/3.5.2/lib/python3.5/site-packages/spacy/data
~$ ls /Users/burtondewilde/.pyenv/versions/3.5.2/lib/python3.5/site-packages/spacy/data
__cache__  cookies.txt  de-1.0.0
@bdewilde
Copy link
Author

On a related note: While it's good that models that don't exist aren't loaded and then incorrectly applied, it's still a bit surprising when the models in a pipeline are silently not applied. Maybe a one-time warning message indicating that the model was not found and thus could not be applied would clarify things?

@honnibal honnibal added the bug Bugs and behaviour differing from documentation label Oct 25, 2016
@honnibal
Copy link
Member

spaCy should probably do a bit more logging, yes.

To clarify a bit about what's changed: previously, you had to download the data to even get the tokenizer. This was definitely unnecessary. The 1.0 release comes with a lot of language data packaged into code, so that you can get basic usage without the data download. The idea is to support people who just need a tokenizer etc.

The downside is that now the user can be in two states: data present or data absent. This bug you've highlighted makes the problem much much worse. Clearly we'll fix the bug. But what about the two states?

One solution is to log a warning, as you're suggesting. Probably that's best. Another solution is to make the user explicitly ask for the 'micro' state, and raise if they're trying to 'load' but can't. A related solution is to raise when an attribute that would be predicted by a model, e.g. a POS tag, is missing. Both of these seem to not work well if we want to assume that the pipeline is a flexible, user-defined thing. We can't map the attribute "tag" back to the action of nlp.tagger() if the pipeline is flexible/arbitrary.

@honnibal
Copy link
Member

honnibal commented Nov 4, 2016

This should be fixed now, although it's hard to test for, so I'll just cross my fingers...

This took a surprisingly long time to sort out, because I still find the sputnik codebase very difficult.

@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation
Projects
None yet
Development

No branches or pull requests

2 participants