Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Put spaCy data in a shared path #868

Closed
chelming opened this issue Mar 2, 2017 · 11 comments
Closed

Put spaCy data in a shared path #868

chelming opened this issue Mar 2, 2017 · 11 comments
Labels
enhancement Feature requests and improvements
Milestone

Comments

@chelming
Copy link

chelming commented Mar 2, 2017

Would it be possible to have spaCy data work similarly to NLTK_data where it goes to a shared path, i.e., C:\nltk_data for Windows, /usr/local/share/nltk_data for macOS, or /usr/share/nltk_data for Unix (obviously substituting spacy_data for nltk_data)?

I understand that I can have it download to a custom location but it would be nice to have it look for it automatically rather than having to set spacy.util.set_data_path() before calling spacy.load(), or by passing a path argument to spacy.en.English.

My use case for this is deploying it in computer labs, were it'd be preferable for me to be able to package and deploy the data without each user having to download it individually. Especially in cases where each user has an ~/anaconda folder since the data downloads to ~/anaconda/lib/python3.5/site-packages/spacy/en/data for each user. It'd be (selfishly) easier for a user to be able to use spacy without me telling them where the data is and without them filling up the HD.

If there's a reason that it's done the way it currently is, that's fine :)

@honnibal
Copy link
Member

honnibal commented Mar 2, 2017

Hi,

I do understand the pain on this, because it's sometimes inconvenient to have all my development copies of spaCy.

I know it's becoming common for libraries to drop files in the home directory now, but I don't think it's a great pattern. I think it makes it much harder to reason about how the file-system state affects what's being executed. I use virtualenv etc because I want to isolate my projects from each other, so I'm often unimpressed when libraries go behind my back to set up shared state.

Some suggestions for your use-case. If you're happy to have another install step, you could make a command that replaces the data directory with a symbolic link to your shared location. If you can't find a way to make this nice, or you want a really "1 click" procedure that just uses pip, you could make a library that does this, and put it on PyPi. I hadn't thought of this before, but I think it might be useful to others too. I could see myself using it in some situations, for instance.

Matt

@chelming
Copy link
Author

chelming commented Mar 3, 2017

I'm actually going to be creating a .pkg that would get deployed to multiple macOS machines. It's more desirable in my environment vs using pip because all traffic stays on my network between the deployment server and the workstation. Again, it's not that big of a deal in this situation, compared to nltk_data which was somewhere around 11 GB.

I completely agree though that the method of hiding data in a ~/.folder would be awful, which is why I suggested the /usr/local/spacy_data, /usr/local/share/spacy_data, c:\spacy_data method used by nltk.

@honnibal
Copy link
Member

honnibal commented Mar 3, 2017

That starts to introduce permissions problems --- most systems don't give user accounts write access to those directories. And then if you require sudo, you drop the environment variables, and it's hard to execute the correct download command...

If you're creating the .pkg file, can you set up the symlink within it? That should meet your requirements quite well.

@chelming
Copy link
Author

chelming commented Mar 3, 2017

The problem with the symlink is that I don't necessarily know where a user will have anaconda installed nor would I know what their environment is called. If they create a new environment after the data package is installed, they'd have to download the data to their environment because the symlink wouldn't exist there.

I don't really see any issue with permissions; just have it use the first directory that exists and is writable. If a user runs sudo python let it go to /usr/share, /usr/local/share, and if they run it in their user context, let it go to pathlib.

I guess the main thing I'd want is automatic checking in /usr/share and /usr/local/share for spacy_data.

@honnibal
Copy link
Member

honnibal commented Mar 7, 2017

I don't really see any issue with permissions; just have it use the first directory that exists and is writable. If a user runs sudo python let it go to /usr/share, /usr/local/share, and if they run it in their user context, let it go to pathlib.

I guess this comes down to a matter of taste. I find that sort of behaviour really unappealing.

I'm not sure what the best solution is for you, given all your constraints. But I do think it'll be quite easy to point spaCy to save and load data to some path by default. You'll just need to decide which path to use.

@honnibal honnibal added the enhancement Feature requests and improvements label Mar 7, 2017
@jck
Copy link

jck commented Mar 17, 2017

The XDG standard has a directory for data(default: ~/.local/share/application-name). This is used by many apps including pip.

@ines
Copy link
Member

ines commented Mar 17, 2017

Just to give you a heads-up – this will be fixed in v1.7!

You'll be able to store your data wherever you want, and download and install models directly, or using the new spacy.download command. Models can be installed as Python packages via pip or loaded in manually. There'll also be a new command spacy.link that lets you set up symlinks for your models (local directory or installed Python package), so you can load them by name, e.g. spacy.load('my_cool_model'). This will also make it much easier to use your own models with spaCy.

We're just in the process of reuploading all the models (taking a bit longer than expected, because we've trained new models and decided to provide different options, i.e. with GloVe vectors and without). But as soon as they're up, we'll push the new release and docs 🎉

/ cc: @CWhits, @jck

@chelming
Copy link
Author

@ines: will there be a specific set of search paths built-in that wouldn't require the models to be manually loaded or linked?

@ines
Copy link
Member

ines commented Mar 17, 2017

The data path spacy/data (or any custom one set via util.set_data_path()) will still be the directory spaCy uses to look for models. It's also where the symlinks will be created.

So spacy.load() will still work for any models placed in this directory, using the exact name of the model directory. (spaCy will now only check version compatibility if you download models via spacy.download and not make any other assumptions. So if you do things manually, you should be able to load whichever model you want.)

Note that v1.7. will include extensions to the list of hard-coded symbols for Universal Dependencies 2.0 compatibility. So the old models won't be compatible with 1.7 and you'll have to download the new models (either via the spaCy downloader, pip + URL or local path, or manually).

The new models are:

  • updated models for English (one with GloVe vectors, one without, one GloVe vectors only)
  • smaller English model + vectors (~50MB, 2% less accurate than larger model)
  • updated model for German

(Still training, uploading and testing – fingers crossed! Can't wait to publish the release.)

Update: Found a better solution to the way symbols are added to the vocab, so that vocabularies remain compatible across spaCy versions. This means the current models can still be used with the new code. We're also releasing a new smaller English model with vectors (~50MB, 2% less accurate than larger model). New larger models will then follow in v2.0.

@ines ines added this to the v1.7.0 milestone Mar 18, 2017
@ines
Copy link
Member

ines commented Mar 18, 2017

@ines ines closed this as completed Mar 18, 2017
@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Feature requests and improvements
Projects
None yet
Development

No branches or pull requests

4 participants