Put spaCy data in a shared path #868

chelming · 2017-03-02T21:24:53Z

Would it be possible to have spaCy data work similarly to NLTK_data where it goes to a shared path, i.e., C:\nltk_data for Windows, /usr/local/share/nltk_data for macOS, or /usr/share/nltk_data for Unix (obviously substituting spacy_data for nltk_data)?

I understand that I can have it download to a custom location but it would be nice to have it look for it automatically rather than having to set spacy.util.set_data_path() before calling spacy.load(), or by passing a path argument to spacy.en.English.

My use case for this is deploying it in computer labs, were it'd be preferable for me to be able to package and deploy the data without each user having to download it individually. Especially in cases where each user has an ~/anaconda folder since the data downloads to ~/anaconda/lib/python3.5/site-packages/spacy/en/data for each user. It'd be (selfishly) easier for a user to be able to use spacy without me telling them where the data is and without them filling up the HD.

If there's a reason that it's done the way it currently is, that's fine :)

The text was updated successfully, but these errors were encountered:

honnibal · 2017-03-02T23:33:39Z

Hi,

I do understand the pain on this, because it's sometimes inconvenient to have all my development copies of spaCy.

I know it's becoming common for libraries to drop files in the home directory now, but I don't think it's a great pattern. I think it makes it much harder to reason about how the file-system state affects what's being executed. I use virtualenv etc because I want to isolate my projects from each other, so I'm often unimpressed when libraries go behind my back to set up shared state.

Some suggestions for your use-case. If you're happy to have another install step, you could make a command that replaces the data directory with a symbolic link to your shared location. If you can't find a way to make this nice, or you want a really "1 click" procedure that just uses pip, you could make a library that does this, and put it on PyPi. I hadn't thought of this before, but I think it might be useful to others too. I could see myself using it in some situations, for instance.

Matt

chelming · 2017-03-03T14:26:42Z

I'm actually going to be creating a .pkg that would get deployed to multiple macOS machines. It's more desirable in my environment vs using pip because all traffic stays on my network between the deployment server and the workstation. Again, it's not that big of a deal in this situation, compared to nltk_data which was somewhere around 11 GB.

I completely agree though that the method of hiding data in a ~/.folder would be awful, which is why I suggested the /usr/local/spacy_data, /usr/local/share/spacy_data, c:\spacy_data method used by nltk.

honnibal · 2017-03-03T14:33:37Z

That starts to introduce permissions problems --- most systems don't give user accounts write access to those directories. And then if you require sudo, you drop the environment variables, and it's hard to execute the correct download command...

If you're creating the .pkg file, can you set up the symlink within it? That should meet your requirements quite well.

chelming · 2017-03-03T15:07:22Z

The problem with the symlink is that I don't necessarily know where a user will have anaconda installed nor would I know what their environment is called. If they create a new environment after the data package is installed, they'd have to download the data to their environment because the symlink wouldn't exist there.

I don't really see any issue with permissions; just have it use the first directory that exists and is writable. If a user runs sudo python let it go to /usr/share, /usr/local/share, and if they run it in their user context, let it go to pathlib.

I guess the main thing I'd want is automatic checking in /usr/share and /usr/local/share for spacy_data.

honnibal · 2017-03-07T14:43:03Z

I don't really see any issue with permissions; just have it use the first directory that exists and is writable. If a user runs sudo python let it go to /usr/share, /usr/local/share, and if they run it in their user context, let it go to pathlib.

I guess this comes down to a matter of taste. I find that sort of behaviour really unappealing.

I'm not sure what the best solution is for you, given all your constraints. But I do think it'll be quite easy to point spaCy to save and load data to some path by default. You'll just need to decide which path to use.

jck · 2017-03-17T07:07:34Z

The XDG standard has a directory for data(default: ~/.local/share/application-name). This is used by many apps including pip.

ines · 2017-03-17T11:29:21Z

Just to give you a heads-up – this will be fixed in v1.7!

You'll be able to store your data wherever you want, and download and install models directly, or using the new spacy.download command. Models can be installed as Python packages via pip or loaded in manually. There'll also be a new command spacy.link that lets you set up symlinks for your models (local directory or installed Python package), so you can load them by name, e.g. spacy.load('my_cool_model'). This will also make it much easier to use your own models with spaCy.

We're just in the process of reuploading all the models (taking a bit longer than expected, because we've trained new models and decided to provide different options, i.e. with GloVe vectors and without). But as soon as they're up, we'll push the new release and docs 🎉

/ cc: @CWhits, @jck

chelming · 2017-03-17T13:37:09Z

@ines: will there be a specific set of search paths built-in that wouldn't require the models to be manually loaded or linked?

ines · 2017-03-17T14:18:50Z

The data path spacy/data (or any custom one set via util.set_data_path()) will still be the directory spaCy uses to look for models. It's also where the symlinks will be created.

So spacy.load() will still work for any models placed in this directory, using the exact name of the model directory. (spaCy will now only check version compatibility if you download models via spacy.download and not make any other assumptions. So if you do things manually, you should be able to load whichever model you want.)

Note that v1.7. will include extensions to the list of hard-coded symbols for Universal Dependencies 2.0 compatibility. So the old models won't be compatible with 1.7 and you'll have to download the new models (either via the spaCy downloader, pip + URL or local path, or manually).

The new models are:

updated models for English (one with GloVe vectors, one without, one GloVe vectors only)

smaller English model + vectors (~50MB, 2% less accurate than larger model)

updated model for German

~~(Still training, uploading and testing – fingers crossed! Can't wait to publish the release.)~~

Update: Found a better solution to the way symbols are added to the vocab, so that vocabularies remain compatible across spaCy versions. This means the current models can still be used with the new code. We're also releasing a new smaller English model with vectors (~50MB, 2% less accurate than larger model). New larger models will then follow in v2.0.

ines · 2017-03-18T19:51:36Z

Just pushed v1.7.0! 🎉

lock · 2018-05-09T01:39:05Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the enhancement Feature requests and improvements label Mar 7, 2017

ines added this to the v1.7.0 milestone Mar 18, 2017

ines closed this as completed Mar 18, 2017

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Put spaCy data in a shared path #868

Put spaCy data in a shared path #868

chelming commented Mar 2, 2017 •

edited

Loading

honnibal commented Mar 2, 2017 •

edited

Loading

chelming commented Mar 3, 2017

honnibal commented Mar 3, 2017

chelming commented Mar 3, 2017

honnibal commented Mar 7, 2017

jck commented Mar 17, 2017

ines commented Mar 17, 2017 •

edited

Loading

chelming commented Mar 17, 2017

ines commented Mar 17, 2017 •

edited

Loading

ines commented Mar 18, 2017

lock bot commented May 9, 2018

Put spaCy data in a shared path #868

Put spaCy data in a shared path #868

Comments

chelming commented Mar 2, 2017 • edited Loading

honnibal commented Mar 2, 2017 • edited Loading

chelming commented Mar 3, 2017

honnibal commented Mar 3, 2017

chelming commented Mar 3, 2017

honnibal commented Mar 7, 2017

jck commented Mar 17, 2017

ines commented Mar 17, 2017 • edited Loading

chelming commented Mar 17, 2017

ines commented Mar 17, 2017 • edited Loading

ines commented Mar 18, 2017

lock bot commented May 9, 2018

chelming commented Mar 2, 2017 •

edited

Loading

honnibal commented Mar 2, 2017 •

edited

Loading

ines commented Mar 17, 2017 •

edited

Loading

ines commented Mar 17, 2017 •

edited

Loading