Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a way to get the URL to download a pipeline to the CLI #11175

Merged
merged 23 commits into from
Sep 2, 2022

Conversation

polm
Copy link
Contributor

@polm polm commented Jul 21, 2022

Description

This adds a --url flag to the spacy info [model] command that prints the URL of the most recent compatible version of the named pipeline. This URL is printed without any formatting for use in scripts, like:

url=$(spacy info en_core_web_sm --url)
wget "$url"

Because this requires a compatibility check, it requires an Internet connection.

This also adds the download URL for installed packages to the output of spacy info [model] if available. The information is typically available for pipelines installed with pip, and is accessed through pkg_resources. This does not require an Internet connection.


Original description:

This is a simple implementation of a --dry-run flag for spacy download that makes it print the URL of the package to be installed without downloading anything.

I don't have strong opinions about the way this is done, but there are some questions:

  • would it be better to print just the URL?
  • should there be a method to get the URL for outside use?
  • should we be concerned about the flags overlapping with pip (they don't currently)?

I'm not sure how universal this is, but I used -n for the short flag, which I understand to come from noop.

Types of change

minor enhancement

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@polm polm added feat / cli Feature: Command-line interface enhancement Feature requests and improvements labels Jul 21, 2022
@adrianeboyd
Copy link
Contributor

I think there's too much danger for this to conflict with the pip flags. I noticed that this was just merged, maybe we can use this somehow instead? pypa/pip#10771

Or alternatively this should be in a separate command (not download) and without any interaction with pip flags.

@adrianeboyd
Copy link
Contributor

There's no way for pip to resolve things without downloading so pip install --dry-run does the download but then doesn't install in the end, which isn't great here.

@polm
Copy link
Contributor Author

polm commented Jul 24, 2022

OK, we definitely shouldn't use this if pip is adding a flag with the same name. Since pip has a lot of options it seems like the only safe option is adding another command. I'm not sure what it should be called, maybe download-url?

We could also provide more information and call it something like pipeline-info.

@adrianeboyd
Copy link
Contributor

I think a separate command sounds easier, but the naming is hard. download_url still sounds too much to me like it might be downloading something.

In general I think people would like to be able to do something like:

spacy whatever en_core_web_sm >> requirements.txt

@polm
Copy link
Contributor Author

polm commented Jul 25, 2022

What if we had commands like spacy pipelines url and spacy pipelines list or something? list giving the full list and url giving the url of the listed (maybe one or more?) pipelines?

@adrianeboyd
Copy link
Contributor

Let's see, I'd never noticed this but we already have spacy info model, which is only intended for installed models, but maybe we can add a bit of info about how to specify requirements, also for our own trained pipelines that aren't installed. It would require some level of failing gracefully without internet access.

@svlandeg
Copy link
Member

svlandeg commented Aug 9, 2022

I like this idea - could it be

spacy info en_core_web_sm --url

that would basically work regardless of whether the model is already installed or not?

@adrianeboyd
Copy link
Contributor

I was imagining something that could show a bit more about how to specify it as a dependency, so at least some option with more details like:

$ spacy info en_core_web_sm --requirements

setup.cfg:

install_requires = 
    en_core_web_sm @ URL

poetry:

   something with {url = "URL"} (I'd have to look it up)

Also a plain --url seems useful. Or maybe this should just point to the docs with more details, but something that makes this easier in general.

@polm
Copy link
Contributor Author

polm commented Aug 16, 2022

OK, I have updated this to support this syntax:

spacy info en_core_web_sm --url

Initially I had this work like other info commands and print a table by default, but consdering the comment about piping the output upthread, and that there's only one relevant field, I instead modified to print just the URL of the most recent compatible version of the pipeline when successful.

As far as failure modes:

  • if not online, the compatibility check will fail gracefully with a clear message (this is just in the compatibility function)
  • if the pipeline name doesn't exist, an error mentioning that will be raised (same as downloading a name that doesn't exist)
  • if you didn't pass a pipeline name there's an error for that

There are some unresolved questions. There are lots of things we could add but I think it's better to keep this a simple, minimal feature that can be integrated in other things, so I've kept things simple when in doubt about the below. (In particular, because this is overloading the existing info command, the flag interactions would get weird. If we want more sophisticated options I think it'd be better to have a new command or mode.)

  • this always provides the wheel URL rather than the .tar.gz URL. I don't think that's a problem but I'm not sure when one would be preferable over the other.
  • I didn't add a feature to directly specify the version rather than grabbing the latest compatible one.
  • we could also provide more info about models that aren't installed, but that feels like a different feature.

For the --requirements option, it wouldn't be difficult to add that, I'll look at doing it tomorrow.

spacy/cli/download.py Outdated Show resolved Hide resolved
@adrianeboyd
Copy link
Contributor

I think providing just the wheel URL is fine and just providing the same URL as you would get from spacy download is also fine.

The no-internet error is pretty ugly here, especially since it's not that clear to users that it's going to need internet access to provide the URL.

@adrianeboyd
Copy link
Contributor

adrianeboyd commented Aug 16, 2022

Actually if the model is installed locally, it might make more sense to provide that version, or possibly to provide both?

@polm
Copy link
Contributor Author

polm commented Aug 17, 2022

The no-internet error is pretty ugly here, especially since it's not that clear to users that it's going to need internet access to provide the URL.

I don't think it's that bad? It clearly states what the failure was (couldn't look up the compatibility table). It would be nice if we could check compatibility offline, but since we can't... The ideal solution would probably be a URL that could be generated locally that would redirect to the latest compatible version. (There is also the issue that looking up the compatibility table offline doesn't throw an exception, it exits the program, so it would need to be modified to handle it elegantly.)

Actually if the model is installed locally, it might make more sense to provide that version, or possibly to provide both?

I'm not sure I understand the use case? These are the uses I was imagining:

  1. Download a model in CI for some reason. You probably want the most recent compatible version, and if you don't you can cache the output somehow.
  2. You need the URL to download the model in a particular way due to a corporate firewall. So you just need the default download URL.

I guess for locally installed models, you might want the URL if you want to declare dependencies for a project you've been working on - is that what you had in mind? For that use case, would it be better to add the download URL to the spacy info model_name (without the --url option)?

spacy/cli/info.py Outdated Show resolved Hide resolved
spacy/cli/info.py Show resolved Hide resolved
polm and others added 2 commits August 18, 2022 14:56
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
@polm polm changed the title Add a dry run flag to download command Add a way to get the URL to download a pipeline to the CLI Aug 18, 2022
@polm
Copy link
Contributor Author

polm commented Aug 18, 2022

I just pushed a change to add output for spacy info [pipeline] --requirements. The code is a bit messy because there are two output formats here.

This is the one I prefer:

============== Requirements info for pipeline 'en_core_web_sm' ==============



    Release info page:

    https://github.com/explosion/spacy-models/releases/tag/en_core_web_sm-3.4.0

    setup.cfg:

    install_requires =
        en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl

    poetry:

    en_core_web_sm = { url = "https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl" }

This is a little hacky though - it's using the same msg output as other pre-existing commands, but with an empty key and one value in order to preserve formatting. (This is probably why the markdown option uses print.)

This is output with one key per entry. This is the same way other commands work but I think it's too hard to read, mainly because of the newline for setup.cfg format.

============== Requirements info for pipeline 'en_core_web_sm' ==============

Release info page   https://github.com/explosion/spacy-models/releases/tag/en_core_web_sm-3.4.0
setup.cfg           install_requires =
        en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl
poetry              en_core_web_sm = { url = "https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl" }

I also feel like at this point spacy info is getting a little overloaded, and it would be better to have subcommands (like debug data). But since we already support model names I guess we'd need to have a new keyword if we wanted to do that.

For this PR I think we should leave --requirements for later and figure out the formatting for the output. My preference is to provide just a URL for ease when scripting, but I think it would also make sense to provide both the download URL and the release info URL (example) in the typical table format used by other existing options.

This should make mypy happy
@polm
Copy link
Contributor Author

polm commented Aug 23, 2022

Just to clarify the state of this a bit, what I think we should do now is:

  1. leave --requirements for a separate PR
  2. output just the URL, without formatting, for use in scripting

About generating different URLs based on installed packages, I'm still not clear what the use case for that is for downloading (it makes sense for --requirements).

I have left the --requirements related suggestions in the PR to make it easier to check them, but would delete them before taking them out of draft if we do want to do that separately.

spacy/cli/info.py Outdated Show resolved Hide resolved
@polm polm added the v3.5 Related to v3.5 label Aug 24, 2022
@polm polm marked this pull request as ready for review August 24, 2022 05:29
@polm
Copy link
Contributor Author

polm commented Aug 24, 2022

I think this is OK to merge now?

@adrianeboyd
Copy link
Contributor

I'll take another look. Can you update the PR description?

@polm
Copy link
Contributor Author

polm commented Aug 24, 2022

Updated the PR description to reflect the current behavior.

spacy/cli/info.py Outdated Show resolved Hide resolved
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
@adrianeboyd
Copy link
Contributor

I think the functionality is looking good. Do we want to include the changes to the docs in this PR or a follow-up PR?

There are at least these sections:

https://spacy.io/usage/models#download-pip
https://spacy.io/usage/models#production

And maybe easier instructions on the models pages themselves? (I think having to dig through releases for the asset URL is the too-difficult part.)

@svlandeg
Copy link
Member

svlandeg commented Aug 29, 2022

I'd prefer having the docs changes with the PR!

Add a sidebar about finding download URLs, with some examples of the new
command.
@polm
Copy link
Contributor Author

polm commented Aug 30, 2022

OK, I updated the docs. In addition to adding mention of the new command to the places linked above, I put download links directly in the tables on the model pages. How's that?

@polm
Copy link
Contributor Author

polm commented Aug 30, 2022

Here's what it looks like on the model page:

2022-08-30T18-26-35

Copy link
Contributor

@adrianeboyd adrianeboyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the summary of this is that I don't like "Download URL" as the name for this. It's just a URL? Possibly a "(pipeline) (package) URL"?

website/docs/usage/models.md Outdated Show resolved Hide resolved
website/docs/usage/models.md Outdated Show resolved Hide resolved
website/docs/usage/models.md Outdated Show resolved Hide resolved
website/docs/usage/models.md Outdated Show resolved Hide resolved
website/docs/usage/models.md Outdated Show resolved Hide resolved
website/src/templates/models.js Outdated Show resolved Hide resolved
website/src/templates/models.js Show resolved Hide resolved
@adrianeboyd adrianeboyd merged commit 977dc33 into explosion:master Sep 2, 2022
polm added a commit to polm/spaCy that referenced this pull request Sep 6, 2022
…#11175)

* Add a dry run flag to download

* Remove --dry-run, add --url option to `spacy info` instead

* Make mypy happy

* Print only the URL, so it's easier to use in scripts

* Don't add the egg hash unless downloading an sdist

* Update spacy/cli/info.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Add two implementations of requirements

* Clean up requirements sample slightly

This should make mypy happy

* Update URL help string

* Remove requirements option

* Add url option to docs

* Add URL to spacy info model output, when available

* Add types-setuptools to testing reqs

* Add types-setuptools to requirements

* Add "compatible", expand docstring

* Update spacy/cli/info.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Run prettier on CLI docs

* Update docs

Add a sidebar about finding download URLs, with some examples of the new
command.

* Add download URLs to table on model page

* Apply suggestions from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Updates from review

* download url -> download link

* Update docs

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
@svlandeg svlandeg removed the v3.5 Related to v3.5 label Sep 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / cli Feature: Command-line interface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants