Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite PackageFinder’s _sort_locations() and related parsing logic #5846

Closed
uranusjr opened this issue Oct 3, 2018 · 8 comments
Closed
Labels
auto-locked Outdated issues that have been locked by automation C: finder PackageFinder and index related code type: refactor Refactoring code

Comments

@uranusjr
Copy link
Member

uranusjr commented Oct 3, 2018

Related discussions:

#5800
#5827
#5836 (comment)
#5838 (comment)
#5838 (comment)

I am looking into how I can break apart find_all_candidates() and extract logic from it. The general logic in that function isn’t too difficult, but how inputs are grouped and interpreted is 🤯

The logic is scattered between _sort_locations() and during fetching the HTML page. There are so many edge cases filtered out in one place or another, I don’t know how to extract anything without breaking others.

I decided to take another approach. First, I list all possible input variants and possible outcomes, and compile a table of what happens with each kind of inputs. Then I’ll try to rewrite one function at a time to match the behaviour. Once inner functions are replaced, I can deal with find_all_candidates() with confidence.

First up is _sort_locations(). Here’s the compiled table. Dependency links are not included, but @pradyunsg said it will be removed “pretty soon”, and I don’t think I can finish this before that happens.

                        Index URL   Find Links

file: to non-HTML       Archive     Archive
file: to HTML           Page        Page
file: to directory      index.html  Archives Inside
path to non-HTML        Archive     Archive
path to HTML            Page        Page
path to directory       Dropped     Archives Inside
http(s): or ftp: URL    Page        Page

Other notes:

  • “File” means “any local entity that’s not a directory”.
  • Local HTML files are detected by mimetypes.guess_type().
  • Non-existent paths and file: URLs are dropped with a warning.
  • Remote URLs except http:, https:, and ftp: are dropped a warning. VCS
    schemes (e.g. git+https:) are allowed in _sort_locations(), but dropped
    later in _get_html_page().
  • HTTP(S) pages are dropped if the response’s content type is not text/html.
  • FTP pages are dropped if the URL looks like an archive.
@cjerdonek
Copy link
Member

Related to my last review comment, it might be good to have solid end-to-end test coverage prior to doing the refactor. If you have structured a good end-to-end test well, you could add test cases to the list as you are investigating various code paths and finding edge cases, without too much more work. That will build confidence as you go.

@cjerdonek
Copy link
Member

Also, for _sort_locations(), one of my suggestions from before was to separate the code that evaluates the URLs from the code that does the appending. You can also break up the for loop by making a function to evaluate a single URL.

On a separate, more general note (for the larger task), one of Donald's suggestions was to take a sans io approach. That would give you another axis of separation when deciding what and how to extract things out (e.g. pure IO functions vs. pure processing).

@cjerdonek cjerdonek added type: refactor Refactoring code C: finder PackageFinder and index related code labels Oct 3, 2018
@cjerdonek cjerdonek added this to the Internal Cleansing milestone Oct 3, 2018
@uranusjr
Copy link
Member Author

uranusjr commented Oct 3, 2018

There are actually some tests around this in test_finder.py. More fine-grained tests are still definitely a good idea, especially for those edge cases that “works” but probably unintended (e.g. I don’t think it makes sense to allow an index URL pointing to an archive).

@mhsmith
Copy link

mhsmith commented Oct 19, 2018

Further to #5827, surely dropping a local directory path passed to --index-url is not the right thing to do. What would be the problem with setting expand_dir to True for --index-url? I think that's a more useful behavior, and it's already implied by the documentation when it says --index-url "should point to a repository compliant with PEP 503 (the simple repository API) or a local directory laid out in the same format".

@uranusjr
Copy link
Member Author

uranusjr commented Oct 20, 2018

PEP 503 implies the endpoint should be specified by URLs. --index-url already supports PEP 503 behaviour (but not through expand_dir, which does a slightly different thing) if you pass it a directory as a file: URL. What is being dropped now is when you pass a path, which I am not sure is specified.

It is indeed more helpful to treat paths the same as file: URLs passed to --index-url (but not expand_dir=True, again because that is different), but I honestly don’t think it is a good idea to be too helpful here. pip is the de-facto reference implementation of Python packaging, and it is kind of treated above the standards by users. Adding unspecified features would force implementers of alternative PEP 503 clients to duplicate undocumented behaviour, and that can be unproductive IMO. I’d agree it is better to be helpful if PEP 503 is amended/clarified to explicitly explain how paths can be used to specify a simple API.

@mhsmith
Copy link

mhsmith commented Oct 20, 2018

PEP 503 only specifies an HTTP-based protocol. It says nothing about server-side filesystem layout. The words "directory" and "index.html" do not appear anywhere.

So what does the pip documentation mean by "a local directory laid out in the same format", when the PEP does not actually specify a format at all? The reasonable (and useful) interpretation is that pip will emulate the behavior of common web servers, namely:

@pradyunsg
Copy link
Member

pradyunsg commented Oct 7, 2019

@cjerdonek I think we can close this but I'll let you have the fun of clicking close on this. :)

@pradyunsg
Copy link
Member

pradyunsg commented Jan 21, 2020

Given that I've not seen much activity from @cjerdonek lately, I'll take the liberty of closing this issue. :)

@lock lock bot added the auto-locked Outdated issues that have been locked by automation label Feb 21, 2020
@lock lock bot locked as resolved and limited conversation to collaborators Feb 21, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
auto-locked Outdated issues that have been locked by automation C: finder PackageFinder and index related code type: refactor Refactoring code
Projects
None yet
Development

No branches or pull requests

4 participants