Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functions for parsing wheel and sdist filenames #387

Merged
merged 5 commits into from
Jan 20, 2021

Conversation

pfmoore
Copy link
Member

@pfmoore pfmoore commented Jan 19, 2021

I finally got fed up of writing the same code to parse wheel and sdist filenames, so I thought it might be worth having a "standard" version.

As written, the code only handles sdist filenames in the form described in the packaging standards (.tar.gz files with the form {name}-{version}.tar.gz). It could be extended to handle more of the cases in use on PyPI (e.g., .zip format sdists) but I chose in the first instance to stick with the stricter definition that's the current de facto standard.

@pfmoore
Copy link
Member Author

pfmoore commented Jan 19, 2021

The docs build is failing. It seems that the doctest step is checking against the released code, not against the in-development sources. I'm not sure what I'm doing wrong to have this happen. Advice would be appreciated 🙂

@pradyunsg
Copy link
Member

Try adding a session.install("-e", ".") go the docs function in the noxfile?

@pfmoore
Copy link
Member Author

pfmoore commented Jan 19, 2021

Thanks, that fixed it. I spent ages trying to work out why there was an older version of packaging present. (I'm still not clear why that's happening, or how the other doctests didn't fail when they were added, but it's working now, so 🤷

@pradyunsg
Copy link
Member

nox is like tox, each run happens in it's own isolated environment. However, unlike tox, nox doesn't install your package automatically - you need to do that explicitly.

We hadn't. :)

@pradyunsg
Copy link
Member

how the other doctests didn't fail when they were added

Pretty sure we weren't testing them earlier.

@uranusjr
Copy link
Member

I built a package for this functionality a while ago: https://github.com/uranusjr/packaging-dists

The interface proposed here is slightly lower level, but the parsing part should be the same. I’ll try to find some time to compare implementations and see if either is missing edge cases.

)

parts = filename.split("-", dashes - 2)
name = canonicalize_name(parts[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should instead check that name == canonical name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The escaping rules for wheel filenames aren't the same in PEP 427 as in PEP 503. I'm more interested in parsing the data than validating, so I chose to return the canonicalized name rather than try to apply a check. I could be persuaded to not canonicalize at all, but I'm against trying to validate.

Validating the name component against PEP 427 would be "only alphanumeric and underscores, and underscores can't occur together". Which I don't think is that useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth documenting what these functions don't check compared to the PEPs? That way anyone using them who wants to be strict know what more they would need to do on top of what the functions return?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On reflection, doing the check and raising an error if the name part isn't valid isn't that hard, and I'm fine with raising an error if it's invalid. The check is so minimal I don't expect anything to fail it in any practical situation...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we don't validate, then yes, I'm fine with documenting what we don't do 🙂

@pfmoore
Copy link
Member Author

pfmoore commented Jan 19, 2021

I’ll try to find some time to compare implementations and see if either is missing edge cases.

I had a quick look. The main differences I can see are that you allow more formats for sdists (technically only .tar.gz is allowed nowadays, thanks to PEP 517, but as I say, I'm happy to add more) and you don't assume PEP 440 versions for sdists, so you can't simply split on the last -. I'm not personally interested in legacy versions, and packaging is deprecating them, so I don't think this function should worry about them.

Let me know if I missed anything, though, or you disagree on any of the above.

@brettcannon
Copy link
Member

Thanks for doing this, @pfmoore (and @uranusjr for your version)! I also came close to doing this myself for mousebender (which I do hope to upstream here after we drop Python 2 support), so I definitely think there's demand for this. 😉

@sinoroc
Copy link
Contributor

sinoroc commented Jan 19, 2021

I definitely think there's demand for this

I can confirm, this is something I would find very useful.

@pradyunsg
Copy link
Member

I'll hold off on clicking the merge button, until Paul decides whether he wants to add the validation for the name. Regardless, once he's happy with this, I think this is ready for merging. :)

@pfmoore
Copy link
Member Author

pfmoore commented Jan 19, 2021

Let me add the validation. I'll do that tomorrow then this can be good to go :-)

@pfmoore
Copy link
Member Author

pfmoore commented Jan 20, 2021

OK, validation added. I tried to add a test for the Unicode case, but Python 2 made me cry, so I just dropped it. If anyone can tell me how I'd do that cleanly, I'm happy to add it back, but it's a bit of a corner case, so I'm not too worried. (My first thought was to slap a u"..." around the string, but I don't know if the type annotations would approve...)

@pfmoore
Copy link
Member Author

pfmoore commented Jan 20, 2021

@pradyunsg This is ready for merge now, IMO.

@pradyunsg pradyunsg merged commit da68002 into pypa:master Jan 20, 2021
@pradyunsg
Copy link
Member

Thanks @pfmoore. I've made an issue to follow up on the unicode test, once we drop Python 2. :)

@pfmoore pfmoore deleted the parse_filenames branch January 20, 2021 12:46
@pfmoore
Copy link
Member Author

pfmoore commented Jan 20, 2021

Excellent, thanks! I wasn't sure what the timescale was for dropping Python 2, but that sounds like the ideal solution.

@brettcannon
Copy link
Member

@pfmoore The plans are as soon as we do one more release (see #385 ).

@brettcannon
Copy link
Member

Quick question before we release this: according to PEP 427, the build tag should be sorted "as the empty string if unspecified". Should we then have parse_wheel_filename() return "" instead of None when there's no build tag?

BTW, I'm expecting that as part of upstreaming mousebender's simple index code there will be an improvement to group files together, which will mean some code to sort by build tag which has some interesting requirements:

Sort as the empty string if unspecified, else sort the initial digits as a number, and the remainder lexicographically.

@pfmoore
Copy link
Member Author

pfmoore commented Jan 23, 2021

Good point, yes that makes sense.

@brettcannon
Copy link
Member

Sorry about this, but I'm thinking through how to group wheel files from a Simple repository API perspective and it's causing me to realize things in stages. 😉

Maybe this function is where we should do the work to parse the build tag for proper sorting? Looking at the default used in pip it's actually the empty tuple. Then when there is a build tag it's a tuple of (int, str) to get the sorting that PEP 491 specifies. Maybe for typing purposes the default should be (0, '')? Or maybe we could use a type alias like we use for normalized package names so people don't need to care about the underlying tuple structure? I have no magical insight here, BTW, just ideas. 😄

Since the appropriate wheel can't be selected without sorting by build tag as shown by the sort key in pip (if I'm reading that sort key tuple appropriately), it seems this work can't be avoided and so it might as well be centralized here and done automatically.

And I will open an issue as appropriate for any of these ideas based on what we all agree on.

@pfmoore
Copy link
Member Author

pfmoore commented Jan 24, 2021

I'm fine with returning something more structured for the build tag. I've very rarely seen it used "in the wild", so I did the simplest thing that would work to start with, but I'm happy to expand on it.

I think a tuple is probably sufficient - I don't think there's any need to name the parts. I haven't done enough with type hints to know how I feel about using a type alias, so I'd go with what others feel is reasonable there.

BTW, I'm curious to know how you plan on using this in mousebender. I have a library, shadwell that implements the package selection part of pip's finder functionality - that will need this change - and from what I recall when we last spoke you didn't feel like that was a good fit for mousebender. Have you since changed your mind, or are you planning to use this for something different?

@brettcannon
Copy link
Member

@pfmoore the reason I bring all of this up in terms of mousebender is grouping results from a Simple repository API. Taking https://download.pytorch.org/whl/torch_stable.html as an example, I would assume the results from a Simple repository index isn't useful unless you group by:

  1. Project
  2. Version
  3. Sort by build tag
  4. (Frozen) set of wheel tags

Since the build tag seemingly has an innate sort order which is agnostic of what sort of wheel one is after, I figured having the build tag pulled together upfront made sense (i.e. build tag takes precedence over wheel tag matching).

Now if I have read the PEPs and pip source code wrong and the build tag doesn't play that much of an important part and is really more for breaking ties when matching by wheel tag, please let me know as that means we can skip the parsing upfront. That would save CPU cycles for the common case of having only one wheel tag match per version.

BTW, I'm going to open an issue to track all of this in a more appropriate location. 😄

@brettcannon
Copy link
Member

#389

@henryiii
Copy link
Contributor

henryiii commented Jan 26, 2021

FYI, parse_wheel_filename('spam-0.1.0-cp39-cp39-macosx_10_9_universal2.macosx_11_0_universal2.whl') produces ('spam', <Version('0.1.0')>, None, frozenset([<cp39-cp39-macosx_10_9_universal2 @ 4329376176>, <cp39-cp39-macosx_11_0_universal2 @ 4329376256>])). Why does it not give access to the python/abi/platform tags, out of curiosity?

Edit: partially answering my question, they are grouped together.

@pfmoore
Copy link
Member Author

pfmoore commented Jan 26, 2021

Why does it not give access to the python/abi/platform tags, out of curiosity?

It returns a frozenset of tag objects, which have interpreter, abi and platform attributes...

@henryiii
Copy link
Contributor

Ahh, excellent, that's useful, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants