-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add functions for parsing wheel and sdist filenames #387
Conversation
The docs build is failing. It seems that the doctest step is checking against the released code, not against the in-development sources. I'm not sure what I'm doing wrong to have this happen. Advice would be appreciated 🙂 |
6a8c53a
to
ad5b301
Compare
Try adding a |
Thanks, that fixed it. I spent ages trying to work out why there was an older version of packaging present. (I'm still not clear why that's happening, or how the other doctests didn't fail when they were added, but it's working now, so 🤷 |
nox is like tox, each run happens in it's own isolated environment. However, unlike tox, nox doesn't install your package automatically - you need to do that explicitly. We hadn't. :) |
Pretty sure we weren't testing them earlier. |
I built a package for this functionality a while ago: https://github.com/uranusjr/packaging-dists The interface proposed here is slightly lower level, but the parsing part should be the same. I’ll try to find some time to compare implementations and see if either is missing edge cases. |
packaging/utils.py
Outdated
) | ||
|
||
parts = filename.split("-", dashes - 2) | ||
name = canonicalize_name(parts[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should instead check that name == canonical name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The escaping rules for wheel filenames aren't the same in PEP 427 as in PEP 503. I'm more interested in parsing the data than validating, so I chose to return the canonicalized name rather than try to apply a check. I could be persuaded to not canonicalize at all, but I'm against trying to validate.
Validating the name component against PEP 427 would be "only alphanumeric and underscores, and underscores can't occur together". Which I don't think is that useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth documenting what these functions don't check compared to the PEPs? That way anyone using them who wants to be strict know what more they would need to do on top of what the functions return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On reflection, doing the check and raising an error if the name part isn't valid isn't that hard, and I'm fine with raising an error if it's invalid. The check is so minimal I don't expect anything to fail it in any practical situation...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if we don't validate, then yes, I'm fine with documenting what we don't do 🙂
I had a quick look. The main differences I can see are that you allow more formats for sdists (technically only Let me know if I missed anything, though, or you disagree on any of the above. |
I can confirm, this is something I would find very useful. |
I'll hold off on clicking the merge button, until Paul decides whether he wants to add the validation for the name. Regardless, once he's happy with this, I think this is ready for merging. :) |
Let me add the validation. I'll do that tomorrow then this can be good to go :-) |
8c168e7
to
56b1dfd
Compare
OK, validation added. I tried to add a test for the Unicode case, but Python 2 made me cry, so I just dropped it. If anyone can tell me how I'd do that cleanly, I'm happy to add it back, but it's a bit of a corner case, so I'm not too worried. (My first thought was to slap a u"..." around the string, but I don't know if the type annotations would approve...) |
@pradyunsg This is ready for merge now, IMO. |
Thanks @pfmoore. I've made an issue to follow up on the unicode test, once we drop Python 2. :) |
Excellent, thanks! I wasn't sure what the timescale was for dropping Python 2, but that sounds like the ideal solution. |
Quick question before we release this: according to PEP 427, the build tag should be sorted "as the empty string if unspecified". Should we then have BTW, I'm expecting that as part of upstreaming mousebender's simple index code there will be an improvement to group files together, which will mean some code to sort by build tag which has some interesting requirements:
|
Good point, yes that makes sense. |
Sorry about this, but I'm thinking through how to group wheel files from a Simple repository API perspective and it's causing me to realize things in stages. 😉 Maybe this function is where we should do the work to parse the build tag for proper sorting? Looking at the default used in pip it's actually the empty tuple. Then when there is a build tag it's a tuple of Since the appropriate wheel can't be selected without sorting by build tag as shown by the sort key in pip (if I'm reading that sort key tuple appropriately), it seems this work can't be avoided and so it might as well be centralized here and done automatically. And I will open an issue as appropriate for any of these ideas based on what we all agree on. |
I'm fine with returning something more structured for the build tag. I've very rarely seen it used "in the wild", so I did the simplest thing that would work to start with, but I'm happy to expand on it. I think a tuple is probably sufficient - I don't think there's any need to name the parts. I haven't done enough with type hints to know how I feel about using a type alias, so I'd go with what others feel is reasonable there. BTW, I'm curious to know how you plan on using this in mousebender. I have a library, shadwell that implements the package selection part of pip's finder functionality - that will need this change - and from what I recall when we last spoke you didn't feel like that was a good fit for mousebender. Have you since changed your mind, or are you planning to use this for something different? |
@pfmoore the reason I bring all of this up in terms of mousebender is grouping results from a Simple repository API. Taking https://download.pytorch.org/whl/torch_stable.html as an example, I would assume the results from a Simple repository index isn't useful unless you group by:
Since the build tag seemingly has an innate sort order which is agnostic of what sort of wheel one is after, I figured having the build tag pulled together upfront made sense (i.e. build tag takes precedence over wheel tag matching). Now if I have read the PEPs and pip source code wrong and the build tag doesn't play that much of an important part and is really more for breaking ties when matching by wheel tag, please let me know as that means we can skip the parsing upfront. That would save CPU cycles for the common case of having only one wheel tag match per version. BTW, I'm going to open an issue to track all of this in a more appropriate location. 😄 |
FYI, Edit: partially answering my question, they are grouped together. |
It returns a frozenset of tag objects, which have |
Ahh, excellent, that's useful, thanks! |
I finally got fed up of writing the same code to parse wheel and sdist filenames, so I thought it might be worth having a "standard" version.
As written, the code only handles sdist filenames in the form described in the packaging standards (
.tar.gz
files with the form{name}-{version}.tar.gz
). It could be extended to handle more of the cases in use on PyPI (e.g.,.zip
format sdists) but I chose in the first instance to stick with the stricter definition that's the current de facto standard.