-
Notifications
You must be signed in to change notification settings - Fork 983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use PEP 503 rules to validate upload filename #10072
Conversation
202e90c
to
bb94e39
Compare
This fix looks good to me. 👍 |
I see that one test - |
Thanks for taking a look, TBH I totally forgot about this. I’ll work on it this weekend. |
c6e2a38
to
5ee21e6
Compare
I changed the test to use a valid version string, and also added a try-catch to |
782adaf
to
8ac3105
Compare
Are there tests that validate that a project name with a I attempted to test the change. I followed the getting started guide, but the steps didn't work. I can report that issue separately. This change looks right to me, especially if it's strictly more lenient. |
There were not tests that validate anything (in the project name) except alphabets. I can probably add some. |
I just confirmed that the change won't block uploading unmangled filenames: But I do worry that it's allowing uploading files with mangled filenames. Worse, it allows uploading of duplicate artifacts that vary only by the mangled name: At the very least, Warehouse should reject those duplicates. A period is a valid character in the filename and in my opinion the preferred character, because that's the character that's used to separate the python packages that the distribution represents and it's also the character used in the project name. My preference would be for warehouse to simply require that flit (or whomever) to use an unmangled name. |
When I tried downloading
I don't know if that's an arbitrary selection or based on the order that the two artifacts were uploaded. |
A period there is not valid, according to the current wheel spec:
This was the outcome of this previous discussion. The doc on packaging.python.org is the canonical version (and PEP 427 has had a note added to point to that). In which case, wheelhouse ought to reject Of course, we could change the written spec (again) to reflect what setuptools/wheel/warehouse already do. It's an easy enough change to Flit. But I think both of the possible changes have other drawbacks:
|
Looking back at the discussion that led to that change, most of the conversation was about the version part, not the name. @uranusjr made a PR which referenced the It is still nice to have the same rules for normalisation in wheel filenames, sdist filenames and |
IMO this would make duplication worse, because this means an identical wheel can have a myriad of names depending on how tools generate it. We’re focusing on the dot here, but this would also open doors for upper-case letters and continuous run of dashes and underscores, which setuptools does normalise right now. Its current normalisation (or mangling if you prefer) logic is entirely arbitrary, and there are absolutely no reason to use that as the standard behaviour, and I would advocate removing support for that entirely if we have not been doing it for like ten years. My preference would be for warehouse to simply require that setuptools (or whomever) to use a properly normalised name, instead of doing things half-way and force everyone else to play the same. But that’s not really possible, so the next best thing is to allow both the normalisation logic that makes sense (which this PR does), and tolarete what setuptools is currently doing. |
I agree - the specs and the implementations should agree. And there's a good argument to be made that a PEP 503 normalized name should be used anywhere an internal representation of the project name is needed. My concern is primarily with how this normalization is bleeding into the external experience. I'll open a new discussion to capture and discuss this concern. I started this discussion to capture my concerns and proposed ways forward. |
The reason this “bleeds into the external experience” is how setuptools has been implementing the specifications wrong. What do you propose instead? The only way we can avoid any of this “bleeding into” is to either make setuptools’s current behaviour the standard, or prohibit what setuptools is currently doing and break thousands of people’s workflows. Both are inviable, as I already explained, so this is my proposal to make things work. What do you propose instead? Because at this point it seems like you are holding things back without any viable alternatives, for reasons that nobody else seems to agree with. |
I propose the following:
Other actions I'd prefer to see:
|
8ac3105
to
ea0289c
Compare
@uranusjr Thanks for the PR. I reviewed and wanted to make some modifications, hope you don't mind me pushing them. Summary of changes here:
I'll leave this open for ~24 hours or so for folks in this thread to review, but I think this satisfies #10030 as well as other concerns raised here. |
@@ -758,7 +766,8 @@ def _is_duplicate_file(db_session, filename, hashes): | |||
|
|||
if file_ is not None: | |||
return ( | |||
file_.filename == filename | |||
# This has the effect of canonicalizing the project name and version | |||
_parse_filename(file_.filename) == _parse_filename(filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the intention here to ensure that I can't upload foo_bar-0.1.0.tar.gz
if foo.bar-0.1.0.tar.gz
already exists?
If so, I don't think this quite handles it. In the query above we filter on if the filenames OR hashes collide.
In the test for this change, we always collide hashes... and I'm pretty sure it would fail to catch the "collision" marked above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s also not quite easy to check for non-canonicalized file names from the database here, at least not without some schema shakeup. This is sort of a “best effort” to detect the user uploading the same file; duplicates are still possible, but they wouldn’t be that problematic and it’s not worth it to eliminate them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from concern with normalized filename collisions raised above, my only other request would be to add a section to the /help page similar to https://pypi.org/help/#file-name-reuse that we can link people to directly from the log ala https://github.com/pypa/warehouse/blob/c6c4033477feca74aee226d9f3b27f445d1aa964/warehouse/forklift/legacy.py#L1291-L1293.
Ideally it has some helpful context as to why their previously working upload pipeline started failing.
FWIW, it seems like the related Discourse discussion referred to above has been resolved in favor of adhering to the spec, and further explicitly specifying in pypa/packaging.python.org#1032 to also normalize uppercase to lowercase characters (which is the status quo in Warehouse as well as Setuptools, right?) while clarifying that implementations should tolerate the legacy behavior. I'm not sure something specific was decided about name collisions on upload, though (which seems to be a concern?) |
Is there any update on progress for this issue? I'm still having a problem when trying to upload a pep420 namespace package created with latest flit to testpypi. |
Here's my attempt at that; feel free to adapt it. I've mentioned that the rules used to be more lax, but implied that they're strictly enforced now, which is not actually true. I think that the complexity of trying to explain that probably outweighs strict accuracy. Why am I getting an "Invalid filename" error? Package files uploaded to PyPI need to have names in a set format - your packaging tools should create files with suitable names by default. The filenames consist of a number of parts, separated by hyphens (
Normalised project names are lowercase, with any runs of These rules have become stricter over time, so you may see existing packages with names which would no longer be allowed for new uploads. |
This is needed to guarantee parse_[sdist|wheel]_filename functions.
And link to it in the error message.
40ed5d6
to
b14eff7
Compare
I incorporated the documentation addition. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The situation about what file names are acceptable is extremely messy right now, see this discuss thread.
I would prefer if we didn't change Warehouse here until someone does the work to actually fix the specs as to what the requirements on file names are, since any change we make here has risks to make the situation even messier.
Closing this until we are more sure what to actually change in Warehouse. We can always reopen, although a new PR is probably preferred anyway due to conflicts. |
Fix #10030.