Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use PEP 503 rules to validate upload filename #10072

Closed
wants to merge 7 commits into from

Conversation

uranusjr
Copy link
Contributor

Fix #10030.

@uranusjr uranusjr force-pushed the standard-dist-name-check branch 3 times, most recently from 202e90c to bb94e39 Compare September 24, 2021 21:14
@takluyver
Copy link
Contributor

This fix looks good to me. 👍

@takluyver
Copy link
Contributor

I see that one test - test_upload_fails_with_diff_filename_same_blake2 is failing, because it sends an sdist with a filename ending -fake.tar.gz, and parse_sdist_filename rejects fake as a version number. I would hope it's appropriate to just change the test.

@uranusjr
Copy link
Contributor Author

uranusjr commented Nov 26, 2021

Thanks for taking a look, TBH I totally forgot about this. I’ll work on it this weekend.

@uranusjr uranusjr force-pushed the standard-dist-name-check branch 2 times, most recently from c6e2a38 to 5ee21e6 Compare November 26, 2021 11:07
@uranusjr
Copy link
Contributor Author

I changed the test to use a valid version string, and also added a try-catch to _is_valid_filename so the endpoint does not crash with 500 when the input is invalid.

@uranusjr uranusjr force-pushed the standard-dist-name-check branch 3 times, most recently from 782adaf to 8ac3105 Compare November 26, 2021 12:24
@jaraco
Copy link
Contributor

jaraco commented Dec 2, 2021

Are there tests that validate that a project name with a . in it can be uploaded?

I attempted to test the change. I followed the getting started guide, but the steps didn't work. I can report that issue separately.

This change looks right to me, especially if it's strictly more lenient.

@uranusjr
Copy link
Contributor Author

uranusjr commented Dec 2, 2021

There were not tests that validate anything (in the project name) except alphabets. I can probably add some.

@jaraco
Copy link
Contributor

jaraco commented Dec 3, 2021

I just confirmed that the change won't block uploading unmangled filenames:

image

But I do worry that it's allowing uploading files with mangled filenames. Worse, it allows uploading of duplicate artifacts that vary only by the mangled name:

image

At the very least, Warehouse should reject those duplicates.

A period is a valid character in the filename and in my opinion the preferred character, because that's the character that's used to separate the python packages that the distribution represents and it's also the character used in the project name. My preference would be for warehouse to simply require that flit (or whomever) to use an unmangled name.

@jaraco
Copy link
Contributor

jaraco commented Dec 3, 2021

When I tried downloading jaraco.develop and jaraco_develop, I got the same artifact:

jaraco.develop main $ pip-run -i http://localhost/pypi --no-deps jaraco.develop -- -c pass
Looking in indexes: http://localhost/pypi
Collecting jaraco.develop
  Downloading http://localhost:9001/packages/4f/a4/edf484a882d669833045a4eaa3aafeae150e2a44a2f22bc96c1a0d271f81/jaraco.develop-7.9.0-py3-none-any.whl (10 kB)
Installing collected packages: jaraco.develop
Successfully installed jaraco.develop-7.9.0
jaraco.develop main $ pip-run -i http://localhost/pypi --no-deps jaraco_develop -- -c pass
Looking in indexes: http://localhost/pypi
Collecting jaraco_develop
  Downloading http://localhost:9001/packages/4f/a4/edf484a882d669833045a4eaa3aafeae150e2a44a2f22bc96c1a0d271f81/jaraco.develop-7.9.0-py3-none-any.whl (10 kB)
Installing collected packages: jaraco-develop
Successfully installed jaraco-develop-7.9.0

I don't know if that's an arbitrary selection or based on the order that the two artifacts were uploaded.

@takluyver
Copy link
Contributor

A period is a valid character in the filename and in my opinion the preferred character

A period there is not valid, according to the current wheel spec:

In distribution names, any run of -_. characters (HYPHEN-MINUS, LOW LINE and FULL STOP) should be replaced with _ (LOW LINE). This is equivalent to PEP 503 normalisation followed by replacing - with _.

This was the outcome of this previous discussion. The doc on packaging.python.org is the canonical version (and PEP 427 has had a note added to point to that).

In which case, wheelhouse ought to reject jaraco.develop-7.9.0-py3-none-any.whl. Though I imagine we might want to be somewhat lax if that's what setuptools & wheel produce.

Of course, we could change the written spec (again) to reflect what setuptools/wheel/warehouse already do. It's an easy enough change to Flit. But I think both of the possible changes have other drawbacks:

  • If we allow . to stay, and just replace runs of -_ with _, then you get a different wheel filename depending on whether you start from the PEP 503 normalised name or from the un-normalised name. (Unless we change PEP 503 as well, which would also affect .dist-info directories).
  • If we normalise -_. to . instead of _, you avoid that kind of ambiguity, but retrospectively invalidate filenames of many existing wheels.

@takluyver
Copy link
Contributor

Looking back at the discussion that led to that change, most of the conversation was about the version part, not the name. @uranusjr made a PR which referenced the .dist-info specification for both name & version parts, and then in my PR I copied the details from there. It looks like neither I nor anyone reviewing it particularly noticed that it was a sizeable change to the rules on the name part.

It is still nice to have the same rules for normalisation in wheel filenames, sdist filenames and .dist-info directory names, though.

@uranusjr
Copy link
Contributor Author

uranusjr commented Dec 3, 2021

My preference would be for warehouse to simply require that flit (or whomever) to use an unmangled name.

IMO this would make duplication worse, because this means an identical wheel can have a myriad of names depending on how tools generate it. We’re focusing on the dot here, but this would also open doors for upper-case letters and continuous run of dashes and underscores, which setuptools does normalise right now. Its current normalisation (or mangling if you prefer) logic is entirely arbitrary, and there are absolutely no reason to use that as the standard behaviour, and I would advocate removing support for that entirely if we have not been doing it for like ten years.

My preference would be for warehouse to simply require that setuptools (or whomever) to use a properly normalised name, instead of doing things half-way and force everyone else to play the same. But that’s not really possible, so the next best thing is to allow both the normalisation logic that makes sense (which this PR does), and tolarete what setuptools is currently doing.

@jaraco
Copy link
Contributor

jaraco commented Dec 5, 2021

I agree - the specs and the implementations should agree. And there's a good argument to be made that a PEP 503 normalized name should be used anywhere an internal representation of the project name is needed. My concern is primarily with how this normalization is bleeding into the external experience. I'll open a new discussion to capture and discuss this concern.

I started this discussion to capture my concerns and proposed ways forward.

@uranusjr
Copy link
Contributor Author

The reason this “bleeds into the external experience” is how setuptools has been implementing the specifications wrong. What do you propose instead? The only way we can avoid any of this “bleeding into” is to either make setuptools’s current behaviour the standard, or prohibit what setuptools is currently doing and break thousands of people’s workflows. Both are inviable, as I already explained, so this is my proposal to make things work. What do you propose instead? Because at this point it seems like you are holding things back without any viable alternatives, for reasons that nobody else seems to agree with.

@jaraco
Copy link
Contributor

jaraco commented Dec 12, 2021

I propose the following:

  • Accept this change. I don't think this change is particularly harmful regardless of the outcome, except that it opens up the security vulnerability allowing a release's artifacts to be mutated.
  • Optionally, address the security vulnerability by disallowing artifact collisions based on the normalization rules.

Other actions I'd prefer to see:

  • PyPA should update (provisionally, for now) the sdist/wheel specs to soften the normalization standard for distribution artifacts and metadata filenames to match the Setuptools behavior. Why? There's nothing about PEP 503 that indicates that the normalization should apply to these targets and introducing the PEP 503 constraint of mangling names with . has problemmatic consequences.
  • flit (and other backends) should minimally mangle these names based on the softer normalization rules for distribution artifacts and metadata filenames.
  • I'll continue to follow up in the discussion thread. I've been busy and didn't realize how contentious this issue would be so I need to spend some more time on it. Perhaps the outcome is that the packaging ecosystem should mangle . characters in these artifacts, but that decision shouldn't be made lightly as an implied consequence of normalization within warehouse.

@di di force-pushed the standard-dist-name-check branch from 8ac3105 to ea0289c Compare December 14, 2021 01:51
@di
Copy link
Member

di commented Dec 14, 2021

@uranusjr Thanks for the PR. I reviewed and wanted to make some modifications, hope you don't mind me pushing them. Summary of changes here:

  • 2fcf144 - I made the error messages a bit more clear/detailed, and to be more like the original error here when the prefix didn't match. I also added comments acknowledging that the current behavior technically violates the wheel spec.
  • 6a79492 - I added a test for the behavior which is violating the spec. I don't think we're in a place where PyPI can stop accepting filenames that don't conform to the spec right now, and want to ensure we don't accidently break users that are replying on this until we have time to make a decision and update tooling if necessary
  • ea0289c - I changed PyPI's behavior to consider canonical project names and versions in the filename when checking for duplicate filenames, rather than the explicit version in the filename. This should eliminate the undesired behavior that @jaraco noted above.

I'll leave this open for ~24 hours or so for folks in this thread to review, but I think this satisfies #10030 as well as other concerns raised here.

@@ -758,7 +766,8 @@ def _is_duplicate_file(db_session, filename, hashes):

if file_ is not None:
return (
file_.filename == filename
# This has the effect of canonicalizing the project name and version
_parse_filename(file_.filename) == _parse_filename(filename)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intention here to ensure that I can't upload foo_bar-0.1.0.tar.gz if foo.bar-0.1.0.tar.gz already exists?

If so, I don't think this quite handles it. In the query above we filter on if the filenames OR hashes collide.

In the test for this change, we always collide hashes... and I'm pretty sure it would fail to catch the "collision" marked above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s also not quite easy to check for non-canonicalized file names from the database here, at least not without some schema shakeup. This is sort of a “best effort” to detect the user uploading the same file; duplicates are still possible, but they wouldn’t be that problematic and it’s not worth it to eliminate them.

Copy link
Member

@ewdurbin ewdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from concern with normalized filename collisions raised above, my only other request would be to add a section to the /help page similar to https://pypi.org/help/#file-name-reuse that we can link people to directly from the log ala https://github.com/pypa/warehouse/blob/c6c4033477feca74aee226d9f3b27f445d1aa964/warehouse/forklift/legacy.py#L1291-L1293.

Ideally it has some helpful context as to why their previously working upload pipeline started failing.

@CAM-Gerlach
Copy link

FWIW, it seems like the related Discourse discussion referred to above has been resolved in favor of adhering to the spec, and further explicitly specifying in pypa/packaging.python.org#1032 to also normalize uppercase to lowercase characters (which is the status quo in Warehouse as well as Setuptools, right?) while clarifying that implementations should tolerate the legacy behavior. I'm not sure something specific was decided about name collisions on upload, though (which seems to be a concern?)

@wwuck
Copy link

wwuck commented Feb 8, 2022

Is there any update on progress for this issue? I'm still having a problem when trying to upload a pep420 namespace package created with latest flit to testpypi.

@takluyver
Copy link
Contributor

add a section to the /help page similar to https://pypi.org/help/#file-name-reuse

Here's my attempt at that; feel free to adapt it. I've mentioned that the rules used to be more lax, but implied that they're strictly enforced now, which is not actually true. I think that the complexity of trying to explain that probably outweighs strict accuracy.


Why am I getting an "Invalid filename" error?

Package files uploaded to PyPI need to have names in a set format - your packaging tools should create files with suitable names by default. The filenames consist of a number of parts, separated by hyphens (-):

  • For source distributions: normalised project name and version. E.g. importlib_metadata-4.10.1.tar.gz
  • For wheels: normalised project name, version, build tag (optional), Python tag, ABI tag, and platform tag. E.g. importlib_metadata-4.10.1-py3-none-any.whl

Normalised project names are lowercase, with any runs of _-. characters replaced with a single _. Version numbers should be normalised as described in PEP 440. All parts are expected to consist only of ASCII characters.

These rules have become stricter over time, so you may see existing packages with names which would no longer be allowed for new uploads.

This is needed to guarantee parse_[sdist|wheel]_filename functions.
And link to it in the error message.
@uranusjr uranusjr force-pushed the standard-dist-name-check branch from 40ed5d6 to b14eff7 Compare April 19, 2022 05:21
@uranusjr
Copy link
Contributor Author

I incorporated the documentation addition. Thanks!

Copy link
Member

@dstufft dstufft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The situation about what file names are acceptable is extremely messy right now, see this discuss thread.

I would prefer if we didn't change Warehouse here until someone does the work to actually fix the specs as to what the requirements on file names are, since any change we make here has risks to make the situation even messier.

@uranusjr
Copy link
Contributor Author

Closing this until we are more sure what to actually change in Warehouse. We can always reopen, although a new PR is probably preferred anyway due to conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PyPI does not accept wheel file name with . replaced with _
8 participants