Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use PEP 508 rules when setting deps from extras #724

Merged
merged 11 commits into from
Jun 21, 2022

Conversation

mattoberle
Copy link
Contributor

PR Checklist

Please check if your PR fulfills the following requirements:

  • Tests for the changes have been added (for bug fixes / features)
  • Docs have been added / updated (for bug fixes / features)

PR Type

What kind of change does this PR introduce?

  • Bugfix
  • Feature (please, look at the "Scope of the project" section in the README.md file)
  • Code style update (formatting, local variables)
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • CI related changes
  • Documentation content changes
  • Other... Please describe:

What is the current behavior?

Issue Number: #720

What is the new behavior?

PEP 426 states that distribution
names are case-insensitive and "-" is interchangeable with "_".

The pip-compile command creates a lockfile where all package names are lowercase.
The tool may also modify interchangeable characters.

The following examples are all valid requirements.txt or requirements_lock.txt entries:

SQLAlchemy[postgresql_psycopg2binary]==1.4.36
sqlalchemy[postgresql_psycopg2binary]==1.4.36
sentry_sdk[flask]==1.5.8
sentry-sdk[flask]==1.5.8

A distribution's METADATA file contains the case and delimiters chosen by the package publisher.
By applying a "sanitise" function when building the extras dict and when performing lookups we can eliminate this difference as a concern.

Note: Unlike the package name, the content inside the extras [] is case-sensitive and - vs _ is significant.

Does this PR introduce a breaking change?

  • Yes
  • No

Other information

While I believe this PR covers the spots required to fix the issue, it's pretty clunky.
One has to remember to sanitize keys before looking them up in the extras dict.
We could get around that by sanitizing the wheel name parsed from the metadata but I didn't look into what consequences that might have.

Additionally, replacing "-" with "_" was an arbitrary decision.
It could easily be "_" to "-".

This commit addresses issue bazelbuild#720.

[PEP 426](https://peps.python.org/pep-0426/#name) states that distribution
names are case-insensitive and "-" is interchangeable with "_".

The `pip-compile` command creates a lockfile where all package names are lowercase.
The tool may also modify interchangeable characters.

The following examples are all valid `requirements.txt` or `requirements_lock.txt` entries:

```
SQLAlchemy[postgresql_psycopg2binary]==1.4.36
sqlalchemy[postgresql_psycopg2binary]==1.4.36
sentry_sdk[flask]==1.5.8
sentry-sdk[flask]==1.5.8
```

A distribution's `METADATA` file contains the stylization chosen by the publisher.
By applying a "sanitise" function when building the `extras` dict and when performing lookups
we can eliminate this difference as a concern.
Comment on lines 26 to 33
def sanitise_requirement(requirement: str) -> str:
"""Given a requirement string, returns a normalized version of the requirement name.
The case and use of "_" vs "-" in a requirements file may vary.
https://peps.python.org/pep-0426/#name
"""
return requirement.replace("-", "_").lower()


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more correct to run both sides (whl.name and the entries in extras) through the "official" packaging.utils.canonicalize_name.

Or, if we don't have that dependency and don't want to take it on, use the code from PEP 503:

import re

def normalize(name):
    return re.sub(r"[-_.]+", "-", name).lower()

More specifically, this code also normalizes periods and replaces all contiguous runs of [-_.] with a single -. So My-_._-Package and my-package are equivalent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging that up @jvolkman.
Taking the regex from PEP 503 seems like a good fit.

I should probably update the name of the PR to reflect the more accurate PEP.

Copy link
Collaborator

@groodt groodt Jun 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, PEP-426 is Withdrawn. PEP-508 or PEP-503 is what we should be referencing here.

We are in full control of the code. So normalizing when creating the mapping between binary distributions (wheels)-> extras requested should be done when the requirementst.txt is parsed, and when installing such a wheel and setting up the relevant dependencies for the extras.

Copy link
Contributor Author

@mattoberle mattoberle Jun 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked through pypa/installer and didn't find a method for normalizing the distribution name while parsing wheel metadata:

However, I did find that pypa/pkg_resources provides an interface for normalization:

I applied this to the .name property on Wheel.

I had to apply the normalization in _parse_requirement_for_extra because that's the closest spot where the distribution name is isolated from the extras.

pkg_resources.Requirement can also parse the extras section but that applies safe_extra to each entry.
That may be a suitable replacement for the current regex in _parse_requirement_for_extra, as long as "fixing" the extras doesn't break their inclusion.

Edit: It looks like pkg_resources.Requirement can do the job, but I'll have to change some tests that currently assert non-normalized extras. (The code that swaps the regex for the code from pkg_resources is currently committed). I tried to find the context around the initial implementation to make sure this road hasn't been explored in the past but GitHub links to a different PR and the commit message is brief.

That seem viable or should I revert the latest commit and leave the extras-parsing as-is?

Edit 2: Oh, I see now. That method doesn't really handle anything outside of index packages.

Copy link
Collaborator

@groodt groodt Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look at what piptools was doing. They use the vendored version of packaging to normalize the name https://github.com/jazzband/pip-tools/blob/9fb2d425d6d7abb9d9b9c51ebb711af420da25f4/piptools/utils.py#L30

`pypa/installer` is used to parse Wheel metadata, but does not currently
provide a method for normalizing distribution names:

- pypa/installer#97

`pypa/pkg_resources` provides `Requirement.parse` which returns an instance
of `Requirement` where `.key` is the canonical distribution name per PEP 503.

The `Requirement` class can also parse `extras`, but it returns a normalized
form that I believe could break the installation of the extras.
@mattoberle mattoberle changed the title Use PEP 426 rules when setting deps from extras Use PEP 508 rules when setting deps from extras Jun 7, 2022
@@ -1,6 +1,8 @@
import re
from typing import Dict, Optional, Set, Tuple

import pkg_resources
Copy link
Collaborator

@groodt groodt Jun 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment doesn't intend to block this PR (I think we need this fix).

Mostly an fyi for future travellers: pkg_resources (we use it elsewhere, so shouldn't block this PR anyway) is on the slow path towards deprecation:

There are probably parts of the stdlib we could use instead: https://github.com/python/cpython/blob/9cc0afcb8782c22575fa91756c54029c7e774dfb/Lib/importlib/metadata/__init__.py#L867

Or just create our own function if that's easier.

@thundergolfer thundergolfer requested review from thundergolfer and removed request for brandjon and lberki June 15, 2022 01:05
Copy link
Collaborator

@thundergolfer thundergolfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we follow what's shown in pip-tools here –> #724 (comment)?

We already access pip._internal for a bunch of stuff.

This replaces `pkg_resources.Requirement.parse` with
`packaging.utils.canonicalize_name`. Doing this pulls in a vendored
requirement from `pip`, which may be undesirable.

The code we want is just:

```
re.sub(r"[-_.]+", "-", name).lower()
```

This commit also leaves a reference to `pkg_resources` in `wheel.py` which
does not canonicalize the name.
@mattoberle
Copy link
Contributor Author

Could we follow what's shown in pip-tools here –> #724 (comment)?

We already access pip._internal for a bunch of stuff.

I'm happy to push up either this approach or just write the function here.
Looking at the source in packaging it's just:

_canonicalize_regex = re.compile(r"[-_.]+")
_canonicalize_regex.sub("-", name).lower()

I pushed up a commit which demonstrates the pip._vendor approach.

It looks like pkg_resources.Requirement is still used here and a non-canonicalized name is added to the dependency set:
https://github.com/bazelbuild/rules_python/blob/main/python/pip_install/extract_wheels/lib/wheel.py#L80-L86

Pulling in the vendored canonicalize_name casts the return value to a custom type NormalizedName, which is probably going to upset the type checker:
https://packaging.pypa.io/en/latest/_modules/packaging/utils.html#canonicalize_name

Do we need to / should we canonicalize the name in the dependency_set as well?
It looks like they run through a bazel-specific sanitizer here:

if incremental:
sanitised_dependencies = [
sanitised_repo_library_label(d, repo_prefix=repo_prefix) for d in whl_deps
]
sanitised_wheel_file_dependencies = [
sanitised_repo_file_label(d, repo_prefix=repo_prefix) for d in whl_deps
]
else:
sanitised_dependencies = [
sanitised_library_label(d, prefix=repo_prefix) for d in whl_deps
]
sanitised_wheel_file_dependencies = [
sanitised_file_label(d, prefix=repo_prefix) for d in whl_deps
]

@thundergolfer
Copy link
Collaborator

@mattoberle, I keep forgetting that authors don't have perms to merge these even after approval, do they?

@mattoberle
Copy link
Contributor Author

@mattoberle, I keep forgetting that authors don't have perms to merge these even after approval, do they?

They do not, looks like merge is locked down to repository owners or another group.

@alexeagle alexeagle merged commit dae610b into bazelbuild:main Jun 21, 2022
mattem pushed a commit to mattem/rules_python that referenced this pull request Jul 7, 2022
* Use PEP 426 rules when setting deps from extras

This commit addresses issue bazelbuild#720.

[PEP 426](https://peps.python.org/pep-0426/#name) states that distribution
names are case-insensitive and "-" is interchangeable with "_".

The `pip-compile` command creates a lockfile where all package names are lowercase.
The tool may also modify interchangeable characters.

The following examples are all valid `requirements.txt` or `requirements_lock.txt` entries:

```
SQLAlchemy[postgresql_psycopg2binary]==1.4.36
sqlalchemy[postgresql_psycopg2binary]==1.4.36
sentry_sdk[flask]==1.5.8
sentry-sdk[flask]==1.5.8
```

A distribution's `METADATA` file contains the stylization chosen by the publisher.
By applying a "sanitise" function when building the `extras` dict and when performing lookups
we can eliminate this difference as a concern.

* Use PEP 503 rules when sanitising extras

* Normalize distribution name with pkg_resources

`pypa/installer` is used to parse Wheel metadata, but does not currently
provide a method for normalizing distribution names:

- pypa/installer#97

`pypa/pkg_resources` provides `Requirement.parse` which returns an instance
of `Requirement` where `.key` is the canonical distribution name per PEP 503.

The `Requirement` class can also parse `extras`, but it returns a normalized
form that I believe could break the installation of the extras.

* Use Requirement.parse to populate extra reqs

* Revert "Use Requirement.parse to populate extra reqs"

This reverts commit f0faa97.

* Test for distribution name normalization in extras

* Replace pkg_resources with packaging.utils

This replaces `pkg_resources.Requirement.parse` with
`packaging.utils.canonicalize_name`. Doing this pulls in a vendored
requirement from `pip`, which may be undesirable.

The code we want is just:

```
re.sub(r"[-_.]+", "-", name).lower()
```

This commit also leaves a reference to `pkg_resources` in `wheel.py` which
does not canonicalize the name.

Co-authored-by: Jonathon Belotti <jonathon@canva.com>
Co-authored-by: Alex Eagle <alex@aspect.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants