Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use "named interpreter constraints" #12652

Open
Eric-Arellano opened this issue Aug 25, 2021 · 20 comments
Open

Use "named interpreter constraints" #12652

Eric-Arellano opened this issue Aug 25, 2021 · 20 comments
Labels
backend: Python Python backend-related issues

Comments

@Eric-Arellano
Copy link
Contributor

Right now, anyone can add an arbitrary interpreter constraint to any Python target, like interpreter_constraints=["==2.7.*"]. As explained in https://blog.pantsbuild.org/python-3-migrations/, it's powerful that it's decentralized to set interpreter constraints. But it has downsides:

  1. It's too easy for a user to unintentionally use un-blessed interpreter constraints. For example, an org might be trying to wait to use Python 3.10, but a developer accidentally starts using it without the build team knowing.
  2. Performance. We have to scan the entire repository to determine all possible interpreter constraints. There are some performance improvements we can make if we know the repository only ever uses a single interpreter constraint, like no longer partitioning MyPy/Flake8/Bandit/Pylint by interpreter constraints. If we can simply see if the list of globally blessed constraints only has one element, we know we can short circuit that.

This is an improvement over #12495.

Modeling

There will be an option like this:

[python-setup]
interpreter_constraints = [
   "default: ['>=3.6,<3.10']",
   "py2: ['==2.7.*']",
]

Then targets will set:

python_library(interpreter_constraints="py2")

We also probably still want to support using only a single global interpreter constraint like this, in which case you would not be able to set the interpreter_constraints field:

[python-setup]
interpreter_constraints = ['>=3.6,<3.10']

It's not clear the best way to model the [python-setup] option. We want a mapping of names to constraints. In TOML, that would suggest using a nested table, but that does not work well with our options system which requires using list or dict. dict types are awkward in Pants and it would mean we can't keep allowing the simple case of setting a single global interpreter constraint:

[python-setup]
interpreter_constraints = "{'default': ['>=3.6,<3.10'], 'py2': ['==2.7.*']}"

I think a list[str] makes sense, where it can either be a single element w/ just the global constraints, or a list of key: value elements to create a mapping of blessed constraints. If you are doing the blessed constraints approach, we should probably special case the "default" key to be the default. (Is it an error to not set a default?)

How to handle deprecation

This is tricky because interpreter_constraints is already claimed and is a good name to use.

I propose we still keep the same option and field name, and teach our code to handle the old semantics and new one. For example, if the field interpreter_constraints is being set to a list[str] instead of str, assume they're using old semantics and give a deprecation. If a str, assume new semantics and try to find the named interpreter constraints.

@benjyw
Copy link
Contributor

benjyw commented Aug 25, 2021

I like this!

Why isn't the option dict-valued?

@Eric-Arellano
Copy link
Contributor Author

Eric-Arellano commented Aug 25, 2021

Why isn't the option dict-valued?

Because:

dict types are awkward in Pants and it would mean we can't keep allowing the simple case of setting a single global interpreter constraint:

[python-setup]
# Awkward string escaping
interpreter_constraints = "{'default': ['>=3.6,<3.10'], 'py2': ['==2.7.*']}"
[python-setup]
# Note how clunky it is to set a single default for the whole repo
interpreter_constraints = "{'default': ['>=3.6,<3.10']}"

It's not possible for this option to be sometimes a list (if you just have one global default) vs. sometimes a dict.

Wdyt?

@benjyw
Copy link
Contributor

benjyw commented Aug 25, 2021

That doesn't seem particularly clunky to me. It's just a normal Python dict literal.

And avoiding dict-valued options because we haven't provided good support for them in pants.toml seems like a big smell...

@benjyw
Copy link
Contributor

benjyw commented Aug 25, 2021

Probably we can start supporting toml nested tables, now that option subscopes are no longer a thing?

@Eric-Arellano
Copy link
Contributor Author

Probably we can start supporting toml nested tables, now that option subscopes are no longer a thing?

That would be pretty sweet.

[python-setup]
interpreter_constraints = { default = [">3.6,<4"], py2 = ["==2.7.*"] }

aka

[python-setup.interpreter_constraints]
default = [">3.6,<4"]
py2 = ["==2.7.*"]

Much much more natural. And then if you want to use the CLI or env vars, you'd have to use the Python syntax. Same with the add/remove syntax.

Note that "named resolves" will be implemented this same way, so getting this right will be useful.

--

If we have to add back subscopes, we can redesign a more TOML-tolerant mechanism like using | as the separator. cc @stuhood , sg?

@benjyw
Copy link
Contributor

benjyw commented Aug 26, 2021

I can take a look at the toml thing. Unless, Eric, you already know how to do it and it's easy.

@Eric-Arellano
Copy link
Contributor Author

It's not super complex fwict and I wrote the original TOML parser, so I'd be happy to do it. But feel free if it sounds fun and you're looking for things to do! I wouldn't be able to get to it till next week with tool lockfiles being stable.

Before embarking on it though, are we confident we won't add back subscopes that use . as the delimiter? We are comfortable committing to some alternative like |?

@benjyw
Copy link
Contributor

benjyw commented Aug 26, 2021

Subscopes were very complicated for both Pants developers and users, and I would be reluctant to add them back, especially as a general name-based mechanism. But yeah, if we do, we can use something other than . as the delimiter, since I guess that has special meaning in TOML.

@Eric-Arellano
Copy link
Contributor Author

FYI a subscope still exists in 2.7: python-protobuf.mypy-plugin. We're so close to 2.8 that it may make sense to hold off on implementing this new TOML feature. Or, special case it.

@benjyw
Copy link
Contributor

benjyw commented Aug 26, 2021

Yeah, good point. Let's wait until we can do that one deprecation.

@Eric-Arellano
Copy link
Contributor Author

Native TOML dictionaries are now supported! Based on developments with named resolves, I'm now thinking this modeling for named interpreter constraints:

[python-setup]
possible_interpreter_constraints = { py2 = "==2.7.*", py3 = ">=3.6,<4" }
default_interpreter_constraints = "py3"
python_library(interpreter_constraints="py2")

(Or known_interpreter_constraints, valid_interpreter_constraints, etc)

See #12742 for the two competing proposals for how to model the default ICs. From there:

I think option 2 gives better modeling when an org wants to change their default to a new value. Rather than having to retcon default, they can simply point to the [interpreter constraints] they want...If you start with --default=py2, then you can update it to --default=py3. That's better than changing default to have a different value.

This is more boilerplate when you only have a single interpreter constraint in your repository than the status quo. That's a bummer, but I think it's minimal enough and worth it. Reminder that you only have to set these two options once usually, and usually done by a codebase admin.

@stuhood
Copy link
Member

stuhood commented Sep 2, 2021

@Eric-Arellano : Two completely wildcard ideas which have have been floating around in my mind (and which I'm very uncertain about).

  1. There is the possibility that you would want to configure resolves and ICs at once. Then this might be something like:
class PythonConfiguration:
    name: str
    lockfile: str
    interpreter_constraints: str
  1. I wonder how naming resolves/ICs interplays with https://docs.google.com/document/d/1sr1O7W9517w8YbdzfE56xvDb72LafK94Fwxm0juIRFc/edit?disco=AAAAHXZOocI ... essentially, whether if you could define the resolve/ICs for an entire subdirectory by globbing it with a target and having additive Fields (i.e., what we have previously called "multiple owning targets"), then would you still want to name the resolve/ICs vs defining it somewhere that globbed in the matching targets.

@Eric-Arellano
Copy link
Contributor Author

I remain very concerned about the performance hit of calculating interpreter constraints in a single-IC repo, and our users have pointed it out, prompting changes like #15141.

I'm not sure how I feel about named interpreter constraints still. My concern is you have to create this custom naming mechanism and have all your internal users know what names to use. Consider this:

python_sources(
    interpreter_constraints="py35-39",
)

python_tests(
   interpreter_constraints=parametrize("py35", "py36", "py37", "py38", "py39"
)

Is that confusing to users to discover those names?

The major benefit I see still is that admins have tight blessing over which ICs are valid in a repo. That's desirable.

--

Reminder that if we don't use named interpreter constraints, we can do something like #12495 to fix performance.

Thoughts @thejcannon @stuhood @benjyw ?

@stuhood
Copy link
Member

stuhood commented May 6, 2022

For the purposes of resolving the performance issue, I feel very strongly that we should implement #15241. If we want to add the naming of interpreter contraints, then we should do it for a usability reason rather than a performance reason, because I'm confident that #15241 would resolve the performance concern.

@Eric-Arellano
Copy link
Contributor Author

I'm confident that #15241 would resolve the performance concern.

We would still need to iterate over O(n) where n is # targets, vs. O(1) of accessing len(python.interpreter_constraints). So while I agree it's no longer as mission critical..named interpreter constraints do still have a performance benefit.

@Eric-Arellano
Copy link
Contributor Author

@benjyw you spent a lot of time with interpreter constraints & parametrize when recently adopting Pants in a repo. Purely from UX perspective, what do you think about the constraints being named? (#12652 (comment) for example)

@benjyw
Copy link
Contributor

benjyw commented May 11, 2022

Purely from a UX perspective, this would be excellent to have. Right now I'm writing macros to get around this.

Imagine a repo that builds a wheel for 5 different interpreter versions, and therefore also has to run tests on those versions, for example. Being able to DRY and then parametrize on the interpreter constraint name instead of the actual constraint string would be a great step.

@Eric-Arellano
Copy link
Contributor Author

Awesome! So then there's sufficient motivation:

  • ergonomics Benjy talks about w/ DRY
  • tight-control from codebase admins on when new ICs are introduced
  • minor performance enhancements, can short-circuit

Now just to figure out how to deprecate...

@stuhood
Copy link
Member

stuhood commented May 11, 2022

IMO, we should probably wait until we know what #13767 will look like. If the configuration idea pans out, then it is roughly equivalent to named ICs, but with a different UX and no deprecation cycle.

@benjyw
Copy link
Contributor

benjyw commented May 11, 2022

Yeah, I don't know if I would single out interpreter constraints for special UX consideration over using named configurations in general

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: Python Python backend-related issues
Projects
Status: No status
Development

No branches or pull requests

4 participants