-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plans for packaging.metadata
#570
Comments
@abravalheri @pfmoore when you say you want more lenient handling, what does that mean to you? I think we are going to want to use exception groups so that all issues can be gathered and reported to the user instead of just the first error (@pradyunsg should we take a dependency on @overload
def from_email_headers(..., raise_exc=True) -> Self: ...
@overload
def from_email_headers(..., raise_exc=False) -> (Self, ExceptionGroup): ... In the end the only code necessary to accommodate this is some branching at the end of the function: if raise_exc:
if exception_group:
raise exception_group
else:
return metadata_
else:
return metadata_, exception_group I contemplated only having the lenient return type, but I realized that's not a great API if you forget to raise that exception group if that's what you truly want compared to getting a As for what exceptions are returned within the exception group, I was thinking of all of them being BTW I am side-stepping if there would be such a thing as an unrecoverable error no matter what, like leaving out the metadata version, project name, etc. |
It means that I want to be able to parse metadata with (for example) legacy versions and not have an InvalidVersion error raised, but instead get the value as a string. Returning exceptions rather than raising them is of no value for my use cases. For me, there are two parts to parsing metadata (from email, for example):
I consider "lenient" parsing to be just doing step 1, so that all possible values can be processed, regardless of validity. Strict parsing involves doing step 2 as well, and rejecting invalid values. I tend to think that "lenient" parsing is important, on the basis of the "be lenient in what you consume and strict in what you produce" principle. When reading a file, you want to do whatever you can to get from the raw data to at least a name-value structure, so you can do more detailed processing on a simple but structured representation. Maybe you'll want to reject invalid values, maybe you'll want to spot them and report them, maybe you'll want to analyze patterns in what sort of deviations from valid values are present. The point is, being able to get the raw data into an internal form is key here. Before you write data, though, you want it to be in a valid form (usually - there are use cases for simply reading and re-writing data and allowing invalid values to pass through unmodified, but they are specialised). So you store the data in a validating internal data structure, that catches any dodgy values as early as possible, and then write it out knowing you're only writing valid data to the external form. Being able to read data from an external form, and then convert it to the validating internal structure (with errors being flagged at conversion time) may well be the more common pattern when working with a single metadata file. But when working with metadata files in bulk, the pattern is (IMO) much more likely to be "read in bulk without validation, then validate in bulk". My |
I agree with Paul's definition of not raising an error. It would make sense to parse whichever fields are possible on a best effort basis and populate the parsed information into the structured object. A tuple with an exception group also make sense, so a build-backend for example could use it to produce its own warnings. In my mind, some examples of things that are not correct but can be parsed on "a best effort basis" are:
We can also ignore non-standard fields (e.g. (I tried to capture some of this "permissiveness" on the implementation of #498. Probably not the best code in the universe, but please feel free to steal if you find something useful). |
(Just to avoid confusion: |
To the list of "things that are not correct but may not be a problem", can I add
|
Fair enough, but the exceptions would still cause you to have access to the invalid data via their raw strings. So I think this is coming more down to you want strings all the time while I want to provide structured data where possible. I acknowledge both approaches have legitimate uses, and so I will stop bugging you about this. 🙂
👍
I honestly wasn't planning to strictly check non-standard field names anyway. The biggest chance of failure here is something |
Maybe. But my core use case looks something like this:
With a string-only ("phase 1 only") API, that "just works". Validation can happen in a later stage, when processing a particular file. How would you do that with an API that only does the full conversion to a validated structure, returning exceptions when needed, in such a way that I can later write the data out unmodified? The sort of use case I have in mind is processing a bunch of wheels off PyPI (using HTTP range requests to only pull the METADATA file out of the wheel and not download the whole lot, running parallel downloads for speed), processing the metadata to report "interesting statistics", and then saving the structured data to a database (where it'll all be serialised back to strings) for further future analysis. That Anyway, like you say, both approaches have legitimate uses, so we can agree to differ. I'm mostly just a little sad because you will have to internally read the metadata into strings before converting to types like |
We don't need to have a single object represent both "raw" and "validated + parsed" metadata, FWIW.
This would cover both use cases, at the cost of an additional layer of indirection which is necessary to deal with the two sets of user needs here. That's a design if we want to solve both use cases in a single module. I think it's worthwhile, but I won't make that call unilaterally of course. :)
Meh... It depends on flit_scm for its own build, which in turn uses setuptools_scm and flit_core. I'm ambivalent on this because it's gonna mean that we have redistributors feel more pain and, possibly, enough of it that they come complaining about it. :) |
That was my suggestion, with the "json-compatible dictionary" format specified in the metadata PEP as the "raw" form. |
So just somewhat of a one-off for email headers, or would we eventually want it for
I would assume we would have a When we were originally discussing dict vs. data class, I was thinking "one or the other," not "both". That does simplify the story and structure of the overall approach by normalizing to JSON as the low-level input for
Does that work for you as well, @abravalheri ? I would assume we would ditch the overrides on @pfmoore might actually get what he wants. 😉 |
It is hard to give a informed opinion before working with the API, so please take the following with a grain of salt: I don't mind having to do more steps to use the API, but I would say that for the use case I have in mind1, I don't think a build backend fits in the category of consumers that want to copy the data unaltered (or even just to display it unaltered). We can consider that a backend want to parse the So if the only way of not running into errors is to use the a raw A minor comment about a previous topic of discussion:
If there are limitations on what By the way, if Footnotes
|
OK, so you would appreciate the exception group with all the failed metadata still being returned. So given email headers as the input, we have:
Maybe I have been thinking about this the wrong way by trying to generalize everything such that every metadata format should be treated the same instead of just treating every format as unique (and to hell with the complexity of it all)? Thinking about the formats and reading/writing them, you have:
So maybe
It can parse >>> from packaging import requirements
>>> requirements.Requirement("dep (>=1.2)")
<Requirement('dep>=1.2')>
>>> requirements.Requirement("dep (1.2)")
Traceback (most recent call last):
File "/tmp/temp-venv/lib/python3.9/site-packages/packaging/requirements.py", line 102, in __init__
req = REQUIREMENT.parseString(requirement_string)
File "/tmp/temp-venv/lib/python3.9/site-packages/pyparsing/core.py", line 1141, in parse_string
raise exc.with_traceback(None)
pyparsing.exceptions.ParseException: Expected string_end, found '(' (at char 4), (line:1, col:5)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/tmp/temp-venv/lib/python3.9/site-packages/packaging/requirements.py", line 104, in __init__
raise InvalidRequirement(
packaging.requirements.InvalidRequirement: Parse error at "'(1.2)'": Expected string_end
>>> requirements.Requirement("dep (==1.2)")
<Requirement('dep==1.2')> |
PyPI would want strict validation FWIW The way I'd implement this is that you have a variety of functions that read a file and turn it into a "standard" shaped dictionary with primitive keys / values. No validation would be done at this point so it's completely YOLO, the only different between this and a hypothetical So, they'd have a function signature that looked something like: from typing import TypedDict, Optional
class ParsedMetadata(TypedDict):
name: Optional[str]
# Everything is optional, everything is primitive types.
def parse_FORMAT(content: str) -> ParsedMetadata:
pass There are a few subtle things here that I would suggest:
Then a classmethod can be added to the existing class Metadata:
# ...
@classmethod
def from_parsed(cls, parsed: ParsedMetadata) -> Metadata:
pass This method consumes the parsed value, validates all of the metadata, then returns the validated I wouldn't actually have the I think this has a lot of nice benefits:
For someone that wants the strict validation, they would ultimately call this like: meta = Metadata.from_parsed(parse_email(content)) We could offer some shortcut methods, that do: class Metadata:
@classmethod
def from_email(cls, content: str) -> Metadata:
return cls.from_parsed(parse_email(content)) Which then gives people the option for the best of both worlds. Footnotes
|
Thanks @dstufft - this is exactly the model I've been arguing for all along (and I agree, conversion to metadata should always raise - which I think is what @brettcannon might mean when he says "strict", and I think he believes I object to...) I've been thinking some more about this, though, and I think that @brettcannon has a point when he says maybe the logic should depend on the input. From my point of view:
There's also another assumption that I've been making, which I think should be made explicit. The So, for example, I believe the following code: m = Metadata("foo", Version("1.0"))
m.requires_python = 3.7 should raise an exception. I don't think there should be any choice here, the attribute should be validated on assignment. The same goes for This is important because there will be tools that want to accept direct user input for metadata fields, not from a file format, but from things like command line options ( I'd also point out (although it's somewhat off topic here) that I believe that Footnotes
|
I think that even for cases like the hypothetical The benefit of that is most obvious for the email parser, because the email format isn't sufficiently expressive enough to have all the primitive types we might use natively, so we need to do some extra parsing. However, the same could be true for a "good" serialization format like a hypothetical JSON. Let's say for instance that Another maybe less contrived example is that it's possible that we (purposely, or accidentally) end up with different key names in different serialization formats. The underlying thing is that if we try to say that we don't need I also think that while obviously the biggest problem with "bad" metadata is going to be the email based fields because of our history here, it seems silly to assume that tools aren't going to have bugs that generate bad metadata that 10 years from now we'll end up wanting a lenient parser for, for all the same reasons we have today. Particularly since the cost of doing so seems to be pretty minimal? And since we'll want it for email metadata anyways, if we don't implement it that way, we're forced to be inconsistent in how we implement different serialization types. I'm personally kind of sketch on the idea of using these APIs for If we are going to combine def validate_as_pyproject(meta: Metadata) -> bool:
...
def validate_as_core_metadata(meta: Metadata) -> bool:
... Which implements the specific logic for the two different rules.... but really I think that An interesting thing about this all, is that we can add a method to Putting this all together, my suggestion would be: from typing import TypedDict
class RawMetadata:
... # Typed so that everything is optional, and only using primitive types.
class Metadata:
... # This is typed so that required things are required, optional things are optional.
@classmethod
def from_raw(cls, raw: RawMetadata) -> Metadata:
...
@classmethod
def from_email(cls, data: bytes) -> Metadata:
raw, _leftover = parse_email(data)
return cls.from_raw(raw)
def as_raw(self) -> RawMetadata:
...
def as_email(self) -> bytes:
return emit_email(
def parse_email(data: bytes) -> (RawMetdata, dict[str, Any]):
...
def emit_email(raw: RawMetadata) -> bytes:
... This gives us tons of flexibility, sensible layers to our implementation, and still easy to use functions for people to use when they don't care about those layers. Footnotes
|
It's worth a lot! 🙂
That's what I was thinking for a time yesterday, and that's what I'm considering in the email header case. I don't want to make it universal, though, as other formats like
Yep, as that's how you get permissible/loose management of the data.
That does go against what's in PEP 566. What I'm more concerned about is any of the structured data in
Those keys would actually not even be typed as they wouldn't be represented by the
There would very likely be a validation step on output.
What I have in my head is email headers get converted to JSON, and then we work with JSON for
Essentially we use JSON as the lossy input format and work with |
I'm confused what you would be redoing if it's already structured appropriately? If the result of def parse_json(data: bytes) -> (RawMetadata, dict[str, Any]):
raw: RawMetadata = {}
leftover: dict[str, Any] = {}
for key, value in json.loads(data).items():
if key in KNOWN_FIELDS:
raw[key] = value
else:
leftover[key] = value
return raw, leftover You're not redoing anything there, you're just spliting the keys up into raw or leftover, and if it's decided to just leave the unknown keys inside the def parse_json(data: bytes) -> RawMetadata:
return json.loads(data) Is it kind of silly to have this extra function call? Yea slightly, but it provides better typing but more importantly it keeps things consistent. There are no special cases here. 10 years from now, if we all decide that JSON is lame and we want to use FSON, but FSON doesn't have quite the same capabilities as JSON, we can easily just drop in a new parse function, More importantly, I think there's a fundamental assumption here that the input to For example, this is probably more realistically what that function would look like:
def parse_json(data: bytes) -> (RawMetadata, dict[Any, Any]):
raw: RawMetadata = {}
leftover: dict[Any, Any] = {}
parsed = json.loads(data)
if not isinstance(parsed, dict):
raise ValueError("data is not a json mapping")
for key, value in parsed.items():
if key in KNOWN_FIELDS and isinstance(value, KNOWN_FIELDS[key]):
raw[key] = value
else:
leftover[key] = value
return raw, leftover That function will:
Well part of the idea here is that
As I mentioned above, I don't think That being said, For example Now if you end up agreeing that Which raises the questions:
Footnotes
|
Let's take author details as an example. In Now consider How do you reconcile those two differences of the data's structure in your
I think this is where I'm having a hard time understanding the balance of the complexity of what I would assume would need to be the greatest common denominator or have every which way to specify data from every format and just be a monster of an object (in terms of structure).
It's not the only bad one, just the worst one. 😉 I think the thing we can all agree on is people have various uses for getting the data out of bytes and into a format that can be programmatically accessed without any data loss somehow. What Donald and Paul seem to be advocating for is an intermediate format that seems to be some dict or data class that holds the data in the most flexible way possible to separate out the reading part from the parsing/processing part. I seem to be advocating for skipping the intermediate stage and going straight to |
I think I would reword this to with minimal data loss. I think ultimately the scale of nonsense garbage that can get shoved in these files is sufficiently unbounded that there's no way to write a general-purpose parser that has no data loss.
Ultimately you just pick something and run with it, I think? The authors/maintainers field here is interesting because while If I were modeling that, I would represent it like this (completely untested) 2: import tomllib
from email.parser import BytesParser
from email.headerregistry import AddressHeader
from typing import TypedDict, Optional, Union
class RawPerson:
name: Optional[str]
email: Optional[str]
# Core metadata spec supports the RFC From header, which has support
# for named groups of email addresses, it's "great".
#
# Maybe it would make more sense to ditch ``RawGroup``, and just add
# an optional "group" field on RawPerson that's just a string of the group
# name, if there is one?
class RawGroup:
group: Optional[str]
persons: list[RawPerson]
class RawMetadata:
authors: Optional[list[Union[RawPerson, RawGroup]]]
def parse_email(data: bytes) -> tuple[RawMetadata, dict[Any, Any]]:
raw: RawMetadata = {}
leftover: dict[Any, Any] = {}
parsed = BytesParser().parsebytes(data)
for key, value in parsed.items():
if key == "Author":
# This sucks, but the core metadata spec doesn't really give us any
# format that this value is in, so even though the example gives an
# email, we can't really do anything about it and just have to treat
# it as a string.
raw.setdefault("authors", []).append(RawPerson({"name": value}))
elif key = "Author-Email":
# This API is gross, maybe there's a better one somewhere in the stdlib?
address = {}
try:
AddressHeader.parse(value, address)
except Exception: # No idea what errors it can return, seems like a bunch of random ones
leftover[key] = value
else:
authors = raw.setdefault("authors", [])
for group in address.get("groups", []):
if group.display_name is None:
for addr in group.addresses:
p: RawPerson = {}
if addr.display_name:
p["name"] = addr.display_name
if addr.addr_spec:
p["email"] = addr.addr_spec
authors.append(p)
else:
rg = RawGroup({"group": group.display_name, "persons": [})
for addr in group.addresses:
p: RawPerson = {}
if addr.display_name:
p["name"] = addr.display_name
if addr.addr_spec:
p["email"] = addr.addr_spec
rg.persons.append(p)
authors.append(rg)
return raw, leftover
# If we must...
def parse_pyproject(data: bytes) -> tuple[RawMetadata, dict[Any, Any]]:
raw: RawMetadata = {}
leftover: dict[Any, Any] = {}
parsed = tomllib.loads(data)
# Going to arbitrarily decide that the rest of the toml file doesn't
# matter, just the projects key.
if "projects" not in parsed:
raise ValueError("No metadata")
for key, value in parsed.items():
if key == "authors":
# An interesting question is whether a single entry in a list like this should fail
# parsing the entire list, or should we be greedy and parse as much as possible?
#
# I'm going to fail parsing the entire key, because if their is bad data, we don't know
# what else is broken.
authors = []
invalid = False
for entry in value:
if isinstance(entry, dict) and ("name" in entry or "email" in entry):
p = RawPerson()
if "name" in entry:
p["name"] = entry["name"]
if "email" in entry:
p["email"] = entry["email"]
authors.append(p)
else:
invalid = True
break
if invalid:
leftover["authors"] = value
else:
raw["authors"]
return raw, leftover Footnotes
|
I'm going to sketch up a PR here, and see how things turn out! |
Something that I noticed while implementing this: PEP 566 says this about converting metadata into json:
However, the packaging specifications say this:
but more importantly, it says this about the keywords field:
From what I can see, the specifications haven't explicitly altered how |
Next fun fact! You can't actually use You can of course just decode the bytes yourself as utf8 and pass them into |
How about |
(s/Police/Policy/) |
Nope. The message_from_bytes function is a thin wrapper around The ByteParser class is entirely implemented as a wrapper around |
Of course, if you don't care about being lenient, you can just do That will correctly parse the |
|
Passing in |
But is there any way to do it via some |
Uhh, I probably added that because something on PyPI made it happen when I was testing it against PyPI data... but I don't remember what. |
At least, I don't think I would have gone out of my way to do that on my own. |
Trying to serialize an object with a list payload, at least, doesn't work. It trips up in |
#671 covers |
Raw metadata parsing just landed in |
🎉 |
If PEP 639 gets accepted this becomes a standard field |
@nilskattenbeck-bosch correct, but that hasn't happened yet (I should know, I'm the PEP delegate 😁). Once the field exists we will bump the max version for metadata and then update the code accordingly. |
May I also suggest adding a small information box to the documentation as to why the function is called |
I would go even farther and say you can propose a PR to add such a note. 😁 |
Introduce the `packaging.metadata.Metadata` class which gives an object API similar to `packaging.metadata.RawMetadata` while returning "enriched" objects like `version.Version` where appropriate. It also supports validating the metadata. Part of #570
With my work for reading metadata now complete, I'm personally considering my work on this issue done. Hopefully someone else will feel motivated to do the "writing" bit, but as someone who has only written tools to consume metadata I don't think I'm in a good position to drive the writing part of this. |
Thank you very much @brettcannon for working on this.
There was at least one previous PR (#498, probably more) that tried to address the "writing" capabilities for metadata, however these were closed because probably they don't fit into the long term vision that the maintainers would like for the API/implementation (which is a very good thing to have in a project). Would it be possible (for the sake of anyone that intends to contribute with the "writing" part) to have a clear guideline on how we should go about it? (This question is targeted at all I am just concerned that throwing PRs at |
I think the first question is what are the requirements for the feature to write metadata? Do you need to minimize diffs for The other question is whether tools will build up the data in a There's also how much you want to worry about older metadata versions? Do you let the user specify the metadata version you're going to write out and thus need to do fancier checks for what's allowed? For me, I think consistency in output is more important than keeping pre-existing |
I'd also mention that everything I've done with
|
Oh ugh, emitting probably also has to answer the newlines in fields question, and possibly the other issues I raised with the spec earlier up thread. |
Just a note, I have a PR up now to Warehouse (pypi/warehouse#14718) that switches our metadata validation on upload from a custom validation routine implemented using wtforms to The split between I'm still giving it a through set of manual testing, but so far it looks like other than a few bugs/issues that fell out (#733, #735) integrating it was relatively painless. Of course the real test will be when it goes live and that wide array of nonsense that people upload starts flowing through it. We're not yet using the actual parsing of |
Do you think it might make sense to keep both the old code and the new code paths running for a bit on warehouses' end, with the results of the old code path being returned? i.e. def upload(...):
try:
new_result = new_upload(...)
except Exception:
new_result = ...
old_result = old_upload_handler(...)
if new_result != old_result:
log.somehow(old_result, new_result)
return old_result |
The text was updated successfully, but these errors were encountered: