Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plans for packaging.metadata #570

Open
2 of 3 tasks
brettcannon opened this issue Jul 8, 2022 · 97 comments
Open
2 of 3 tasks

Plans for packaging.metadata #570

brettcannon opened this issue Jul 8, 2022 · 97 comments

Comments

@brettcannon
Copy link
Member

brettcannon commented Jul 8, 2022

@brettcannon
Copy link
Member Author

@abravalheri @pfmoore when you say you want more lenient handling, what does that mean to you? I think we are going to want to use exception groups so that all issues can be gathered and reported to the user instead of just the first error (@pradyunsg should we take a dependency on exceptiongroup or have our own ExceptionGroup shim class?). That leads to a possible solution where there's an option of returning the exception group so the caller can choose how to handle the problems (with all the relevant data saved to the exceptions). What we could then do is something like:

@overload
def from_email_headers(..., raise_exc=True) -> Self: ...

@overload
def from_email_headers(..., raise_exc=False) -> (Self, ExceptionGroup): ...

In the end the only code necessary to accommodate this is some branching at the end of the function:

if raise_exc:
    if exception_group:
        raise exception_group
    else:
        return metadata_
else:
    return metadata_, exception_group

I contemplated only having the lenient return type, but I realized that's not a great API if you forget to raise that exception group if that's what you truly want compared to getting a TypeError that there are not enough values to unpack for metadata_, exc_group = Metadata.from_email_headers(..., raise_exc=True).

As for what exceptions are returned within the exception group, I was thinking of all of them being InvalidMetadata which stores the core metadata name, value, and the reason why it failed (either a string or setting __cause__ as appropriate, e.g. InvalidSpecifier).

BTW I am side-stepping if there would be such a thing as an unrecoverable error no matter what, like leaving out the metadata version, project name, etc.

@pfmoore
Copy link
Member

pfmoore commented Jul 10, 2022

It means that I want to be able to parse metadata with (for example) legacy versions and not have an InvalidVersion error raised, but instead get the value as a string. Returning exceptions rather than raising them is of no value for my use cases.

For me, there are two parts to parsing metadata (from email, for example):

  1. Parse the email format into name-value pairs for the various metadata fields. At this stage, the "value" is a simple string value (or list of such values).
  2. Validate the values which have a defined format, and convert them into structured types as appropriate.

I consider "lenient" parsing to be just doing step 1, so that all possible values can be processed, regardless of validity. Strict parsing involves doing step 2 as well, and rejecting invalid values.

I tend to think that "lenient" parsing is important, on the basis of the "be lenient in what you consume and strict in what you produce" principle. When reading a file, you want to do whatever you can to get from the raw data to at least a name-value structure, so you can do more detailed processing on a simple but structured representation. Maybe you'll want to reject invalid values, maybe you'll want to spot them and report them, maybe you'll want to analyze patterns in what sort of deviations from valid values are present. The point is, being able to get the raw data into an internal form is key here. Before you write data, though, you want it to be in a valid form (usually - there are use cases for simply reading and re-writing data and allowing invalid values to pass through unmodified, but they are specialised). So you store the data in a validating internal data structure, that catches any dodgy values as early as possible, and then write it out knowing you're only writing valid data to the external form.

Being able to read data from an external form, and then convert it to the validating internal structure (with errors being flagged at conversion time) may well be the more common pattern when working with a single metadata file. But when working with metadata files in bulk, the pattern is (IMO) much more likely to be "read in bulk without validation, then validate in bulk".

My pkg_metadata project is focused on only doing step 1, with the dictionary format being the "unvalidated key-value" data structure. I would still happily contribute it to packaging, with step 2 then being about converting the dictionary format to the Metadata class. And probably some convenience functions to do steps 1+2 together, for people who don't need the separation. But if you don't agree with my split into steps 1 and 2, I can keep pkg_metadata as a separate project for people who need that extra flexibility (I'll probably add something to do step 2, converting to the packaging Metadata class once it has stabilised, in that case).

@abravalheri
Copy link
Contributor

abravalheri commented Jul 11, 2022

@abravalheri @pfmoore when you say you want more lenient handling, what does that mean to you?

I agree with Paul's definition of not raising an error. It would make sense to parse whichever fields are possible on a best effort basis and populate the parsed information into the structured object. A tuple with an exception group also make sense, so a build-backend for example could use it to produce its own warnings.

In my mind, some examples of things that are not correct but can be parsed on "a best effort basis" are:

  1. "long text" fields (e.g. Description) when the indentation does not strictly follow the " " * 7 + "|" rule (e.g. when the indentation is " " * 8).
  2. "outdated" forms of a field's key or value, when the metadata version is inconsistent, e.g.:
    • PEP 314 => PEP 345 renaming, such as: Requires => Requires-dist
    • Requires-Dist using a parenthesised version without operator (e.g. dep (1.2))
  3. Multi-line Summary (we can simply join multiple lines when parsing the value).

We can also ignore non-standard fields (e.g. License-Files).

(I tried to capture some of this "permissiveness" on the implementation of #498. Probably not the best code in the universe, but please feel free to steal if you find something useful).

@merwok
Copy link

merwok commented Jul 11, 2022

(Just to avoid confusion: Requires from PEP 314 is not just an old name for Requires-Dist, because it expected an importable name for value, not a project name. So these are really two separate fields, one obsolete (can be discarded) and one current.)

@pfmoore
Copy link
Member

pfmoore commented Jul 11, 2022

To the list of "things that are not correct but may not be a problem", can I add

  • Invalid values (e.g., incorrectly formatted specifiers) in any field, when you're not interested in that value. Very few applications in my experience actually care about the actual value of (for example) the "Author" and "Author-Email" metadata read from an existing distribution, but wanting to copy that data unaltered (or even just to display it unaltered!) is a fairly reasonable thing to hope for.

@brettcannon
Copy link
Member Author

Returning exceptions rather than raising them is of no value for my use cases.

Fair enough, but the exceptions would still cause you to have access to the invalid data via their raw strings. So I think this is coming more down to you want strings all the time while I want to provide structured data where possible. I acknowledge both approaches have legitimate uses, and so I will stop bugging you about this. 🙂

A tuple with an exception group also make sense, so a build-backend for example could use it to produce its own warnings.

👍

We can also ignore non-standard fields (e.g. License-Files).

I honestly wasn't planning to strictly check non-standard field names anyway. The biggest chance of failure here is something packaging itself can't parse (e.g. the dep (1.2) example). But hopefully returning the incorrect strings for introspection for tools decide what extra effort they want to put in is flexible enough.

@pfmoore
Copy link
Member

pfmoore commented Jul 12, 2022

Fair enough, but the exceptions would still cause you to have access to the invalid data via their raw strings.

Maybe. But my core use case looks something like this:

structured_metadata = {filename: msg_to_metadata(get_metadata_bytes(filename) for filename in list_of_filenames}

With a string-only ("phase 1 only") API, that "just works". Validation can happen in a later stage, when processing a particular file. How would you do that with an API that only does the full conversion to a validated structure, returning exceptions when needed, in such a way that I can later write the data out unmodified?

The sort of use case I have in mind is processing a bunch of wheels off PyPI (using HTTP range requests to only pull the METADATA file out of the wheel and not download the whole lot, running parallel downloads for speed), processing the metadata to report "interesting statistics", and then saving the structured data to a database (where it'll all be serialised back to strings) for further future analysis. That structured_metadata dictionary is the in-memory data structure I'd use for that processing and then to serialise from.

Anyway, like you say, both approaches have legitimate uses, so we can agree to differ. I'm mostly just a little sad because you will have to internally read the metadata into strings before converting to types like Version and Requirement, and yet you prefer not to expose that intermediate state meaning that people wanting string-only values have to reimplement that code. But I have that reimplemetation in pkg_metadata, so there's no more extra work now anyway.

@pradyunsg
Copy link
Member

We don't need to have a single object represent both "raw" and "validated + parsed" metadata, FWIW.

  • Provide a "low level" parse function, that just does the email.FeedParser stuff, from the raw underlying string/bytes and provides a value (either the Message as is or a different class, StringValuedMetadata or something, that's aware of keys and does a lookup on attribute access mirroring the Metadata class's attributes).
  • Metadata will perform the validation during initialisation. It would also grow a classmethod to allow initialising through that in addition to being able to be initialised from the raw string/bytes as well. We'll keep the exception groups as well.

This would cover both use cases, at the cost of an additional layer of indirection which is necessary to deal with the two sets of user needs here.


That's a design if we want to solve both use cases in a single module. I think it's worthwhile, but I won't make that call unilaterally of course. :)

should we take a dependency on exceptiongroup

Meh... It depends on flit_scm for its own build, which in turn uses setuptools_scm and flit_core. I'm ambivalent on this because it's gonna mean that we have redistributors feel more pain and, possibly, enough of it that they come complaining about it. :)

@pfmoore
Copy link
Member

pfmoore commented Jul 13, 2022

This would cover both use cases, at the cost of an additional layer of indirection which is necessary to deal with the two sets of user needs here.

That was my suggestion, with the "json-compatible dictionary" format specified in the metadata PEP as the "raw" form.

@brettcannon
Copy link
Member Author

brettcannon commented Jul 13, 2022

  • that just does the email.FeedParser stuff

So just somewhat of a one-off for email headers, or would we eventually want it for pyproject.toml as well? I guess the code can guide us as to whether it's easier to normalize to JSON for everything and have Metadata take that in to do the conversion to the appropriately types and do validation.

  • either the Message as is or a different class, StringValuedMetadata or something

I would assume we would have a TypedDict defined for the JSON dict format.

When we were originally discussing dict vs. data class, I was thinking "one or the other," not "both". That does simplify the story and structure of the overall approach by normalizing to JSON as the low-level input for Metadata, but at the cost of more code to manage the same data, just in different formats. But since we don't exactly change the core metadata spec that often the duplication shouldn't be burdensome.

This would cover both use cases, at the cost of an additional layer of indirection which is necessary to deal with the two sets of user needs here.

That was my suggestion, with the "json-compatible dictionary" format specified in the metadata PEP as the "raw" form.

Does that work for you as well, @abravalheri ? I would assume we would ditch the overrides on Metadata.from_email_headers() in preference to this two-step process of getting from METADATA to Metadata.

@pfmoore might actually get what he wants. 😉

@abravalheri
Copy link
Contributor

Does that work for you as well, @abravalheri ? I would assume we would ditch the overrides on Metadata.from_email_headers() in preference to this two-step process of getting from METADATA to Metadata.

It is hard to give a informed opinion before working with the API, so please take the following with a grain of salt:

I don't mind having to do more steps to use the API, but I would say that for the use case I have in mind1, from_XXX(..., raise_exc=False) doing "best effort" parsing and returning a tuple with a (potentially incomplete) structured metadata object is what I am really looking for.

I don't think a build backend fits in the category of consumers that want to copy the data unaltered (or even just to display it unaltered). We can consider that a backend want to parse the PKG-INFO into an structured object, use this object for the build and then use the API to create a valid METADATA file...

So if the only way of not running into errors is to use the a raw TypedDict, the backend would have to re-implement the logic that converts the TypedDict into an structured metadata object...


A minor comment about a previous topic of discussion:

The biggest chance of failure here is something packaging itself can't parse (e.g. the dep (1.2) example). But hopefully returning the incorrect strings for introspection for tools decide what extra effort they want to put in is flexible enough.

If there are limitations on what packaging can parse, that is fine. An backend using this API can decide to ignore all the fields that are not essential for a --no-binary installation (e.g. author, summary) and try to do something about the fields that are essential (e.g. requires-dist) based on the raw value in the ExceptionGroup. The important part is that a (potentially incomplete) structured data object is returned, so the backend does not have to re-implement the entire conversion logic.

By the way, if packaging cannot parse Requires-dist: dep (1.2), this also means that this API will not be able to parse Metadata-Version <= 1.2, right?

Footnotes

  1. a build backend parsing a PKG-INFO file from an old sdist to produce a wheel to be used by the installer. I am assuming this is necessary to support pip install --no-binary or when no wheel is available in the package index for an specific operating system/processor architecture.

@brettcannon
Copy link
Member Author

I don't mind having to do more steps to use the API, but I would say that for the use case I have in mind1, from_XXX(..., raise_exc=False) doing "best effort" parsing and returning a tuple with a (potentially incomplete) structured metadata object is what I am really looking for.

OK, so you would appreciate the exception group with all the failed metadata still being returned.

So given email headers as the input, we have:

  1. Get me the data out of email headers, but don't process anything (Paul)
  2. Do your best to process the data into some structure, but give back the left-overs (Anderson)
  3. Be strict about the data (... no one?)

Maybe I have been thinking about this the wrong way by trying to generalize everything such that every metadata format should be treated the same instead of just treating every format as unique (and to hell with the complexity of it all)? Thinking about the formats and reading/writing them, you have:

  1. pyproject.toml (strict read, no write)
  2. PKG-INFO (loose read, strict write)
  3. METADATA (loose read, strict write)
  4. JSON/metadata.json (??? read, strict write)
  5. Direct object creation (strict read in the sense of typing will make sure you don't mess it up)

So maybe from_email_headers() should only return (Self, dict[str, str]) with as much processed data as possible in Self and a dictionary with everything that wasn't used/usable (with the keys all normalized to lower-case)? I don't know what we would want to do with data-related errors like something in Dynamic that actually shouldn't be there in the PKG-INFO case (let it through, return it in the dict?). But since the tools reading PKG-INFO/METADATA seem to want leniency, it might be best to just embrace it and let tools decide if they care whether the left-overs dict wasn't empty?

By the way, if packaging cannot parse Requires-dist: dep (1.2), this also means that this API will not be able to parse Metadata-Version <= 1.2, right?

It can parse dep (>=1.2) just fine; the lack of version specification is the issue.

>>> from packaging import requirements
>>> requirements.Requirement("dep (>=1.2)")
<Requirement('dep>=1.2')>
>>> requirements.Requirement("dep (1.2)")
Traceback (most recent call last):
  File "/tmp/temp-venv/lib/python3.9/site-packages/packaging/requirements.py", line 102, in __init__
    req = REQUIREMENT.parseString(requirement_string)
  File "/tmp/temp-venv/lib/python3.9/site-packages/pyparsing/core.py", line 1141, in parse_string
    raise exc.with_traceback(None)
pyparsing.exceptions.ParseException: Expected string_end, found '('  (at char 4), (line:1, col:5)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/temp-venv/lib/python3.9/site-packages/packaging/requirements.py", line 104, in __init__
    raise InvalidRequirement(
packaging.requirements.InvalidRequirement: Parse error at "'(1.2)'": Expected string_end
>>> requirements.Requirement("dep (==1.2)")
<Requirement('dep==1.2')>

@dstufft
Copy link
Member

dstufft commented Jul 15, 2022

PyPI would want strict validation FWIW


The way I'd implement this is that you have a variety of functions that read a file and turn it into a "standard" shaped dictionary with primitive keys / values. No validation would be done at this point so it's completely YOLO, the only different between this and a hypothetical json.loads() is that any differences in serialization are papered over so that all of the different metadata methods return a "standard" dictionary. Call these the parse_FORMAT function, so like parse_json, parse_email, parse_toml, etc.

So, they'd have a function signature that looked something like:

from typing import TypedDict, Optional

class ParsedMetadata(TypedDict):
    name: Optional[str]
    # Everything is optional, everything is primitive types.


def parse_FORMAT(content: str) -> ParsedMetadata:
    pass

There are a few subtle things here that I would suggest:

  • If a field is missing, or otherwise unable to be read from the raw format (see next point) they are just omitted from the returned dictionary rather than generating some kind of error.
  • Part of papering over different serializations will be parsing types, for instance if Project-URL is chosen to be a dictionary, then pase_email will need to convert ["Key, URL"] into {"Key": "URL"} 1.
  • Extra fields that we don't know about are omitted from the returned value 2.

Then a classmethod can be added to the existing Metadata class, like:

class Metadata:
    # ...

    @classmethod
    def from_parsed(cls, parsed: ParsedMetadata) -> Metadata:
        pass

This method consumes the parsed value, validates all of the metadata, then returns the validated Metadata object.

I wouldn't actually have the raises keyword, I would just have this method always raise, but attach the incomplete result to the raised exception. That satisfies the same use cases but avoids a boolean parameter and the need to deal with overloading return types.

I think this has a lot of nice benefits:

  • People who want something that will do as little parsing as possible, but still provide some semblance of a unified type, if you don't care about a unified type, then you're best just reading the files directly 3.
  • We cleanly separate concerns from parsing the file contents and massaging them into a unified shape from validating the primitive types and turning them into more complex, specific objects.
    • This has the added bonus that new formats can be implemented/tested outside of packaging or even custom formats.
  • We handle all 3 "levels" of validation that people have expressed an interest in.

For someone that wants the strict validation, they would ultimately call this like:

meta = Metadata.from_parsed(parse_email(content))

We could offer some shortcut methods, that do:

class Metadata:

    @classmethod
    def from_email(cls, content: str) -> Metadata:
        return cls.from_parsed(parse_email(content))

Which then gives people the option for the best of both worlds.

Footnotes

  1. This means that technically some malformed metadata fields will be unable to be read using these methods since they'll fail to parse and be omitted. We should strive to make our parsing as lenient as we reasonably can (for instance, the Project-URL example, make sure we work in cases where the URL or key is missing, the comma is missing, etc to produce something). If these lenient methods aren't lenient enough, then you'll have to implement it yourself (but the layering here means if your implementation emits a ParsedMetadata it can be used in the other methods).

  2. We could allow this; my biggest concern would be that we can't enforce any kind of typing in it which may be a footgun for consumers of this API. Maybe an option would be a _extra key that is just dict[str, Any] or have the parse_FORMAT return two values, ParsedMetadata and the left over items from the dict.

  3. This is because there is no such thing without massaging the data to have a return type that returns the same data for every serialization format. Formats like email are going to treat everything as strings, or list of strings, but a format like json or toml will support other primitive types, so they can't return str unless we cast them to string, which would be silly.

@pfmoore
Copy link
Member

pfmoore commented Jul 15, 2022

Thanks @dstufft - this is exactly the model I've been arguing for all along (and I agree, conversion to metadata should always raise - which I think is what @brettcannon might mean when he says "strict", and I think he believes I object to...)

I've been thinking some more about this, though, and I think that @brettcannon has a point when he says maybe the logic should depend on the input. From my point of view:

  1. The pyproject.toml format should always be validated in practice. I still think the "parse to unvalidated" -> "classmethod to construct with validation" separation is logically cleaner, but I don't see a use case for parsing pyproject.toml without validating, so in practice I don't see anyone using just one without the other. The pyproject format is modern, and anyone entering data in that form will be expecting a "modern" experience, including validation to the latests standards.
  2. The metadata.json format is a slightly odd one, because it's already designed for machine processing. In pkg_metadata I don't even need a parser for it, because my "unvalidated" format is the json-compatible dictionary. But if we go with @dstufft's idea of an unvalidated ParsedMetadata class, we'd need to convert. But either way, the only processing that adds value (IMO) is validating - the raw form loaded with json.load is sufficient for unvalidated processing. Which I think in @brettcannon's terms means that I support validating the JSON format...
  3. The killer is the email format. This is sufficiently tricky to process correctly that having a library that does it is immensely valuable. But it's been used for so long that there's a lot of data out there that is invalid1 by today's standards, and being able to read that is still important for backward compatibility if nothing else. So I think that separating raw parsing and validation is important for this case.

There's also another assumption that I've been making, which I think should be made explicit. The Metadata class itself should be validated. At the moment, it isn't - it has type hints, but there's no runtime validation on field content. And I don't think we should assume that everyone uses a type checker, or that it's OK to omit validation because fields have type annotations.

So, for example, I believe the following code:

m = Metadata("foo", Version("1.0"))
m.requires_python = 3.7

should raise an exception. I don't think there should be any choice here, the attribute should be validated on assignment. The same goes for m = Metadata("foo", "1.0", requires_python = 3.7) or m = Metadata("foo", "1.0", requires_python = "invalid_specifier).

This is important because there will be tools that want to accept direct user input for metadata fields, not from a file format, but from things like command line options (build_wheel --name foo --version 1.0 --requires_python 3.7). Those tools should not be able to construct incorrect Metadata objects by forgetting to validate their inputs (and as I said, we should not assume that all callers will run a type checker).

I'd also point out (although it's somewhat off topic here) that I believe that m = Metadata("foo", "1.0") should work. That is, I think that we should accept string values and convert them to the appropriate classes, using the constructor for that class. It's not just the convenience of omitting the constructor, it's the fact that you don't even need to import the class yourself. While I'd expect careful applications to do the conversion themselves (to control error handling), for adhoc scripts and use in the REPL, passing strings is just way more convenient2.

Footnotes

  1. One aspect of "unvalidated" here is that not all legacy email format input is encoded in UTF-8. So a parser should take care to accept arbitrary non-UTF-8 content in fields. This is impossible in full generality, but it is possible to support round-tripping, either by falling back to latin-1 (which can accept any byte value) or using the surrogateescape handler.

  2. This is somewhere I think type hints are lacking - what I want is to say "can be assigned anything that can be converted to a Version, but I'll do the conversion so the attribute itself is always a Version". Rust has the into and from traits to express this (if I understand them correctly) but Python can't do this easily.

@dstufft
Copy link
Member

dstufft commented Jul 15, 2022

I think that even for cases like the hypothetical parse_json function, there's still value in having that function over just json.loads, even if ultimately, it's a thin wrapper over json.loads. The biggest win I think we get from that is normalization into the ParsedMetadata dictionary.

The benefit of that is most obvious for the email parser, because the email format isn't sufficiently expressive enough to have all the primitive types we might use natively, so we need to do some extra parsing. However, the same could be true for a "good" serialization format like a hypothetical JSON.

Let's say for instance that ParsedMetadata decided that keywords should be set[str], JSON doesn't have a way to represent a set, so the parse_json function would translate the list[str] to set[str] (I'm assuming we'd make the sensible choice to represent a set as a list in json).

Another maybe less contrived example is that it's possible that we (purposely, or accidentally) end up with different key names in different serialization formats.

The underlying thing is that if we try to say that we don't need parse_json because json.loads() gets us something close to ParsedMetadata, we're basically saying that we expect every serialization format we might ever support to happen to emit things in exactly the same way, which I don't think is good assumption to make.

I also think that while obviously the biggest problem with "bad" metadata is going to be the email based fields because of our history here, it seems silly to assume that tools aren't going to have bugs that generate bad metadata that 10 years from now we'll end up wanting a lenient parser for, for all the same reasons we have today. Particularly since the cost of doing so seems to be pretty minimal? And since we'll want it for email metadata anyways, if we don't implement it that way, we're forced to be inconsistent in how we implement different serialization types.

I'm personally kind of sketch on the idea of using these APIs for pyproject.toml. My concern is that while some of the data there is the same, I think that conceptually pyproject.toml is a different (but related) thing all together, which I think is further supported by the fact that they're two different specifications on packaging.python.org and those specifications have different rules for validation 1.

If we are going to combine pyproject.toml and METADATA into a singular class, then I don't think that class can actually validate the metadata, because it needs to know what rules it's operating under in order to do that. Which I would probably expose two additional functions:

def validate_as_pyproject(meta: Metadata) -> bool:
    ...

def validate_as_core_metadata(meta: Metadata) -> bool:
   ...

Which implements the specific logic for the two different rules.... but really I think that pyproject.toml and the various METADATA flavors should just be sperate types.

An interesting thing about this all, is that we can add a method to Metadata that turns a Metadata instance into a ParsedMetadata instance, then have functions to emit that back into the underling serialization format. Doing that it may make more sense to rename ParsedMetadata to something like RawMetadata.

Putting this all together, my suggestion would be:

from typing import TypedDict

class RawMetadata:
    ...  # Typed so that everything is optional, and only using primitive types.

class Metadata:
    ...  # This is typed so that required things are required, optional things are optional.

    @classmethod
    def from_raw(cls, raw: RawMetadata) -> Metadata:
        ...

    @classmethod
    def from_email(cls, data: bytes) -> Metadata:
        raw, _leftover = parse_email(data)
        return cls.from_raw(raw)

    def as_raw(self) -> RawMetadata:
        ...

    def as_email(self) -> bytes:
        return emit_email(


def parse_email(data: bytes) -> (RawMetdata, dict[str, Any]):
    ...

def emit_email(raw: RawMetadata) -> bytes:
    ...

This gives us tons of flexibility, sensible layers to our implementation, and still easy to use functions for people to use when they don't care about those layers.

Footnotes

  1. For instance, pyproject.toml allows version to be specified as a dynamic value, but nowhere else is that actually allowed, and for practical reasons, if you've got a set of bytes that are a pyproject.toml, you don't have a good way to even know if it contains this metadata without first parsing it, since the [project] table is optional. So, my assertion here is that pyproject.toml is a different thing from METADATA and friends, one is the input, and one is the output, but they should re-use a lot of the same primitives (Version, etc).

@brettcannon
Copy link
Member Author

PyPI would want strict validation FWIW

It's worth a lot! 🙂

The way I'd implement this is that you have a variety of functions that read a file and turn it into a "standard" shaped dictionary with primitive keys / values.

That's what I was thinking for a time yesterday, and that's what I'm considering in the email header case. I don't want to make it universal, though, as other formats like pyproject.toml already have some things structured appropriately and having to redo it seems like a waste (e.g. author details).

  • If a field is missing, or otherwise unable to be read from the raw format (see next point) they are just omitted from the returned dictionary rather than generating some kind of error.

Yep, as that's how you get permissible/loose management of the data.

  • Part of papering over different serializations will be parsing types, for instance if Project-URL is chosen to be a dictionary, then pase_email will need to convert ["Key, URL"] into {"Key": "URL"}

That does go against what's in PEP 566.

What I'm more concerned about is any of the structured data in pyproject.toml which doesn't map out well to what METADATA would produce without parsing it. Is that clean information to be tossed out, converted to a string to be parsed again? Supported in multiple ways in the JSON dict and you isinstance() to know how it's structured?

  • Extra fields that we don't know about are omitted from the returned value
    2. We could allow this; my biggest concern would be that we can't enforce any kind of typing

Those keys would actually not even be typed as they wouldn't be represented by the TypedDict. As such, they can go in and you would have to actively work around the typing to get at them. Otherwise, as you said, they can get shoved somewhere that has very loose typing of str | list[str].

I don't think we should assume that everyone uses a type checker, or that it's OK to omit validation because fields have type annotations.

There would very likely be a validation step on output.

An interesting thing about this all, is that we can add a method to Metadata that turns a Metadata instance into a ParsedMetadata instance, then have functions to emit that back into the underling serialization format.

What I have in my head is email headers get converted to JSON, and then we work with JSON for Metadata. For pyproject.toml, it stays a separate concept for Metadata.

  • email headers -> JSON -> Metadata
  • JSON -> Metadata
  • pyproject.toml -> Metadata

Essentially we use JSON as the lossy input format and work with pyproject.toml directly as the strict input format.

@dstufft
Copy link
Member

dstufft commented Jul 15, 2022

That's what I was thinking for a time yesterday, and that's what I'm considering in the email header case. I don't want to make it universal, though, as other formats like pyproject.toml already have some things structured appropriately and having to redo it seems like a waste (e.g. author details).

I'm confused what you would be redoing if it's already structured appropriately?

If the result of json.loads() is already in the same format as RawMetadata, then your parse_json function basically just looks like this:

def parse_json(data: bytes) -> (RawMetadata, dict[str, Any]):
    raw: RawMetadata = {}
    leftover: dict[str, Any] = {}

    for key, value in json.loads(data).items():
        if key in KNOWN_FIELDS:
            raw[key] = value
        else:
            leftover[key] = value

    return raw, leftover

You're not redoing anything there, you're just spliting the keys up into raw or leftover, and if it's decided to just leave the unknown keys inside the RawMetadata instead of splitting them out, then your parse function is as simple as:

def parse_json(data: bytes) -> RawMetadata:
    return json.loads(data)

Is it kind of silly to have this extra function call? Yea slightly, but it provides better typing but more importantly it keeps things consistent. There are no special cases here. 10 years from now, if we all decide that JSON is lame and we want to use FSON, but FSON doesn't have quite the same capabilities as JSON, we can easily just drop in a new parse function, parse_fson, and the API still feels natural. We don't have to have this weird "convert everything to some python representation of deserialized json of what we thought would be easy to convert 20 year old data serialized as email headers into" lurking around forever.

More importantly, I think there's a fundamental assumption here that the input to json.loads() is "correct" because it's from a file called metadata.json (or that pyproject.toml is correct because it's called pyproject.toml), and in my example the parse_json functions would still be doing some very basic validation.

For example, this is probably more realistically what that function would look like:

 
def parse_json(data: bytes) -> (RawMetadata, dict[Any, Any]):
    raw: RawMetadata = {}
    leftover: dict[Any, Any] = {}

    parsed = json.loads(data)
    if not isinstance(parsed, dict):
        raise ValueError("data is not a json mapping")

    for key, value in parsed.items():
        if key in KNOWN_FIELDS and isinstance(value, KNOWN_FIELDS[key]):
            raw[key] = value
        else:
            leftover[key] = value

    return raw, leftover

That function will:

  • Give good error messages in the presence of things that aren't actually mappings being passed into it.
  • Will actually make sure that the RawMetadata returns is typed correctly, so that mypy will work better to prevent typing mismatches.
  • Will put any leftover data (including fields that weren't typed correctly) into a separate data structure that is also typed correctly (even though it's with Any, because we don't know anything else about it).

That does go against what's in PEP 566.

Well part of the idea here is that RawMetadata doesn't have to match what the on-disk format is. We're not writing something that you pass directly to json.loads(), it's our own format, and we can do what makes the most sense programmatically here.

RawMetadata isn't specified in a PEP after all, it's just an intermediate format that allows us to cleanly separate the concerns of getting data that is serialized in multiple possible formats (including future formats), from validation and constituting them as "full" objects.

What I'm more concerned about is any of the structured data in pyproject.toml which doesn't map out well to what METADATA would produce without parsing it. Is that clean information to be tossed out, converted to a string to be parsed again? Supported in multiple ways in the JSON dict and you isinstance() to know how it's structured?

As I mentioned above, I don't think pyproject.toml should be a concern here, because it's a different thing than METADATA and metadata.json, it has different validation rules and is just fundamentally an input that is passed into a build tool, that ultimately outputs a core metadata, and as part of that it's made up of pieces that are equivalent to core metadata, but it itself is a distinct thing.

That being said, pyproject.toml is a useful example for why I think "just use the JSON" is the wrong path to go here. It uses different data structures, different names, etc, from any of the other pieces of the core metadata spec, and as such you can't pass the result of toml.loads() and json.loads(), even if they are both 100% correct, into the same function without doing some massaging.

For example metadata.json has project_url: list[str], and depending on how you reference it, pyproject.toml either has a dict named project, with a key named urls: dict[str, str] or a key named urls: dict[str, str]. Either way, it's going to require massaging that data.

Now if you end up agreeing that pyproject.toml is a wholly different thing, that's great. But I still think this "parse into a RawMetadata" is something we need for every format, because there's nothing special about pyproject.toml in the above example (the reasons it should be a separate thing is higher level). It's just a serialization format that happened to map the underlying data in a slightly different way than the other serialization formats did, and unless we commit to the idea that every serialization format has to look exactly like metadata.json does today when naively deserialized 1, then every other serialization format is going to require massaging.

Which raises the questions:

  • Do we expose the intermediate format to end users so they can implement lenient parsing 2?
  • Why tie our intermediate format to metadata.json, just to special case JSON, and not even JSON, but specifically a "what's the minimal way we can cram email headers into JSON" format" when we can do better for basically no cost?
  • Why assume that the only bad data that we'll want lenient parsing for analyzation or to work in more scenarios we'll ever encounter is in the METADATA format? People are equally as good at producing bad JSON or bad TOML or bad any serialization format we pick.

Footnotes

  1. Which I think would be a shame, when defining how to serialize our metadata into a new serialization format, we should strive to produce an output that feels like it belongs in that serialization format. I think it's about as lame to define a JSON standard that feels like it's just email headers, but with different syntax, as it is to write Python, but like you're a Java programmer.

  2. For example, the pyproject.toml example that massaging still exists, it's just buried in the pyproject.toml -> Metadata function.

@brettcannon
Copy link
Member Author

I'm confused what you would be redoing if it's already structured appropriately?

Let's take author details as an example. In METADATA, it can come in via Author or Author-Email. Both are strings. Otherwise it's a bit messy how they will be structured. They can also only occur once.

Now consider authors from pyproject.toml. The name and email are already structurally separated. There can multiple entries.

How do you reconcile those two differences of the data's structure in your RawMetadata format that you would want to use as an intermediary format? Both can be put into RawMetadata without any potential failure in processing, but they are coming in from different structures where how they come into Metadata could be useful. If you lean into the greatest common denominator (GCD) approach, then I guess you could reformat authors into a format that makes sense for Author-Email, but that's obviously unfortunate if we have to go that way as Metadata structures author_emails into two-item tuples.

Now if you end up agreeing that pyproject.toml is a wholly different thing, that's great. But I still think this "parse into a RawMetadata" is something we need for every format, because there's nothing special about pyproject.toml in the above example (the reasons it should be a separate thing is higher level).

I think this is where I'm having a hard time understanding the balance of the complexity of what I would assume would need to be the greatest common denominator or have every which way to specify data from every format and just be a monster of an object (in terms of structure).

  • Why assume that the only bad data that we'll want lenient parsing for analyzation or to work in more scenarios we'll ever encounter is in the METADATA format? People are equally as good at producing bad JSON or bad TOML or bad any serialization format we pick.

It's not the only bad one, just the worst one. 😉

I think the thing we can all agree on is people have various uses for getting the data out of bytes and into a format that can be programmatically accessed without any data loss somehow.

What Donald and Paul seem to be advocating for is an intermediate format that seems to be some dict or data class that holds the data in the most flexible way possible to separate out the reading part from the parsing/processing part. I seem to be advocating for skipping the intermediate stage and going straight to Metadata with some way to let the caller know what data wasn't ultimately used. That seem like a fair assessment?

@dstufft
Copy link
Member

dstufft commented Jul 16, 2022

I think the thing we can all agree on is people have various uses for getting the data out of bytes and into a format that can be programmatically accessed without any data loss somehow.

I think I would reword this to with minimal data loss. I think ultimately the scale of nonsense garbage that can get shoved in these files is sufficiently unbounded that there's no way to write a general-purpose parser that has no data loss.

Let's take author details as an example. In METADATA, it can come in via Author or Author-Email. Both are strings. Otherwise it's a bit messy how they will be structured. They can also only occur once.

Now consider authors from pyproject.toml. The name and email are already structurally separated. There can multiple entries.

Ultimately you just pick something and run with it, I think?

The authors/maintainers field here is interesting because while pyproject.toml is an easier to understand data format, it's actually less powerful than what the core metadata allows.. which is fine for a restricted input format, but that means that pyproject.toml can't represent otherwise valid metadata 1.

If I were modeling that, I would represent it like this (completely untested) 2:

import tomllib

from email.parser import BytesParser
from email.headerregistry import AddressHeader
from typing import TypedDict, Optional, Union

class RawPerson:
    name: Optional[str]
    email: Optional[str]

# Core metadata spec supports the RFC From header, which has support
# for named groups of email addresses, it's "great".
#
# Maybe it would make more sense to ditch ``RawGroup``, and just add
# an optional "group" field on RawPerson that's just a string of the group
# name, if there is one?
class RawGroup:
    group: Optional[str]
    persons: list[RawPerson]

class RawMetadata:
    authors: Optional[list[Union[RawPerson, RawGroup]]]


def parse_email(data: bytes) -> tuple[RawMetadata, dict[Any, Any]]:
    raw: RawMetadata = {}
    leftover: dict[Any, Any] = {}

    parsed = BytesParser().parsebytes(data)

    for key, value in parsed.items():
        if key == "Author":
            # This sucks, but the core metadata spec doesn't really give us any
            # format that this value is in, so even though the example gives an
            # email, we can't really do anything about it and just have to treat
            # it as a string.
            raw.setdefault("authors", []).append(RawPerson({"name": value}))
        elif key = "Author-Email":
            # This API is gross, maybe there's a better one somewhere in the stdlib?
            address = {}
            try:
                AddressHeader.parse(value, address)
            except Exception:  # No idea what errors it can return, seems like a bunch of random ones
                leftover[key] = value
            else:
                authors = raw.setdefault("authors", [])
                for group in address.get("groups", []):
                    if group.display_name is None:
                        for addr in group.addresses:
                            p: RawPerson = {}
                            if addr.display_name:
                                p["name"] = addr.display_name
                            if addr.addr_spec:
                                p["email"] = addr.addr_spec
                            authors.append(p)
                    else:
                        rg = RawGroup({"group": group.display_name, "persons": [})
                        for addr in group.addresses:
                            p: RawPerson = {}
                            if addr.display_name:
                                p["name"] = addr.display_name
                            if addr.addr_spec:
                                p["email"] = addr.addr_spec
                            rg.persons.append(p)
                        authors.append(rg)

    return raw, leftover


# If we must...
def parse_pyproject(data: bytes) -> tuple[RawMetadata, dict[Any, Any]]:
    raw: RawMetadata = {}
    leftover: dict[Any, Any] = {}

    parsed = tomllib.loads(data)
    
    # Going to arbitrarily decide that the rest of the toml file doesn't
    # matter, just the projects key.
    if "projects" not in parsed:
        raise ValueError("No metadata")

    for key, value in parsed.items():
        if key == "authors":
            # An interesting question is whether a single entry in a list like this should fail
            # parsing the entire list, or should we be greedy and parse as much as possible?
            # 
            # I'm going to fail parsing the entire key, because if their is bad data, we don't know
            # what else is broken.
            authors = []
            invalid = False
            for entry in value:
                if isinstance(entry, dict) and ("name" in entry or "email" in entry):
                    p = RawPerson()
                    if "name" in entry:
                        p["name"] = entry["name"]
                    if "email" in entry:
                        p["email"] = entry["email"]
                    authors.append(p)
                else:
                    invalid = True
                    break

            if invalid:
                leftover["authors"] = value
            else:
                raw["authors"]

    return raw, leftover

Footnotes

  1. I swear I'm not trying to sound like a broken record here, but I really do think that we're going to save ourselves a lot of pain if we don't try to cram pypyroject.toml into the same box as the core metadata. It was intended for a different purpose than the core metadata, and it has different rules and it can't even fully represent everything that core metadata can (which is fine! It's designed as a more restrictive input format, not as a replacement for the core metadata).

  2. BTW, writing this in a GitHub comment instead of a real editor was a mistake. It's easy to forget how much a real editor does for you.

@dstufft
Copy link
Member

dstufft commented Jul 16, 2022

I'm going to sketch up a PR here, and see how things turn out!

@dstufft
Copy link
Member

dstufft commented Jul 16, 2022

Something that I noticed while implementing this:

PEP 566 says this about converting metadata into json:

It may be necessary to store metadata in a data structure which does not allow for multiple repeated keys, such as JSON.

The canonical method to transform metadata fields into such a data structure is as follows:

  1. The original key-value format should be read with email.parser.HeaderParser;
  2. All transformed keys should be reduced to lower case. Hyphens should be replaced with underscores, but otherwise should retain all other characters;
  3. The transformed value for any field marked with “(Multiple-use”) should be a single list containing all the original values for the given key;
  4. The Keywords field should be converted to a list by splitting the original value on whitespace characters;
  5. The message body, if present, should be set to the value of the description key.
  6. The result should be stored as a string-keyed dictionary.

However, the packaging specifications say this:

Although PEP 566 defined a way to transform metadata into a JSON-compatible dictionary, this is not yet used as a standard interchange format. The need for tools to work with years worth of existing packages makes it difficult to shift to a new format.

but more importantly, it says this about the keywords field:

A list of additional keywords, separated by commas, to be used to assist searching for the distribution in a larger catalog.

Note: The specification previously showed keywords separated by spaces, but distutils and setuptools implemented it with commas. These tools have been very widely used for many years, so it was easier to update the specification to match the de facto standard.

From what I can see, the specifications haven't explicitly altered how metadata.json is intended to be produced, so in theory the algorithm in PEP 566 is still canonical, but given keywords have been changed to be comma delimited, we should probably just ignore that part of the PEP and do what the specifications say to do for keywords.

@dstufft
Copy link
Member

dstufft commented Jul 16, 2022

Next fun fact!

You can't actually use BytesParser, BytesFeedParser, or message_from_bytes to parse METADATA files, because METADATA files are explicitly utf8, and all of those classes are implemented such that they actually just wrap their string based counter parts with a hard coded .decode("ascii", "surrogateescape") attached to the incoming bytes.

You can of course just decode the bytes yourself as utf8 and pass them into Parser etc, but then you lose leniency to be able to attempt to parse something out of those files.

@abravalheri
Copy link
Contributor

How about message_from_bytes(content, EmailMessage, EmailPolice(max_line_length=math.inf, utf8=True))?

@merwok
Copy link

merwok commented Jul 16, 2022

(s/Police/Policy/)

@dstufft
Copy link
Member

dstufft commented Jul 16, 2022

Nope.

The message_from_bytes function is a thin wrapper around ByteParser.parsebytes.

The ByteParser class is entirely implemented as a wrapper around FeedParser, with hard coded calls to decoding the bytes with ascii + surrogateescape.

@dstufft
Copy link
Member

dstufft commented Jul 16, 2022

Of course, if you don't care about being lenient, you can just do message_from_string(binary_content.decode("utf8"), policy=email.policy.compat32).

That will correctly parse the METADATA file as UTF8, but one of the things Paul mentioned wanting was the ability to not fail if the METADATA file is mojibaked, but the one header you need is in a valid encoding which you can't do if you're decoding the entire content at once.

@pfmoore
Copy link
Member

pfmoore commented Jul 16, 2022

pkg_metadata includes the dance you need to do to read bytes with arbitrarily encoded text, if you’re interested.

@pradyunsg
Copy link
Member

pradyunsg commented Jan 17, 2023

Passing in Message().attach(1234) should trigger that.

@brettcannon
Copy link
Member Author

Passing in Message().attach(1234) should trigger that.

But is there any way to do it via some METADATA format? If there isn't then maybe the code isn't necessary since the API only take strings or bytes?

@dstufft
Copy link
Member

dstufft commented Jan 17, 2023

Uhh, I probably added that because something on PyPI made it happen when I was testing it against PyPI data... but I don't remember what.

@dstufft
Copy link
Member

dstufft commented Jan 17, 2023

At least, I don't think I would have gone out of my way to do that on my own.

@FFY00
Copy link
Member

FFY00 commented Jan 18, 2023

Trying to serialize an object with a list payload, at least, doesn't work. It trips up in

https://github.com/python/cpython/blob/3ef9f6b508a8524f385cdc9fdd4b4afca0eac59b/Lib/email/generator.py#L238-L239

@brettcannon
Copy link
Member Author

#671 covers RawMetadata and parse_email().

@brettcannon
Copy link
Member Author

Raw metadata parsing just landed in main! Next is providing enriched/strict metadata.

@dstufft
Copy link
Member

dstufft commented Feb 1, 2023

🎉

@nilskattenbeck-bosch
Copy link

We can also ignore non-standard fields (e.g. License-Files).

If PEP 639 gets accepted this becomes a standard field

@brettcannon
Copy link
Member Author

@nilskattenbeck-bosch correct, but that hasn't happened yet (I should know, I'm the PEP delegate 😁). Once the field exists we will bump the max version for metadata and then update the code accordingly.

@nilskattenbeck-bosch
Copy link

May I also suggest adding a small information box to the documentation as to why the function is called parse_email. By now I read through the corresponding PEPs and specifications to understand that the metadata is serialized just like email headers and parsed using that module though at first it was really unnatural and felt like I was using the wrong function and that it would leak abstraction detail. If other parse_FORMAT will be introduced then this naming makes sense though having an assurance that this is the correct method and why it is named that way would be assuring.

@brettcannon
Copy link
Member Author

May I also suggest adding a small information box to the documentation as to why the function is called parse_email.

I would go even farther and say you can propose a PR to add such a note. 😁

brettcannon added a commit that referenced this issue Aug 18, 2023
Introduce the `packaging.metadata.Metadata` class which gives an object API similar to `packaging.metadata.RawMetadata` while returning "enriched" objects like `version.Version` where appropriate. It also supports validating the metadata.

Part of #570
@brettcannon
Copy link
Member Author

With my work for reading metadata now complete, I'm personally considering my work on this issue done. Hopefully someone else will feel motivated to do the "writing" bit, but as someone who has only written tools to consume metadata I don't think I'm in a good position to drive the writing part of this.

@abravalheri
Copy link
Contributor

abravalheri commented Aug 23, 2023

With my work for reading metadata now complete, I'm personally considering my work on this issue done.

Thank you very much @brettcannon for working on this.

Hopefully someone else will feel motivated to do the "writing" bit, but as someone who has only written tools to consume metadata I don't think I'm in a good position to drive the writing part of this.

There was at least one previous PR (#498, probably more) that tried to address the "writing" capabilities for metadata, however these were closed because probably they don't fit into the long term vision that the maintainers would like for the API/implementation (which is a very good thing to have in a project).

Would it be possible (for the sake of anyone that intends to contribute with the "writing" part) to have a clear guideline on how we should go about it? (This question is targeted at all packaging maintainers, not only Brett 😝).

I am just concerned that throwing PRs at packaging without a clear design goal/acceptance criteria will just result on PRs getting closed and work hours being lost.

@brettcannon
Copy link
Member Author

Would it be possible (for the sake of anyone that intends to contribute with the "writing" part) to have a clear guideline on how we should go about it?

I think the first question is what are the requirements for the feature to write metadata? Do you need to minimize diffs for METADATA output that you originally read from? Or should the output be consistent across tools (i.e., field order is hard-coded)?

The other question is whether tools will build up the data in a TypedDict via RawMetadata and then use Metadata to do the writing, or do you make it so you build up the metadata in Metadata itself? And if you do the latter, do you do validation as you go, or as a last step before creating the bytes? I will admit I totally punted on this one based on how I coded Metadata and it currently lends itself to building up via RawMetadata as the descriptor I used is a non-data descriptor and thus the caching does not lend itself to attribute assignment.

There's also how much you want to worry about older metadata versions? Do you let the user specify the metadata version you're going to write out and thus need to do fancier checks for what's allowed?

For me, I think consistency in output is more important than keeping pre-existing METADATA files from having a smaller diff (since long-term that will happen naturally). As for the API, I think that's up to people like you, @abravalheri , who author build back-ends since you will be the folks using any construction API. But what I will say is if you build up a RawMetadata dict then an API to generate the bytes is surprisingly straightforward to code up (define the descriptors in the order you want them written out, iterate over Metadata for all instances of _Validator, get the values, and then introspect on the results to know how to write out the format appropriately for strings vs lists vs dicts).

@dstufft
Copy link
Member

dstufft commented Sep 29, 2023

I'd also mention that everything I've done with METADATA has primarily been consuming them not emitting them, but my intuition is that it should have these properties:

  • Don't worry about diffs produced in round tripping, I suspect that use case is going to be extremely niche (but I could be wrong) and I wouldn't spend a lot of time worrying about it.
    • I think at the extreme edges of this, you basically need a special parser that preserves the original formatting, which the email parser doesn't have nor does any of the stdlib parsers, making it a much bigger deal to do.
  • I would have the write APIs mirror the same pattern as the read APIs, in that:
    • You can work with a Metadata object, and that object maintains the requirement that you can only get or set valid metadata to it. This Metadata object has a Metadata().to_raw() function that returns a RawMetadata.
    • You can work with a RawMetadata object, which is a dict at runtime or a TypedDict at type check time, and there is minimal validation here (basically just that primitive types match).
    • We have a write_email, function that takes a RawMetadata and returns the serialized bytes, probably also any leftover fields that it didn't know how to serialize.
    • We have Metadata().to_email() functions that act as a wrapper around Metadata().to_raw() and write_email.
  • I would let the user set a metadata version, but if they don't set one then use a default value.
    • Of course validating that the fields used exist in the given metadata version.
    • A good question is whether the default should be the latest version or if we should attempt to do something like the minimal version that contains the fields that are in use.
      • Personally I think we should do the latest version, but with the caveat that maybe whenever a brand new version is released, we should hold off on setting the default to that for some period of time to give tools time to catch up? But maybe we would have to implement version detection then incase someone uses a new version, so it might be simpler/better to just always emit the latest by default.

@dstufft
Copy link
Member

dstufft commented Sep 29, 2023

Oh ugh, emitting probably also has to answer the newlines in fields question, and possibly the other issues I raised with the spec earlier up thread.

@dstufft
Copy link
Member

dstufft commented Oct 9, 2023

Just a note, I have a PR up now to Warehouse (pypi/warehouse#14718) that switches our metadata validation on upload from a custom validation routine implemented using wtforms to packaging.metadata.

The split between Metadata and RawMetadata was super useful, since we (currently) have the metadata handed to Warehouse as a multipart form data on the request, I was able to make a custom shim to generate a RawMetadata from that form data, and then just pass that into the Metadata.from_raw().

I'm still giving it a through set of manual testing, but so far it looks like other than a few bugs/issues that fell out (#733, #735) integrating it was relatively painless. Of course the real test will be when it goes live and that wide array of nonsense that people upload starts flowing through it.

We're not yet using the actual parsing of METADATA files yet (though that is planned), so this is strictly just the validation aspect of the Metadata class that we're currently using.

@pradyunsg
Copy link
Member

pradyunsg commented Oct 9, 2023

Do you think it might make sense to keep both the old code and the new code paths running for a bit on warehouses' end, with the results of the old code path being returned?

i.e.

def upload(...):
    try:
        new_result = new_upload(...)
    except Exception:
        new_result = ...
    old_result = old_upload_handler(...)
    if new_result != old_result:
       log.somehow(old_result, new_result)
    return old_result

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants