Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADR0006: Where to implement model serialization #1270

Merged
merged 1 commit into from
Jul 8, 2021

Conversation

lukpueh
Copy link
Member

@lukpueh lukpueh commented Jan 29, 2021

Fixes #-
Related to #1139 (Implements Option 5 of this ADR)

Description of the changes being introduced by the pull request:
Add decision record about the design of de/serialization between TUF metadata class model and tuf wire line metadata formats.

Please verify and check that the pull request fulfills the following
requirements
:

  • The code follows the Code Style Guidelines
  • Tests have been added for the bug fix or new feature
  • Docs have been added for the bug fix or new feature

@MVrachev
Copy link
Collaborator

MVrachev commented Jan 29, 2021

I will leave some thoughts here for the different options:

  1. De/serialization on metadata classes:

About the first bad point: do we plan to add support for additional metadata types additional than JSON?
Is there a user that asked for that and do we consider switching?
For the second bad point I agree it's a bit complicated. When I wrote my first pr for the new api I had a couple of questions about the behavior of the de/serialization methods. Still, I am not sure how to simplify it...
With the current API we are just supporting many stuff (which all are meaningful?) :

  • the possibility to add support for additional file formats (by providing for example from_yaml which calls from_dict())
  • custom filenames for the files representing the roles, that's why we have to check the _type field in from_dict()
  • additional helpful methods to_json, to_json_file, from_json, from_json_file.

I agree with the good points. From an OOP perspective, it feels for me this is the best solution.
Sidenote: it seems from_json and from_json_file are the public methods to use for deserialization, then it's good to mark from_dict as a private (with _ for example), internal method. The same applies to other internal methods as well. If we chose that approach for the adr, I can create an issue/pr for this.

  1. De/serialization on metadata subclasses

About the bad point: this seems to add not trivial complecity to the model. If I understand the idea correctly, we would need to add the following classes JsonMetadata, JsonRoot, JsonTargets, etc.? This seems an overkill to support possible custom parsers...

  1. De/serialization separated

About the bad point: At first, this seemed like a reasonable option, but then I realized there are two cons:

  • the smaller one: as you mentioned "re-implement" the entire class hierarchy
  • what I am most concerned about: the procedural implementation we have to provide. I don't have any bad feelings about Trishank or tuf-on-plane, but at least for me, it feels harder to read, follow and understand what's going on. From an OOP perspective, this option seems not right to me.

Summary: Overall I feel like the first option we have now is the best. It's the most logical, fits in my mental model and the deserialization functions don't worry about types and finding out what kind of JSON they are deserializing. It's unclear to me how many of our users would want custom file formats or do we considering supporting other file formats and that's why it's not the top priority from my perspective.
What I would like is to discuss in a different issue is how we can simplify this model more? Is there a way? Do we want to be so extendable and support all of the above points I mentioned?

@trishankatdatadog
Copy link
Member

Clearly, I'm for the third option.

In my experience building tuf-on-a-plane, it was very helpful to separate the two concerns, as I could: (1) cleanly separate the concerns and reason about them (a bug in one did not have to do with the other), and (2) easily extend custom parsers/decoders/deserializers without changing anything about the intermediate Metadata representation.

The problem with the first option is that both the (de)serialization and Metadata code are all in one place, and it does not help to read or reason about them at the same time. I argue that they are both conceptually and practically separate.

I should also note that whatever we decide (e.g., global key databases instead of functionally local key lookup tables) does end up influencing other implementations. The easier we make it for other developers to reason about how TUF works, the better. So if that means a better separation of the concerns, then we should do that.

Anyway, that's just my 0.02 BTC (even though it's something I feel relatively strong about). What do others think?

@joshuagl
Copy link
Member

joshuagl commented Feb 1, 2021

I should also note that whatever we decide (e.g., global key databases instead of functionally local key lookup tables) does end up influencing other implementations. The easier we make it for other developers to reason about how TUF works, the better. So if that means a better separation of the concerns, then we should do that.

This is something we're seeing over and over as more implementations of TUF are being developed. Let's make "The easier we make it for other developers to reason about how TUF works, the better." our maintainer mantra.

@trishankatdatadog
Copy link
Member

This is something we're seeing over and over as more implementations of TUF are being developed. Let's make "The easier we make it for other developers to reason about how TUF works, the better." our maintainer mantra.

Thanks, Joshua. Another concern I'd like to us address is removing the optics that TUF is officially tied exclusively to canonical JSON as data transport, so if separating the two concerns (parsing vs models) helps us do that, then all the better.

@joshuagl
Copy link
Member

joshuagl commented Feb 2, 2021

I've been thinking about this a for a few days while reviewing the current tuf.api.metadata implementation. As @trishankatdatadog indicated, a primary goal for any metadata model in the reference implementation should be recognisability – that the model represents what is described by the specification, and shared on the wire.

I also feel like that the reference implementation has some other goals:

  • be easy to understand (and copy) – the reference implementation is a model for other implementations
  • concise, yet extensible – the reference implementation should be easily extensible, with a simple core that can easily be adapted to a variety of update systems

Given those two things, I feel like proposal 4 in the ADR is not the way to go. As I understand it, it would necessitate a separate serialisation implementation for formats that aren't part of the default implementation. This would mean that 1) it is harder for people to implement a new serialisation method – many will want to start by copying, then modifying, the default mechanism and that won't work if JSON serialisation is baked in, and 2) that there's a separate codepath for the non-default serialisation mechanism – which may lead to subtle issues around serialisation to non-default formats.

Given the goals proposed above, I am in favour of a generic and abstract serialisation/deserialisation interface which is separate from the metadata model, extensible without directly modifying the code of python-tuf, and shared by the upstream and any downstream serialisation/deserialisation implementations.

@lukpueh
Copy link
Member Author

lukpueh commented Feb 17, 2021

Thanks for your comments, everyone! Based on the discussion here I have implemented option 3 in #1279.

I do, however, have another option, or rather a variant of option 3, to offer. See option 5 in the just updated ADR, which I have generally fleshed out a bit.

Note that #1279 is already crafted in such a way that we can easily switch to option 5 by chopping off the last commit (b786bc0).

@MVrachev
Copy link
Collaborator

Thanks for your comments, everyone! Based on the discussion here I have implemented option 3 in #1279.

I do, however, have another option, or rather a variant of option 3, to offer. See option 5 in the just updated ADR, which I have generally fleshed out a bit.

Note that #1279 is already crafted in such a way that we can easily switch to option 5 by chopping off the last commit (b786bc0).

I had a look into option 5 and read the code without commit b786bc0.
I see the benefits of this approach and totally agree with the second one.
It makes sense to instantiate new objects inside a member
function instead of a separate function.

@jku
Copy link
Member

jku commented Feb 22, 2021

At first option 5 felt like a bit of smoke-and-mirrors to hide the fact that metadata still implements the json de/serialization... but on second look it doesn't bother me much. The potential issues might be:

  • maybe in the future the de/serialization is not actually possible using the public metadata API and we fail to notice because we do it inside the metadata classes
  • it's not really a 1:1 example for the hypothetical developer who is implementing another non-json de/serializer

but I think the advantages you listed are more important than those issues (let's not optimize too much for the hypothetical case)... So LGTM.


I did try to come up with a clever-er way of separating to_dict()/from_dict() from the actual metadata just a little bit. E.g. some "JsonMixin" classes that lived with the actual Metadata classes but were still clearly separate. I could not come up with anything cleaner than this version.

3. De/serialization separated
4. Compromise 1: Default json de/serialization on metadata classes,
non-default de/serialization separated
5. Compromise 2: Dictitionary conversion for default json de/serialization on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
5. Compromise 2: Dictitionary conversion for default json de/serialization on
5. Compromise 2: Dictionary conversion for default json de/serialization on

@lukpueh
Copy link
Member Author

lukpueh commented Feb 22, 2021

  • maybe in the future the de/serialization is not actually possible using the public metadata API and we fail to notice because we do it inside the metadata classes

I think that's only an issue in option 4. In option 5 we still serialize via the external serializer and thus don't have two different code paths. Or do I miss something?

  • it's not really a 1:1 example for the hypothetical developer who is implementing another non-json de/serializer

You are right, I even briefly had that listed as con for option 5, but removed it because it felt so weak. The inherit-from-abstract-serializer scaffolding can be copy-pasted from the default json de/serializer, and adopting the to/from_dict method stack seemed like a fair ask.

@trishankatdatadog
Copy link
Member

I'm not sure I'm convinced about Option 5. Can someone explain why it's conceptually easier to go from a high-level Metadata object to an intermediate-level dictionary object and then to a low-level serialization format such as JSON or XML? Why is the intermediate step necessary?

@lukpueh
Copy link
Member Author

lukpueh commented Feb 22, 2021

What about the arguments in the ADR, @trishankatdatadog?

@jku
Copy link
Member

jku commented Feb 22, 2021

  • maybe in the future the de/serialization is not actually possible using the public metadata API and we fail to notice because we do it inside the metadata classes

I think that's only an issue in option 4. In option 5 we still serialize via the external serializer and thus don't have two different code paths. Or do I miss something?

I just mean if we later add an internal function in Metadata and use it in to_dict()... this internal function is now really part of serialization code but not available to someone doing it outside of Metadata. I don't think this is something we should worry about here.

Maybe related point: The way metadata.py refers to JsonDicts in a lot of places does make it a little confusing: are they really json or just very specific dictionaries? I'm not sure if I can tell how separate metadata.py and json are at this point (maybe worth documenting things like Metadata.signature & sign() return value and Root.__init__() arguments at some point)

@trishankatdatadog
Copy link
Member

What about the arguments in the ADR, @trishankatdatadog?

Sorry, I'm not sure I fully follow. Why is it easier for third-party developers to write decoders from JSON/XML to Python dictionaries instead of more natural, higher-level Metadata objects? What are we saving here? I must be missing something.

@jku
Copy link
Member

jku commented Feb 22, 2021

Sorry, I'm not sure I fully follow. Why is it easier for third-party developers to write decoders from JSON/XML to Python dictionaries instead of more natural, higher-level Metadata objects? What are we saving here? I must be missing something.

I don't think the idea was for 3rd parties to to use to_dict/from_dict: they are just helpers for the default de/serializer -- it's code that could be in the de/serializer but becames a bit simpler when you can just include it in metadata. The status and purpose of these functions could maybe be better documented.

@lukpueh
Copy link
Member Author

lukpueh commented Feb 22, 2021

Sorry, I'm not sure I fully follow. Why is it easier for third-party developers to write decoders from JSON/XML to Python dictionaries instead of more natural, higher-level Metadata objects? What are we saving here? I must be missing something.

JSON to dictionary is really easy:

import json
json.loads(data)

For formats that don't have libraries with builtin dict de/serialization support it doesn't make sense to use a dictionary intermediary, but you also don't have to.

@lukpueh
Copy link
Member Author

lukpueh commented Feb 22, 2021

I just mean if we later add an internal function in Metadata and use it in to_dict()... this internal function is now really part of serialization code but not available to someone doing it outside of Metadata.

Hm. I if it were all implemented in JsonSerializer.serialize it wouldn't be available to a different serializer to start with.

The way metadata.py refers to JsonDicts in a lot of places does make it a little confusing: are they really json or just very specific dictionaries? I'm not sure if I can tell how separate metadata.py and json are at this point (maybe worth documenting things like Metadata.signature & sign() return value and Root.init() arguments at some point)

Agreed, we should remove references to JSON in metadata.py. Removing the usage of JsonDict as type hint is documented in #1140 (comment).

@trishankatdatadog
Copy link
Member

I think I'm a bit lost here. There are too many options, and I've got too many things going on to properly switch context. Would appreciate a quick 1/2 hour meeting to quickly discuss options. Especially w/o quick code to glance at, I can't immediately tell what's going on, sorry.

@lukpueh
Copy link
Member Author

lukpueh commented Feb 23, 2021

@trishankatdatadog, did you take a look at #1279, which implements your preferred option 3, as suggested above? Its PR description explains the alternative option 5 including pros and cons, and it even points to quick code to glance at that shows the diff between option 3 and 5 (see b786bc0).

I think reading the PR description and skimming the diff shouldn't take longer than half an hour and give you a good idea what's going on here. Let me know if you still need a meeting afterwards.

@trishankatdatadog
Copy link
Member

I think reading the PR description and skimming the diff shouldn't take longer than half an hour and give you a good idea what's going on here. Let me know if you still need a meeting afterwards.

It's good. I read the code, and I like how Option 3 is implemented, thanks. What still I don't understand is the difference in Option 5. Is there any good reason to expose those utility from/to dictionary methods in the base Metadata class? Would other (de)serializers benefit from it?

@lukpueh
Copy link
Member Author

lukpueh commented Feb 25, 2021

Thanks for the feedback, @trishankatdatadog! Regarding your question whether other de/serializers benefit from from/to_dict, yes they do. As a matter of fact, #1279 implements different De/Serializers for JSON and for canonical JSON, which are each just thin wrappers around from/to_dict. And I can very well imagine that for other formats there also exist libraries that make it easy to de/serialize from/to generic dictionaries.

Although it doesn't make a big difference whether the from/to_dict functions are available as methods on the metadata classes or in a separate public module, I do think that they are more easily found on the corresponding class, e.g. when using a Python interpreter with auto-completion, and also because users might expect classes to have such methods (see 3rd-party libraries such as attrs, pydantic, marshmallow, or built-in vars()/__dict__ or dict()/__iter__ properties and methods that provide default and customizable dict representation for custom objects.

So one argument is that it seems idiomatic to implement these functions as methods on the class, and the other argument is that the code seems a bit better structured when each method is scoped within the corresponding class, which in turn should make maintaining/modifying/extending the class model easier.

@trishankatdatadog
Copy link
Member

Do what you think is best, Lukas and friends. I trust you will make the right decision. From my point of view, what is critical is that developers can easily write new (de)serializers targeting the same abstract, high-level metadata interface, and not worry about dictionaries and JSON and so on.

lukpueh added a commit to lukpueh/tuf that referenced this pull request Mar 3, 2021
Revert an earlier commit that moved to/from_dict metadata class
model methods to a util module of the serialization sub-package.

We keep to/from_dict methods on the metadata classes because:
- It seems **idiomatic** (see e.g. 3rd-party libaries such as attrs,
pydantic, marshmallow, or built-ins that provide default or
customizable dict representation for higher-level objects).
The idiomatic choice should make usage more intuitive.
- It feels better **structured** when each method is encapsulated
within the corresponding class, which in turn should make
maintaining/modifying/extending the class model easier.
- It allows us to remove function-scope imports (see subsequent
commit).

Caveat:
Now that "the meat" of the sub-packaged JSON serializer is
implemented on the class, it might make it harder to create a
non-dict based serializer by copy-paste-amending the JSON
serializer.

However, the benefits from above seem to outweigh the disadvantage.

See option 5 of ADR0006 for further details (theupdateframework#1270).

Signed-off-by: Lukas Puehringer <lukas.puehringer@nyu.edu>
lukpueh added a commit to lukpueh/tuf that referenced this pull request Mar 4, 2021
Revert an earlier commit that moved to/from_dict metadata class
model methods to a util module of the serialization sub-package.

We keep to/from_dict methods on the metadata classes because:
- It seems **idiomatic** (see e.g. 3rd-party libaries such as attrs,
pydantic, marshmallow, or built-ins that provide default or
customizable dict representation for higher-level objects).
The idiomatic choice should make usage more intuitive.
- It feels better **structured** when each method is encapsulated
within the corresponding class, which in turn should make
maintaining/modifying/extending the class model easier.
- It allows us to remove function-scope imports (see subsequent
commit).

Caveat:
Now that "the meat" of the sub-packaged JSON serializer is
implemented on the class, it might make it harder to create a
non-dict based serializer by copy-paste-amending the JSON
serializer.

However, the benefits from above seem to outweigh the disadvantage.

See option 5 of ADR0006 for further details (theupdateframework#1270).

Signed-off-by: Lukas Puehringer <lukas.puehringer@nyu.edu>
Add decision record about the design of de/serialization between
TUF metadata class model and wire line metadata formats.

Chosen option: Serialization and class model are decoupled, but the
class model provides conversion helper methods.

Signed-off-by: Lukas Puehringer <lukas.puehringer@nyu.edu>
@lukpueh lukpueh changed the title ADR0006: Where to implement model serialization (WIP) ADR0006: Where to implement model serialization Mar 18, 2021
@lukpueh lukpueh marked this pull request as ready for review March 18, 2021 09:58
@MVrachev
Copy link
Collaborator

MVrachev commented Apr 14, 2021

I think the decision is already made - we have implemented model (de)serialization by merging #1279.
My opinion is that for now, it works fine, but if we want to change our approach we can update this ADR or create a new ADR overwriting this one. In order to merge this pr, we have to update it with our decision.

What do you think @jku?

Copy link
Member

@joshuagl joshuagl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks to @MVrachev for following up on this long-lived PR both here and offline. We have been moving ahead with option 5 (aka compromise 2) since the merge of #1279

Therefore it makes sense to merge this ADR documenting the approach and the other options we considered.

@joshuagl joshuagl merged commit 885fcac into theupdateframework:develop Jul 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants