Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation on serialization #615

Open
fungs opened this issue Dec 14, 2023 · 5 comments
Open

Validation on serialization #615

fungs opened this issue Dec 14, 2023 · 5 comments

Comments

@fungs
Copy link

fungs commented Dec 14, 2023

Description

I want to be sure that data which is serialized and transferred is really valid. Currently, the constraints are only checked when decoding them. To achieve this, one can of course try to decode the data on the sender side before submission. While this works for small objects, it creates a computational burden for very large objects.

Wouldn't it be possible to run the very same constraints checks at serialization time (on demand). In my understanding, it would create the same little overhead it currently does on the receiver side. Otherwise, is there a why to manually call a validation on the data?

@FHU-yezi
Copy link

FHU-yezi commented Dec 14, 2023

Maybe related to #513?

We can validate the data when we create the struct object.

@fungs
Copy link
Author

fungs commented Dec 15, 2023

Thanks @FHU-yezi for the linked issue. I've read through it, and it is definitely related.

Let me try to explain a little further for everyone to understand this request.

IMO there are architectural and practical differences depending on who does the validation when. The goal should be to guarantee that a data structure was validated and not modified before serialization.

Strategy 1: In-type validation (variant validate-frozen-on-construction)

I totally like this concept because it merges the concepts of type and constraints. The distinction of both concepts is, in my eyes, is just an artifact of how computer systems commonly define and handle data types, mostly related to hardware architecture. However, to guarantee that the data is valid all the way until serialization, we must either write-protect it effectively (aka frozen objects), or we must revalidate after each possible modification. The former is difficult in Python due to its dynamic nature. The latter requires you to rewrite or wrap a type with all its write-enabled methods, even its accessible members.

An example of this approach is Pydantics NonNegativeInt type. If the type invariance says "I cannot be invalid", all is fine. I'd go for this approach in appropriate programming languages, not in Python. It would be really hard for anyone to write custom types.

Strategy 2: Lazy validation (serialization)

If we cannot guarantee a validated state or safeguard the type object from modification during processing, the logical option is to defer the validation to the time of serialization, thus circumventing the problem. To me, this also makes sense because usually the serialization routine needs to touch and re-encode every single item in the data structure, which would guarantee that we spend linear time on validation. It's important not note, that the validation needs to be type-informed, just like the serialization: both require deep knowledge about the semantics and structure of the type being processed.

In msgspec, validation is only applied for the back-transform. In this case, it doesn't really matter how it is done, because the full pipeline is implemented in msgspec itself. I assume, that for efficiency reasons, msgspec does validation on deserialization in C code, once the final data type objects are constructed in the chain.

Architecture

So why don't we just validate on instantiation and protect the data by code ownership until serialization?

The answer is software architecture. The data types in these kinds of frameworks (see attrs, pydantic, dataclass etc.) serve two different purposes: defining data models and interfaces and creating and working with objects easily and efficiently. So when building a standalone serialization layer for specific data, with a matching interface, the objects are constructed outside, maybe in a mutable version, maybe much earlier in the data processing pipeline, in custom code or in a different Python package, but relying on the very same interface definition. Thus, we cannot assume that all passed objects comply with the definition expected by the receiver.

That being said, if the struct constructor mentioned in #513 accepts an object of the same type with zero copy and can validate all the members, this would be equivalent to a simple validate(data) call to be run right before serialization (although probably less efficient than validation and serialization in the same procedure).

@fungs
Copy link
Author

fungs commented Dec 15, 2023

#614 is inspired by the same architectural considerations.

@FHU-yezi
Copy link

@fungs said something really meaningful.

For strategy 2 he mentioned, we also have another use case: What if this struct will never be serialized?

In my case, the struct object is directly used by user's code, and it is only for auto complete and type checking, user will never serialize it, unless they want to store it in another place.

In that case, if we doesn't support validate on init, the struct defination may be different from the real data, which will lead to misunderstanding.

@fungs
Copy link
Author

fungs commented Dec 21, 2023

This seems to be a well structured approach to strategy 1: https://smarie.github.io/python-vtypes/

It might be compatible with msgspec, I need to test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants