-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pydantic-based type checking #179
base: main
Are you sure you want to change the base?
Conversation
45580ef
to
e2562e5
Compare
e2562e5
to
8fa0b9f
Compare
1ff6a1f
to
216dc7a
Compare
abeb644
to
28553fa
Compare
3e214c6
to
0bb9f25
Compare
…CSP_PYDANTIC environment variable. Signed-off-by: Pascal Tomecek <pascal.tomecek@cubistsystematic.com>
0bb9f25
to
8adf249
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a first pass of the PR here. I will return to it after we discuss some lengthier comments I had.
Overall, I do think we should push forward with this PR as the csp typing system has been giving me some headaches lately as well. Due to the constraint of backwards compatibility, I don't think the pydantic-based system is any simpler or more maintainable. However, it is more performant and more widely recognized by external contributors, so those are important pros.
There's really two different areas of improvement with respect to the type system:
- Making the type checking more standardized and performant. This PR achieves that.
- Removing confusing custom type logic that exists in csp i.e. Numpy array typing, non-standard container annotations, etc.. This PR does not seek to change that due to backwards compatibility.
I think the plan of attack should be to enable Pydantic checking so that when we eventually (possibly?) do a major version release (csp 1.0) we can also achieve 2) using Pydantic. Then the logic will be standardized, understandable and performant.
@@ -53,7 +53,11 @@ def __new__(cls, *args, **kwargs): | |||
kwargs = {k: v if not isTsBasket(v) else OutputBasket(v) for k, v in kwargs.items()} | |||
|
|||
# stash for convenience later | |||
kwargs["__annotations__"] = kwargs | |||
kwargs["__annotations__"] = kwargs.copy() | |||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we only use pydantic if the env variable USE_PYDANTIC
is True
?
Even if the user has pydantic>2
installed in their environment the first cut should have them opt-in to it explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both are possible but this block doesn't change behavior to end users and is closer to the end state: i.e. where csp requires pydantic>2 and this code is executed all the time. The advantage to having it like this is that all the existing unit tests will check the code in ci/cd because pydantic is listed as a dev dependency. By making it depend on the env variable, a whole separate set of tests is needed, only for them to be deleted in the next step.
if issubclass(value_type, self._source_type): | ||
return value_type | ||
except TypeError: | ||
# So that List[float] validates as list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could hit some weird cases with Numpy array typing here. For example, the origin type of NumpyNDArray[str] is np.ndarray but we can't validate the former as the latter due to our (incorrect) default of np.ndarray = NumpyNDArray[float]
. If we swap annotations beforehand though so np.ndarray
is substituted you can ignore this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agree. I was hoping to tackle all the weirdness around numpy array typing as a separate step/PR (possibly also including your suggestions around combining the 1D/ND stuff and adjusting the parquet adapters).
def revalidate(self, model): | ||
"""Once tvars have been resolved, need to revalidate input values against resolved tvars""" | ||
# Determine the fields that need to be revalidated because of tvar resolution | ||
# At the moment, that's only int fields that need to be converted to float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little confused on the complexity here: can't we just cast_int_to_float
before passing them to the consumer?
We don't want to change the original tstype anyways as that caused issues #181
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is potentially more generic (though perhaps too generic). i.e. the old type checking logic assumes that the only upcasting you are doing is int to float, but there are other corner cases and this is safer. For example, what about np.float64 or np.float32 to float?
There's no promises really about what the UpcastRegistry
will do (and I didn't change it), so we shouldn't be using knowledge of it's implementation in the pydantic type resolver. Ultimately if the UpcastRegistry
says A and B both upcast to C, then we need to revalidate A and B as C to be safe.
87dc24d
to
c131e28
Compare
Signed-off-by: Pascal Tomecek <pascal.tomecek@cubistsystematic.com>
73d4127
to
1a1b9a3
Compare
Please read the below.
I have run all unit tests locally with and without pydantic type checking enabled, so the changes are fully compatible (though exception messages have changed).
Open Issues
The main open issues are
Motivation
Pydantic is the most widely used data validation library for Python. I wanted to leverage it to do the type checking that csp had custom implementations for, in order to
validate_call
decorator for validation, etc)In the end, I've probably added as much code as could be removed, but in my opinion it's more compartmentalized and easier to extend. Furthermore, performance is slightly-to-significantly better (baskets in particular are notably improved) and other ways of building graphs with type checking are supported as intended.
Challenges
The main challenge is csp's handling of "template variables", which is pretty unique, i.e. to have the type var/forward ref resolved based on the input arguments at runtime. This is handled by introducing a custom validation context (
TVarValidationContext
) and leveraging some of the existing code to resolve conflicts.Examples
Need to run this before running any of the examples:
Graphs (but not nodes) can now take baskets of baskets as inputs (which was not possible before).
Graphs can take custom pydantic models that include edge types as attributes (useful for grouping together time series of different underlying types)
Graphs can now also take Union of ts types as input (not yet as outputs)
Additional types can now be validated as static arguments, i.e.
Callable
(though pydantic only performs a simple check that the argument is callable, no validation of arguments, their types or the return type is performed) :See also https://docs.pydantic.dev/latest/api/standard_library_types/
The pydantic validation decorator can be applied to csp types if only type validation is required:
Future work
Note that items 2)-4) would be breaking changes.
Implementation Details
The implementation consists of the following pieces:
TsType
,Outputs
,OutputBasket
, etc were extended with__get_pydantic_core_schema__
implementations so that pydantic validation can apply to them. This is enough to enable the use of pydantic models and the pydantic validator with ts types. The complex validation logic forTsType
is delegated toTsTypeValidator
(which is a combination of glorified "is subtype" logic and handling of TVars - see below)CspTypeVar
andCspTypeVarType
as the TVars are neither ForwardRefs or TypeVar.adjust_annotations
function is implemented to adjust the "standard" csp type annotations into fully compliant pydantic annotationsCSP_PYDANTIC
env variable is set, input checking in signature.py and output checking graph.py uses the input/output model for validation, instead of the existing logicTVarValidationContext
) is introduced, which tracks the different resolutions and resolves conflicts, using logic that is nearly identical to the existing implementation (but more generic where possible). This context can maintain state between the validation calls to the different arguments and (sub-arguments for nested structured). In particular, the validation ofCspTypeVar
andCspTypeVarType
interacts specifically with this validation context. This context is instantiated and passed to the model validation step above. The final step in model validation is to do the resolution of all the detected TVars and to revalidate any fields that have changed type as a result of this (i.e. int->float).Un-scientific profiling