Pydantic-based type checking #179

ptomecek · 2024-04-01T19:14:53Z

Please read the below.
I have run all unit tests locally with and without pydantic type checking enabled, so the changes are fully compatible (though exception messages have changed).

Open Issues
The main open issues are

Need to agree on how to enable it and roll it out. It's currently done through an env variable, so that that the existing behavior is the default, and pydantic type checking (and import-dependency) is opt-in. Here's what I suggest

Merge these changes (with no pydantic runtime dependency)
Bump version to 0.8 (higher than internal version)
Warn users that we will soon have a pydantic 2 dependency (in case they are still on pydantic v1)
After some time (TBD), enable pydantic type checking by default if pydantic is importable.
After some time (TBD), add the pydantic 2 runtime dependency, and make it the default
After some time (TBD), remove the option to do type checking the old way and delete the old code

Graph instantiation passes the original (unvalidated) arguments to the underlying function call, rather than the validated ones. While this doesn't cause any existing tests to fail, I want to change this to more fully take advantage of the validation that Pydantic can provide.

Motivation
Pydantic is the most widely used data validation library for Python. I wanted to leverage it to do the type checking that csp had custom implementations for, in order to

Reduce the amount of custom code while improving extensibility and modularization
Improve performance
Allow for other ways of building graphs (i.e. pydantic models that contain edges, use of pydantic's validate_call decorator for validation, etc)
Fix existing issues (such as Casting of ts[int] to ts[float] as part of type checking will mutate input baskets in-place, which can lead to downstream errors. #181)

In the end, I've probably added as much code as could be removed, but in my opinion it's more compartmentalized and easier to extend. Furthermore, performance is slightly-to-significantly better (baskets in particular are notably improved) and other ways of building graphs with type checking are supported as intended.

Challenges
The main challenge is csp's handling of "template variables", which is pretty unique, i.e. to have the type var/forward ref resolved based on the input arguments at runtime. This is handled by introducing a custom validation context (TVarValidationContext) and leveraging some of the existing code to resolve conflicts.

Examples
Need to run this before running any of the examples:

import os
os.environ["CSP_PYDANTIC"] = "1"
import csp
from csp import ts
from typing import Dict, Union

Graphs (but not nodes) can now take baskets of baskets as inputs (which was not possible before).

@csp.graph
def foo(x: Dict[str, Dict[str, ts[int]]]) -> ts[bool]:
    return csp.const(True)

foo({"x": {"Y": csp.const(0)}})

Graphs can take custom pydantic models that include edge types as attributes (useful for grouping together time series of different underlying types)

from pydantic import BaseModel

class MyBundle(BaseModel):
    x: ts[str]
    y: ts[float]
    z: str = ""

@csp.graph
def f(bundle: MyBundle) -> ts[str]:
    return csp.sample(bundle.y, bundle.x)

f(MyBundle(x=csp.const("foo"), y=csp.const(1.)))

Graphs can now also take Union of ts types as input (not yet as outputs)

@csp.graph
def foo(x: Union[ts[float], ts[str]]):
    pass
foo(csp.const("x"))

Additional types can now be validated as static arguments, i.e. Callable (though pydantic only performs a simple check that the argument is callable, no validation of arguments, their types or the return type is performed) :

@csp.graph
def foo(f: Callable[[float], float], x: ts[float]) -> ts[float]:
    return csp.apply(x, f, float)
foo(lambda x: x, csp.const(1.))

See also https://docs.pydantic.dev/latest/api/standard_library_types/

The pydantic validation decorator can be applied to csp types if only type validation is required:

@validate_call(validate_return=True)
def foo(a: str, b: ts[float], c: Dict[str, ts[int]]) -> csp.Outputs(x=ts[float], y=Dict[str, ts[int]]):
    return {"x": b, "y": c}

foo("x", csp.const(0.), {"A": csp.const(1)})

Future work

Make the USE_PYDANTIC flag the default if pydantic>2 is found in the environment
Do not allow ts to be None by default - enforce use of Optional
Remove support for type hints that are not standard python (i.e. [int] instead of list[int] or List[int])
Force dynamic baskets to be declared through DynamicBasket[K,V] type rather than Dict[ts[K], ts[V]] type
Better support for return of Union outputs, especially for csp.stats.
Make csp structs more pydantic compatible (by adding validators/serializers within the pydantic framework, without changing the internal representation).

Note that items 2)-4) would be breaking changes.

Implementation Details
The implementation consists of the following pieces:

Existing types, i.e. TsType, Outputs, OutputBasket, etc were extended with __get_pydantic_core_schema__ implementations so that pydantic validation can apply to them. This is enough to enable the use of pydantic models and the pydantic validator with ts types. The complex validation logic for TsType is delegated to TsTypeValidator (which is a combination of glorified "is subtype" logic and handling of TVars - see below)
To support the csp TVar logic, new Pydantic types are introduced, which will have special handling: CspTypeVar and CspTypeVarType as the TVars are neither ForwardRefs or TypeVar.
To support existing csp type checking behavior, dynamic baskets and the TVar resolution, an adjust_annotations function is implemented to adjust the "standard" csp type annotations into fully compliant pydantic annotations
The signature of each node is extended to dynamically create a pydantic model for the inputs and outputs based on the adjusted annotations.
If the CSP_PYDANTIC env variable is set, input checking in signature.py and output checking graph.py uses the input/output model for validation, instead of the existing logic
To fully handle the TVar logic, a custom validation context (i.e. TVarValidationContext) is introduced, which tracks the different resolutions and resolves conflicts, using logic that is nearly identical to the existing implementation (but more generic where possible). This context can maintain state between the validation calls to the different arguments and (sub-arguments for nested structured). In particular, the validation of CspTypeVar and CspTypeVarType interacts specifically with this validation context. This context is instantiated and passed to the model validation step above. The final step in model validation is to do the resolution of all the detected TVars and to revalidate any fields that have changed type as a result of this (i.e. int->float).

Un-scientific profiling

import os
os.environ["CSP_PYDANTIC"] = "1"
from typing import Dict, List
import csp
from csp import ts

@csp.graph
def bar(x: Dict[str, List[float]]) -> ts[bool]:
    return csp.const(True)
inp_bar = {f"sym_{i}": list(range(100)) for i in range(1000)}

@csp.graph
def baz(x: Dict[str, ts[int]]) -> ts[bool]:
    return csp.const(True)
inp_baz = {f"key{i}": csp.const(i) for i in range(1000)}

@csp.graph
def qux(x: Dict[str, ts[List[float]]]) -> ts[bool]:
    return csp.const(True)
inp_qux = {f"key{i}": csp.const.using(T=List[float])([]) for i in range(1000)}

…CSP_PYDANTIC environment variable. Signed-off-by: Pascal Tomecek <pascal.tomecek@cubistsystematic.com>

AdamGlustein

I did a first pass of the PR here. I will return to it after we discuss some lengthier comments I had.

Overall, I do think we should push forward with this PR as the csp typing system has been giving me some headaches lately as well. Due to the constraint of backwards compatibility, I don't think the pydantic-based system is any simpler or more maintainable. However, it is more performant and more widely recognized by external contributors, so those are important pros.

There's really two different areas of improvement with respect to the type system:

Making the type checking more standardized and performant. This PR achieves that.
Removing confusing custom type logic that exists in csp i.e. Numpy array typing, non-standard container annotations, etc.. This PR does not seek to change that due to backwards compatibility.

I think the plan of attack should be to enable Pydantic checking so that when we eventually (possibly?) do a major version release (csp 1.0) we can also achieve 2) using Pydantic. Then the logic will be standardized, understandable and performant.

AdamGlustein · 2024-07-22T13:36:44Z

csp/impl/types/common_definitions.py

@@ -53,7 +53,11 @@ def __new__(cls, *args, **kwargs):
 kwargs = {k: v if not isTsBasket(v) else OutputBasket(v) for k, v in kwargs.items()}

 # stash for convenience later
- kwargs["__annotations__"] = kwargs
+ kwargs["__annotations__"] = kwargs.copy()
+ try:


Shouldn't we only use pydantic if the env variable USE_PYDANTIC is True?
Even if the user has pydantic>2 installed in their environment the first cut should have them opt-in to it explicitly.

Both are possible but this block doesn't change behavior to end users and is closer to the end state: i.e. where csp requires pydantic>2 and this code is executed all the time. The advantage to having it like this is that all the existing unit tests will check the code in ci/cd because pydantic is listed as a dev dependency. By making it depend on the env variable, a whole separate set of tests is needed, only for them to be deleted in the next step.

csp/impl/types/common_definitions.py

csp/impl/types/pydantic_type_resolver.py

csp/impl/types/typing_utils.py

AdamGlustein · 2024-07-27T01:45:58Z

csp/impl/types/typing_utils.py

+ if issubclass(value_type, self._source_type):
+ return value_type
+ except TypeError:
+ # So that List[float] validates as list


We could hit some weird cases with Numpy array typing here. For example, the origin type of NumpyNDArray[str] is np.ndarray but we can't validate the former as the latter due to our (incorrect) default of np.ndarray = NumpyNDArray[float]. If we swap annotations beforehand though so np.ndarray is substituted you can ignore this comment.

Yeah, agree. I was hoping to tackle all the weirdness around numpy array typing as a separate step/PR (possibly also including your suggestions around combining the 1D/ND stuff and adjusting the parquet adapters).

csp/impl/types/typing_utils.py

AdamGlustein · 2024-07-27T01:57:21Z

csp/impl/types/pydantic_type_resolver.py

+ def revalidate(self, model):
+ """Once tvars have been resolved, need to revalidate input values against resolved tvars"""
+ # Determine the fields that need to be revalidated because of tvar resolution
+ # At the moment, that's only int fields that need to be converted to float


A little confused on the complexity here: can't we just cast_int_to_float before passing them to the consumer?
We don't want to change the original tstype anyways as that caused issues #181

This is potentially more generic (though perhaps too generic). i.e. the old type checking logic assumes that the only upcasting you are doing is int to float, but there are other corner cases and this is safer. For example, what about np.float64 or np.float32 to float?
There's no promises really about what the UpcastRegistry will do (and I didn't change it), so we shouldn't be using knowledge of it's implementation in the pydantic type resolver. Ultimately if the UpcastRegistry says A and B both upcast to C, then we need to revalidate A and B as C to be safe.

Signed-off-by: Pascal Tomecek <pascal.tomecek@cubistsystematic.com>

ptomecek force-pushed the pit/pydantic_validation branch from 45580ef to e2562e5 Compare April 2, 2024 01:14

ptomecek changed the title ~~[WIP] First attempt at implementing pydantic-based type checking~~ [WIP] Pydantic-based type checking Apr 2, 2024

ptomecek force-pushed the pit/pydantic_validation branch from e2562e5 to 8fa0b9f Compare April 2, 2024 17:29

timkpaine added the tag: wip PRs that are a work in progress - converted to drafts label Apr 5, 2024

ptomecek force-pushed the pit/pydantic_validation branch 2 times, most recently from 1ff6a1f to 216dc7a Compare April 13, 2024 00:06

ptomecek force-pushed the pit/pydantic_validation branch 4 times, most recently from abeb644 to 28553fa Compare May 13, 2024 23:39

ptomecek marked this pull request as ready for review May 13, 2024 23:40

ptomecek requested review from robambalu, AdamGlustein, svatasoiu, alexddobkin, timkpaine and czgdp1807 as code owners May 13, 2024 23:40

ptomecek changed the title ~~[WIP] Pydantic-based type checking~~ Pydantic-based type checking May 13, 2024

ptomecek added type: enhancement Issues and PRs related to improvements to existing features lang: python Issues and PRs related to the Python codebase and removed tag: wip PRs that are a work in progress - converted to drafts labels May 13, 2024

ptomecek force-pushed the pit/pydantic_validation branch 9 times, most recently from 3e214c6 to 0bb9f25 Compare May 21, 2024 00:37

End-to-end implemention pydantic-based type checking, controlled via …

8adf249

…CSP_PYDANTIC environment variable. Signed-off-by: Pascal Tomecek <pascal.tomecek@cubistsystematic.com>

ptomecek force-pushed the pit/pydantic_validation branch from 0bb9f25 to 8adf249 Compare May 21, 2024 12:18

AdamGlustein reviewed Jul 28, 2024

View reviewed changes

ptomecek force-pushed the pit/pydantic_validation branch from 87dc24d to c131e28 Compare August 8, 2024 20:03

Updates for PR feedback

1a1b9a3

Signed-off-by: Pascal Tomecek <pascal.tomecek@cubistsystematic.com>

ptomecek force-pushed the pit/pydantic_validation branch from 73d4127 to 1a1b9a3 Compare August 9, 2024 14:58

Merge branch 'main' into pit/pydantic_validation

4f5505f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pydantic-based type checking #179

Pydantic-based type checking #179

ptomecek commented Apr 1, 2024 •

edited

Loading

AdamGlustein left a comment

AdamGlustein Jul 22, 2024

ptomecek Aug 8, 2024

AdamGlustein Jul 27, 2024

ptomecek Aug 8, 2024

AdamGlustein Jul 27, 2024

ptomecek Aug 7, 2024

Pydantic-based type checking #179

Are you sure you want to change the base?

Pydantic-based type checking #179

Conversation

ptomecek commented Apr 1, 2024 • edited Loading

AdamGlustein left a comment

Choose a reason for hiding this comment

AdamGlustein Jul 22, 2024

Choose a reason for hiding this comment

ptomecek Aug 8, 2024

Choose a reason for hiding this comment

AdamGlustein Jul 27, 2024

Choose a reason for hiding this comment

ptomecek Aug 8, 2024

Choose a reason for hiding this comment

AdamGlustein Jul 27, 2024

Choose a reason for hiding this comment

ptomecek Aug 7, 2024

Choose a reason for hiding this comment

ptomecek commented Apr 1, 2024 •

edited

Loading