Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pydantic-based type checking #179

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

ptomecek
Copy link
Collaborator

@ptomecek ptomecek commented Apr 1, 2024

Please read the below.
I have run all unit tests locally with and without pydantic type checking enabled, so the changes are fully compatible (though exception messages have changed).

Open Issues
The main open issues are

  1. Need to agree on how to enable it and roll it out. It's currently done through an env variable, so that that the existing behavior is the default, and pydantic type checking (and import-dependency) is opt-in. Here's what I suggest
  • Merge these changes (with no pydantic runtime dependency)
  • Bump version to 0.8 (higher than internal version)
  • Warn users that we will soon have a pydantic 2 dependency (in case they are still on pydantic v1)
  • After some time (TBD), enable pydantic type checking by default if pydantic is importable.
  • After some time (TBD), add the pydantic 2 runtime dependency, and make it the default
  • After some time (TBD), remove the option to do type checking the old way and delete the old code
  1. Graph instantiation passes the original (unvalidated) arguments to the underlying function call, rather than the validated ones. While this doesn't cause any existing tests to fail, I want to change this to more fully take advantage of the validation that Pydantic can provide.

Motivation
Pydantic is the most widely used data validation library for Python. I wanted to leverage it to do the type checking that csp had custom implementations for, in order to

  1. Reduce the amount of custom code while improving extensibility and modularization
  2. Improve performance
  3. Allow for other ways of building graphs (i.e. pydantic models that contain edges, use of pydantic's validate_call decorator for validation, etc)
  4. Fix existing issues (such as Casting of ts[int] to ts[float] as part of type checking will mutate input baskets in-place, which can lead to downstream errors. #181)

In the end, I've probably added as much code as could be removed, but in my opinion it's more compartmentalized and easier to extend. Furthermore, performance is slightly-to-significantly better (baskets in particular are notably improved) and other ways of building graphs with type checking are supported as intended.

Challenges
The main challenge is csp's handling of "template variables", which is pretty unique, i.e. to have the type var/forward ref resolved based on the input arguments at runtime. This is handled by introducing a custom validation context (TVarValidationContext) and leveraging some of the existing code to resolve conflicts.

Examples
Need to run this before running any of the examples:

import os
os.environ["CSP_PYDANTIC"] = "1"
import csp
from csp import ts
from typing import Dict, Union

Graphs (but not nodes) can now take baskets of baskets as inputs (which was not possible before).

@csp.graph
def foo(x: Dict[str, Dict[str, ts[int]]]) -> ts[bool]:
    return csp.const(True)

foo({"x": {"Y": csp.const(0)}})

Graphs can take custom pydantic models that include edge types as attributes (useful for grouping together time series of different underlying types)

from pydantic import BaseModel

class MyBundle(BaseModel):
    x: ts[str]
    y: ts[float]
    z: str = ""

@csp.graph
def f(bundle: MyBundle) -> ts[str]:
    return csp.sample(bundle.y, bundle.x)

f(MyBundle(x=csp.const("foo"), y=csp.const(1.)))

Graphs can now also take Union of ts types as input (not yet as outputs)

@csp.graph
def foo(x: Union[ts[float], ts[str]]):
    pass
foo(csp.const("x"))

Additional types can now be validated as static arguments, i.e. Callable (though pydantic only performs a simple check that the argument is callable, no validation of arguments, their types or the return type is performed) :

@csp.graph
def foo(f: Callable[[float], float], x: ts[float]) -> ts[float]:
    return csp.apply(x, f, float)
foo(lambda x: x, csp.const(1.))    

See also https://docs.pydantic.dev/latest/api/standard_library_types/

The pydantic validation decorator can be applied to csp types if only type validation is required:

@validate_call(validate_return=True)
def foo(a: str, b: ts[float], c: Dict[str, ts[int]]) -> csp.Outputs(x=ts[float], y=Dict[str, ts[int]]):
    return {"x": b, "y": c}

foo("x", csp.const(0.), {"A": csp.const(1)})

Future work

  1. Make the USE_PYDANTIC flag the default if pydantic>2 is found in the environment
  2. Do not allow ts to be None by default - enforce use of Optional
  3. Remove support for type hints that are not standard python (i.e. [int] instead of list[int] or List[int])
  4. Force dynamic baskets to be declared through DynamicBasket[K,V] type rather than Dict[ts[K], ts[V]] type
  5. Better support for return of Union outputs, especially for csp.stats.
  6. Make csp structs more pydantic compatible (by adding validators/serializers within the pydantic framework, without changing the internal representation).

Note that items 2)-4) would be breaking changes.

Implementation Details
The implementation consists of the following pieces:

  1. Existing types, i.e. TsType, Outputs, OutputBasket, etc were extended with __get_pydantic_core_schema__ implementations so that pydantic validation can apply to them. This is enough to enable the use of pydantic models and the pydantic validator with ts types. The complex validation logic for TsType is delegated to TsTypeValidator (which is a combination of glorified "is subtype" logic and handling of TVars - see below)
  2. To support the csp TVar logic, new Pydantic types are introduced, which will have special handling: CspTypeVar and CspTypeVarType as the TVars are neither ForwardRefs or TypeVar.
  3. To support existing csp type checking behavior, dynamic baskets and the TVar resolution, an adjust_annotations function is implemented to adjust the "standard" csp type annotations into fully compliant pydantic annotations
  4. The signature of each node is extended to dynamically create a pydantic model for the inputs and outputs based on the adjusted annotations.
  5. If the CSP_PYDANTIC env variable is set, input checking in signature.py and output checking graph.py uses the input/output model for validation, instead of the existing logic
  6. To fully handle the TVar logic, a custom validation context (i.e. TVarValidationContext) is introduced, which tracks the different resolutions and resolves conflicts, using logic that is nearly identical to the existing implementation (but more generic where possible). This context can maintain state between the validation calls to the different arguments and (sub-arguments for nested structured). In particular, the validation of CspTypeVar and CspTypeVarType interacts specifically with this validation context. This context is instantiated and passed to the model validation step above. The final step in model validation is to do the resolution of all the detected TVars and to revalidate any fields that have changed type as a result of this (i.e. int->float).

Un-scientific profiling

import os
os.environ["CSP_PYDANTIC"] = "1"
from typing import Dict, List
import csp
from csp import ts

@csp.graph
def bar(x: Dict[str, List[float]]) -> ts[bool]:
    return csp.const(True)
inp_bar = {f"sym_{i}": list(range(100)) for i in range(1000)}

@csp.graph
def baz(x: Dict[str, ts[int]]) -> ts[bool]:
    return csp.const(True)
inp_baz = {f"key{i}": csp.const(i) for i in range(1000)}

@csp.graph
def qux(x: Dict[str, ts[List[float]]]) -> ts[bool]:
    return csp.const(True)
inp_qux = {f"key{i}": csp.const.using(T=List[float])([]) for i in range(1000)}

image
image

@ptomecek ptomecek changed the title [WIP] First attempt at implementing pydantic-based type checking [WIP] Pydantic-based type checking Apr 2, 2024
@timkpaine timkpaine added the tag: wip PRs that are a work in progress - converted to drafts label Apr 5, 2024
@ptomecek ptomecek force-pushed the pit/pydantic_validation branch 2 times, most recently from 1ff6a1f to 216dc7a Compare April 13, 2024 00:06
@ptomecek ptomecek force-pushed the pit/pydantic_validation branch 4 times, most recently from abeb644 to 28553fa Compare May 13, 2024 23:39
@ptomecek ptomecek marked this pull request as ready for review May 13, 2024 23:40
@ptomecek ptomecek changed the title [WIP] Pydantic-based type checking Pydantic-based type checking May 13, 2024
@ptomecek ptomecek added type: enhancement Issues and PRs related to improvements to existing features lang: python Issues and PRs related to the Python codebase and removed tag: wip PRs that are a work in progress - converted to drafts labels May 13, 2024
@ptomecek ptomecek force-pushed the pit/pydantic_validation branch 9 times, most recently from 3e214c6 to 0bb9f25 Compare May 21, 2024 00:37
…CSP_PYDANTIC environment variable.

Signed-off-by: Pascal Tomecek <pascal.tomecek@cubistsystematic.com>
Copy link
Collaborator

@AdamGlustein AdamGlustein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first pass of the PR here. I will return to it after we discuss some lengthier comments I had.

Overall, I do think we should push forward with this PR as the csp typing system has been giving me some headaches lately as well. Due to the constraint of backwards compatibility, I don't think the pydantic-based system is any simpler or more maintainable. However, it is more performant and more widely recognized by external contributors, so those are important pros.

There's really two different areas of improvement with respect to the type system:

  1. Making the type checking more standardized and performant. This PR achieves that.
  2. Removing confusing custom type logic that exists in csp i.e. Numpy array typing, non-standard container annotations, etc.. This PR does not seek to change that due to backwards compatibility.

I think the plan of attack should be to enable Pydantic checking so that when we eventually (possibly?) do a major version release (csp 1.0) we can also achieve 2) using Pydantic. Then the logic will be standardized, understandable and performant.

@@ -53,7 +53,11 @@ def __new__(cls, *args, **kwargs):
kwargs = {k: v if not isTsBasket(v) else OutputBasket(v) for k, v in kwargs.items()}

# stash for convenience later
kwargs["__annotations__"] = kwargs
kwargs["__annotations__"] = kwargs.copy()
try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we only use pydantic if the env variable USE_PYDANTIC is True?
Even if the user has pydantic>2 installed in their environment the first cut should have them opt-in to it explicitly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both are possible but this block doesn't change behavior to end users and is closer to the end state: i.e. where csp requires pydantic>2 and this code is executed all the time. The advantage to having it like this is that all the existing unit tests will check the code in ci/cd because pydantic is listed as a dev dependency. By making it depend on the env variable, a whole separate set of tests is needed, only for them to be deleted in the next step.

csp/impl/types/common_definitions.py Outdated Show resolved Hide resolved
csp/impl/types/common_definitions.py Show resolved Hide resolved
csp/impl/types/pydantic_type_resolver.py Outdated Show resolved Hide resolved
csp/impl/types/pydantic_type_resolver.py Show resolved Hide resolved
csp/impl/types/typing_utils.py Outdated Show resolved Hide resolved
csp/impl/types/typing_utils.py Outdated Show resolved Hide resolved
if issubclass(value_type, self._source_type):
return value_type
except TypeError:
# So that List[float] validates as list
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could hit some weird cases with Numpy array typing here. For example, the origin type of NumpyNDArray[str] is np.ndarray but we can't validate the former as the latter due to our (incorrect) default of np.ndarray = NumpyNDArray[float]. If we swap annotations beforehand though so np.ndarray is substituted you can ignore this comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agree. I was hoping to tackle all the weirdness around numpy array typing as a separate step/PR (possibly also including your suggestions around combining the 1D/ND stuff and adjusting the parquet adapters).

csp/impl/types/typing_utils.py Outdated Show resolved Hide resolved
def revalidate(self, model):
"""Once tvars have been resolved, need to revalidate input values against resolved tvars"""
# Determine the fields that need to be revalidated because of tvar resolution
# At the moment, that's only int fields that need to be converted to float
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little confused on the complexity here: can't we just cast_int_to_float before passing them to the consumer?
We don't want to change the original tstype anyways as that caused issues #181

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is potentially more generic (though perhaps too generic). i.e. the old type checking logic assumes that the only upcasting you are doing is int to float, but there are other corner cases and this is safer. For example, what about np.float64 or np.float32 to float?
There's no promises really about what the UpcastRegistry will do (and I didn't change it), so we shouldn't be using knowledge of it's implementation in the pydantic type resolver. Ultimately if the UpcastRegistry says A and B both upcast to C, then we need to revalidate A and B as C to be safe.

Signed-off-by: Pascal Tomecek <pascal.tomecek@cubistsystematic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang: python Issues and PRs related to the Python codebase type: enhancement Issues and PRs related to improvements to existing features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants