Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🔧 fix typing issues from checking untyped defs, fixes #509 #510

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

kevinsantana11
Copy link
Contributor

One thing we should enable is the feature to disallow untyped definitions. This option was previously turned off but I've enabled it and fixed all of the issues listed (100+)

I still want to add some more changes here regarding standardizing on the array types we want to support as well as leveraging generics more with bounds to achieve some of the more dynamic behavior of the library.

This also follows the guidance from mypy to turn these feature on as soon as possible: article

gamma: float,
beta: float,
radian_frequency: np.ndarray,
gamma: float | np.ndarray,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gamma and beta parameters cannot be np.ndarray the way the function is written

@@ -660,13 +664,12 @@ def cartesian_to_spherical(
return lon, lat


T = TypeVar("T", bound=float | np.ndarray)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not finding this alias very readable. Why does this help?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't an alias, its a generic type variable: https://mypy.readthedocs.io/en/stable/generics.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the whole typing thing is confusing to me

Copy link
Contributor

@philippemiron philippemiron Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read more about TypeVar and actually like it. If I understand correctly, it assigns the type to T dynamically, correct?

@kevinsantana11 kevinsantana11 force-pushed the 509-untyped-defs branch 2 times, most recently from 68148f2 to 37113e7 Compare August 24, 2024 03:52
@kevinsantana11
Copy link
Contributor Author

This code also fixes a bug surfaced by the new type strictness.

The bug itself was an issue where the download logic would never check if the upstream source was updated or the same resource when compared to the local data-store always resulting in files being downloaded.

Our test code should have caught this but a mismatch in type checking introduced a bit of an oversight.

Increase the type strictness should prevent code like where we call functions with the wrong types.

Had this been type checked I would have updated to use the right type and realized the function would no longer check for the local modification time since it only handled buffers.

Copy link
Member

@selipot selipot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are my comments. You will see that I am confused in many places, we need to discuss.

@@ -11,17 +12,19 @@
import pandas as pd
import xarray as xr

_Out = t.TypeVar("_Out", bound=tuple[np.ndarray, np.ndarray] | np.ndarray)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a user, I look at the function input and output arguments to understand how to use it. When I see t.Any and _Out, I have no idea what it means. In the same way, what is typing.TypeVar("_Out", bound=tuple[np.ndarray, np.ndarray] | np.ndarray). Are we making things harder for the user? Look at the scipy library. Do they have any of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that the type _Out makes reading the function signature more difficult. I wish we could use the direct syntax support for generics that was released in python 3.12 but given were currently supporting python 3.10+ so we'll have to wait a bit before being able to leverage that syntax for generics.

The purpose of using this generic type variable is to annotate that the output type of both the func passed to apply_ragged and apply_ragged itself are both the same. As an example if the output type of the func variable is a tuple[ndarray, ndarray] then the output type of apply_ragged will also be a tuple[ndarray, ndarray].

Maybe I'm misunderstanding how this function works but the output type of the function passed to apply_ragged (func) and apply_ragged itself should match right?

Regarding t.Any: I used it in places where that was the implicit type and no more specific type could be determined like np.int or something similar. Examples are type annotations such as list which would implicitly be interpret as list[Any] by the type checker. In places where I could determine a more specific type I would use that but there were cases where I couldn't and so I would just annotate it with t.Any.

My hope with the type annotations and stricter type checking is never to make users life harder but to try and improve the libraries robustness to change.

We could take a different approach to typing similar to numpy where they leverage .pyi or stub files to define types like this

A quick code search on the usage of TypeVar is different per project please see:

Taking a look at the above I see moderate usage throughout the different projects. Some use it to some extent other projects have little use.

rows: int | Iterable[int] = None,
ragged_array: np.ndarray | xr.DataArray,
rowsize: np.ndarray[int] | xr.DataArray,
rows: int | np.int_ | Iterable[int] | None = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary to have both int and np.int_?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can check on this

@@ -10,7 +10,7 @@ def analytic_signal(
x: np.ndarray | xr.DataArray,
boundary: str = "mirror",
time_axis: int = -1,
) -> np.ndarray | tuple[np.ndarray, np.ndarray]:
) -> np.ndarray:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output CAN be a tuple depending on the input. Why remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, will update it back.

@@ -660,13 +664,12 @@ def cartesian_to_spherical(
return lon, lat


T = TypeVar("T", bound=float | np.ndarray)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the whole typing thing is confusing to me

import numpy as np
from scipy.special import gamma as _gamma
from scipy.special import gammaln as _lgamma


@t.overload
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for? @t.overload then a repeat of the definition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These likely fit better in a stub file but the purpose of it is two fold.

  1. Document that the output type is directly dependent on the the value of the complex parameter
  2. Helps the type checker understand the functions output type is dependent on the complex parameters value.

Without this, the type checker will think function calls made to morse_wavelet_transform can return either type (tuple or a single ndarray) and so will warn us to write code to check the types and handle the two scenarios separately.

def morse_wavelet_transform(
x: np.ndarray,
gamma: float,
beta: float,
radian_frequency: np.ndarray,
complex: bool = False,
complex: bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a boolean is False by default? Why remove the default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can revert this change.

)

return wtx
return wavelet_transform(x, wavelet, boundary=boundary, time_axis=time_axis)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i understand how it is the same logic but to me it is less clear.

@@ -346,7 +367,7 @@ def morse_wavelet(
length: int,
gamma: float,
beta: float,
radian_frequency: np.ndarray,
radian_frequency: float | np.ndarray,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure the code needs the radian frequency to be an array, not a float, even with a single element.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I will correct this.

@@ -22,7 +22,7 @@ def test_morse_wavelet_transform_real(self):
length = 1023
radian_frequency = 2 * np.pi / np.logspace(np.log10(10), np.log10(100), 50)
x = np.random.random(length)
wtx = morse_wavelet_transform(x, 3, 10, radian_frequency)
wtx = morse_wavelet_transform(x, 3, 10, radian_frequency, False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we have the complex=False a default for this function. You seem to have it changed to a complusory argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will revert to have this default put back

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was forgotten.

@selipot selipot requested a review from milancurcic September 5, 2024 17:43
@milancurcic
Copy link
Member

This is generally not necessary, but can be helpful for developers (e.g. if you want to require mypy to pass with certain settings on).

I didn't look at the full diff, but from the description it seems that the changes are only to the type hints. Python ignores type hints, so the code itself is unchanged in functionality.

So, two questions to ask are:

  1. Are these hints helpful for the contributors?
  2. Are these hints helpful for the end user?

IMO:

  1. They can be in moderation, if I can understand the types. But if you develop a type system only to satisfy a strict type checking tool, e.g. mypy, it can easily escalate to a type system nightmare. There are some type definitions here that I don't understand on first look. Doesn't mean that I couldn't understand them if I studied them, but they're complex enough that they're not obvious.
  2. Probably not; the users will mostly read the docstrings.

Copy link

codecov bot commented Oct 9, 2024

Codecov Report

Attention: Patch coverage is 95.00000% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
clouddrift/adapters/gdp/gdpsource.py 90.00% 1 Missing ⚠️
clouddrift/adapters/utils.py 83.33% 1 Missing ⚠️
clouddrift/typing.py 92.30% 1 Missing ⚠️
Additional details and impacted files

📢 Thoughts on this report? Let us know!

@kevinsantana11
Copy link
Contributor Author

@selipot this is ready for re-review

@philippemiron philippemiron self-requested a review October 17, 2024 14:01
@philippemiron
Copy link
Contributor

philippemiron commented Oct 17, 2024

What's the advantage of using from numpy.typing import NDArray instead of np.ndarray ?

And if it's really required, I would suggest doing it uniformly. I see some places you have:

from numpy.typing import NDArray
import numpy.typing as np_typing

and I think we are only using NDArray in the code from this module.

@philippemiron
Copy link
Contributor

and I would suggest also doing :

from clouddrift.typing import ArrayTypes

to remove some verbosity in the docstring.

longitude: float | np.ndarray,
latitude: float | np.ndarray,
) -> tuple[float, float] | tuple[np.ndarray, np.ndarray]:
u: T, v: T, w: T, longitude: V, latitude: V
Copy link
Contributor

@philippemiron philippemiron Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I'm confused then what is the purpose of the ArrayTypes from clouddrift.typing if we use TypeVar here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we define bounds to the type variable so while the type is "variable" we can define some bounds to constrain it a bit

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious as to why we have both T and V here when they seem to have the same definition.

@kevinsantana11
Copy link
Contributor Author

kevinsantana11 commented Oct 18, 2024

What's the advantage of using from numpy.typing import NDArray instead of np.ndarray ?

And if it's really required, I would suggest doing it uniformly. I see some places you have:

from numpy.typing import NDArray
import numpy.typing as np_typing

and I think we are only using NDArray in the code from this module.

The main reason I swapped the ndarray class out with the NDArray one is because it makes the type annotations less verbose. Using np.ndarray would be preferable if not for the extra verbosity required when using it as an annotation. It's one less import and more familiar to users but what made me consider using NDArray was that it only required we a dtype as a a type variable. To provide a fully concrete type using the ndarray class, a type for the shape must also be provided and the dtype (np.int64, np.float64, etc...) must be provided as a type variable of the np.dtype type. An example of what I mean in code can be found below:

Example

Using np.ndarray

import numpy as np
from typing import Any

def foo() -> np.ndarray[Any, np.dtype[np.int64]]:
    ...

Using np.typing.NDArray

from numpy.typing import NDArray

def foo() -> NDArray[np.int64]:
    ...

@@ -725,12 +729,12 @@ def cartesian_to_tangentplane(
return u_projected, v_projected


T = TypeVar("T", bound=float | np.ndarray)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this defined a second time? I see it is also defined in line 667.

func: Callable[..., _ArrayOutput],
arrays: ArrayTypes,
rowsize: ArrayTypes,
*args: typing.Any,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might help to keep things consistent to also update the docstrings here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants