Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TYP: pandas/core/missing.py #38339

Closed
wants to merge 31 commits into from
Closed

Conversation

arw2019
Copy link
Member

@arw2019 arw2019 commented Dec 7, 2020

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

)
if isna(order):
raise ValueError(f"order needs to be specified; got order: {order}")
assert isinstance(order, int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not use asserts to please mypy if there is any risk for it to bubble up to the user (without it being a panda bug). I didn't look in enough detail to be sure here, but it looks like a user can pass eg a float here, and it should not raise an AssertionError.

xref #37350

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this is done because isna won't narrow the type. If order is typed correct, order is None should be used instead of isna: order must either be a python integer or None. This eliminates the need for the assert.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this is done because isna won't narrow the type.

Yes that's the reason for this

If order is typed correct, order is None should be used instead of isna: order must either be a python integer or None. This eliminates the need for the assert.

Ok!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not use asserts to please mypy if there is any risk for it to bubble up to the user (without it being a panda bug). I didn't look in enough detail to be sure here, but it looks like a user can pass eg a float here, and it should not raise an AssertionError.

xref #37350

Fair point, noted

method = alt_methods[method]
new_y = method(x, y, new_x, **kwargs)

assert isinstance(method, str)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here: is there any guarantee we already checked this is a string by here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure (don't think so) but if it's not then we'll raise with/without the assert because in the line below method is a key to a dict where all keys are strings

# ignores some kwargs that could be passed along.
alt_methods = {
"barycentric": interpolate.barycentric_interpolate,
"krogh": interpolate.krogh_interpolate,
"from_derivatives": _from_derivatives,
"piecewise_polynomial": _from_derivatives,
}

I can add a more informative error message here, though

xi: ArrayLike,
yi: ArrayLike,
x: Union[ArrayLike, Scalar],
axis: Optional[int] = 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is axis optional? (it has a default of 0 already, so it's always specified?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK Optional means it could be None.

That said, the docstring lists this arg as optional so I'll fix that while I'm here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional[T] in type-hinting is not the same as optional in docstrings. The former is an alias for Union[T, None] while the latter means doesn't need to be specified.

):
x,
y,
new_x,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we know anything about x, y, new_x?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - added

@@ -61,7 +63,7 @@ def mask_missing(arr: ArrayLike, values_to_mask) -> np.ndarray:
return mask


def clean_fill_method(method, allow_nearest: bool = False):
def clean_fill_method(method: str, allow_nearest: bool = False) -> Optional[str]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like method can be None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good match (mypy missed this)

start_nans = set() if start_nan_idx is None else set(range(start_nan_idx))

end_nan_idx = find_valid_index(yvalues, "last")
end_nans = set() if end_nan_idx is None else set(range(1 + end_nan_idx, len(valid)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this fixing a bug in the case where end_nan_idx is None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so because we special case all nans/no nans at the top of the the method

@@ -324,7 +338,7 @@ def _interpolate_scipy_wrapper(
elif method == "cubicspline":
alt_methods["cubicspline"] = _cubicspline_interpolate

interp1d_methods = [
interp1d_methods: List[str] = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy cant figure this out on its own?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe I needed this at some point but yes mypy gets this without the annotation with the file as is

yi: ArrayLike,
x: Union[Scalar, ArrayLike],
der: Optional[int] = 0,
axis: Axis = 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think just int; Axis includes "index" and "columns" which dont get resolved here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, changed to int

def _akima_interpolate(xi, yi, x, der=0, axis=0):
def _akima_interpolate(
xi: ArrayLike,
yi: ArrayLike,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the docstring says ArrayLike, but im guessing this wouldnt work on EAs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess not since this is a wrapper for scipy.interpolate.Akima1DInterpolator (and mypy works with both of these annotated asnp.ndarray)

xi: ArrayLike,
yi: ArrayLike,
x: Union[Scalar, ArrayLike],
der: Optional[int] = 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is None really allowed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, don't think so. Changed to plain int

@@ -572,12 +612,12 @@ def _interpolate_with_limit_area(


def interpolate_2d(
values,
values: np.ndarray,
method: str = "pad",
axis: Axis = 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guessing just int

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed this to int

@@ -447,7 +480,14 @@ def _akima_interpolate(xi, yi, x, der=0, axis=0):
return P(x, nu=der)


def _cubicspline_interpolate(xi, yi, x, axis=0, bc_type="not-a-knot", extrapolate=None):
def _cubicspline_interpolate(
xi: ArrayLike,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be np.ndarray here as well, I think?

@@ -23,7 +23,9 @@
from pandas import Index


def mask_missing(arr: ArrayLike, values_to_mask) -> np.ndarray:
def mask_missing(
arr: ArrayLike, values_to_mask: Union[List, Tuple, Scalar]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values_to_mask is passed to infer_dtype_from_array

so the type of values_to_mask should be the same as the arr argument of infer_dtype_from_array

the arr argument of infer_dtype_from_array is not yet typed. (sometimes better to type the lower level functions first, but while np.ndarray resolves to Any, mypy won't report any inconsistencies)

looking quickly through the body of infer_dtype_from_array, Series and EA are handled, so passing an array-like to values_to_mask should be fine.

To handle Series, could either add to Union, or could maybe (if no perf implications) also update infer_dtype_from_array to handle Index as well, and use AnyArrayLike instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the Index check to infer_dtype_from_array

diff --git a/pandas/core/dtypes/cast.py b/pandas/core/dtypes/cast.py
index 165e63e23d..676ab572a0 100644
--- a/pandas/core/dtypes/cast.py
+++ b/pandas/core/dtypes/cast.py
@@ -83,6 +83,7 @@ from pandas.core.dtypes.generic import (
     ABCDatetimeArray,
     ABCDatetimeIndex,
     ABCExtensionArray,
+    ABCIndexClass,
     ABCPeriodArray,
     ABCPeriodIndex,
     ABCSeries,
@@ -877,7 +878,7 @@ def infer_dtype_from_array(
     if pandas_dtype and is_extension_array_dtype(arr):
         return arr.dtype, arr
 
-    elif isinstance(arr, ABCSeries):
+    elif isinstance(arr, ABCSeries) or isinstance(arr, ABCIndexClass):
         return arr.dtype, np.asarray(arr)
 
     # don't force numpy coerce with nan's

seems to degrade some of the benchmarks that hit this

asv continuous -f 1.1 upstream/master HEAD -b ^algorithms -b ^replace

       before           after         ratio
     [d0db0098]       [9b685722]
     <master~1>       <master>  
+     3.41±0.04ms       5.40±0.7ms     1.59  algorithms.Factorize.time_factorize(False, False, 'boolean')
+     4.50±0.06ms       6.98±0.9ms     1.55  algorithms.Factorize.time_factorize(False, True, 'boolean')
+     3.27±0.01μs       3.89±0.5μs     1.19  algorithms.Duplicated.time_duplicated(True, 'first', 'float')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

so I guess a Union with Series is the way to go unless there's a more efficient patch for infer_dtype_from_array

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this. Looking again, I see that Index is already handled and falls through the the final return after np.asarray(arr). so could probably use Union[AnyArrayLike, Sequence[Any], Scalar] after all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - done

@arw2019 arw2019 added the Typing type annotations, mypy/pyright type checking label Dec 11, 2020
@@ -815,8 +822,8 @@ def dict_compat(d: Dict[Scalar, Scalar]) -> Dict[Scalar, Scalar]:


def infer_dtype_from_array(
arr, pandas_dtype: bool = False
) -> Tuple[DtypeObj, ArrayLike]:
arr: "Union[ArrayLike, Series, PandasScalar]", pandas_dtype: bool = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for these just use AnyArrayLike

@@ -23,7 +33,9 @@
from pandas import Index


def mask_missing(arr: ArrayLike, values_to_mask) -> np.ndarray:
def mask_missing(
arr: ArrayLike, values_to_mask: Union[AnyArrayLike, Scalar, Sequence[Any]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AnyArrayLike

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

How many derivatives to extract; None for all potentially nonzero
derivatives (that is a number equal to the number of points), or a
list of derivatives to extract. This number includes the function
value as 0th derivative.
extrapolate : bool, optional
extrapolate : bool, default False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the alignmnet here is off

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

def _akima_interpolate(
xi: np.ndarray,
yi: np.ndarray,
x: Union[Scalar, ArrayLike],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think these are just np.ndarray right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and similar in almost all places that you are using Union[Scalar, ArrayLike] unless you can reveal_type that this is not the case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, they're always ndarray except in _interpolate_scipy_wrapper. It's because in _interpolate_scipy_wrapper we cast any scalar input to ndarray and these functions always get called from there:

new_x = np.asarray(new_x)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed these, and docstrings too

@jreback
Copy link
Contributor

jreback commented Feb 11, 2021

@arw2019 this is somewhat old, so maybe simpler to start a new PR, or merge master and update; as always on typing if can do the unambiguous ones first

@arw2019
Copy link
Member Author

arw2019 commented Feb 11, 2021

@arw2019 this is somewhat old, so maybe simpler to start a new PR, or merge master and update; as always on typing if can do the unambiguous ones first

I'm thinking I'll factor out the easy ones into a separate PR so this is easier to look at and maybe get that merged first

(FWIW this got a lot of review so i'm confident the hard ones here are mostly right at this point)

@@ -39,7 +44,9 @@
from pandas import Index


def mask_missing(arr: ArrayLike, values_to_mask) -> np.ndarray:
def mask_missing(
arr: AnyArrayLike, values_to_mask: Union[AnyArrayLike, Scalar, Sequence[Any]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ArrayLike for arr was correct here?
(or if not, the docstring needs to be updated as well)

@simonjayhawkins
Copy link
Member

@arw2019 can you merge master and fixup mypy failures

@simonjayhawkins
Copy link
Member

@arw2019 this is somewhat old, so maybe simpler to start a new PR, or merge master and update; as always on typing if can do the unambiguous ones first

@arw2019 closing this. can you break this up and open several new smaller PRs

@simonjayhawkins
Copy link
Member

@arw2019 if you add an ignore, cast or assert, or change/delete an existing annotation or change any actual code, put those changes a minimally scoped PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Typing type annotations, mypy/pyright type checking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants