Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated setup.cfg mypy flags and resolved related errors. #703

Merged
merged 31 commits into from
Nov 9, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
a74ef96
updated setup.cfg mypy flags and resolved errors
Sanketh7 Nov 1, 2022
8b1d198
Merge branch 'main' into mypy-flags
Sanketh7 Nov 1, 2022
cb58600
Merge branch 'main' into mypy-flags
Sanketh7 Nov 2, 2022
1eb64ae
updated return type of _is_each_row_float
Sanketh7 Nov 2, 2022
0ad2a38
updated mypy hook to include numpy as a dependency and fixed relevant…
Sanketh7 Nov 2, 2022
0313deb
updated _is_each_row_float
Sanketh7 Nov 2, 2022
2d5307f
updated _get_data_as_records
Sanketh7 Nov 2, 2022
82cbf42
Merge branch 'main' into mypy-flags
taylorfturner Nov 3, 2022
81f9f36
clean up
taylorfturner Nov 3, 2022
c1b2bd1
Merge branch 'main' into mypy-flags
taylorfturner Nov 3, 2022
3c9aaa3
updated evaluate_accuracy types
Sanketh7 Nov 3, 2022
1de9958
removed float cast in biased_skew
Sanketh7 Nov 3, 2022
cc8224e
typed self.match_count
Sanketh7 Nov 4, 2022
2538ea3
updated estimate_stats_from_histogram type
Sanketh7 Nov 4, 2022
b737249
Merge branch 'main' into mypy-flags
Sanketh7 Nov 4, 2022
60a2202
update biased skewness to Union[float, np.float64]
Sanketh7 Nov 4, 2022
fde3a97
updated _correct_bias_skewness to return Union[float, np.float64]
Sanketh7 Nov 4, 2022
481a8ec
updated biased kurtosis to be Union[float, np.float64]
Sanketh7 Nov 4, 2022
9a261e1
added generics to AutoSubRegistrationMeta
Sanketh7 Nov 4, 2022
06762be
Merge branch 'main' into mypy-flags
Sanketh7 Nov 7, 2022
25094c2
update np_type_to_type return type
Sanketh7 Nov 7, 2022
1ee80eb
changed float to float64 where needed
Sanketh7 Nov 7, 2022
a9d097b
updated _estimate_mode_from_histogram to not use Union
Sanketh7 Nov 8, 2022
d7ee476
Merge branch 'main' into mypy-flags
Sanketh7 Nov 8, 2022
841adf5
return 0.0 instead of 0
Sanketh7 Nov 8, 2022
9c2d636
Merge branch 'main' into mypy-flags
Sanketh7 Nov 9, 2022
78f57fa
revert AutoSubRegistrationMeta changes
Sanketh7 Nov 9, 2022
ffbcb43
Update base_model.py
taylorfturner Nov 9, 2022
e417189
Update base_model.py
taylorfturner Nov 9, 2022
ec2745b
isort fix
taylorfturner Nov 9, 2022
e6399be
DS_Store fix
taylorfturner Nov 9, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion dataprofiler/data_readers/avro_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ def is_match(
if data_utils.is_stream_buffer(file_path):
starting_location = file_path.tell()

is_valid_avro = fastavro.is_avro(file_path)
is_valid_avro: bool = fastavro.is_avro(file_path)

# return to original position in stream
if data_utils.is_stream_buffer(file_path):
Expand Down
15 changes: 9 additions & 6 deletions dataprofiler/data_readers/json_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
import pandas as pd
from six import StringIO

from .._typing import JSONType
from . import data_utils
from .base_data import BaseData
from .filepath_or_buffer import FileOrBufferHandler
Expand Down Expand Up @@ -236,36 +237,38 @@ def _get_data_as_flattened_dataframe(self, json_lines):

return data

def _load_data_from_str(self, data_as_str: str) -> List:
def _load_data_from_str(self, data_as_str: str) -> JSONType:
"""
Load the data from a string.

:param data_as_str: data in string format.
:type data_as_str: str
:return: dict
:return: JSONType
"""
data: JSONType
try:
data = json.loads(data_as_str)
except json.JSONDecodeError:
data = data_utils.data_generator(data_as_str.splitlines())
data_generator = data_utils.data_generator(data_as_str.splitlines())
data = data_utils.read_json(
data_generator=data,
data_generator=data_generator,
selected_columns=self.selected_keys,
read_in_string=False,
)
return data

def _load_data_from_file(self, input_file_path: str) -> List:
def _load_data_from_file(self, input_file_path: str) -> JSONType:
"""
Load the data from a file.

:param input_file_path: file path to file being loaded.
:type input_file_path: str
:return:
:return: JSONType
"""
with FileOrBufferHandler(
input_file_path, "r", encoding=self.file_encoding
) as input_file:
data: JSONType
try:
data = json.load(input_file)
except (json.JSONDecodeError, UnicodeDecodeError):
Expand Down
4 changes: 2 additions & 2 deletions dataprofiler/data_readers/structured_mixins.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,12 +72,12 @@ def _get_data_as_df(self, data: pd.DataFrame) -> pd.DataFrame:
def _get_data_as_records(self, data: Any) -> List[str]:
"""Return data records."""
records_per_line = min(len(data), self.SAMPLES_PER_LINE_DEFAULT)
data = [
data_ = [
str(
"\n".join(data[i * records_per_line : (i + 1) * records_per_line])
.encode("UTF-8")
.decode()
)
for i in range((len(data) + records_per_line - 1) // records_per_line)
]
return data
return data_
6 changes: 3 additions & 3 deletions dataprofiler/labelers/base_data_labeler.py
Original file line number Diff line number Diff line change
Expand Up @@ -463,7 +463,7 @@ def _load_parameters(dirpath: str, load_options: Dict = None) -> Dict[str, Dict]
load_options = {}

with open(os.path.join(dirpath, "data_labeler_parameters.json")) as fp:
params = json.load(fp)
params: Dict[str, Dict] = json.load(fp)

if "model_class" in load_options:
model_class = load_options.get("model_class")
Expand Down Expand Up @@ -677,7 +677,7 @@ def load_with_components(
data_labeler.set_preprocessor(preprocessor)
data_labeler.set_model(model)
data_labeler.set_postprocessor(postprocessor)
return data_labeler
return cast(BaseDataLabeler, data_labeler)

def _save_model(self, dirpath: str) -> None:
"""
Expand Down Expand Up @@ -914,4 +914,4 @@ def load_with_components(
data_labeler.set_preprocessor(preprocessor)
data_labeler.set_model(model)
data_labeler.set_postprocessor(postprocessor)
return data_labeler
return cast(TrainableDataLabeler, data_labeler)
2 changes: 1 addition & 1 deletion dataprofiler/labelers/base_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ def __new__(
cls, clsname, bases, attrs
)
new_class._register_subclass()
return new_class
return cast(AutoSubRegistrationMeta, new_class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be better to do a T generic because really the class should be original of the input? Otherwise, would this convert something using this metaclass to that type later in code?

Is that why loading has the issue as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that new_class needs to be typed as Any initially because otherwise mypy complains that ._register_subclass() doesn't exist. I'm thinking we could also do:

new_class: AutoSubRegistrationMeta = super(AutoSubRegistrationMeta, cls).__new__(
    cls, clsname, bases, attrs
)
new_class._register_subclass() # type: ignore
return new_class

which does pass mypy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree thatnew_class: AutoSubRegistrationMeta = ... would not work either. However, I was wondering if there's a generic approach that could fix this that allows us to do new_class: T = ... or some sort where T is the cls or something of the like. If it is too complex, fair. Either way, it doesn't feel right to me that cast(AutoSubRegistrationMeta, new_class) shouldn't be some cast(T, new_class) as the output is not a class AutoSubRegistrationMeta, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T = TypeVar("T", bound="AutoSubRegistrationMeta")


class AutoSubRegistrationMeta(abc.ABCMeta):
    """For registering subclasses."""

    def __new__(
        cls: Type[T], clsname: str, bases: Tuple[type, ...], attrs: Dict[str, object]
    ) -> T:
        """Create auto registration object and return new class."""
        new_class: T = super(AutoSubRegistrationMeta, cls).__new__(
            cls, clsname, bases, attrs
        )
        new_class._register_subclass()  # type: ignore
        return new_class

Was able to get it working with this change (which I just committed). Couldn't get around the ._register_subclass() line because that's done dynamically.



class BaseModel(object, metaclass=abc.ABCMeta):
Expand Down
3 changes: 2 additions & 1 deletion dataprofiler/labelers/labeler_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ def evaluate_accuracy(
predicted_entities_in_index: List[List[int]],
true_entities_in_index: List[List[int]],
num_labels: int,
entity_rev_dict: Dict,
entity_rev_dict: Dict[int, Any],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the Any is required to be str

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Made the change.

verbose: bool = True,
omitted_labels: Tuple[str, ...] = ("PAD", "UNKNOWN"),
confusion_matrix_file: str = None,
Expand Down Expand Up @@ -125,6 +125,7 @@ def evaluate_accuracy(
true_labels_flatten = np.hstack(true_labels_padded) # type: ignore
predicted_labels_flatten = np.hstack(predicted_entities_in_index)

all_labels: List[str] = []
if entity_rev_dict:
all_labels = [entity_rev_dict[key] for key in sorted(entity_rev_dict.keys())]

Expand Down
1 change: 1 addition & 0 deletions dataprofiler/profilers/base_column_profilers.py
Original file line number Diff line number Diff line change
Expand Up @@ -267,6 +267,7 @@ def __init__(self, name: Optional[str]) -> None:
# Number of values that match the column type. eg. how many floats match
# in the float column
self.match_count: int = 0
self.sample_size: int # inherited from BaseColumnProfiler

def _update_column_base_properties(self, profile: Dict) -> None:
"""
Expand Down
6 changes: 3 additions & 3 deletions dataprofiler/profilers/float_column_profile.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

import copy
import re
from typing import Dict, List, Optional
from typing import Dict, Optional

import numpy as np
import pandas as pd
Expand Down Expand Up @@ -285,7 +285,7 @@ def _get_float_precision(
return subset_precision

@classmethod
def _is_each_row_float(cls, df_series: pd.Series) -> List[bool]:
def _is_each_row_float(cls, df_series: pd.Series) -> pd.Series[bool]:
"""
Determine if each value in a dataframe is a float.

Expand All @@ -297,7 +297,7 @@ def _is_each_row_float(cls, df_series: pd.Series) -> List[bool]:
:param df_series: series of values to evaluate
:type df_series: pandas.core.series.Series
:return: is_float_col
:rtype: list
:rtype: pandas.Series[bool]
"""
if len(df_series) == 0:
return list()
Expand Down
8 changes: 4 additions & 4 deletions dataprofiler/profilers/graph_profiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import pickle
from collections import defaultdict
from datetime import datetime
from typing import Dict, List, Optional, Tuple, Union
from typing import Dict, List, Optional, Tuple, Union, cast

import networkx as nx
import numpy as np
Expand Down Expand Up @@ -330,12 +330,12 @@ def _update_categorical_distribution(
@BaseColumnProfiler._timeit(name="num_nodes")
def _get_num_nodes(self, graph: nx.Graph) -> int:
"""Compute the number of nodes."""
return graph.number_of_nodes()
return cast(int, graph.number_of_nodes())

@BaseColumnProfiler._timeit(name="num_edges")
def _get_num_edges(self, graph: nx.Graph) -> int:
"""Compute the number of edges."""
return graph.number_of_edges()
return cast(int, graph.number_of_edges())

@BaseColumnProfiler._timeit(name="categorical_attributes")
def _get_categorical_attributes(self, graph: nx.Graph) -> List[str]:
Expand All @@ -362,7 +362,7 @@ def _get_global_max_component_size(self, graph: nx.Graph) -> int:
nx.connected_components(graph), key=len, reverse=True
)
largest_component: nx.Graph = graph.subgraph(graph_connected_components[0])
return largest_component.size()
return cast(int, largest_component.size())

@BaseColumnProfiler._timeit(name="continuous_distribution")
def _get_continuous_distribution(
Expand Down
7 changes: 2 additions & 5 deletions dataprofiler/profilers/histogram_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,8 @@
from typing import List, Optional, Tuple, Union

import numpy as np
from numpy.lib.histograms import ( # type: ignore
_get_outer_edges,
_hist_bin_selectors,
_unsigned_subtract,
)
from numpy.lib.histograms import _get_outer_edges # type: ignore
from numpy.lib.histograms import _hist_bin_selectors, _unsigned_subtract


def _get_bin_edges(
Expand Down
46 changes: 23 additions & 23 deletions dataprofiler/profilers/numerical_column_stats.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
import copy
import itertools
import warnings
from typing import Any, Callable, Dict, List, Optional, Tuple, Union
from typing import Any, Callable, Dict, List, Optional, Tuple, Union, cast

import numpy as np
import pandas as pd
Expand Down Expand Up @@ -399,7 +399,7 @@ def mean(self) -> float:
"""Return mean value."""
if self.match_count == 0:
return 0
return float(self.sum) / self.match_count
return float(self.sum) / float(self.match_count)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy compains if self.match_count isn't a float?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, now that I look at the code again, I don't see self.match_count getting initialized anywhere. This would explain why it's being typed as Any (thus requiring some sort of cast).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it by manually adding a type for it in the constructor. I assume self.match_count is populated in some dynamic way so mypy isn't able to pick up its type automatically.


@property
def mode(self) -> List[float]:
Expand All @@ -422,7 +422,7 @@ def median(self) -> float:
:rtype: float
"""
if not self._has_histogram or not self._median_is_enabled:
return np.nan
return cast(float, np.nan)
return self._get_percentile([50])[0]

@property
Expand All @@ -438,8 +438,8 @@ def variance(self) -> float:
def stddev(self) -> float:
"""Return stddev value."""
if self.match_count == 0:
return np.nan
return np.sqrt(self.variance)
return cast(float, np.nan)
return cast(float, np.sqrt(self.variance))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we convert later to a float bc this could be an np.float64. Probably good to say this could be np.float so that we are returning the real type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. This would still require a cast though because numpy's type annotations for sqrt on a scalar has a return type of Any.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to cast it to np.float64 but the issue becomes that there's not really a compatible return type annotation because np.nan is of type float. Using something like np.float_ causes mypy to complain about the np.nan case and sticking with float causes mypy to complain about the np.sqrt case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing it to a Union causes issues with line 385:

            "stddev": utils.find_diff_of_numbers(self.stddev, other_profile.stddev),
dataprofiler/profilers/numerical_column_stats.py:385: error: Value of type variable "T" of "find_diff_of_numbers" cannot be "object"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So def stddev(self) -> Union[float, np.float64]:
with the cast(np.float64, np.sqrt(self.variance)) doesn't work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It seems to cause issues with other functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate, we shouldn't be forcing it to fix mypy, but instead the typing should be valid. In this case, we aren't being valid by casting np.float64 as a float. The reason this is problematic is that code that might try to apply json.dumps to this down the line would fail if it wasn't converted into a float first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assume def stddev(self) -> Union[float, np.float64]: even w/o the cast should be functional.

but if this is erroring elsewhere, then we should fix that once we do this correctly.


@property
def skewness(self) -> float:
Expand Down Expand Up @@ -563,7 +563,7 @@ def _merge_biased_variance(
elif match_count2 < 1:
return biased_variance1
elif np.isnan(biased_variance1) or np.isnan(biased_variance2):
return np.nan
return cast(float, np.nan)

curr_count = match_count1
delta = mean2 - mean1
Expand All @@ -586,7 +586,7 @@ def _correct_bias_variance(match_count: int, biased_variance: float) -> float:
"False in ProfilerOptions.",
RuntimeWarning,
)
return np.nan
return cast(float, np.nan)

variance = match_count / (match_count - 1) * biased_variance
return variance
Expand Down Expand Up @@ -621,7 +621,7 @@ def _merge_biased_skewness(
elif match_count2 < 1:
return biased_skewness1
elif np.isnan(biased_skewness1) or np.isnan(biased_skewness2):
return np.nan
return cast(float, np.nan)

delta = mean2 - mean1
N = match_count1 + match_count2
Expand All @@ -645,7 +645,7 @@ def _merge_biased_skewness(
third_term = 3 * delta * (match_count1 * M2_2 - match_count2 * M2_1) / N
M3 = first_term + second_term + third_term

biased_skewness = np.sqrt(N) * M3 / np.sqrt(M2**3)
biased_skewness: float = np.sqrt(N) * M3 / np.sqrt(M2**3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a float or is it a np.float64?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does seem to be a np.float64. Because biased_skewness could be either np.nan or some np.float64, I've updated any usage of biased skewness to be a Union[float, np.float64].

return biased_skewness

@staticmethod
Expand All @@ -665,9 +665,9 @@ def _correct_bias_skewness(match_count: int, biased_skewness: float) -> float:
"False in ProfilerOptions.",
RuntimeWarning,
)
return np.nan
return cast(float, np.nan)

skewness = (
skewness: float = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated _correct_bias_skewness to return Union[float, np.float64]

np.sqrt(match_count * (match_count - 1))
* biased_skewness
/ (match_count - 2)
Expand Down Expand Up @@ -708,7 +708,7 @@ def _merge_biased_kurtosis(
elif match_count2 < 1:
return biased_kurtosis1
elif np.isnan(biased_kurtosis1) or np.isnan(biased_kurtosis2):
return np.nan
return cast(float, np.nan)

delta = mean2 - mean1
N = match_count1 + match_count2
Expand Down Expand Up @@ -742,7 +742,7 @@ def _merge_biased_kurtosis(
fourth_term = 4 * delta * (match_count1 * M3_2 - match_count2 * M3_1) / N
M4 = first_term + second_term + third_term + fourth_term

biased_kurtosis = N * M4 / M2**2 - 3
biased_kurtosis: float = N * M4 / M2**2 - 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated _merged_biased_kurtosis return return Union[float, np.float64]

return biased_kurtosis

@staticmethod
Expand All @@ -762,7 +762,7 @@ def _correct_bias_kurtosis(match_count: int, biased_kurtosis: float) -> float:
"False in ProfilerOptions.",
RuntimeWarning,
)
return np.nan
return cast(float, np.nan)

kurtosis = (
(match_count - 1)
Expand Down Expand Up @@ -803,15 +803,15 @@ def _estimate_mode_from_histogram(self) -> List[float]:
mode = (
bin_edges[highest_idxs] + bin_edges[highest_idxs + 1] # type: ignore
) / 2
return mode.tolist()
return cast(List[float], mode.tolist())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a list of float or np.float64?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bin_edges doesn't necessarily need to contain numpy float values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is if we are prescribing it to contain something that isn't true. I'd rather say it could contains either if that is true because otherwise if someone is relying on static typing to ensure functionality of a function, this could be cause them to infer incorrectly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some more investigating and numpy functions end up converting the values to float64 so this should only be List[np.float64]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it would be List[float] because .tolist() converts values to the compatible python type.


def _estimate_stats_from_histogram(self) -> float:
# test estimated mean and var
bin_counts = self._stored_histogram["histogram"]["bin_counts"]
bin_edges = self._stored_histogram["histogram"]["bin_edges"]
mids = 0.5 * (bin_edges[1:] + bin_edges[:-1])
mean = np.average(mids, weights=bin_counts)
var = np.average((mids - mean) ** 2, weights=bin_counts)
var: float = np.average((mids - mean) ** 2, weights=bin_counts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will end up being a np.float64. I changed the type annotation and method return type to reflect this.

return var

def _total_histogram_bin_variance(
Expand Down Expand Up @@ -858,7 +858,7 @@ def _histogram_bin_error(self, input_array: Union[np.ndarray, pd.Series]) -> flo
# reset the edge
bin_edges[-1] = temp_last_edge

sum_error = sum(
sum_error: float = sum(
(input_array - (bin_edges[inds] + bin_edges[inds - 1]) / 2) ** 2
)

Expand Down Expand Up @@ -1180,7 +1180,7 @@ def _get_best_histogram_for_profile(self) -> Dict:
self.histogram_selection = method
best_hist_loss = hist_loss

return self.histogram_methods[self.histogram_selection]["histogram"]
return cast(Dict, self.histogram_methods[self.histogram_selection]["histogram"])

def _get_percentile(
self, percentiles: Union[np.ndarray, List[float]]
Expand Down Expand Up @@ -1220,7 +1220,7 @@ def _get_percentile(
)
if median_value:
quantiles[percentiles == 50] = median_value
return quantiles.tolist()
return cast(List[float], quantiles.tolist())

@staticmethod
def _fold_histogram(
Expand Down Expand Up @@ -1295,7 +1295,7 @@ def median_abs_deviation(self) -> float:
:return: median absolute deviation
"""
if not self._has_histogram or not self._median_abs_dev_is_enabled:
return np.nan
return cast(float, np.nan)

bin_counts = self._stored_histogram["histogram"]["bin_counts"]
bin_edges = self._stored_histogram["histogram"]["bin_edges"]
Expand Down Expand Up @@ -1344,9 +1344,9 @@ def median_abs_deviation(self) -> float:

median_inds = np.abs(bin_counts_impose - 0.5) < 1e-10
if np.sum(median_inds) > 1:
return np.mean(bin_edges_impose[median_inds])
return cast(float, np.mean(bin_edges_impose[median_inds]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these would also be np.float64

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Casting this to np.float64 and changing the return type to Union[float, np.float64] gives me a similar issue as one of the other methods:

"median_absolute_deviation": utils.find_diff_of_numbers(
                self.median_abs_deviation, other_profile.median_abs_deviation
            ),
dataprofiler/profilers/numerical_column_stats.py:381: error: Value of type variable "T" of "find_diff_of_numbers" cannot be "object"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did google provide any response to this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it has something to do with the return of these numpy funcs being any maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed. See #703 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be a float64

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting a similar issue as before:

"median_absolute_deviation": utils.find_diff_of_numbers(
         self.median_abs_deviation, other_profile.median_abs_deviation
 ),
dataprofiler/profilers/numerical_column_stats.py:381: error: Value of type variable "T" of "find_diff_of_numbers" cannot be "object"

Looking into it more, I believe it's because we're trying to assign a Union to a TypeVar with bound=Subtractable and the Union itself does not follow the Subtractable protocol. So far the only reasonable solution I can think of is doing:

"median_absolute_deviation": utils.find_diff_of_numbers(
         cast(float, self.median_abs_deviation), cast(float, other_profile.median_abs_deviation)
 ),

which isn't very ideal. I'll look into mypy generics more to see if there's a workaround.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. See #703 (comment)


return np.interp(0.5, bin_counts_impose, bin_edges_impose)
return cast(float, np.interp(0.5, bin_counts_impose, bin_edges_impose))

def _get_quantiles(self) -> None:
"""
Expand Down Expand Up @@ -1670,7 +1670,7 @@ def is_int(x: str) -> bool:
return a == b

@staticmethod
def np_type_to_type(val: Any) -> Union[int, float]:
def np_type_to_type(val: Any) -> Union[int, float, Any]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this can be any, we technically don't need union, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, will this error then bc we are using Any?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the Union was redundant. Changing to Any actually doesn't cause issues with our mypy flag warn_return_any = True because that only complains when we return Any and the function return type is not Any.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

"""
Convert numpy variables to base python type variables.

Expand Down
Loading