-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: numeric operations on numeric values only #381
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -27,7 +27,7 @@ def calc(self, df: pd.DataFrame) -> int: | |
|
||
class Sum(UnaryOperator, BaseMetric): | ||
def calc(self, df: pd.DataFrame) -> float: | ||
return cast(float, df.loc[:, self.operand].sum()) | ||
return cast(float, pd.to_numeric(df.loc[:, self.operand], errors="coerce").sum()) | ||
|
||
|
||
class VectorSum(UnaryOperator, VectorOperator, ZeroInitialValue, BaseMetric): | ||
|
@@ -40,7 +40,7 @@ def calc(self, df: pd.DataFrame) -> Union[float, npt.NDArray[np.float64]]: | |
|
||
class Mean(UnaryOperator, BaseMetric): | ||
def calc(self, df: pd.DataFrame) -> float: | ||
return df.loc[:, self.operand].mean() | ||
return pd.to_numeric(df.loc[:, self.operand], errors="coerce").mean() | ||
|
||
|
||
class VectorMean(UnaryOperator, VectorOperator, BaseMetric): | ||
|
@@ -55,12 +55,12 @@ def calc(self, df: pd.DataFrame) -> Union[float, npt.NDArray[np.float64]]: | |
|
||
class Min(UnaryOperator, BaseMetric): | ||
def calc(self, df: pd.DataFrame) -> float: | ||
return cast(float, df.loc[:, self.operand].min()) | ||
return cast(float, pd.to_numeric(df.loc[:, self.operand], errors="coerce").min()) | ||
|
||
|
||
class Max(UnaryOperator, BaseMetric): | ||
def calc(self, df: pd.DataFrame) -> float: | ||
return cast(float, df.loc[:, self.operand].max()) | ||
return cast(float, pd.to_numeric(df.loc[:, self.operand], errors="coerce").max()) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't want a crazy number of unit tests either but it might be good to start scaffolding a small set so that we just start testing the edge cases (numeric, string, vector, timestamp) just so that we have assurance about these cases going forward and so that the expected outputs are captured - e.g. the tests can act as a form of documentation. Nothing really robust, just trying to get us in the right direction. Maybe you were planning on some of this as part of the benchmarking effort? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yea, determining column type in pandas is tricky because it's intentionally very lenient (unlike arrows or numpy). let me scope this out some more and will add more tests in the future |
||
|
||
|
||
class Cardinality(UnaryOperator, BaseMetric): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly confused about this fix. Does this coerce strings to ascii based numbers or?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does. It's a bit of trade-off in the sense that this is guaranteed to work, because the flip-side is that, at least for the time being, there's no sure-fire way--in pandas exclusively (without the help of Arrows or Parquet)--to determine if a column is numeric or not a priori. Below is one such example of an integer column.