Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: numeric operations on numeric values only #381

Merged
merged 1 commit into from
Mar 16, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions src/phoenix/metrics/metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ def calc(self, df: pd.DataFrame) -> int:

class Sum(UnaryOperator, BaseMetric):
def calc(self, df: pd.DataFrame) -> float:
return cast(float, df.loc[:, self.operand].sum())
return cast(float, pd.to_numeric(df.loc[:, self.operand], errors="coerce").sum())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly confused about this fix. Does this coerce strings to ascii based numbers or?

Copy link
Contributor Author

@RogerHYang RogerHYang Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does. It's a bit of trade-off in the sense that this is guaranteed to work, because the flip-side is that, at least for the time being, there's no sure-fire way--in pandas exclusively (without the help of Arrows or Parquet)--to determine if a column is numeric or not a priori. Below is one such example of an integer column.

>>> import pandas as pd
>>> from pandas.api.types import is_numeric_dtype
>>> s = pd.Series([1,2,None], dtype=object)
>>> s
0       1
1       2
2    None
dtype: object
>>> s.max()
2
>>> is_numeric_dtype(s)
False
>>> pd.to_numeric(s)
0    1.0
1    2.0
2    NaN
dtype: float64



class VectorSum(UnaryOperator, VectorOperator, ZeroInitialValue, BaseMetric):
Expand All @@ -40,7 +40,7 @@ def calc(self, df: pd.DataFrame) -> Union[float, npt.NDArray[np.float64]]:

class Mean(UnaryOperator, BaseMetric):
def calc(self, df: pd.DataFrame) -> float:
return df.loc[:, self.operand].mean()
return pd.to_numeric(df.loc[:, self.operand], errors="coerce").mean()


class VectorMean(UnaryOperator, VectorOperator, BaseMetric):
Expand All @@ -55,12 +55,12 @@ def calc(self, df: pd.DataFrame) -> Union[float, npt.NDArray[np.float64]]:

class Min(UnaryOperator, BaseMetric):
def calc(self, df: pd.DataFrame) -> float:
return cast(float, df.loc[:, self.operand].min())
return cast(float, pd.to_numeric(df.loc[:, self.operand], errors="coerce").min())


class Max(UnaryOperator, BaseMetric):
def calc(self, df: pd.DataFrame) -> float:
return cast(float, df.loc[:, self.operand].max())
return cast(float, pd.to_numeric(df.loc[:, self.operand], errors="coerce").max())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want a crazy number of unit tests either but it might be good to start scaffolding a small set so that we just start testing the edge cases (numeric, string, vector, timestamp) just so that we have assurance about these cases going forward and so that the expected outputs are captured - e.g. the tests can act as a form of documentation.

Nothing really robust, just trying to get us in the right direction. Maybe you were planning on some of this as part of the benchmarking effort?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, determining column type in pandas is tricky because it's intentionally very lenient (unlike arrows or numpy). let me scope this out some more and will add more tests in the future



class Cardinality(UnaryOperator, BaseMetric):
Expand Down