-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: numeric operations on numeric values only #381
Conversation
@@ -27,7 +27,7 @@ def calc(self, df: pd.DataFrame) -> int: | |||
|
|||
class Sum(UnaryOperator, BaseMetric): | |||
def calc(self, df: pd.DataFrame) -> float: | |||
return cast(float, df.loc[:, self.operand].sum()) | |||
return cast(float, pd.to_numeric(df.loc[:, self.operand], errors="coerce").sum()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly confused about this fix. Does this coerce strings to ascii based numbers or?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does. It's a bit of trade-off in the sense that this is guaranteed to work, because the flip-side is that, at least for the time being, there's no sure-fire way--in pandas exclusively (without the help of Arrows or Parquet)--to determine if a column is numeric or not a priori. Below is one such example of an integer column.
>>> import pandas as pd
>>> from pandas.api.types import is_numeric_dtype
>>> s = pd.Series([1,2,None], dtype=object)
>>> s
0 1
1 2
2 None
dtype: object
>>> s.max()
2
>>> is_numeric_dtype(s)
False
>>> pd.to_numeric(s)
0 1.0
1 2.0
2 NaN
dtype: float64
|
||
|
||
class Max(UnaryOperator, BaseMetric): | ||
def calc(self, df: pd.DataFrame) -> float: | ||
return cast(float, df.loc[:, self.operand].max()) | ||
return cast(float, pd.to_numeric(df.loc[:, self.operand], errors="coerce").max()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want a crazy number of unit tests either but it might be good to start scaffolding a small set so that we just start testing the edge cases (numeric, string, vector, timestamp) just so that we have assurance about these cases going forward and so that the expected outputs are captured - e.g. the tests can act as a form of documentation.
Nothing really robust, just trying to get us in the right direction. Maybe you were planning on some of this as part of the benchmarking effort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, determining column type in pandas is tricky because it's intentionally very lenient (unlike arrows or numpy). let me scope this out some more and will add more tests in the future
* feat: clean-up tooltip styles (#382) * feat: add links to app / docs on launch (#383) * v0.0.2rc5 * convert to numeric (#381) * feat: add descriptions to use-cases (#387) * feat: add descriptions to use-cases * fix ruff styling * Fix prettier * Remove readme * pr comments * pr comments * rewrap * v0.0.2rc6 * docs: add module level docstring (#391) * docs: add module level docstring * docs: add module level docstring * Update README.md --------- Co-authored-by: Roger Yang <80478925+RogerHYang@users.noreply.github.com>
* docs: No subject (GITBOOK-34) * docs: sync 3-16-2023 (#390) * feat: clean-up tooltip styles (#382) * feat: add links to app / docs on launch (#383) * v0.0.2rc5 * convert to numeric (#381) * feat: add descriptions to use-cases (#387) * feat: add descriptions to use-cases * fix ruff styling * Fix prettier * Remove readme * pr comments * pr comments * rewrap * v0.0.2rc6 * docs: add module level docstring (#391) * docs: add module level docstring * docs: add module level docstring * Update README.md --------- Co-authored-by: Roger Yang <80478925+RogerHYang@users.noreply.github.com> * docs: serve local image files (GITBOOK-33) * remove "target_score" column from how-to example * docs: define actuals in schema how-to (GITBOOK-35) * docs: unnest api section (GITBOOK-36) * docs: small note to define actuals in quickstart (GITBOOK-37) * docs: change embedding vector column name in schema how-to example (GITBOOK-38) * rearrange columns in text embedding example * docs: clarify language in quickstart (GITBOOK-39) * docs: No subject (GITBOOK-43) * docs: remove tutorial sub-pages and add notebooks sub-page (GITBOOK-40) * docs: add explanation to warning in how-to schema (GITBOOK-44) * docs: using example datasets how-to guide (GITBOOK-47) * docs: fix broken link on phoenix basics page (GITBOOK-48) * docs: remove session.view() from quickstart (GITBOOK-49) * docs: update default parameters to Session.view (GITBOOK-50) * docs: sentiment classification notebook link (GITBOOK-51) * docs: remove note (GITBOOK-52) * docs: added regression schema example (GITBOOK-53) * docs: added ranking model schema examples (GITBOOK-54) * docs: What is Phoenix section (GITBOOK-55) * docs: updated images (GITBOOK-56) * fix merge confict in main * chore: clear output of quickstart.ipnb --------- Co-authored-by: Jason Lopatecki <jason@arize.com> Co-authored-by: Roger Yang <80478925+RogerHYang@users.noreply.github.com> Co-authored-by: Xander Song <xsong@arize.com> Co-authored-by: Alexander Song <axiomofjoy@gmail.com> Co-authored-by: Amit Goren <amit@arize.com>
resolves #380