Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: numeric operations on numeric values only #381

Merged
merged 1 commit into from
Mar 16, 2023
Merged

Conversation

RogerHYang
Copy link
Contributor

resolves #380

@@ -27,7 +27,7 @@ def calc(self, df: pd.DataFrame) -> int:

class Sum(UnaryOperator, BaseMetric):
def calc(self, df: pd.DataFrame) -> float:
return cast(float, df.loc[:, self.operand].sum())
return cast(float, pd.to_numeric(df.loc[:, self.operand], errors="coerce").sum())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly confused about this fix. Does this coerce strings to ascii based numbers or?

Copy link
Contributor Author

@RogerHYang RogerHYang Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does. It's a bit of trade-off in the sense that this is guaranteed to work, because the flip-side is that, at least for the time being, there's no sure-fire way--in pandas exclusively (without the help of Arrows or Parquet)--to determine if a column is numeric or not a priori. Below is one such example of an integer column.

>>> import pandas as pd
>>> from pandas.api.types import is_numeric_dtype
>>> s = pd.Series([1,2,None], dtype=object)
>>> s
0       1
1       2
2    None
dtype: object
>>> s.max()
2
>>> is_numeric_dtype(s)
False
>>> pd.to_numeric(s)
0    1.0
1    2.0
2    NaN
dtype: float64



class Max(UnaryOperator, BaseMetric):
def calc(self, df: pd.DataFrame) -> float:
return cast(float, df.loc[:, self.operand].max())
return cast(float, pd.to_numeric(df.loc[:, self.operand], errors="coerce").max())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want a crazy number of unit tests either but it might be good to start scaffolding a small set so that we just start testing the edge cases (numeric, string, vector, timestamp) just so that we have assurance about these cases going forward and so that the expected outputs are captured - e.g. the tests can act as a form of documentation.

Nothing really robust, just trying to get us in the right direction. Maybe you were planning on some of this as part of the benchmarking effort?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, determining column type in pandas is tricky because it's intentionally very lenient (unlike arrows or numpy). let me scope this out some more and will add more tests in the future

@RogerHYang RogerHYang merged commit be2e2d5 into main Mar 16, 2023
@RogerHYang RogerHYang deleted the numeric-operations branch March 16, 2023 19:07
mikeldking added a commit that referenced this pull request Mar 17, 2023
* feat: clean-up tooltip styles (#382)

* feat: add links to app / docs on launch (#383)

* v0.0.2rc5

* convert to numeric (#381)

* feat: add descriptions to use-cases (#387)

* feat: add descriptions to use-cases

* fix ruff styling

* Fix prettier

* Remove readme

* pr comments

* pr comments

* rewrap

* v0.0.2rc6

* docs: add module level docstring (#391)

* docs: add module level docstring

* docs: add module level docstring

* Update README.md

---------

Co-authored-by: Roger Yang <80478925+RogerHYang@users.noreply.github.com>
mikeldking added a commit that referenced this pull request Mar 23, 2023
* docs: No subject (GITBOOK-34)

* docs: sync 3-16-2023 (#390)

* feat: clean-up tooltip styles (#382)

* feat: add links to app / docs on launch (#383)

* v0.0.2rc5

* convert to numeric (#381)

* feat: add descriptions to use-cases (#387)

* feat: add descriptions to use-cases

* fix ruff styling

* Fix prettier

* Remove readme

* pr comments

* pr comments

* rewrap

* v0.0.2rc6

* docs: add module level docstring (#391)

* docs: add module level docstring

* docs: add module level docstring

* Update README.md

---------

Co-authored-by: Roger Yang <80478925+RogerHYang@users.noreply.github.com>

* docs: serve local image files (GITBOOK-33)

* remove "target_score" column from how-to example

* docs: define actuals in schema how-to (GITBOOK-35)

* docs: unnest api section (GITBOOK-36)

* docs: small note to define actuals in quickstart (GITBOOK-37)

* docs: change embedding vector column name in schema how-to example (GITBOOK-38)

* rearrange columns in text embedding example

* docs: clarify language in quickstart (GITBOOK-39)

* docs: No subject (GITBOOK-43)

* docs: remove tutorial sub-pages and add notebooks sub-page (GITBOOK-40)

* docs: add explanation to warning in how-to schema (GITBOOK-44)

* docs: using example datasets how-to guide (GITBOOK-47)

* docs: fix broken link on phoenix basics page (GITBOOK-48)

* docs: remove session.view() from quickstart (GITBOOK-49)

* docs: update default parameters to Session.view (GITBOOK-50)

* docs: sentiment classification notebook link (GITBOOK-51)

* docs: remove note (GITBOOK-52)

* docs: added regression schema example  (GITBOOK-53)

* docs: added ranking model schema examples (GITBOOK-54)

* docs: What is Phoenix section (GITBOOK-55)

* docs: updated images (GITBOOK-56)

* fix merge confict in main

* chore: clear output of quickstart.ipnb

---------

Co-authored-by: Jason Lopatecki <jason@arize.com>
Co-authored-by: Roger Yang <80478925+RogerHYang@users.noreply.github.com>
Co-authored-by: Xander Song <xsong@arize.com>
Co-authored-by: Alexander Song <axiomofjoy@gmail.com>
Co-authored-by: Amit Goren <amit@arize.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

🐛 [gql] min / max queries fail because it attempts to calculate the value on categorical
2 participants