Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Oct 7, 2022

What changes were proposed in this pull request?

add a dedicated expression for product:

  1. for integral inputs, directly use LongType to avoid the rounding error:
  2. when ignoreNA is true, skip following values when meet a zero;
  3. when ignoreNA is false, skip following values when meet a zero or null;

Why are the changes needed?

  1. existing computation logic is too complex in the PySpark side, with a dedicated expression, we can simplify the PySpark side and apply it in more cases.
  2. existing computation of product is likely to introduce rounding error for integral inputs, for example 55108 x 55108 x 55108 x 55108 in the following case:

before:

In [14]: df = pd.DataFrame({"a": [55108, 55108, 55108, 55108], "b": [55108.0, 55108.0, 55108.0, 55108.0], "c": [1, 2, 3, 4]})

In [15]: df.a.prod()
Out[15]: 9222710978872688896

In [16]: type(df.a.prod())
Out[16]: numpy.int64

In [17]: df.b.prod()
Out[17]: 9.222710978872689e+18

In [18]: type(df.b.prod())
Out[18]: numpy.float64

In [19]: 

In [19]: psdf = ps.from_pandas(df)

In [20]: psdf.a.prod()
Out[20]: 9222710978872658944

In [21]: type(psdf.a.prod())
Out[21]: int

In [22]: psdf.b.prod()
Out[22]: 9.222710978872659e+18

In [23]: type(psdf.b.prod())
Out[23]: float


In [24]: df.a.prod() - psdf.a.prod()
Out[24]: 29952

after:

In [1]: import pyspark.pandas as ps

In [2]: import pandas as pd

In [3]: df = pd.DataFrame({"a": [55108, 55108, 55108, 55108], "b": [55108.0, 55108.0, 55108.0, 55108.0], "c": [1, 2, 3, 4]})

In [4]: df.a.prod()
Out[4]: 9222710978872688896

In [5]: psdf = ps.from_pandas(df)

In [6]: psdf.a.prod()
Out[6]: 9222710978872688896                                                     

In [7]: df.a.prod() - psdf.a.prod()
Out[7]: 0

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing UT & added UT

init

init

init

fix update

fix update

fix update

init

init

fix equal

refine product

fix prod

nit
self.assert_eq(pdf.b.kurtosis(), psdf.b.kurtosis())
self.assert_eq(pdf.c.kurtosis(), psdf.c.kurtosis())

def test_prod_precision(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test will fail in original implementation due to precision loss

the new implementation will provide the precise result for integral inputs, when the product in [Long.MinValue, Long.MaxValue]

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable to me

@zhengruifeng
Copy link
Contributor Author

also cc @HyukjinKwon @itholic @Yikun

@HyukjinKwon
Copy link
Member

Merged to master.

@zhengruifeng
Copy link
Contributor Author

thanks @srowen @HyukjinKwon for reivews

@zhengruifeng zhengruifeng deleted the ps_new_prod branch October 11, 2022 02:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants