Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

Document non-determinism of max_by and min_by

Why are the changes needed?

I have been confused by this non-determinism twice, it occurred like a correctness bug to me.
So I think we need to document it

Does this PR introduce any user-facing change?

doc change only

How was this patch tested?

ci

Was this patch authored or co-authored using generative AI tooling?

no

Copy link
Contributor

@JoshRosen JoshRosen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 towards improving the documentation here.

I think the non-determinism might be clear to anyone who is already familiar with the non-determinism of first, last, etc., but it doesn't hurt to call it out explicitly.

I see that we have made similar documentation improvements in the past, e.g. in #27099. In that PR, it looks like we updated multiple documentation sources, including Python, R (where applicable), the Scala docs for the expression itself (which also feed into the SQL function catalog documentation), the functions.scala docs, etc. Maybe we should also update those other sources as part of this PR? Due to Spark Connect, it looks like we might now have two copies of functions.scala that might need updates.

Those existing docs have fewer examples than the PySpark docs, so in that setting it's probably fine to just add the one-sentence note about non-determinism.

@zhengruifeng
Copy link
Contributor Author

+1 towards improving the documentation here.

I think the non-determinism might be clear to anyone who is already familiar with the non-determinism of first, last, etc., but it doesn't hurt to call it out explicitly.

I see that we have made similar documentation improvements in the past, e.g. in #27099. In that PR, it looks like we updated multiple documentation sources, including Python, R (where applicable), the Scala docs for the expression itself (which also feed into the SQL function catalog documentation), the functions.scala docs, etc. Maybe we should also update those other sources as part of this PR? Due to Spark Connect, it looks like we might now have two copies of functions.scala that might need updates.

Those existing docs have fewer examples than the PySpark docs, so in that setting it's probably fine to just add the one-sentence note about non-determinism.

make sense, let me only add a note and also update other documents

@zhengruifeng zhengruifeng changed the title [SPARK-48842][PYTHON][DOCS] Document non-determinism of max_by and min_by [SPARK-48842][DOCS] Document non-determinism of max_by and min_by Jul 10, 2024
Comment on lines 1276 to 1277
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bit hard to understand. Is there a simpler example we can use to demonstrate this idea?

Copy link
Member

@HyukjinKwon HyukjinKwon Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just say that The function is non-deterministic so the output order can be different for those associated the same values of `col`

@zhengruifeng
Copy link
Contributor Author

thanks all, merged to master

@zhengruifeng zhengruifeng deleted the py_doc_max_by branch July 12, 2024 04:41
jingz-db pushed a commit to jingz-db/spark that referenced this pull request Jul 22, 2024
### What changes were proposed in this pull request?
Document non-determinism of max_by and min_by

### Why are the changes needed?
I have been confused by this non-determinism twice, it occurred like a correctness bug to me.
So I think we need to document it

### Does this PR introduce _any_ user-facing change?
doc change only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#47266 from zhengruifeng/py_doc_max_by.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants