-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-48842][DOCS] Document non-determinism of max_by and min_by #47266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
JoshRosen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 towards improving the documentation here.
I think the non-determinism might be clear to anyone who is already familiar with the non-determinism of first, last, etc., but it doesn't hurt to call it out explicitly.
I see that we have made similar documentation improvements in the past, e.g. in #27099. In that PR, it looks like we updated multiple documentation sources, including Python, R (where applicable), the Scala docs for the expression itself (which also feed into the SQL function catalog documentation), the functions.scala docs, etc. Maybe we should also update those other sources as part of this PR? Due to Spark Connect, it looks like we might now have two copies of functions.scala that might need updates.
Those existing docs have fewer examples than the PySpark docs, so in that setting it's probably fine to just add the one-sentence note about non-determinism.
make sense, let me only add a note and also update other documents |
657593c to
4cef05e
Compare
4cef05e to
07b8d7a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bit hard to understand. Is there a simpler example we can use to demonstrate this idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just say that The function is non-deterministic so the output order can be different for those associated the same values of `col`
d0daca4 to
0f89ff3
Compare
|
thanks all, merged to master |
### What changes were proposed in this pull request? Document non-determinism of max_by and min_by ### Why are the changes needed? I have been confused by this non-determinism twice, it occurred like a correctness bug to me. So I think we need to document it ### Does this PR introduce _any_ user-facing change? doc change only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47266 from zhengruifeng/py_doc_max_by. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
What changes were proposed in this pull request?
Document non-determinism of max_by and min_by
Why are the changes needed?
I have been confused by this non-determinism twice, it occurred like a correctness bug to me.
So I think we need to document it
Does this PR introduce any user-facing change?
doc change only
How was this patch tested?
ci
Was this patch authored or co-authored using generative AI tooling?
no