-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-7135][SQL] DataFrame expression for monotonically increasing IDs. #5709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #30959 has finished for PR 5709 at commit
|
|
Could it be confusing to users that the ID associated with each record might be different on stage or task retries? The fact that ordering within a partition is not deterministic has caused people some concern in the past, and I wonder if this could sort of lead to more confusion since you are giving some sort of ordering semantics. |
|
partition id doesn't change between retries, does it? |
python/pyspark/sql/functions.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
guaranteed TO be
|
No, but the ordering of records in a partition can change, so you might have different identifiers for the same record across retries (unless this is only used for already sorted data... is it?). |
|
Those could change in shuffle I guess, but I don't think this is creating more confusion. What we care about here is not the record ordering, but the output of this expression is monotonic increasing. That will always be true. This is very similar to the row id idea a lot of databases have. Records in database tables also don't have ordering, unless they are sorted. |
|
@pwendell you raised a very good point about ordering of records within RDDs and DataFrames. I think we should document those more clearly in the javadoc for these. |
|
@rxin yeah I just mean if I'm in a database and I run the same query twice, I will get the same row ID for the same record. Because of non determinism in the shuffle, that's not true here. |
|
(That's not always true -- somebody could've deleted an index and then the scan gets turned from index scan to sequential scan, and then record ordering changed) |
|
Oh I see - I guess it doesn't matter then. |
|
Test build #30969 has started for PR 5709 at commit |
|
Jenkins, retest this please. |
|
LGTM |
|
Test build #31112 has finished for PR 5709 at commit
|
|
Test build #31116 has finished for PR 5709 at commit
|
Author: Reynold Xin <rxin@databricks.com> Closes apache#5709 from rxin/inc-id and squashes the following commits: 7853611 [Reynold Xin] private sql. a9fda0d [Reynold Xin] Missed a few numbers. 343d896 [Reynold Xin] Self review feedback. a7136cb [Reynold Xin] [SPARK-7135][SQL] DataFrame expression for monotonically increasing IDs.
Author: Reynold Xin <rxin@databricks.com> Closes apache#5709 from rxin/inc-id and squashes the following commits: 7853611 [Reynold Xin] private sql. a9fda0d [Reynold Xin] Missed a few numbers. 343d896 [Reynold Xin] Self review feedback. a7136cb [Reynold Xin] [SPARK-7135][SQL] DataFrame expression for monotonically increasing IDs.
No description provided.