Skip to content

Conversation

@rxin
Copy link
Contributor

@rxin rxin commented Apr 26, 2015

No description provided.

@SparkQA
Copy link

SparkQA commented Apr 26, 2015

Test build #30959 has finished for PR 5709 at commit a7136cb.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@pwendell
Copy link
Contributor

Could it be confusing to users that the ID associated with each record might be different on stage or task retries? The fact that ordering within a partition is not deterministic has caused people some concern in the past, and I wonder if this could sort of lead to more confusion since you are giving some sort of ordering semantics.

@rxin
Copy link
Contributor Author

rxin commented Apr 27, 2015

partition id doesn't change between retries, does it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guaranteed TO be

@pwendell
Copy link
Contributor

No, but the ordering of records in a partition can change, so you might have different identifiers for the same record across retries (unless this is only used for already sorted data... is it?).

@rxin
Copy link
Contributor Author

rxin commented Apr 27, 2015

Those could change in shuffle I guess, but I don't think this is creating more confusion. What we care about here is not the record ordering, but the output of this expression is monotonic increasing. That will always be true.

This is very similar to the row id idea a lot of databases have. Records in database tables also don't have ordering, unless they are sorted.

@rxin
Copy link
Contributor Author

rxin commented Apr 27, 2015

@pwendell you raised a very good point about ordering of records within RDDs and DataFrames. I think we should document those more clearly in the javadoc for these.

@pwendell
Copy link
Contributor

@rxin yeah I just mean if I'm in a database and I run the same query twice, I will get the same row ID for the same record. Because of non determinism in the shuffle, that's not true here.

@rxin
Copy link
Contributor Author

rxin commented Apr 27, 2015

(That's not always true -- somebody could've deleted an index and then the scan gets turned from index scan to sequential scan, and then record ordering changed)

@pwendell
Copy link
Contributor

Oh I see - I guess it doesn't matter then.

@SparkQA
Copy link

SparkQA commented Apr 27, 2015

Test build #30969 has started for PR 5709 at commit a9fda0d.

@rxin
Copy link
Contributor Author

rxin commented Apr 28, 2015

Jenkins, retest this please.

@yhuai
Copy link
Contributor

yhuai commented Apr 28, 2015

LGTM

@SparkQA
Copy link

SparkQA commented Apr 28, 2015

Test build #31112 has finished for PR 5709 at commit a9fda0d.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class MonotonicallyIncreasingID() extends LeafExpression
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 28, 2015

Test build #31116 has finished for PR 5709 at commit 7853611.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@asfgit asfgit closed this in d94cd1a Apr 28, 2015
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 14, 2015
Author: Reynold Xin <rxin@databricks.com>

Closes apache#5709 from rxin/inc-id and squashes the following commits:

7853611 [Reynold Xin] private sql.
a9fda0d [Reynold Xin] Missed a few numbers.
343d896 [Reynold Xin] Self review feedback.
a7136cb [Reynold Xin] [SPARK-7135][SQL] DataFrame expression for monotonically increasing IDs.
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
Author: Reynold Xin <rxin@databricks.com>

Closes apache#5709 from rxin/inc-id and squashes the following commits:

7853611 [Reynold Xin] private sql.
a9fda0d [Reynold Xin] Missed a few numbers.
343d896 [Reynold Xin] Self review feedback.
a7136cb [Reynold Xin] [SPARK-7135][SQL] DataFrame expression for monotonically increasing IDs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants