[SPARK-7135][SQL] DataFrame expression for monotonically increasing IDs. #5709

rxin · 2015-04-26T19:29:38Z

No description provided.

SparkQA · 2015-04-26T20:35:36Z

Test build #30959 has finished for PR 5709 at commit a7136cb.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

pwendell · 2015-04-26T21:22:25Z

Could it be confusing to users that the ID associated with each record might be different on stage or task retries? The fact that ordering within a partition is not deterministic has caused people some concern in the past, and I wonder if this could sort of lead to more confusion since you are giving some sort of ordering semantics.

rxin · 2015-04-27T00:33:20Z

partition id doesn't change between retries, does it?

rxin · 2015-04-27T00:33:50Z

python/pyspark/sql/functions.py

guaranteed TO be

pwendell · 2015-04-27T01:54:56Z

No, but the ordering of records in a partition can change, so you might have different identifiers for the same record across retries (unless this is only used for already sorted data... is it?).

rxin · 2015-04-27T04:11:50Z

Those could change in shuffle I guess, but I don't think this is creating more confusion. What we care about here is not the record ordering, but the output of this expression is monotonic increasing. That will always be true.

This is very similar to the row id idea a lot of databases have. Records in database tables also don't have ordering, unless they are sorted.

rxin · 2015-04-27T04:28:23Z

@pwendell you raised a very good point about ordering of records within RDDs and DataFrames. I think we should document those more clearly in the javadoc for these.

pwendell · 2015-04-27T04:45:27Z

@rxin yeah I just mean if I'm in a database and I run the same query twice, I will get the same row ID for the same record. Because of non determinism in the shuffle, that's not true here.

rxin · 2015-04-27T04:46:47Z

(That's not always true -- somebody could've deleted an index and then the scan gets turned from index scan to sequential scan, and then record ordering changed)

pwendell · 2015-04-27T04:48:47Z

Oh I see - I guess it doesn't matter then.

SparkQA · 2015-04-27T18:19:03Z

Test build #30969 has started for PR 5709 at commit a9fda0d.

rxin · 2015-04-28T04:33:40Z

Jenkins, retest this please.

yhuai · 2015-04-28T04:42:09Z

LGTM

SparkQA · 2015-04-28T04:49:28Z

Test build #31112 has finished for PR 5709 at commit a9fda0d.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class MonotonicallyIncreasingID() extends LeafExpression
This patch does not change any dependencies.

SparkQA · 2015-04-28T07:28:19Z

Test build #31116 has finished for PR 5709 at commit 7853611.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

Author: Reynold Xin <rxin@databricks.com> Closes apache#5709 from rxin/inc-id and squashes the following commits: 7853611 [Reynold Xin] private sql. a9fda0d [Reynold Xin] Missed a few numbers. 343d896 [Reynold Xin] Self review feedback. a7136cb [Reynold Xin] [SPARK-7135][SQL] DataFrame expression for monotonically increasing IDs.

[SPARK-7135][SQL] DataFrame expression for monotonically increasing IDs.

a7136cb

rxin reviewed Apr 27, 2015
View reviewed changes

python/pyspark/sql/functions.py Outdated

Copy link

Contributor Author

rxin Apr 27, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guaranteed TO be

rxin added 2 commits April 26, 2015 21:24

Self review feedback.

343d896

Missed a few numbers.

a9fda0d

private sql.

7853611

asfgit closed this in d94cd1a Apr 28, 2015

[SPARK-7135][SQL] DataFrame expression for monotonically increasing IDs. #5709

[SPARK-7135][SQL] DataFrame expression for monotonically increasing IDs. #5709

Uh oh!

Conversation

rxin commented Apr 26, 2015

Uh oh!

SparkQA commented Apr 26, 2015

Uh oh!

pwendell commented Apr 26, 2015

Uh oh!

rxin commented Apr 27, 2015

Uh oh!

rxin Apr 27, 2015

Choose a reason for hiding this comment

Uh oh!

pwendell commented Apr 27, 2015

Uh oh!

rxin commented Apr 27, 2015

Uh oh!

rxin commented Apr 27, 2015

Uh oh!

pwendell commented Apr 27, 2015

Uh oh!

rxin commented Apr 27, 2015

Uh oh!

pwendell commented Apr 27, 2015

Uh oh!

SparkQA commented Apr 27, 2015

Uh oh!

rxin commented Apr 28, 2015

Uh oh!

yhuai commented Apr 28, 2015

Uh oh!

SparkQA commented Apr 28, 2015

Uh oh!

SparkQA commented Apr 28, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants