Skip to content

Conversation

@ggupta81
Copy link

@ggupta81 ggupta81 commented Aug 4, 2015

JIRA: SPARK-9592
In current implementation, First and Last aggregates were calculating the values for entire DataFrame partition and then the same value was returned for all GroupedData in the partition.
Fixed the First and Last aggregates to compute first and last value per GroupedData instead of entire DataFrame.

…ion and not on entire DataFrame partition.
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@CodingCat
Copy link
Contributor

@ggupta81 ggupta81 changed the title Fixed First and Last aggregates to calculate on GroupedData partition instead of entire dataFrame [SPARK-9592] [SQL] Fixed First and Last aggregates to calculate on GroupedData partition instead of entire dataFrame Aug 4, 2015
@rxin
Copy link
Contributor

rxin commented Aug 4, 2015

cc @yhuai

@yhuai
Copy link
Contributor

yhuai commented Aug 4, 2015

@ggupta81 Can you attach a test case that generates wrong result?

@ggupta81
Copy link
Author

@yhuai How to attach a testcase. I could not find the instructions on:
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-PullRequest

Here is the command line script to test:

import sqlContext.implicits._
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType,StructField,StringType};
val schemaString = "name age"
val schema = StructType( schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
val rdd = sc.parallelize(List( Row("Micheal","31"), Row("Micheal", "30"), Row("Micheal", "28"), Row("Micheal", "32"), Row("Andy", "30"), Row("Andy", "31") , Row("Andy", "27"), Row("Justin", "19")), 1)
val df = sqlContext.createDataFrame(rdd, schema)
df.cache
df.registerTempTable("df")
sqlContext.sql("SELECT name,last(age) FROM df GROUP BY name").show()

The result is:
+-------+--+
| name|c1|
+-------+--+
| Andy|19|
| Justin|19|
|Micheal|19|
+-------+--+

whereas it should be:
+-------+--+
| name|c1|
+-------+--+
| Andy|27|
| Justin|19|
|Micheal|32|
+-------+--+

@yhuai
Copy link
Contributor

yhuai commented Aug 11, 2015

@ggupta81 oh, I meant adding a comment with your test case. Actually, I tried it with master. I got the correct result.

@yhuai
Copy link
Contributor

yhuai commented Aug 11, 2015

I believe it has been fixed. Can you close this pr?

@ggupta81
Copy link
Author

Shouldn't we be fixing the same in 1.4?
My original patch request was for 1.4 as i am working on the same revision.
Let me know and i will again add a pull request on 1.4
On Aug 11, 2015 22:10, "Yin Huai" notifications@github.com wrote:

I believe it has been fixed.


Reply to this email directly or view it on GitHub
#7928 (comment).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggupta81 I see the problem. I think we only need to change this line to fix it. Basically, the input row can be a mutable row. So, we need to eagerly evaluate the expression instead of just pointing result to the input.

@yhuai
Copy link
Contributor

yhuai commented Aug 12, 2015

@ggupta81 We are having a new implementation of aggregate functions in 1.5.0. The last function having problem in your case is our old implementation. Since I will make changes to first and last function in 1.5 and master branch, I can fix it in these two branches. Can you submit a pr to fix the 1.4 branch?

I think only last has this problem. We do not need to change first.

@ggupta81
Copy link
Author

correct. I will make the changes you have suggested and update the pull
request.

On Wed, Aug 12, 2015 at 8:38 AM, Yin Huai notifications@github.com wrote:

@ggupta81 https://github.com/ggupta81 We are having a new
implementation of aggregate functions in 1.5.0. The last function having
problem in your case is our old implementation. Let's fix it and I will
merge it to master, 1.5 branch and 1.4 branch.

I think only last has this problem. We do not need to change first.


Reply to this email directly or view it on GitHub
#7928 (comment).

_Gaurav Gupta_Engineering Manager @ Adobe
9971666780

@ggupta81
Copy link
Author

Should this be closed now as we have another pull request #8113
#8113 for fixing the same?

On Wed, Aug 12, 2015 at 9:01 AM, gaurav gupta gupta.gaurav81@gmail.com
wrote:

correct. I will make the changes you have suggested and update the pull
request.

On Wed, Aug 12, 2015 at 8:38 AM, Yin Huai notifications@github.com
wrote:

@ggupta81 https://github.com/ggupta81 We are having a new
implementation of aggregate functions in 1.5.0. The last function having
problem in your case is our old implementation. Let's fix it and I will
merge it to master, 1.5 branch and 1.4 branch.

I think only last has this problem. We do not need to change first.


Reply to this email directly or view it on GitHub
#7928 (comment).

_Gaurav Gupta_Engineering Manager @ Adobe
9971666780

_Gaurav Gupta_Engineering Manager @ Adobe
9971666780

@ggupta81
Copy link
Author

Fixed via #8113

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants