-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-9592] [SQL] Fixed First and Last aggregates to calculate on GroupedData partition instead of entire dataFrame #7928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ion and not on entire DataFrame partition.
|
Can one of the admins verify this patch? |
|
Hi, @ggupta81 , please add [SPARK-9592] to the PR title |
|
cc @yhuai |
|
@ggupta81 Can you attach a test case that generates wrong result? |
|
@yhuai How to attach a testcase. I could not find the instructions on: Here is the command line script to test: import sqlContext.implicits._ The result is: whereas it should be: |
|
@ggupta81 oh, I meant adding a comment with your test case. Actually, I tried it with master. I got the correct result. |
|
I believe it has been fixed. Can you close this pr? |
|
Shouldn't we be fixing the same in 1.4?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggupta81 I see the problem. I think we only need to change this line to fix it. Basically, the input row can be a mutable row. So, we need to eagerly evaluate the expression instead of just pointing result to the input.
|
@ggupta81 We are having a new implementation of aggregate functions in 1.5.0. The last function having problem in your case is our old implementation. Since I will make changes to first and last function in 1.5 and master branch, I can fix it in these two branches. Can you submit a pr to fix the 1.4 branch? I think only |
|
correct. I will make the changes you have suggested and update the pull On Wed, Aug 12, 2015 at 8:38 AM, Yin Huai notifications@github.com wrote:
_Gaurav Gupta_Engineering Manager @ Adobe |
Syncing from head
|
Should this be closed now as we have another pull request #8113 On Wed, Aug 12, 2015 at 9:01 AM, gaurav gupta gupta.gaurav81@gmail.com
_Gaurav Gupta_Engineering Manager @ Adobe |
|
Fixed via #8113 |
JIRA: SPARK-9592
In current implementation, First and Last aggregates were calculating the values for entire DataFrame partition and then the same value was returned for all GroupedData in the partition.
Fixed the First and Last aggregates to compute first and last value per GroupedData instead of entire DataFrame.