Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support GpuConcat on ArrayType #2379

Merged
merged 12 commits into from
May 20, 2021
Merged

Conversation

sperlingxx
Copy link
Collaborator

@sperlingxx sperlingxx commented May 10, 2021

Current PR is to support GpuConcat on ArrayType, which is required in issue #2013.
In addition, this PR also introduces some refinement on the implementation of string concatenation.

Signed-off-by: sperlingxx <lovedreamf@gmail.com>
@jlowe
Copy link
Member

jlowe commented May 10, 2021

build

Signed-off-by: sperlingxx <lovedreamf@gmail.com>
@sperlingxx
Copy link
Collaborator Author

I found another small problem of cuDF implementation during running tests on Array of String locally. And I've put up a PR to fix it.

jlowe
jlowe previously approved these changes May 11, 2021
@jlowe
Copy link
Member

jlowe commented May 11, 2021

build

@sameerz sameerz linked an issue May 11, 2021 that may be closed by this pull request
@sameerz sameerz added the feature request New feature or request label May 12, 2021
@sameerz sameerz added this to the May 10 - May 21 milestone May 12, 2021
@sperlingxx
Copy link
Collaborator Author

build

gerashegalov
gerashegalov previously approved these changes May 13, 2021
Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

private def stringConcat(batch: ColumnarBatch): GpuColumnVector = {
val rows = batch.numRows()

withResource(ArrayBuffer[ColumnVector]()) { buffer =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: here and later: ArrayBuffer.empty[ColumnVector] for readability.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Signed-off-by: sperlingxx <lovedreamf@gmail.com>
@sperlingxx sperlingxx dismissed stale reviews from gerashegalov and jlowe via 52bfccd May 13, 2021 03:14
@sperlingxx
Copy link
Collaborator Author

build

sperlingxx and others added 2 commits May 13, 2021 22:23
…erations.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
…erations.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
@sperlingxx
Copy link
Collaborator Author

build

…erations.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
jlowe
jlowe previously approved these changes May 13, 2021
@jlowe
Copy link
Member

jlowe commented May 13, 2021

build

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nits but there are some corner cases that string concat missed and that this also is missing.

I tested the string concat code and it works if we have asserts disabled. If we enable them it fails. I don't know about list concat. There is probably also an issue with no columns being passed in. I am fine with letting this slide for now because you have to go out of your way to disable expression folding for it to be an issue, but it might be good to just fall back to the CPU if there are no children. That way we don't have to ever worry about it.

@sperlingxx
Copy link
Collaborator Author

build

@sperlingxx
Copy link
Collaborator Author

Still would like to see us prevent against the empty concat when folding is disabled, but it is such a rare corner case it is not important enough to hold this up over it.

I added missing support of empty concat. And I also found that cuDF concat can be simply bypassed in single column concat.

@sperlingxx sperlingxx requested a review from revans2 May 18, 2021 01:03
jlowe
jlowe previously approved these changes May 18, 2021
@pxLi pxLi changed the base branch from branch-0.6 to branch-21.06 May 19, 2021 01:11
@pxLi pxLi dismissed jlowe’s stale review May 19, 2021 01:11

The base branch was changed.

@sperlingxx
Copy link
Collaborator Author

build

@sperlingxx sperlingxx requested a review from jlowe May 19, 2021 03:13
Signed-off-by: sperlingxx <lovedreamf@gmail.com>
@sperlingxx
Copy link
Collaborator Author

build

@sperlingxx sperlingxx requested a review from revans2 May 20, 2021 09:57
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

f.concat(s1, f.col('b')),
f.concat(f.col('a'), s2),
f.concat(f.lit(None).cast('string'), f.col('b')),
f.concat(f.col('a'), f.lit(None).cast('string')),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have an all nulls test case? I guess this line might hit it if col a can generate nulls.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar for arrays, do we want an all nulls test case for arrays

Copy link
Collaborator Author

@sperlingxx sperlingxx May 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be frank, I am not sure whether there exists rows consisting of all null fields among test data for concat_array , with default random seed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think the all null case may be not that special, since the output row will be null only if one input field is null.

@sperlingxx sperlingxx merged commit c694fdc into NVIDIA:branch-21.06 May 20, 2021
abellina pushed a commit to abellina/spark-rapids that referenced this pull request May 24, 2021
Closes NVIDIA#2013

Support GpuConcat on ArrayType. And introduce some refinement for GpuConcat on StringType.

Signed-off-by: sperlingxx <lovedreamf@gmail.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
Closes NVIDIA#2013

Support GpuConcat on ArrayType. And introduce some refinement for GpuConcat on StringType.

Signed-off-by: sperlingxx <lovedreamf@gmail.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
Closes NVIDIA#2013

Support GpuConcat on ArrayType. And introduce some refinement for GpuConcat on StringType.

Signed-off-by: sperlingxx <lovedreamf@gmail.com>
@sperlingxx sperlingxx deleted the concat_array branch December 2, 2021 02:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support concatenating ArrayType columns
7 participants