Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support concat with separator on GPU #2479

Merged
merged 31 commits into from
May 27, 2021

Conversation

tgravescs
Copy link
Collaborator

@tgravescs tgravescs commented May 21, 2021

This depends on CUDF PR: rapidsai/cudf#8289

fixes #63

Note the issue has a bunch of Spark behavior for concat_ws as well as some examples if you want to see them. (#63 (comment)).
You can pay attention to the null handling as that is what we found some incompatibilities with in CUDF.

At a high level we add a new GpuConcatWs expression that supports taking Strings or Arrays or Strings and returns a String column. If only a separator is specified we fall back to the CPU, otherwise as long as 1 other column with the separator we process it. The code handles 2 types of separators, scalar and the spark sql api allows a column of separators. We use the corresponding CUDF api based on the separator passed in. concat_ws also allows an Array of Strings, so the strings within a single Array[String] column have to be concatenated first using a different CUDF api and then those are combined with any other columns to get the final output.

I ran a test on larger customer dataset and it helped mostly due to not going off gpu and then coming back. 80million rows, concatenating 2 sting columns, gpuconcatws took around 200ms.

tgravescs and others added 25 commits May 19, 2021 08:29
Signed-off-by: Thomas Graves <tgraves@apache.org>
Signed-off-by: Thomas Graves <tgraves@apache.org>
Signed-off-by: Thomas Graves <tgraves@nvidia.com>
Signed-off-by: Thomas Graves <tgraves@nvidia.com>
Signed-off-by: Thomas Graves <tgraves@nvidia.com>
Signed-off-by: Thomas Graves <tgraves@nvidia.com>
Signed-off-by: Thomas Graves <tgraves@nvidia.com>
Signed-off-by: Thomas Graves <tgraves@nvidia.com>
@tgravescs tgravescs added the feature request New feature or request label May 21, 2021
@tgravescs tgravescs added this to the May 10 - May 21 milestone May 21, 2021
@tgravescs tgravescs self-assigned this May 21, 2021
@tgravescs
Copy link
Collaborator Author

build

@tgravescs
Copy link
Collaborator Author

CUDF changes all merged in now.

@tgravescs
Copy link
Collaborator Author

build

@jlowe jlowe changed the title [WIP] Support concat with separator on GPU Support concat with separator on GPU May 27, 2021
Signed-off-by: Thomas Graves <tgraves@apache.org>
@tgravescs
Copy link
Collaborator Author

build

@tgravescs
Copy link
Collaborator Author

oops, still need to remove one thing

@tgravescs
Copy link
Collaborator Author

ok, I think I addressed everyone's comments, if I missed something let me know.

revans2
revans2 previously approved these changes May 27, 2021
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two very small nits

jlowe
jlowe previously approved these changes May 27, 2021
@tgravescs tgravescs dismissed stale reviews from jlowe and revans2 via 30a2430 May 27, 2021 15:23
@tgravescs
Copy link
Collaborator Author

build

…nctions.scala

Co-authored-by: Andy Grove <andygrove73@gmail.com>
@tgravescs
Copy link
Collaborator Author

build

@tgravescs tgravescs merged commit 8bc4f79 into NVIDIA:branch-21.06 May 27, 2021
@tgravescs tgravescs deleted the concatws2106ScalarSep branch May 27, 2021 18:10
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* start concat_ws

Signed-off-by: Thomas Graves <tgraves@apache.org>

* more stuff

* call into stringConcateWs

Signed-off-by: Thomas Graves <tgraves@apache.org>

* Fixes and more functionality for concat_ws

* Test and debug 1 column

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Handle arrays and add more tests

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Fix type checks

* TEsts working and running on gpu

* Fix empty str scalar and debug

* debug

* Update to latest cudf changes, tests all passing

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Try using scalar separator but cudf doesn't handle null scalars the same
when its the separator

* more changes

* handle 1 column with scalar sep

* All working with no leaks

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* updates

* Fix line lengths

* reogranize

* refactor code to be more readable, all tests pass, no leaks

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* cleanup

* refactor the separator handling

* Update docs

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Add another test for all null col separator

* Change to use resource if allowed and clean up scalar separator

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Review comments

* rework to remove finally and review comments

Signed-off-by: Thomas Graves <tgraves@apache.org>

* replace nullable with _ and simplify case

* Update sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala

Co-authored-by: Andy Grove <andygrove73@gmail.com>

Co-authored-by: Andy Grove <andygrove73@gmail.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* start concat_ws

Signed-off-by: Thomas Graves <tgraves@apache.org>

* more stuff

* call into stringConcateWs

Signed-off-by: Thomas Graves <tgraves@apache.org>

* Fixes and more functionality for concat_ws

* Test and debug 1 column

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Handle arrays and add more tests

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Fix type checks

* TEsts working and running on gpu

* Fix empty str scalar and debug

* debug

* Update to latest cudf changes, tests all passing

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Try using scalar separator but cudf doesn't handle null scalars the same
when its the separator

* more changes

* handle 1 column with scalar sep

* All working with no leaks

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* updates

* Fix line lengths

* reogranize

* refactor code to be more readable, all tests pass, no leaks

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* cleanup

* refactor the separator handling

* Update docs

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Add another test for all null col separator

* Change to use resource if allowed and clean up scalar separator

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* Review comments

* rework to remove finally and review comments

Signed-off-by: Thomas Graves <tgraves@apache.org>

* replace nullable with _ and simplify case

* Update sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala

Co-authored-by: Andy Grove <andygrove73@gmail.com>

Co-authored-by: Andy Grove <andygrove73@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] support ConcatWs sql function
5 participants