Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Regexp Unicode support #5662

Closed

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented May 26, 2022

Part of #5549

Changes in this PR:

  • New regexp fuzz tests that generate unicode input data
  • RegExp will fall back to CPU if LANG is not en_US.UTF-8
  • An exception will occur if any executors has a different LANG setting to the driver

Driver error when LANG is not UTF-8

      !Expression <RegExpReplace> regexp_replace(name#4, [ae], _, 1) cannot run on GPU because regular expression support is disabled because environment variable 'LANG' is not set to 'en_US.UTF-8' and this would result in incorrect handling of Unicode data
        @Expression <AttributeReference> name#4 could run on GPU

Driver & Executor LANG mismatch

java.lang.RuntimeException: Driver and executor LANG mismatch. Driver LANG is 'en_US.UTF-8' and executor LANG is 'cockney'.
	at com.nvidia.spark.rapids.RapidsExecutorPlugin.init(Plugin.scala:218)

@sameerz sameerz added the feature request New feature or request label May 27, 2022
@sameerz sameerz added this to the May 23 - Jun 3 milestone May 27, 2022
@andygrove andygrove changed the title WIP: Regexp unicode support Regexp Unicode support May 27, 2022
@andygrove andygrove marked this pull request as ready for review May 27, 2022 14:25
@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove andygrove changed the title Regexp Unicode support WIP: Regexp Unicode support May 27, 2022
@andygrove andygrove marked this pull request as draft May 27, 2022 22:37
@andygrove andygrove changed the base branch from branch-22.06 to branch-22.08 June 3, 2022 21:02
@andygrove
Copy link
Contributor Author

build

@andygrove andygrove self-assigned this Jun 3, 2022
@andygrove
Copy link
Contributor Author

build

@sameerz sameerz removed this from the Jun 6 - Jun 17 milestone Jun 18, 2022
@andygrove
Copy link
Contributor Author

Closed in favor of #5776

@andygrove andygrove closed this Jun 21, 2022
@andygrove andygrove deleted the regexp-unicode-support branch May 17, 2023 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants