Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-12946: [C++] String swap case kernel #10855

Closed
wants to merge 9 commits into from
Closed

ARROW-12946: [C++] String swap case kernel #10855

wants to merge 9 commits into from

Conversation

Christian8491
Copy link
Contributor

This PR adds swapcase compute kernel for string. It is similar to Python str.swapcase()

@github-actions
Copy link

github-actions bot commented Aug 2, 2021

@Christian8491 Christian8491 marked this pull request as ready for review August 2, 2021 21:19
@edponce
Copy link
Contributor

edponce commented Aug 2, 2021

LGTM. Need to add implementation for UTF8.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! Just a couple comments.

@Christian8491
Copy link
Contributor Author

@lidavidm please enable CI

@lidavidm
Copy link
Member

lidavidm commented Aug 3, 2021

For CI, it appears that Ubuntu 20.04 provides utf8proc 2.5, but the isupper/islower functions are not provided until 2.6: https://juliastrings.github.io/utf8proc/releases/

Meanwhile, RTools35 is using utf8proc 2.4; it builds utf8proc from source due to some other issue, but I believe it's using the system headers anyways looking at the compiler command since the system headers come before the self-built utf8proc ones.

I think bumping the minimum utf8proc version and fiddling with the build flags so that the utf8proc headers take precedence will be needed.

@edponce
Copy link
Contributor

edponce commented Aug 3, 2021

An alternative solution is to use helper functions already available, refer to https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L1391-L1408

Related notes in Arrow w.r.t. to lower/upper casing of UTF8 can be found in:

@nealrichardson
Copy link
Member

Meanwhile, RTools35 is using utf8proc 2.4; it builds utf8proc from source due to some other issue, but I believe it's using the system headers anyways looking at the compiler command since the system headers come before the self-built utf8proc ones.

Yeah the logs say

-- Could NOT find utf8proc (missing: utf8proc_LIB) (found suitable version "2.4.0", minimum required is "2.2.0")

which doesn't make sense.

@Christian8491
Copy link
Contributor Author

As @edponce suggested, I replaced some utf8proc functions with helper functions. After all the CI passes. This PR is ready for review.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thank you @Christian8491

@pitrou pitrou closed this in e4ba2f2 Aug 4, 2021
@edponce
Copy link
Contributor

edponce commented Aug 5, 2021

Some tests invoke the incorrect kernel (ASCII test uses utf8_capitalize and/or viceversa). These "typos" are fixed in #10869

michalursa pushed a commit to michalursa/arrow that referenced this pull request Aug 17, 2021
This PR adds `swapcase` compute kernel for string.  It is similar to  `Python str.swapcase()`

Closes apache#10855 from Christian8491/ARROW-12946-String-swap-case-kernel

Authored-by: christian <ccce91@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants