Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] add support for non-ASCII feature names (fixes #2983) #3647

Merged
merged 6 commits into from
Jan 3, 2021

Conversation

jameslamb
Copy link
Collaborator

I started working on adding support for non-ASCII features in the R package tonight, and I think this might have already been fixed! I'm not sure how, but my bet is @AlbertoEAF 's work in #3405 , since that touched model reading / writing code.

I tested with R Hub and this is even now working on Solaris!

Notes for Reviewers

The change in this PR is just to stop skipping a test that is no longer broken. However, I gave this the label feature because I think it makes sense for "you can now use non-ASCII features names" to show up in the "New Features" part of the changelog for the next release, even if this PR really didn't do the work to add them and @AlbertoEAF actually did all the hard work 😀

@jameslamb
Copy link
Collaborator Author

ok it looks like this is still failing on Windows 😕

I'll change this to work in progress and work on it tonight.

@jameslamb jameslamb changed the title [R-package] add support for non-ASCII feature names (fixes #2983) WIP: [R-package] add support for non-ASCII feature names (fixes #2983) Dec 15, 2020
@AlbertoEAF
Copy link
Contributor

Hi James :) I deserve no credits here, this came in the v3.0 release, I think it was this PR from @henry0312 https://github.com/microsoft/LightGBM/pull/2976/files.

@jameslamb jameslamb changed the title WIP: [R-package] add support for non-ASCII feature names (fixes #2983) [R-package] add support for non-ASCII feature names (fixes #2983) Dec 30, 2020
@jameslamb
Copy link
Collaborator Author

Ok I think this is ready for review!

I learned a lot from "UTF-8 Support on Windows" and "Windows/UTF-8 Build of R and CRAN Packages" by @kalibera.

"R internally allows strings to be represented in the current native encoding, in UTF-8 and in Latin 1. When interacting with the operating system or external libraries, all these representations have to be converted to native encoding. On Linux and macOS today this is not a problem, because the native encoding is UTF-8, so all Unicode characters are supported. On Windows, the native encoding cannot be UTF-8 nor any other that could represent all Unicode characters."

I think that on Windows, we cannot assume that non-ASCII feature names will survive the round trip between C++ and R ( specifically with Booster$dump_model(), which relies on json11 code that calls encode_utf8()).

Before this PR, the test on non-ASCII feature names was skipped completely on all operating systems.

Now, that test runs on all operating systems, but on Windows we expect the feature names in Booster$dump_model() to be changed.

I re-submitted on R Hub and can confirm these tests pass on 32-bit Windows and Solaris.

# UTF-8 strings are not well-supported on Windows
# * https://developer.r-project.org/Blog/public/2020/05/02/utf-8-support-on-windows/
# * https://developer.r-project.org/Blog/public/2020/07/30/windows/utf-8-build-of-r-and-cran-packages/index.html
if (!ON_WINDOWS) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/gha run r-valgrind

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valgrind is happy

==1913== LEAK SUMMARY:
==1913==    definitely lost: 0 bytes in 0 blocks
==1913==    indirectly lost: 0 bytes in 0 blocks
==1913==      possibly lost: 336 bytes in 1 blocks
==1913==    still reachable: 309,428,667 bytes in 55,367 blocks
==1913==                       of which reachable via heuristic:
==1913==                         newarray           : 4,264 bytes in 1 blocks
==1913==         suppressed: 0 bytes in 0 blocks
==1913== Reachable blocks (those to which a pointer was found) are not shown.
==1913== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==1913== 
==1913== For lists of detected and suppressed errors, rerun with: -s
==1913== ERROR SUMMARY: 3 errors from 2 contexts (suppressed: 0 from 0)

@jameslamb jameslamb merged commit aae4fe4 into microsoft:master Jan 3, 2021
@jameslamb jameslamb deleted the fix/non-ascii-feature-names branch January 3, 2021 19:17
@StrikerRUS StrikerRUS mentioned this pull request Mar 24, 2021
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants