Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concerns with String.toLowerCase() in default Locale #9276

Open
dimas-b opened this issue Dec 11, 2023 · 2 comments
Open

Concerns with String.toLowerCase() in default Locale #9276

dimas-b opened this issue Dec 11, 2023 · 2 comments

Comments

@dimas-b
Copy link
Contributor

dimas-b commented Dec 11, 2023

Query engine

ALL

Question

This stemmed from a discussion thread on PR 8909:

I did a quick scan of the calls to String.toLowerCase() in Iceberg codebase, and I do a few places where it may be a concern:

  1. VectorizedSupport seems like it may be a problem, but I do not really know whether the lower case data is exposed and how.

  2. IcebergRecordObjectInspector converts field names to lower case, and that seems to be affected by the locale problem as various case names are accepted as input parameters to its methods.

  3. jQuery code uses a lot of toLowerCase(), but I do not really know how it is supposed to be used.

To the best of my knowledge this kind of case conversion can be problematic only in German and Turkish locales. The German locale affects only proper German language words (so it is less of a problem), but the Turkish locale can cause English words to be converted in unexpected ways.

For example, this assertion fails: assertThat("VIEW".toLowerCase(new Locale("TR"))).isEqualTo("view");

Does Iceberg support using its libraries in user-defined locales?

@dimas-b dimas-b changed the title Concerns with String.toLoweCase() in default Locale Concerns with String.toLowerCase() in default Locale Dec 11, 2023
@findepi
Copy link
Member

findepi commented Jun 18, 2024

Does Iceberg support using its libraries in user-defined locales?

Not for me to decide this, but i believe that we have basically these options

  1. make the code independent of JVM locale (Palantir's error-prone check DefaultLocale can be helpful, but seems insufficient)
  2. test the code on various locales including Turkish
  3. add a runtime check enforcing JVM locale

Of these I think (3) is least awesome, because Iceberg is a library and should support being embedded in various contesxts. I think we should do (1).

@findepi
Copy link
Member

findepi commented Jun 18, 2024

I did a quick scan of the calls to String.toLowerCase() in Iceberg codebase

The toLowerCase & toUpperCase calls are being fixed in #10521.
There may be other Locale-dependent APIs though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants