Skip to content

Conversation

@lw-lin
Copy link
Contributor

@lw-lin lw-lin commented Jan 17, 2016

A String is being converted to upper or lowercase, using the platform's default encoding. This may result in improper conversions when used with international characters.

For instance, "TITLE".toLowerCase() in a Turkish locale returns "tıtle", where 'ı' -- without a dot -- is the LATIN SMALL LETTER DOTLESS I character. To obtain correct results for locale insensitive strings, we'd better use toLowerCase(Locale.ENGLISH).

For more information on this, please see:

This PR changes our use of String.toUpperCase()/toLowerCase() to String.toUpperCase(Locale.ENGLISH)/toLowerCase(Locale.ENGLISH)

@lw-lin
Copy link
Contributor Author

lw-lin commented Jan 17, 2016

@rdblue @liancheng Would you please take a look at this? Cheers.

@liancheng
Copy link
Contributor

LGTM although I'm not quite sure whether it's absolutely necessary...

@lw-lin
Copy link
Contributor Author

lw-lin commented Jan 18, 2016

@liancheng
Some are not that necessary, but I think we'd better at least apply the following 3 changes:

  • PrimitiveTypeName.valueOf(t.toUpperCase(Locale.ENGLISH))
  • Repetition.valueOf(t.toUpperCase(Locale.ENGLISH))
  • CompressionCodecName.toUpperCase(Locale.ENGLISH))

For instance, this code snippet reproduces an Exception when the default locale is tr:

String upper = "int32".toUpperCase(new Locale("tr"));
PrimitiveType.PrimitiveTypeName.valueOf(upper);

Exception is:

Exception in thread "main" java.lang.IllegalArgumentException: No enum constant org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName.İNT32
at java.lang.Enum.valueOf(Enum.java:236)
at org.apache.parquet.schema.PrimitiveType$PrimitiveTypeName.valueOf(PrimitiveType.java:65)
at ...

Thanks.

@liancheng
Copy link
Contributor

Agree, +1.

@rdblue Would you mind to also take a look?

@julienledem
Copy link
Member

+1

@asfgit asfgit closed this in 6c9ca4d Feb 16, 2016
piyushnarang pushed a commit to piyushnarang/parquet-mr that referenced this pull request Jun 15, 2016
…pperCase()/toLowerCase

A String is being converted to upper or lowercase, using the platform's default encoding. This may result in improper conversions when used with international characters.

For instance, "TITLE".toLowerCase() in a Turkish locale returns "tıtle", where 'ı' -- without a dot -- is the LATIN SMALL LETTER DOTLESS I character. To obtain correct results for locale insensitive strings, we'd better use toLowerCase(Locale.ENGLISH).

For more information on this, please see:
- http://stackoverflow.com/questions/11063102/using-locales-with-javas-tolowercase-and-touppercase
- http://lotusnotus.com/lotusnotus_en.nsf/dx/dotless-i-tolowercase-and-touppercase-functions-use-responsibly.htm
- http://java.sys-con.com/node/46241

This PR changes our use of String.toUpperCase()/toLowerCase() to String.toUpperCase(Locale.*ENGLISH*)/toLowerCase(*Locale.ENGLISH*)

Author: proflin <proflin.me@gmail.com>

Closes apache#312 from proflin/PARQUET-430 and squashes the following commits:

ed55822 [proflin] PARQUET-430
rdblue pushed a commit to rdblue/parquet-mr that referenced this pull request Jul 13, 2016
…pperCase()/toLowerCase

A String is being converted to upper or lowercase, using the platform's default encoding. This may result in improper conversions when used with international characters.

For instance, "TITLE".toLowerCase() in a Turkish locale returns "tıtle", where 'ı' -- without a dot -- is the LATIN SMALL LETTER DOTLESS I character. To obtain correct results for locale insensitive strings, we'd better use toLowerCase(Locale.ENGLISH).

For more information on this, please see:
- http://stackoverflow.com/questions/11063102/using-locales-with-javas-tolowercase-and-touppercase
- http://lotusnotus.com/lotusnotus_en.nsf/dx/dotless-i-tolowercase-and-touppercase-functions-use-responsibly.htm
- http://java.sys-con.com/node/46241

This PR changes our use of String.toUpperCase()/toLowerCase() to String.toUpperCase(Locale.*ENGLISH*)/toLowerCase(*Locale.ENGLISH*)

Author: proflin <proflin.me@gmail.com>

Closes apache#312 from proflin/PARQUET-430 and squashes the following commits:

ed55822 [proflin] PARQUET-430
rdblue pushed a commit to rdblue/parquet-mr that referenced this pull request Jan 6, 2017
…pperCase()/toLowerCase

A String is being converted to upper or lowercase, using the platform's default encoding. This may result in improper conversions when used with international characters.

For instance, "TITLE".toLowerCase() in a Turkish locale returns "tıtle", where 'ı' -- without a dot -- is the LATIN SMALL LETTER DOTLESS I character. To obtain correct results for locale insensitive strings, we'd better use toLowerCase(Locale.ENGLISH).

For more information on this, please see:
- http://stackoverflow.com/questions/11063102/using-locales-with-javas-tolowercase-and-touppercase
- http://lotusnotus.com/lotusnotus_en.nsf/dx/dotless-i-tolowercase-and-touppercase-functions-use-responsibly.htm
- http://java.sys-con.com/node/46241

This PR changes our use of String.toUpperCase()/toLowerCase() to String.toUpperCase(Locale.*ENGLISH*)/toLowerCase(*Locale.ENGLISH*)

Author: proflin <proflin.me@gmail.com>

Closes apache#312 from proflin/PARQUET-430 and squashes the following commits:

ed55822 [proflin] PARQUET-430
rdblue pushed a commit to rdblue/parquet-mr that referenced this pull request Jan 10, 2017
…pperCase()/toLowerCase

A String is being converted to upper or lowercase, using the platform's default encoding. This may result in improper conversions when used with international characters.

For instance, "TITLE".toLowerCase() in a Turkish locale returns "tıtle", where 'ı' -- without a dot -- is the LATIN SMALL LETTER DOTLESS I character. To obtain correct results for locale insensitive strings, we'd better use toLowerCase(Locale.ENGLISH).

For more information on this, please see:
- http://stackoverflow.com/questions/11063102/using-locales-with-javas-tolowercase-and-touppercase
- http://lotusnotus.com/lotusnotus_en.nsf/dx/dotless-i-tolowercase-and-touppercase-functions-use-responsibly.htm
- http://java.sys-con.com/node/46241

This PR changes our use of String.toUpperCase()/toLowerCase() to String.toUpperCase(Locale.*ENGLISH*)/toLowerCase(*Locale.ENGLISH*)

Author: proflin <proflin.me@gmail.com>

Closes apache#312 from proflin/PARQUET-430 and squashes the following commits:

ed55822 [proflin] PARQUET-430
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants