Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect parsing of CSV file. Locale is ignored for some columns #607

Closed
antonarhipov opened this issue Feb 28, 2024 · 4 comments · Fixed by #935
Closed

Incorrect parsing of CSV file. Locale is ignored for some columns #607

antonarhipov opened this issue Feb 28, 2024 · 4 comments · Fixed by #935
Assignees
Labels
bug Something isn't working csv CSV / delim related issues
Milestone

Comments

@antonarhipov
Copy link

I have encountered a funny bug when parsing a CSV file. The floating point numbers in that file are specified with a comma , instead of dot ., and this should be a subject of specifying the correct locale. However, it seems that Locale is respected for one column and ignored for the other. For instance, the comma is ignored for the number 68,83, and the result is 6883.

image

CSV file: https://github.com/antonarhipov/kotlin-sandbox/blob/master/energy-consumption.csv
Notebook: https://github.com/antonarhipov/kotlin-sandbox/blob/master/demo.ipynb

@koperagen
Copy link
Collaborator

koperagen commented Feb 28, 2024

Ok, this is interesting. Here's the culprit:

val nf = NumberFormat.getInstance(Locale("et", "EE"))
nf.parse("-0,08")

This code throws an exception. Because in Estonian locale a different minus sign is used.
So DataFrame tries to parse this column using locale specific rules and fails. Next step of parsing is locale agnostic and it creates wrong values.
We need to improve UX in this part. The fact that locale parameter defines not only , and . is surprising. Maybe it shouldn't.. Suggested workaround: use ParserOptions(Locale.FRANCE)

@koperagen koperagen added the invalid This issue/PR doesn't seem right label Feb 28, 2024
@koperagen koperagen added this to the Backlog milestone Feb 28, 2024
@koperagen
Copy link
Collaborator

Note for future: let's check how it's handled in R dplyr and maybe use it as a reference. I think they might be good at interpretation of all kinds of data

@zaleslaw zaleslaw added bug Something isn't working and removed invalid This issue/PR doesn't seem right labels Apr 8, 2024
@Jolanrensen Jolanrensen added the csv CSV / delim related issues label Aug 20, 2024
@Jolanrensen Jolanrensen self-assigned this Aug 20, 2024
@Jolanrensen Jolanrensen mentioned this issue Aug 20, 2024
27 tasks
@Jolanrensen
Copy link
Collaborator

Ok, this is interesting. Here's the culprit:

val nf = NumberFormat.getInstance(Locale("et", "EE"))
nf.parse("-0,08")

This code throws an exception. Because in Estonian locale a different minus sign is used. So DataFrame tries to parse this column using locale specific rules and fails. Next step of parsing is locale agnostic and it creates wrong values. We need to improve UX in this part. The fact that locale parameter defines not only , and . is surprising. Maybe it shouldn't.. Suggested workaround: use ParserOptions(Locale.FRANCE)

Actually it makes sense that the locale parameter also looks at minus signs etc. Just have a look at Java's DecimalFormatSymbols, it includes the zero digit, grouping separator, decimal separator, per mille sign, percent char, NaN, minus sign, monetary decimal separator, exponent separator, etc... The decimal character is just the tip of the iceberg really.
So, if you read a csv with an estonian locale it will do so :).
If you have mixed locales within one file, I'd suggest reading the erroneous columns as String and parsing them manually with the correct locale in the step afterwards.

@Jolanrensen
Copy link
Collaborator

fixed with ParserOptions(useFastDoubleParser = true). Eventually this new parser will become the default :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working csv CSV / delim related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants