Skip to content

Reading CSV with custom nullString impossible #921

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Jolanrensen opened this issue Oct 14, 2024 · 0 comments · Fixed by #903
Closed

Reading CSV with custom nullString impossible #921

Jolanrensen opened this issue Oct 14, 2024 · 0 comments · Fixed by #903
Assignees
Labels
bug Something isn't working csv CSV / delim related issues

Comments

@Jolanrensen
Copy link
Collaborator

Jolanrensen commented Oct 14, 2024

Umbrella'd under #827

Reported on slack: https://kotlinlang.slack.com/archives/C4W52CFEZ/p1728885330465379

The CSV https://kotlinlang.slack.com/files/U16CM33AB/F07R98VJ7AT/msleep.csv contains several columns of Double values and "NA"s, representing null. This causes some curious cases:

Expected Actual
DataFrame.readCSV() should be able to recognise "NA" means null and parse the column as Double? The column brainwt is parsed as BigDecimal because it doesn't recognize "3e-04" as Double and doesn't handle NA well.
DataFrame.readCSV("NA" in nullStrings) should help recognizing "NA" as null. Recognizes "NA" as null but result is still BigDecimal?
"NA" in nullStrings and colTypes = "brainwt" to ColType.Double should work for sure "java.lang.IllegalStateException: Couldn't parse 'NA' into type kotlin.Double". Apparently giving a colType grabs the Double parser directly and does not take nullStrings into account. Plus, if the result is null it's assumed the parsing failed. We need to give ColType.String and call parse or convert afterwards manually.
parse() and convert().toDouble() should behave the same parse() uses NumberFormat with locale and doesn't recognize "3e-04" . convert using Double.parseDouble() without locale and can parse it.

Most of the issues here are solved by the new CSV implementation under the umbrella issue: #827. The case for "3e-04" requires a different Double parser, which is solved by #935.

@Jolanrensen Jolanrensen added bug Something isn't working csv CSV / delim related issues labels Oct 14, 2024
@Jolanrensen Jolanrensen self-assigned this Oct 14, 2024
@Jolanrensen Jolanrensen mentioned this issue Oct 14, 2024
28 tasks
@Jolanrensen Jolanrensen mentioned this issue Nov 1, 2024
2 tasks
Jolanrensen added a commit that referenced this issue Nov 20, 2024
…pted both csv implementations to use convertTo. Addition of DataColumn<String>.convertTo overloads to allow for ParserOptions (for nullStrings etc.) Moved DataColumn<String>.convertToDouble to impl. Fixed nullstrings support for it, cleaned the parsers. Added tests for Issue #921
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working csv CSV / delim related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant