-
Notifications
You must be signed in to change notification settings - Fork 819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Continue on Error in CSV Reader #4809
Comments
The CSV reader currently is not designed to be used after an error, which should definitely be more clearly documented. This was brought up a while back but I can't now find the issue to link to 😅 The major challenge is CSV supports unquoted newlines. and so it is unclear how to proceed on error |
Suggestion 1: if the reader encounters an unrecoverable error, it should be marked internally as being in an unrecoverable state and any attempts to continue reading should panic with an explicit message. Diagnosis of the panic would be easier than trying to work out where a silent out-of-memory error originates. Suggestion 2: on encountering a schema mismatch, the reader should seek forward to the next unquoted delimiter and attempt to continue reading from there. If the delimiter is a newline character, the unquoted newline is only likely to be a problem if it occurs in a record which is already a schema mismatch; the reader would attempt to read forward from a non-terminal newline and the subsequent fields would also be mismatched to the schema.... at which point it's probably OK to fall back on the behaviour in suggestion 1. Would that be workable? |
I think I am missing something here, the reader returns an error and should then no longer be used. I don't see how this is silent? |
Well, it appeared silent to me. :) I first encountered the OOM error after upgrading from an earlier version of the Arrow crate where the CSV reader would continue after encountering an error. I assumed the reader would continue to work in the same way after the upgrade, and found that my code would spin its wheels for a while and then exit with a "failed to allocate" memory error. The application gives no indication of where the error has arisen; hence I'd call that a silent error. Yes, you could argue that I should have read the documentation more carefully. :) But a panic when I tried to use the failed reader would have made it explicit that I was trying to do something unwise. |
Describe the bug
Attempting to read malformed CSV records where the number of columns doesn't match the schema leads to an out-of-memory error.
To Reproduce
See minimal example Rust code below.
arrow_bug.txt
Expected behavior
The CSV reader should start re-start reading from the first newline/delimiter after the malformed row is encountered.
The text was updated successfully, but these errors were encountered: