Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Continue on Error in CSV Reader #4809

Open
GCHQDeveloper61637 opened this issue Sep 11, 2023 · 4 comments
Open

Support Continue on Error in CSV Reader #4809

GCHQDeveloper61637 opened this issue Sep 11, 2023 · 4 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@GCHQDeveloper61637
Copy link

Describe the bug
Attempting to read malformed CSV records where the number of columns doesn't match the schema leads to an out-of-memory error.

To Reproduce
See minimal example Rust code below.

arrow_bug.txt

Expected behavior
The CSV reader should start re-start reading from the first newline/delimiter after the malformed row is encountered.

@tustvold tustvold changed the title OOM Error in Arrow CSV reader Support Continue on Error in CSV Reader Sep 11, 2023
@tustvold tustvold added enhancement Any new improvement worthy of a entry in the changelog and removed bug labels Sep 11, 2023
@tustvold
Copy link
Contributor

The CSV reader currently is not designed to be used after an error, which should definitely be more clearly documented. This was brought up a while back but I can't now find the issue to link to 😅

The major challenge is CSV supports unquoted newlines. and so it is unclear how to proceed on error

@GCHQDeveloper61637
Copy link
Author

GCHQDeveloper61637 commented Nov 4, 2023

Suggestion 1: if the reader encounters an unrecoverable error, it should be marked internally as being in an unrecoverable state and any attempts to continue reading should panic with an explicit message. Diagnosis of the panic would be easier than trying to work out where a silent out-of-memory error originates.

Suggestion 2: on encountering a schema mismatch, the reader should seek forward to the next unquoted delimiter and attempt to continue reading from there. If the delimiter is a newline character, the unquoted newline is only likely to be a problem if it occurs in a record which is already a schema mismatch; the reader would attempt to read forward from a non-terminal newline and the subsequent fields would also be mismatched to the schema.... at which point it's probably OK to fall back on the behaviour in suggestion 1. Would that be workable?

@tustvold
Copy link
Contributor

tustvold commented Nov 4, 2023

silent out-of-memory error

I think I am missing something here, the reader returns an error and should then no longer be used. I don't see how this is silent?

@GCHQDeveloper61637
Copy link
Author

Well, it appeared silent to me. :) I first encountered the OOM error after upgrading from an earlier version of the Arrow crate where the CSV reader would continue after encountering an error. I assumed the reader would continue to work in the same way after the upgrade, and found that my code would spin its wheels for a while and then exit with a "failed to allocate" memory error. The application gives no indication of where the error has arisen; hence I'd call that a silent error.

Yes, you could argue that I should have read the documentation more carefully. :) But a panic when I tried to use the failed reader would have made it explicit that I was trying to do something unwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

2 participants