Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sep does not gracefully handle improperly-escaped double quotes #150

Open
bryanboettcher opened this issue Jul 24, 2024 · 3 comments
Open
Assignees

Comments

@bryanboettcher
Copy link

@nietras (sorry, I didn't have a more graceful way to tag you)

First off, kudos to Sep for the absolute blistering speed you've accomplished with it. I understand that in many cases, this speed was gained by intentional (or omitted) checks around the correctness of a CSV. At the same time, there are a few existing cases where Sep's correctness can be tuned (such as DisableColumnCountCheck).

A vendor has a process that occasionally generates a malformed CSV. We will get a row that's otherwise properly escaped, but they are leaving an improper backslashed-doublequote in the string field, like so: ,"XXX\" YYYY",. Whenever Sep hits this data, we soon after get the 16MB buffer overrun exception.

Our vendor is aware, but no ETA for fixing their process. Is there a mechanism now or in the future for Sep to either blindly remove this character, or more gracefully handle it? Using DisableQuotesParsing is insufficient in this case, because many of our fields are (properly) quoted and escaped.

I have a repro: https://github.com/bryanboettcher/SepCrash

@nietras
Copy link
Owner

nietras commented Aug 4, 2024

@bryanboettcher thank you for trying out Sep and filing a detailed issue incl. not the least a full repro.

this speed was gained by intentional (or omitted) checks around the correctness of a CSV.

Actually, rather Sep assumes csv should be valid and does not as you observe try to or in any way handle invalid csv like incorrect quoting or custom escaping of quotes. Handling \" in some way would require custom processing that would either require more code or impact performance, so I would not be inclined to support such a thing. I am not even sure how other CSV libraries like CsvHelper or Sylvan would handle this. Have you tried and how do they handle it?

@bryanboettcher
Copy link
Author

bryanboettcher commented Aug 4, 2024

@bryanboettcher thank you for trying out Sep and filing a detailed issue incl. not the least a full repro.

The absolute bonkers speed and anemic memory usage of Sep is 100% worth it to me to work around whatever you don't choose to natively support. It's amazing work.

Actually, rather Sep assumes csv should be valid

I'm not not saying I hoped the library maintainer's statements could help me push back against the vendor's formatting ...

I am not even sure how other CSV libraries like CsvHelper or Sylvan would handle this. Have you tried and how do they handle it?

The previous implementation of this importer used CsvHelper, and ran about 10% the speed. When CsvHelper encountered this malformed line, it chucks an InvalidCsvData (or something) action configurable in the initial options. However, I am too monke-brained to understand what CsvHelper is doing, if not something like a "last known good data" pointer vs. a "suspsected next line" pointer. Once the "next line" pointer has errored in some manner, anything between the "last known good data" pointer and the most recent non-triple-quoted newline would be considered an invalid line and skipped.

In the meantime, this was worked around via an intermediate Stream that would blindly replace any instance of \" with "" while reading the input file. StackOverflow question and answer: https://stackoverflow.com/questions/78790505/whats-a-performant-way-to-remove-data-from-a-stream-or-textwriter

@nietras
Copy link
Owner

nietras commented Sep 4, 2024

@bryanboettcher sorry I did not reply on this. I think your work around is fine and should not be that much less performant. I guess having some form of option for a configurable callback action (delegate) in case when internal buffer grows, with option to manipulate internal data buffer would be an option. I probably don't like it since it adds a rather complicated option to the usage... but it should be possible with minimal perf impact.

I'm not not saying I hoped the library maintainer's statements could help me push back against the vendor's formatting ...

Haha, I hope the vendor folded and fixed their formatting then 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants