Sep does not gracefully handle improperly-escaped double quotes #150

bryanboettcher · 2024-07-24T19:40:34Z

@nietras (sorry, I didn't have a more graceful way to tag you)

First off, kudos to Sep for the absolute blistering speed you've accomplished with it. I understand that in many cases, this speed was gained by intentional (or omitted) checks around the correctness of a CSV. At the same time, there are a few existing cases where Sep's correctness can be tuned (such as DisableColumnCountCheck).

A vendor has a process that occasionally generates a malformed CSV. We will get a row that's otherwise properly escaped, but they are leaving an improper backslashed-doublequote in the string field, like so: ,"XXX\" YYYY",. Whenever Sep hits this data, we soon after get the 16MB buffer overrun exception.

Our vendor is aware, but no ETA for fixing their process. Is there a mechanism now or in the future for Sep to either blindly remove this character, or more gracefully handle it? Using DisableQuotesParsing is insufficient in this case, because many of our fields are (properly) quoted and escaped.

I have a repro: https://github.com/bryanboettcher/SepCrash

The text was updated successfully, but these errors were encountered:

nietras · 2024-08-04T13:04:57Z

@bryanboettcher thank you for trying out Sep and filing a detailed issue incl. not the least a full repro.

this speed was gained by intentional (or omitted) checks around the correctness of a CSV.

Actually, rather Sep assumes csv should be valid and does not as you observe try to or in any way handle invalid csv like incorrect quoting or custom escaping of quotes. Handling \" in some way would require custom processing that would either require more code or impact performance, so I would not be inclined to support such a thing. I am not even sure how other CSV libraries like CsvHelper or Sylvan would handle this. Have you tried and how do they handle it?

bryanboettcher · 2024-08-04T13:26:03Z

@bryanboettcher thank you for trying out Sep and filing a detailed issue incl. not the least a full repro.

The absolute bonkers speed and anemic memory usage of Sep is 100% worth it to me to work around whatever you don't choose to natively support. It's amazing work.

Actually, rather Sep assumes csv should be valid

I'm not not saying I hoped the library maintainer's statements could help me push back against the vendor's formatting ...

I am not even sure how other CSV libraries like CsvHelper or Sylvan would handle this. Have you tried and how do they handle it?

The previous implementation of this importer used CsvHelper, and ran about 10% the speed. When CsvHelper encountered this malformed line, it chucks an InvalidCsvData (or something) action configurable in the initial options. However, I am too monke-brained to understand what CsvHelper is doing, if not something like a "last known good data" pointer vs. a "suspsected next line" pointer. Once the "next line" pointer has errored in some manner, anything between the "last known good data" pointer and the most recent non-triple-quoted newline would be considered an invalid line and skipped.

In the meantime, this was worked around via an intermediate Stream that would blindly replace any instance of \" with "" while reading the input file. StackOverflow question and answer: https://stackoverflow.com/questions/78790505/whats-a-performant-way-to-remove-data-from-a-stream-or-textwriter

nietras · 2024-09-04T11:28:48Z

@bryanboettcher sorry I did not reply on this. I think your work around is fine and should not be that much less performant. I guess having some form of option for a configurable callback action (delegate) in case when internal buffer grows, with option to manipulate internal data buffer would be an option. I probably don't like it since it adds a rather complicated option to the usage... but it should be possible with minimal perf impact.

I'm not not saying I hoped the library maintainer's statements could help me push back against the vendor's formatting ...

Haha, I hope the vendor folded and fixed their formatting then 😉

bryanboettcher assigned nietras Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sep does not gracefully handle improperly-escaped double quotes #150

Sep does not gracefully handle improperly-escaped double quotes #150

bryanboettcher commented Jul 24, 2024

nietras commented Aug 4, 2024

bryanboettcher commented Aug 4, 2024 •

edited

Loading

nietras commented Sep 4, 2024

Sep does not gracefully handle improperly-escaped double quotes #150

Sep does not gracefully handle improperly-escaped double quotes #150

Comments

bryanboettcher commented Jul 24, 2024

nietras commented Aug 4, 2024

bryanboettcher commented Aug 4, 2024 • edited Loading

nietras commented Sep 4, 2024

bryanboettcher commented Aug 4, 2024 •

edited

Loading