Problem with CSV dialect sniffing #68

b2m · 2022-04-29T09:50:06Z

I tested the new behaviour for CSV dialect sniffing introduced for #41 in #42 and discovered the following problems:

The csv.Sniffer().sniff() method will throw a "Could not determine delimiter" error in case it is unsure about a delimiter.
I can not overwrite this behaviour via the CSVKWARGS configuration because it is applied later.

Line 76 in bc723c7

dialect = csv.Sniffer().sniff(csvfile.read(1024))

The reason for csv.Sniffer() being unsure about the delimiter is that while reading a fixed chunk of the csv file this chunk might end in the middle of a csv line and therefore the number of delimiters in this line is off.

Working example with whole file:

import csv
csv.Sniffer().sniff("a,b,c\n1,2,3")

Problematic example with only part of the file (throws error):

import csv
csv.Sniffer().sniff("a,b,c\n1,2")

So I would recommend to use dialect sniffing only (or additionaly?) when the user has not given explicit instructions on the dialect via CSVKWARGS and to use csvfile.readline() to avoid having a line cut somewhere.

The text was updated successfully, but these errors were encountered:

gitonthescene · 2022-04-29T09:57:43Z

I could have sworn I tested exactly this case. Thanks for opening this issue and the example. I’ll try to work through it and will post back if I have questions.

gitonthescene · 2022-04-29T10:03:52Z

Actually, did you use the query.zip file referenced in issue #41? If not, would you mind posting the file you used to reproduce the error and as much detail about how you ran it as you can?

[EDIT] Oh, I see. This raises an error. I agree I should be more defensive especially since the sniffing is effectively optional.

gitonthescene · 2022-04-29T10:34:43Z

@b2m Would you mind testing develop before I push this out in a patch? I still don't have a good feel for how likely people are to run into this issue.

b2m · 2022-04-29T12:31:52Z

I run several szenarios on the develop branch. The overwrite via configuration is working 🥂.

What I discovered is, that the code example from the Python documentation has a problem with csv files that have unquoted fields and a lot of columns because of the limit on the chunk size on 1024.

csv-reconcile/csv_reconcile/initdb.py

Lines 75 to 78 in bc723c7

    
           with open(csvfilenm, newline='', **enckwarg) as csvfile: 
        
               dialect = csv.Sniffer().sniff(csvfile.read(1024)) 
        
               csvfile.seek(0) 
        
               reader = csv.reader(csvfile, dialect, **csvkwargs)

In the case of unquoted fields a frequency approach is used to determine the csv delimiter and this fails when it only has a few lines and one of them is choped up because of the limit on 1024.

https://github.com/python/cpython/blob/11652ceccf1dbf9dd332ad52ac9bd41b4adff274/Lib/csv.py#L280-L297

So in the case of a lot of columns either increasing the chunk size or only feeding one or two lines to the sniffer helped in my experiments to correctly determine the delimiter.

gitonthescene · 2022-04-29T13:13:11Z

Awesome. Thanks. I’ll try to get this out tomorrow. I’ll make a separate issue of making the sniffer more clever.

gitonthescene · 2022-04-30T05:54:07Z

Fix pushed in new release.

gitonthescene pushed a commit that referenced this issue Apr 29, 2022

Fixes #68. Protect against sniffing failing. Fallback to overrides

3406af3

gitonthescene closed this as completed in b239c30 Apr 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with CSV dialect sniffing #68

Problem with CSV dialect sniffing #68

b2m commented Apr 29, 2022

gitonthescene commented Apr 29, 2022 •

edited

Loading

gitonthescene commented Apr 29, 2022 •

edited

Loading

gitonthescene commented Apr 29, 2022 •

edited

Loading

b2m commented Apr 29, 2022 •

edited

Loading

gitonthescene commented Apr 29, 2022 •

edited

Loading

gitonthescene commented Apr 30, 2022

Problem with CSV dialect sniffing #68

Problem with CSV dialect sniffing #68

Comments

b2m commented Apr 29, 2022

gitonthescene commented Apr 29, 2022 • edited Loading

gitonthescene commented Apr 29, 2022 • edited Loading

gitonthescene commented Apr 29, 2022 • edited Loading

b2m commented Apr 29, 2022 • edited Loading

gitonthescene commented Apr 29, 2022 • edited Loading

gitonthescene commented Apr 30, 2022

gitonthescene commented Apr 29, 2022 •

edited

Loading

gitonthescene commented Apr 29, 2022 •

edited

Loading

gitonthescene commented Apr 29, 2022 •

edited

Loading

b2m commented Apr 29, 2022 •

edited

Loading

gitonthescene commented Apr 29, 2022 •

edited

Loading