Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with CSV dialect sniffing #68

Closed
b2m opened this issue Apr 29, 2022 · 6 comments
Closed

Problem with CSV dialect sniffing #68

b2m opened this issue Apr 29, 2022 · 6 comments

Comments

@b2m
Copy link
Contributor

b2m commented Apr 29, 2022

I tested the new behaviour for CSV dialect sniffing introduced for #41 in #42 and discovered the following problems:

  1. The csv.Sniffer().sniff() method will throw a "Could not determine delimiter" error in case it is unsure about a delimiter.
  2. I can not overwrite this behaviour via the CSVKWARGS configuration because it is applied later.

dialect = csv.Sniffer().sniff(csvfile.read(1024))

The reason for csv.Sniffer() being unsure about the delimiter is that while reading a fixed chunk of the csv file this chunk might end in the middle of a csv line and therefore the number of delimiters in this line is off.

Working example with whole file:

import csv
csv.Sniffer().sniff("a,b,c\n1,2,3")

Problematic example with only part of the file (throws error):

import csv
csv.Sniffer().sniff("a,b,c\n1,2")

So I would recommend to use dialect sniffing only (or additionaly?) when the user has not given explicit instructions on the dialect via CSVKWARGS and to use csvfile.readline() to avoid having a line cut somewhere.

@gitonthescene
Copy link
Owner

gitonthescene commented Apr 29, 2022

I could have sworn I tested exactly this case. Thanks for opening this issue and the example. I’ll try to work through it and will post back if I have questions.

@gitonthescene
Copy link
Owner

gitonthescene commented Apr 29, 2022

Actually, did you use the query.zip file referenced in issue #41? If not, would you mind posting the file you used to reproduce the error and as much detail about how you ran it as you can?

[EDIT] Oh, I see. This raises an error. I agree I should be more defensive especially since the sniffing is effectively optional.

@gitonthescene
Copy link
Owner

gitonthescene commented Apr 29, 2022

@b2m Would you mind testing develop before I push this out in a patch? I still don't have a good feel for how likely people are to run into this issue.

@b2m
Copy link
Contributor Author

b2m commented Apr 29, 2022

I run several szenarios on the develop branch. The overwrite via configuration is working 🥂.

What I discovered is, that the code example from the Python documentation has a problem with csv files that have unquoted fields and a lot of columns because of the limit on the chunk size on 1024.

with open(csvfilenm, newline='', **enckwarg) as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.reader(csvfile, dialect, **csvkwargs)

In the case of unquoted fields a frequency approach is used to determine the csv delimiter and this fails when it only has a few lines and one of them is choped up because of the limit on 1024.

https://github.com/python/cpython/blob/11652ceccf1dbf9dd332ad52ac9bd41b4adff274/Lib/csv.py#L280-L297

So in the case of a lot of columns either increasing the chunk size or only feeding one or two lines to the sniffer helped in my experiments to correctly determine the delimiter.

@gitonthescene
Copy link
Owner

gitonthescene commented Apr 29, 2022

Awesome. Thanks. I’ll try to get this out tomorrow. I’ll make a separate issue of making the sniffer more clever.

@gitonthescene
Copy link
Owner

Fix pushed in new release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants