-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with CSV dialect sniffing #68
Comments
I could have sworn I tested exactly this case. Thanks for opening this issue and the example. I’ll try to work through it and will post back if I have questions. |
Actually, did you use the query.zip file referenced in issue #41? If not, would you mind posting the file you used to reproduce the error and as much detail about how you ran it as you can? [EDIT] Oh, I see. This raises an error. I agree I should be more defensive especially since the sniffing is effectively optional. |
@b2m Would you mind testing develop before I push this out in a patch? I still don't have a good feel for how likely people are to run into this issue. |
I run several szenarios on the develop branch. The overwrite via configuration is working 🥂. What I discovered is, that the code example from the Python documentation has a problem with csv files that have unquoted fields and a lot of columns because of the limit on the chunk size on 1024. csv-reconcile/csv_reconcile/initdb.py Lines 75 to 78 in bc723c7
In the case of unquoted fields a frequency approach is used to determine the csv delimiter and this fails when it only has a few lines and one of them is choped up because of the limit on 1024. So in the case of a lot of columns either increasing the chunk size or only feeding one or two lines to the sniffer helped in my experiments to correctly determine the delimiter. |
Awesome. Thanks. I’ll try to get this out tomorrow. I’ll make a separate issue of making the sniffer more clever. |
Fix pushed in new release. |
I tested the new behaviour for CSV dialect sniffing introduced for #41 in #42 and discovered the following problems:
CSVKWARGS
configuration because it is applied later.csv-reconcile/csv_reconcile/initdb.py
Line 76 in bc723c7
The reason for
csv.Sniffer()
being unsure about the delimiter is that while reading a fixed chunk of the csv file this chunk might end in the middle of a csv line and therefore the number of delimiters in this line is off.Working example with whole file:
Problematic example with only part of the file (throws error):
So I would recommend to use dialect sniffing only (or additionaly?) when the user has not given explicit instructions on the dialect via
CSVKWARGS
and to usecsvfile.readline()
to avoid having a line cut somewhere.The text was updated successfully, but these errors were encountered: