-
-
Notifications
You must be signed in to change notification settings - Fork 30.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSV has_headers heuristic could be improved #87791
Comments
Here is an sample of CSV input: "time","forces" when calling has_header() from csv.py on this sample, it returns false. I think the heuristic will better work if rather than just comparing number types, it would also consider casting the values in this order int -> float -> complex. If the values are similar then consider this upgraded type as the type of the column. In the end, this file would be considered float columns with headers. |
I assume the OP is referring to this sort of usage: >>> sniffer = csv.Sniffer()
>>> raw = open("mixed.csv").read()
>>> sniffer.has_header(raw)
False *sigh* I really wish the Sniffer class had never been added to the CSV module. I can't recall who wrote it (the author is long gone). Though I am responsible for the initial commits, it wasn't me or the main authors of csvmodule.c. As far as I know, it never really worked well. I can't recall ever using it. A simpler heuristic would be if the first row contains a bunch of strings and the second row contains a bunch of numbers, then the file has a header. That assumes that CSV files consist mostly of numeric data. Looking at has_header, I see this: for thisType in [int, float, complex]: I think this particular problem would be solved if the order of those types were reversed. The attached diff suggests that as well. Note that the Sniffer class currently contains no test cases, so that the test I added failed before the change and passes after doesn't mean it doesn't break someone's mission critical Sniffer usage. (Sorry, Raymond. My Github-foo is insufficient to allow me to fork, apply the diff and create a PR.) |
I've added the PR here, based on Skip's patch: |
Skip: If I understand right, in the patch the last two types -- float and int, will never have an effect because if float(x) and int(x) succeed, so will complex(x), and conversely, if complex(x) fails, float and int will also fail. So the effect of the patch will be to tolerate any mix of numeric columns when the headers are textual. Which sounds fine to me, just want to confirm that sounds good to you, because the unit test in the patch is a much narrower case. I think the test should then be something like: a b c and the code should be updated to just do Another, more strict option, - would be to special case |
Thanks @andrei.avk. You are right, only the complex test is required. I suppose it's okay to commit this, but reviewing the full code of the has_header method leaves me thinking this is just putting lipstick on a pig. If I read the code correctly, there are two (undocumented) tacit assumptions:
The second criterion means this has a header: ab,de but this doesn't: abc,def It seems to me that it would be a good idea to at least expand on the documentation of that method and maybe add at least one test case where the CSV sample doesn't have a header. I'll try to get that done and attach a patch. |
I retract my comment about fixed length strings in the non-numeric case. There are clearly test cases (which I probably wrote, considering the values) where the sample as a header but the values are of varying length. Misread of the code on my part. I have obviously not had enough coffee yet this morning. |
Here is a change to the has_header documentation and an extra test case documenting the behavior when the sample contains strings. I'm not sure about the wording of the doc change, perhaps you can tweak it? Seems kind of clumsy to me. If it seems okay to you @andrei.avk, can you fold it into your PR? |
Here's a NEWS entry. |
Skip: I've updated the PR Expanded the docs a bit, I think they give a good rough idea of the logic but I feel like they can probably be improved more, let me know what you think! I spent some time thinking about it but found it tricky! :) |
Since this is a heuristic change, I'm thinking we shouldn't backport this to 3.9. This way it will be easier for application authors to notice the change and control it within their apps. |
Łukasz: I agree. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: