-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
read_csv() & EOF character in string cause parsing issue #5500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Further investigation using a hex editor has revealed what is going on:
|
@stephenjshaw are you able to try this on the master branch? |
On pandas 0.13.1, I had the exact same problem and solution. |
I am having the same issue and cannot find any offending characters in the lines near the line number given. Is there some way to search for weird characters given I have no clue where the issue is? |
another user seeing this/similar: http://stackoverflow.com/q/24005761/1240268 |
Note that, according to the documentation at http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.io.parsers.read_csv.html , error_bad_lines=False only means that lines with too many fields will be skipped. If there's other problems with your data, .read_csv() will fail rather than skip the problematic lines, cf #6478 and https://stackoverflow.com/questions/22026181/pandas-warn-bad-lines-false-and-error-bad-lines-false-is-still-trying-to-parse-b |
I don't think this is that hard to fix (essentially the low-level reader returns on EOF, but simple enough to check if that's actually the end of the file by reading again, if not, then can just ignore I think / remove that line). anyone have a couple of test cases (e.g. need EOF inside a quote and outside). can generate the, but maybe @stephenjshaw already has a bit of code to do this? |
Met the same problem here and solved it by @stephenjshaw solution. |
With no examples to really draw from, I created my own here for future reference, but I get no errors: >>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>>
>>> data = 'a,b\n1\x1a,2' # note the EOF in the middle of the last line
>>> read_csv(StringIO(data), engine='c')
a b
0 1� 2
>>> read_csv(StringIO(data), engine='python')
a b
0 1� 2 |
yes I think your EOF PR closed this |
IIRC I didn't do a PR for the EOF (it was the NULL char and BOM). I can add tests though for this. |
oh right ok then |
I know I'm 4 years later for this issue... but I just encounter this bug again.
When I was loading a large csv using pandas, the error message tells me to look for line 853, which is a totally correct line... I'm on macOS 10.12.6, python2.7 annaconda build and pandas version 0.21. This bug also exist in pandas version 0.20. Not sure about all versions before. But probably exists on all versions. |
@patrickwang96 : Look at your CSV string. It's malformed with that unbalanced quotation mark. The error is to be expected. |
It's not always possible to have a perfect CSV file, so where it's more important to have a loaded data file, and less important to get all the data, then it would be good that error_bad_lines does what's expected. I found that adding csv.QUOTE_NONE fixed my issue (as mentioned here: https://stackoverflow.com/questions/18016037/pandas-parsererror-eof-character-when-reading-multiple-csv-files-to-hdf5) |
@morganics : It would, except that |
I processed the same exact CSV file twice. One time it failed and the next time it did not. There must be some sort of race / memory condition causing this? Fun ; -) |
Of the two, probably memory condition. We don't have any concurrency for CSV parsing. Fun, indeed 😉 |
For the reason I pointed out in my answer to this question: |
While importing large text files using read_csv we occasionally get an EOF (End of File ) character within a string, which causes an exception: "Error tokenizing data. C error: EOF inside string starting at line. 844863" . This occurs even with "error_bad_lines = False"..
Further, the line stated in the error message is not the line containing the EOF character. In this particular case the actual row was approx. 230 rows before the one stated, which hinders exception handling. (I now see this difference was caused by other "bad_lines" that were being skipped - the quoted error line is correct but the imported rows was less.)
I feel it would be appropriate if "error_bad_lines = False" handled this exception and allowed such rows to be skipped.
I note that when importing this text file into Excel, the "premature" EOF is simply ignored.
We are running on Windows 8 , with python version 2.7 and pandas version 0.12
The text was updated successfully, but these errors were encountered: