Skip to content

read_csv() & EOF character in string cause parsing issue #5500

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stephenjshaw opened this issue Nov 12, 2013 · 20 comments
Closed

read_csv() & EOF character in string cause parsing issue #5500

stephenjshaw opened this issue Nov 12, 2013 · 20 comments
Labels
Bug IO CSV read_csv, to_csv
Milestone

Comments

@stephenjshaw
Copy link

While importing large text files using read_csv we occasionally get an EOF (End of File ) character within a string, which causes an exception: "Error tokenizing data. C error: EOF inside string starting at line. 844863" . This occurs even with "error_bad_lines = False"..

Further, the line stated in the error message is not the line containing the EOF character. In this particular case the actual row was approx. 230 rows before the one stated, which hinders exception handling. (I now see this difference was caused by other "bad_lines" that were being skipped - the quoted error line is correct but the imported rows was less.)

I feel it would be appropriate if "error_bad_lines = False" handled this exception and allowed such rows to be skipped.

I note that when importing this text file into Excel, the "premature" EOF is simply ignored.

We are running on Windows 8 , with python version 2.7 and pandas version 0.12

@stephenjshaw
Copy link
Author

Further investigation using a hex editor has revealed what is going on:

  • I added 0x1A ("EOF") to a different file and it did not cause any problems. Pandas read_csv imported it without error.
  • I parsed every line of the problematic CSV individually, until I isolated the one causing the problem. It was over 3000 rows after the stated row number in the error message.
  • the row in question had a column with a double quote mark following the delimiter - there were not supposed to be any quote marks in the file. There was no second double quote in the column, or on the row
  • I think the quote mark caused the import to look for a second terminating double quote, ignoring column delimiters and end of line markers until it reached the end of the file. When it didn't find one before the end of the file, Im speculating that triggered the "EOF inside a string" error message.
  • I was able to work around the problem by setting the quotechar to be the same as the delimiter, while tells read_csv to ignore all quotes. It now imports the file perfectly.
  • I still think "error_bad_lines" should catch this by checking if any row contains a column with a missing terminating quote.
  • one reason I think this is important is that by adding a second such double quote, many lines apart, I was able to "fool" the system into skipping all the intervening lines, even though only two rows had an error.

@guyrt
Copy link
Contributor

guyrt commented Dec 19, 2013

@stephenjshaw are you able to try this on the master branch?

@rcompton
Copy link

On pandas 0.13.1, I had the exact same problem and solution.

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Apr 6, 2014
@DataJunkie
Copy link

I am having the same issue and cannot find any offending characters in the lines near the line number given. Is there some way to search for weird characters given I have no clue where the issue is?

@hayd
Copy link
Contributor

hayd commented Jun 3, 2014

another user seeing this/similar: http://stackoverflow.com/q/24005761/1240268

@rcompton
Copy link

rcompton commented Jun 3, 2014

Note that, according to the documentation at http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.io.parsers.read_csv.html , error_bad_lines=False only means that lines with too many fields will be skipped. If there's other problems with your data, .read_csv() will fail rather than skip the problematic lines, cf #6478 and https://stackoverflow.com/questions/22026181/pandas-warn-bad-lines-false-and-error-bad-lines-false-is-still-trying-to-parse-b

@jreback
Copy link
Contributor

jreback commented Jun 3, 2014

I don't think this is that hard to fix (essentially the low-level reader returns on EOF, but simple enough to check if that's actually the end of the file by reading again, if not, then can just ignore I think / remove that line).

anyone have a couple of test cases (e.g. need EOF inside a quote and outside). can generate the, but maybe @stephenjshaw already has a bit of code to do this?

@yyl
Copy link
Contributor

yyl commented Jun 11, 2014

Met the same problem here and solved it by @stephenjshaw solution.

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@gfyoung
Copy link
Member

gfyoung commented Aug 2, 2016

With no examples to really draw from, I created my own here for future reference, but I get no errors:

>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>>
>>> data = 'a,b\n1\x1a,2'  # note the EOF in the middle of the last line
>>> read_csv(StringIO(data), engine='c')
    a  b
0  12
>>> read_csv(StringIO(data), engine='python')
    a  b
0  12

@gfyoung
Copy link
Member

gfyoung commented Aug 22, 2016

@jreback : In light of my examples above, IMO this is no longer an issue. Perhaps some tests?

@jreback
Copy link
Contributor

jreback commented Aug 22, 2016

yes I think your EOF PR closed this
can u add that issue number here

@jreback jreback closed this as completed Aug 22, 2016
@gfyoung
Copy link
Member

gfyoung commented Aug 22, 2016

IIRC I didn't do a PR for the EOF (it was the NULL char and BOM). I can add tests though for this.

@jreback
Copy link
Contributor

jreback commented Aug 22, 2016

oh right ok then

gfyoung added a commit to forking-repos/pandas that referenced this issue Aug 23, 2016
@jreback jreback modified the milestones: 0.19.0, Next Major Release Aug 24, 2016
@patrickwang96
Copy link

I know I'm 4 years later for this issue... but I just encounter this bug again.
I don't think this bug is actually caused by EOF character inside a row of csv.
To reproduce the bug,

In [1]: import pandas as pd

In [2]: from StringIO import StringIO

In [3]: test_csv = '"a\tb\tc\n1\t2\t3'

In [4]: pd.read_csv(test_csv, delimiter='\t')

ParserError: Error tokenizing data. C error: EOF inside string starting at line 0

In [6]: test_csv = test_csv.translate(None, '"')

In [7]: pd.read_csv(StringIO(test_csv), delimiter='\t')
Out[7]:
   a  b  c
0  1  2  3

In [8]:

When I was loading a large csv using pandas, the error message tells me to look for line 853, which is a totally correct line...
The bug is actually thousands of lines behind.

I'm on macOS 10.12.6, python2.7 annaconda build and pandas version 0.21. This bug also exist in pandas version 0.20. Not sure about all versions before. But probably exists on all versions.

@gfyoung
Copy link
Member

gfyoung commented Nov 23, 2017

@patrickwang96 : Look at your CSV string. It's malformed with that unbalanced quotation mark. The error is to be expected.

@morganics
Copy link

morganics commented Jan 23, 2018

It's not always possible to have a perfect CSV file, so where it's more important to have a loaded data file, and less important to get all the data, then it would be good that error_bad_lines does what's expected. I found that adding csv.QUOTE_NONE fixed my issue (as mentioned here: https://stackoverflow.com/questions/18016037/pandas-parsererror-eof-character-when-reading-multiple-csv-files-to-hdf5)

@gfyoung
Copy link
Member

gfyoung commented Jan 23, 2018

@morganics : It would, except that pandas doesn't know where the line ends and begins in this case. You can't handle a bad line if you can't deduce where it begins or ends unfortunately.

@edrossy
Copy link

edrossy commented Sep 27, 2018

I processed the same exact CSV file twice. One time it failed and the next time it did not. There must be some sort of race / memory condition causing this? Fun ; -)

@gfyoung
Copy link
Member

gfyoung commented Sep 29, 2018

Of the two, probably memory condition. We don't have any concurrency for CSV parsing. Fun, indeed 😉

@NeuroBobster
Copy link

For the reason I pointed out in my answer to this question:
https://stackoverflow.com/questions/18016037/pandas-parsererror-eof-character-when-reading-multiple-csv-files-to-hdf5/53173373#53173373
I would suggest to make the quoting=csv.QUOTE_NONE default instead of csv.QUOTE_MINIMAL.
It's easier to realise what's going on when your strings are unexpectedly parsed with quotechars then to get the error when there's odd number of quotechars or no error, but unexpected parsing for even number of quotechars.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests