Skip to content

bug: read_csv incorrect output with skipfooter and skip_blank_lines=True #10164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jsspencer opened this issue May 18, 2015 · 4 comments
Open
Labels
Docs IO CSV read_csv, to_csv

Comments

@jsspencer
Copy link
Contributor

read_csv does not return the data table expected if there's only one line in the CSV file and skipfooter is used and the line directly after the table is blank and skip_blank_lines=True.

>>> import pandas as pd
>>> import StringIO
>>> test_csv = StringIO.StringIO('a,b,c\n1,2,3\n\nend\n')                                                                                                                                                                                      
>>> pd.read_csv(test_csv, skip_footer=2, engine='python')
Empty DataFrame
Columns: [a, b, c]
Index: []

But if the more than one line is in the data table or the line after the data table is not blank or skip_blank_lines is set to False, everything is ok:

>>> test_csv = StringIO.StringIO('a,b,c\n1,2,3\n4,5,6\n\nend\n')
>>> pd.read_csv(test_csv, skip_footer=2, engine='python')
   a  b  c
0  1  2  3
1  4  5  6
>>> test_csv = StringIO.StringIO('a,b,c\n1,2,3\nend\n\n')
>>> pd.read_csv(test_csv, skip_footer=2, engine='python')
   a  b  c
0  1  2  3
>>> test_csv = StringIO.StringIO('a,b,c\n1,2,3\n\nend\n')
>>> pd.read_csv(test_csv, skip_footer=2, engine='python', skip_blank_lines=False)
   a  b  c
0  1  2  3

This occurs in every version from 0.15.0 onwards (ie since skip_blank_lines was introduced).

@jreback
Copy link
Contributor

jreback commented May 18, 2015

I think the warning in the docs is pretty clear, see here

skipping 2 lines skips everything, but 1 gets your data.

In [17]: pd.read_csv(StringIO('a,b,c\n1,2,3\n\nend\n'),skip_footer=1,engine='python')
Out[17]: 
   a  b  c
0  1  2  3

@jreback jreback added Usage Question IO CSV read_csv, to_csv labels May 18, 2015
@jreback jreback closed this as completed May 18, 2015
@jsspencer
Copy link
Contributor Author

Thanks for the fast reply but I don't think the ambiguity is quite the same as that warning. Does skip_footer use row or line numbers? Your comment implies it uses row numbers.

Swapping the blank and non-blank lines changes the behaviour:

In [4]: pd.read_csv(StringIO('a,b,c\n1,2,3\nend\n\n'),skip_footer=1,engine='python')
Out[4]: 
     a   b   c
0    1   2   3
1  end NaN NaN

In [5]: pd.read_csv(StringIO('a,b,c\n1,2,3\nend\n\n'),skip_footer=2,engine='python')
Out[5]: 
   a  b  c
0  1  2  3

So the ambiguity appears to be what pandas regards as the footer. It seems that blank lines between the last line in the data table and the first non-blank line in the footer are skipped but subsequent blank lines are not.

In [6]: pd.read_csv(StringIO('a,b,c\n1,2,3\n\n\n\nend\n\n'),skip_footer=2,engine='python')
Out[6]: 
   a  b  c
0  1  2  3

In [7]: pd.read_csv(StringIO('a,b,c\n1,2,3\n\n\n\nend1\n\nend2'), engine='python')
Out[7]: 
      a   b   c
0     1   2   3
1  end1 NaN NaN
2  end2 NaN NaN

In [8]: pd.read_csv(StringIO('a,b,c\n1,2,3\n\n\n\nend1\n\nend2'), engine='python', skip_footer=2)
Out[8]: 
      a   b   c
0     1   2   3
1  end1 NaN NaN

In [9]: pd.read_csv(StringIO('a,b,c\n1,2,3\n\n\n\nend1\n\nend2'),skip_footer=3,engine='python')
Out[9]: 
   a  b  c
0  1  2  3

Maybe not a bug but I think it's surprising that a footer can contain blank lines but not start with one if skip_blank_lines=True.

@jreback jreback added this to the Next Major Release milestone May 20, 2015
@jreback
Copy link
Contributor

jreback commented May 20, 2015

@jsspencer ok I'll buy that. So you want to give a shot at clarify what it should be doing and/or update the docs to be more specific? thxs.

@jreback
Copy link
Contributor

jreback commented May 20, 2015

cc @mdmueller
cc @selasley

if you guys have thoughts about this

jsspencer pushed a commit to hande-qmc/hande that referenced this issue Sep 29, 2015
See pandas-dev/pandas#10164 for details.
Affects pandas 0.15.0-0.16.1.
@mroeschke mroeschke added Docs and removed API Design labels Apr 18, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

4 participants