Skip to content

BUG: ExcelFile.parse() skiprows arg doesn't play nice #4903

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nehalecky opened this issue Sep 20, 2013 · 0 comments · Fixed by #11340
Closed

BUG: ExcelFile.parse() skiprows arg doesn't play nice #4903

nehalecky opened this issue Sep 20, 2013 · 0 comments · Fixed by #11340
Labels
Bug IO Data IO issues that don't fit into a more specific label IO Excel read_excel, to_excel

Comments

@nehalecky
Copy link
Contributor

related/dup #4340

Was building on this example on SO, and found a bug in ExcelFile().parse option skiprows, when passed an index.

In [2]: xls = pd.ExcelFile('example.xlsx')
In [3]: df = xls.parse(xls.sheet_names[0])
In [4]: print df
              Bill Date  Meter #  Type  SIC Code  Billing Days  Rate Code
0   2011-10-01 00:00:00  1892213     E      8111            29      ALTOU
1   2011-11-01 00:00:00  1892213     E      8111            29      ALTOU
2                   NaN      NaN   NaN       NaN           NaN        NaN
3                   NaN      NaN   NaN       NaN           NaN        NaN
4             Bill Date  Meter #  Type  SIC Code  Billing Days  Rate Code
5   2011-10-01 00:00:00   553961     G      8111            29        GN3
6   2011-11-01 00:00:00   553961     G      8111            29        GN3
7                   NaN      NaN   NaN       NaN           NaN        NaN
8             Bill Date  Meter #  Type  SIC Code  Billing Days  Rate Code
9   2011-10-01 00:00:00  6322158     E                      29          A
10  2011-11-01 00:00:00  6322158     E                      29          A

Wanting to avoid the repeated headers and empty rows, I parse with skiprows, but no dice (no difference with previously parsed sheet):

In [5]: skip_idx = np.array([2,3,4,7,8])
In [6]: df = xls.parse(xls.sheet_names[0], skiprows=skip_idx+1)
In [8]: df
Out[8]: 
              Bill Date  Meter #  Type  SIC Code  Billing Days  Rate Code
0   2011-10-01 00:00:00  1892213     E      8111            29      ALTOU
1   2011-11-01 00:00:00  1892213     E      8111            29      ALTOU
2                   NaN      NaN   NaN       NaN           NaN        NaN
3                   NaN      NaN   NaN       NaN           NaN        NaN
4             Bill Date  Meter #  Type  SIC Code  Billing Days  Rate Code
5   2011-10-01 00:00:00   553961     G      8111            29        GN3
6   2011-11-01 00:00:00   553961     G      8111            29        GN3
7                   NaN      NaN   NaN       NaN           NaN        NaN
8             Bill Date  Meter #  Type  SIC Code  Billing Days  Rate Code
9   2011-10-01 00:00:00  6322158     E                      29          A
10  2011-11-01 00:00:00  6322158     E                      29          A

Meanwhile, our little friend pd.read_csv(), doesn't seem to have the same hangups. Creating a csv directly from the original .xlsx and performing the same operations:

In [9]: df = pd.read_csv('example.csv')
In [10]: df
Out[10]: 
    Bill Date  Meter #  Type  SIC Code  Billing Days  Rate Code
0      Oct-11  1892213     E      8111            29      ALTOU
1      Nov-11  1892213     E      8111            29      ALTOU
2         NaN      NaN   NaN       NaN           NaN        NaN
3         NaN      NaN   NaN       NaN           NaN        NaN
4   Bill Date  Meter #  Type  SIC Code  Billing Days  Rate Code
5      Oct-11   553961     G      8111            29        GN3
6      Nov-11   553961     G      8111            29        GN3
7         NaN      NaN   NaN       NaN           NaN        NaN
8   Bill Date  Meter #  Type  SIC Code  Billing Days  Rate Code
9      Oct-11  6322158     E                      29          A
10     Nov-11  6322158     E                      29          A

Looks fine, and now passing the skiprows option, and I get a correctly parsed df with all the guilty lines missing:

In [11]: df = pd.read_csv('example.csv', skiprows=skip_id+1)
In [12]: df
Out[12]: 
  Bill Date  Meter # Type SIC Code  Billing Days Rate Code
0    Oct-11  1892213    E     8111            29     ALTOU
1    Nov-11  1892213    E     8111            29     ALTOU
2    Oct-11   553961    G     8111            29       GN3
3    Nov-11   553961    G     8111            29       GN3
4    Oct-11  6322158    E                     29         A
5    Nov-11  6322158    E                     29         A

Started tracing for troubleshooting, but after the 5th handoff of the skiprows parameter all over in .io, I gave up. :(

Thoughts?

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Apr 9, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015
@jorisvandenbossche jorisvandenbossche modified the milestones: No action, Next Major Release Jul 21, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Data IO issues that don't fit into a more specific label IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants