Skip to content

[BUG] read_excel: fixes handling of multi index header and other corner cases #58899

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

PL-SalvadorFandino
Copy link

@PL-SalvadorFandino PL-SalvadorFandino commented Jun 2, 2024

@PL-SalvadorFandino PL-SalvadorFandino changed the title Holes [BUG] read_excel: fixes handling of multi index header and other corner cases Jun 2, 2024
@PL-SalvadorFandino PL-SalvadorFandino force-pushed the holes branch 2 times, most recently from 3dc434f to 77c02e5 Compare June 2, 2024 20:07
@PL-SalvadorFandino
Copy link
Author

PL-SalvadorFandino commented Jun 2, 2024

Hi, I can't see why check pre-commit.ci - pr is failing!

@Aloqeely

This comment was marked as resolved.

@Aloqeely
Copy link
Member

Aloqeely commented Jun 3, 2024

Looks like I don't have the permissions to do it, but you should be able to autofix by commenting pre-commit.ci autofix

@PL-SalvadorFandino
Copy link
Author

pre-commit.ci autofix

Copy link
Contributor

github-actions bot commented Jul 4, 2024

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jul 4, 2024
Comment on lines 825 to 832
ixmap = list(range(len(data)))
elif is_integer(skiprows):
ixmap = list(range(skiprows, len(data)))
elif is_list_like(skiprows):
skiprows_set = set(cast(Sequence[int], skiprows))
ixmap = [ix for ix, _ in enumerate(data) if ix not in skiprows_set]
elif callable(skiprows):
ixmap = [ix for ix, _ in enumerate(data) if not skiprows(ix)]
Copy link
Member

@rhshadrach rhshadrach Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is adding a third place where we call the skiprows function on every row of the data. This seems quite expensive, perhaps it'd be better to pass skiprows to get_sheet_data to avoid getting it in the first place. Then the rest of the implementation could proceed with skiprows=None, simplifying the logic.

The current implementation here also seems unnecessarily expensive when skiprows is None, which I think is a common case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, now when skiprows is a callable, it is called on every row and the result cached so that effectively, it is only called once per row.

read_excel function has several bugs regarding how it handles
combinations of header, skiprows and index_col arguments.

The tests here showcase some of them.
@PL-SalvadorFandino
Copy link
Author

I have rebased my branch on top of the current "main" branch, because I was having some issues with the test suite unrelated to my code.

@PL-SalvadorFandino
Copy link
Author

Now, ixmap is not coerced into a list in order to reduce the overhead when skiprows is None or an integer.

The logic in that method was not handling correctly all the possible
combinations of skiprows, header and index_col arguments.

Specifically:

- it was not able to handle correctly multi index header
  with holes (for instance, `header=[0,2]`.

- multi index header and skiprows given as lists.

- forward filling index columns and skiprows gigen as lists.

- inconsistences processing one-element list arguments (for instance,
  `header=1` and `header=[1]` or `index_col=0` and `index_col=[0]` where
  handled differently).

The logic has been revamped, because it was not possible to fix all
the errors with local changes.

The mayor challenge was handling skiprows as a list, as it may remove
rows at any place (before, between or after header(s), index names and
data). Also, header row indexes reference rows **after** removing
skiprows.

To handle that we use an intermediate mapping `ixmap` which goes from
the row indixes with skiprows removed to the row indixes in `data`.

Finally, let me add that IMO, most of the functionality of
_parse_sheet should be moved down into TextParser... but that's work
for another day!
@PL-SalvadorFandino
Copy link
Author

pre-commit.ci autofix

salva and others added 2 commits July 8, 2024 21:52
For callable skiprows arguments, the callback was being called several
times in every row.

Now, it is called just once and the result cached.
@PL-SalvadorFandino
Copy link
Author

pre-commit.ci autofix

@PL-SalvadorFandino
Copy link
Author

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

not stale!

@rhshadrach rhshadrach added Bug IO Excel read_excel, to_excel and removed Stale labels Jul 14, 2024
# a row containing just the index name(s)
has_index_names = False
if is_list_header and not is_len_one_list_header and index_col is not None:
index_col_has_names = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used one place below; instead can you replace it with isinstance(index_col, str) to simplify the code.

Comment on lines +888 to +891
raise ValueError(
"named index_col can not be used together "
"with multi-index header"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this restriction new?


if has_index:
header_name, _ = pop_header_name(
data[row1], sorted(index_col_set)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why sort?

header_name, _ = pop_header_name(
data[row1], sorted(index_col_set)
)
if header_name:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the case header_name being None hit by tests?

@@ -1759,3 +1759,53 @@ def test_corrupt_files_closed(self, engine, tmp_excel):
pd.ExcelFile(tmp_excel, engine=engine)
except errors:
pass

def test_mi_header_skiprows1(self, engine, read_ext):
if engine is None and read_ext == ".xlsx":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add references to the issue as a comment in the first line of each test:

# GH#58898

Comment on lines +1765 to +1768
with open("test_mi_holes.xlsx", "rb") as f:
expected = pd.read_excel(
f, sheet_name="expected", header=[0, 1], index_col=[0, 1]
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding files to the data/ directory, is it possible to generate these sheets on-the-fly? E.g.

df = pd.DataFrame(...)
df.to_excel(tmp_excel, sheet_name="test", index=False)

tmp_excel here is a pytest fixture.

@marcel-goldschen-ohm
Copy link

It would be great if a fix for this issue could be implemented quickly as it is fundamental to properly reading in data from excel files. I would suggest that this should be VERY HIGH PRIORITY.

@marcel-goldschen-ohm
Copy link

Note, I just tried https://github.com/PL-SalvadorFandino/pandas/tree/holes and it does NOT solve the following issue for me:

Consider an excel sheet with the following content (three multiindex header rows with a blank row separating rows 1 and 3):

Group, A, A, B, B
ID, 0, 1, 0, 1
 , , , , 
X, Y, Y, Y, Y
100, 0, 1, 2, 3
200, 4, 5, 6, 7

When reading in the above excel sheet using pandas.read_excel I get mangled data. For example,

pd.read_excel(..., index_col=0, header=[0,1,3], skiprows = lambda row: row not in [0,1,3,4,5])

gives the following output (index_col missing last data row, and other columns missing last header row):

Group, A, A, B, B
ID, 0, 1, 0, 1
X, 0, 1, 2, 3
200, 4, 5, 6, 7

Thinking that header=[...] indices might refer to row indices after removing skipped rows, I tried:

pd.read_excel(..., index_col=0, header=[0,1,2], skiprows = lambda row: row not in [0,1,3,4,5])

which gives the following output (index_col missing last header row, and other columns ok):

Group, A, A, B, B
ID, 0, 1, 0, 1
 , Y, Y, Y, Y
100, 0, 1, 2, 3
200, 4, 5, 6, 7

Note that if the index_col is ignored, then everything works if one assumes header=[...] denotes row indices after removing skipped rows (I find this HUGELY non-intuitive):

pd.read_excel(..., header=[0,1,2], skiprows = lambda row: row not in [0,1,3,4,5])

which gives the correct output (albeit the desired index column is now a regular column):

Group, A, A, B, B
ID, 0, 1, 0, 1
X, Y, Y, Y, Y
100, 0, 1, 2, 3
200, 4, 5, 6, 7

I have the suspicion that index_col and other columns may be treating the indices passed in header=[...] differently (e.g., row indices as in the excel sheet or row indices after removing skipped rows)? If so, I STRONGLY suggest that all row indices should be relative to the excel file, not the remaining rows after extracting the desired rows.

@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: read_excel multiindex head with holes
6 participants