-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
[BUG] read_excel: fixes handling of multi index header and other corner cases #58899
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3dc434f
to
77c02e5
Compare
Hi, I can't see why check |
This comment was marked as resolved.
This comment was marked as resolved.
Looks like I don't have the permissions to do it, but you should be able to autofix by commenting |
pre-commit.ci autofix |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
pandas/io/excel/_base.py
Outdated
ixmap = list(range(len(data))) | ||
elif is_integer(skiprows): | ||
ixmap = list(range(skiprows, len(data))) | ||
elif is_list_like(skiprows): | ||
skiprows_set = set(cast(Sequence[int], skiprows)) | ||
ixmap = [ix for ix, _ in enumerate(data) if ix not in skiprows_set] | ||
elif callable(skiprows): | ||
ixmap = [ix for ix, _ in enumerate(data) if not skiprows(ix)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is adding a third place where we call the skiprows function on every row of the data. This seems quite expensive, perhaps it'd be better to pass skiprows
to get_sheet_data
to avoid getting it in the first place. Then the rest of the implementation could proceed with skiprows=None
, simplifying the logic.
The current implementation here also seems unnecessarily expensive when skiprows
is None, which I think is a common case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, now when skiprows is a callable, it is called on every row and the result cached so that effectively, it is only called once per row.
read_excel function has several bugs regarding how it handles combinations of header, skiprows and index_col arguments. The tests here showcase some of them.
I have rebased my branch on top of the current "main" branch, because I was having some issues with the test suite unrelated to my code. |
Now, ixmap is not coerced into a list in order to reduce the overhead when |
The logic in that method was not handling correctly all the possible combinations of skiprows, header and index_col arguments. Specifically: - it was not able to handle correctly multi index header with holes (for instance, `header=[0,2]`. - multi index header and skiprows given as lists. - forward filling index columns and skiprows gigen as lists. - inconsistences processing one-element list arguments (for instance, `header=1` and `header=[1]` or `index_col=0` and `index_col=[0]` where handled differently). The logic has been revamped, because it was not possible to fix all the errors with local changes. The mayor challenge was handling skiprows as a list, as it may remove rows at any place (before, between or after header(s), index names and data). Also, header row indexes reference rows **after** removing skiprows. To handle that we use an intermediate mapping `ixmap` which goes from the row indixes with skiprows removed to the row indixes in `data`. Finally, let me add that IMO, most of the functionality of _parse_sheet should be moved down into TextParser... but that's work for another day!
pre-commit.ci autofix |
For callable skiprows arguments, the callback was being called several times in every row. Now, it is called just once and the result cached.
for more information, see https://pre-commit.ci
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
not stale! |
# a row containing just the index name(s) | ||
has_index_names = False | ||
if is_list_header and not is_len_one_list_header and index_col is not None: | ||
index_col_has_names = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only used one place below; instead can you replace it with isinstance(index_col, str)
to simplify the code.
raise ValueError( | ||
"named index_col can not be used together " | ||
"with multi-index header" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this restriction new?
|
||
if has_index: | ||
header_name, _ = pop_header_name( | ||
data[row1], sorted(index_col_set) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why sort?
header_name, _ = pop_header_name( | ||
data[row1], sorted(index_col_set) | ||
) | ||
if header_name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the case header_name being None hit by tests?
@@ -1759,3 +1759,53 @@ def test_corrupt_files_closed(self, engine, tmp_excel): | |||
pd.ExcelFile(tmp_excel, engine=engine) | |||
except errors: | |||
pass | |||
|
|||
def test_mi_header_skiprows1(self, engine, read_ext): | |||
if engine is None and read_ext == ".xlsx": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add references to the issue as a comment in the first line of each test:
# GH#58898
with open("test_mi_holes.xlsx", "rb") as f: | ||
expected = pd.read_excel( | ||
f, sheet_name="expected", header=[0, 1], index_col=[0, 1] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding files to the data/ directory, is it possible to generate these sheets on-the-fly? E.g.
df = pd.DataFrame(...)
df.to_excel(tmp_excel, sheet_name="test", index=False)
tmp_excel
here is a pytest fixture.
It would be great if a fix for this issue could be implemented quickly as it is fundamental to properly reading in data from excel files. I would suggest that this should be VERY HIGH PRIORITY. |
Note, I just tried https://github.com/PL-SalvadorFandino/pandas/tree/holes and it does NOT solve the following issue for me: Consider an excel sheet with the following content (three multiindex header rows with a blank row separating rows 1 and 3):
When reading in the above excel sheet using pd.read_excel(..., index_col=0, header=[0,1,3], skiprows = lambda row: row not in [0,1,3,4,5]) gives the following output (index_col missing last data row, and other columns missing last header row):
Thinking that header=[...] indices might refer to row indices after removing skipped rows, I tried: pd.read_excel(..., index_col=0, header=[0,1,2], skiprows = lambda row: row not in [0,1,3,4,5]) which gives the following output (index_col missing last header row, and other columns ok):
Note that if the index_col is ignored, then everything works if one assumes header=[...] denotes row indices after removing skipped rows (I find this HUGELY non-intuitive): pd.read_excel(..., header=[0,1,2], skiprows = lambda row: row not in [0,1,3,4,5]) which gives the correct output (albeit the desired index column is now a regular column):
I have the suspicion that index_col and other columns may be treating the indices passed in header=[...] differently (e.g., row indices as in the excel sheet or row indices after removing skipped rows)? If so, I STRONGLY suggest that all row indices should be relative to the excel file, not the remaining rows after extracting the desired rows. |
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.