[BUG] read_excel: fixes handling of multi index header and other corner cases #58899

PL-SalvadorFandino · 2024-06-02T18:33:16Z

closes BUG: read_excel multiindex head with holes #58898
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

PL-SalvadorFandino · 2024-06-02T20:13:23Z

Hi, I can't see why check pre-commit.ci - pr is failing!

Aloqeely · 2024-06-03T08:53:47Z

Looks like I don't have the permissions to do it, but you should be able to autofix by commenting pre-commit.ci autofix

PL-SalvadorFandino · 2024-06-03T08:55:05Z

pre-commit.ci autofix

github-actions · 2024-07-04T00:06:06Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

rhshadrach · 2024-07-04T13:16:34Z

pandas/io/excel/_base.py

+                ixmap = list(range(len(data)))
+            elif is_integer(skiprows):
+                ixmap = list(range(skiprows, len(data)))
+            elif is_list_like(skiprows):
+                skiprows_set = set(cast(Sequence[int], skiprows))
+                ixmap = [ix for ix, _ in enumerate(data) if ix not in skiprows_set]
+            elif callable(skiprows):
+                ixmap = [ix for ix, _ in enumerate(data) if not skiprows(ix)]


I believe this is adding a third place where we call the skiprows function on every row of the data. This seems quite expensive, perhaps it'd be better to pass skiprows to get_sheet_data to avoid getting it in the first place. Then the rest of the implementation could proceed with skiprows=None, simplifying the logic.

The current implementation here also seems unnecessarily expensive when skiprows is None, which I think is a common case.

Ok, now when skiprows is a callable, it is called on every row and the result cached so that effectively, it is only called once per row.

read_excel function has several bugs regarding how it handles combinations of header, skiprows and index_col arguments. The tests here showcase some of them.

PL-SalvadorFandino · 2024-07-08T15:56:28Z

I have rebased my branch on top of the current "main" branch, because I was having some issues with the test suite unrelated to my code.

PL-SalvadorFandino · 2024-07-08T16:46:23Z

Now, ixmap is not coerced into a list in order to reduce the overhead when skiprows is None or an integer.

The logic in that method was not handling correctly all the possible combinations of skiprows, header and index_col arguments. Specifically: - it was not able to handle correctly multi index header with holes (for instance, `header=[0,2]`. - multi index header and skiprows given as lists. - forward filling index columns and skiprows gigen as lists. - inconsistences processing one-element list arguments (for instance, `header=1` and `header=[1]` or `index_col=0` and `index_col=[0]` where handled differently). The logic has been revamped, because it was not possible to fix all the errors with local changes. The mayor challenge was handling skiprows as a list, as it may remove rows at any place (before, between or after header(s), index names and data). Also, header row indexes reference rows **after** removing skiprows. To handle that we use an intermediate mapping `ixmap` which goes from the row indixes with skiprows removed to the row indixes in `data`. Finally, let me add that IMO, most of the functionality of _parse_sheet should be moved down into TextParser... but that's work for another day!

PL-SalvadorFandino · 2024-07-08T19:09:09Z

pre-commit.ci autofix

For callable skiprows arguments, the callback was being called several times in every row. Now, it is called just once and the result cached.

for more information, see https://pre-commit.ci

PL-SalvadorFandino · 2024-07-08T19:56:48Z

pre-commit.ci autofix

for more information, see https://pre-commit.ci

PL-SalvadorFandino · 2024-07-11T10:43:55Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

not stale!

rhshadrach · 2024-07-21T12:53:58Z

pandas/io/excel/_base.py

-        # a row containing just the index name(s)
-        has_index_names = False
-        if is_list_header and not is_len_one_list_header and index_col is not None:
+            index_col_has_names = False


This is only used one place below; instead can you replace it with isinstance(index_col, str) to simplify the code.

rhshadrach · 2024-07-21T13:08:27Z

pandas/io/excel/_base.py

+                        raise ValueError(
+                            "named index_col can not be used together "
+                            "with multi-index header"
+                        )


Is this restriction new?

rhshadrach · 2024-07-21T13:18:15Z

pandas/io/excel/_base.py

+
+                        if has_index:
+                            header_name, _ = pop_header_name(
+                                data[row1], sorted(index_col_set)


rhshadrach · 2024-07-21T13:19:27Z

pandas/io/excel/_base.py

+                            header_name, _ = pop_header_name(
+                                data[row1], sorted(index_col_set)
+                            )
+                            if header_name:


Is the case header_name being None hit by tests?

rhshadrach · 2024-07-21T13:22:50Z

pandas/tests/io/excel/test_readers.py

@@ -1759,3 +1759,53 @@ def test_corrupt_files_closed(self, engine, tmp_excel):
                pd.ExcelFile(tmp_excel, engine=engine)
            except errors:
                pass
+
+    def test_mi_header_skiprows1(self, engine, read_ext):
+        if engine is None and read_ext == ".xlsx":


Can you add references to the issue as a comment in the first line of each test:

# GH#58898

rhshadrach · 2024-07-21T13:26:08Z

pandas/tests/io/excel/test_readers.py

+            with open("test_mi_holes.xlsx", "rb") as f:
+                expected = pd.read_excel(
+                    f, sheet_name="expected", header=[0, 1], index_col=[0, 1]
+                )


Instead of adding files to the data/ directory, is it possible to generate these sheets on-the-fly? E.g.

df = pd.DataFrame(...) df.to_excel(tmp_excel, sheet_name="test", index=False)

tmp_excel here is a pytest fixture.

marcel-goldschen-ohm · 2024-08-17T20:48:59Z

It would be great if a fix for this issue could be implemented quickly as it is fundamental to properly reading in data from excel files. I would suggest that this should be VERY HIGH PRIORITY.

marcel-goldschen-ohm · 2024-08-17T21:26:01Z

Note, I just tried https://github.com/PL-SalvadorFandino/pandas/tree/holes and it does NOT solve the following issue for me:

Consider an excel sheet with the following content (three multiindex header rows with a blank row separating rows 1 and 3):

Group, A, A, B, B
ID, 0, 1, 0, 1
 , , , , 
X, Y, Y, Y, Y
100, 0, 1, 2, 3
200, 4, 5, 6, 7

When reading in the above excel sheet using pandas.read_excel I get mangled data. For example,

pd.read_excel(..., index_col=0, header=[0,1,3], skiprows = lambda row: row not in [0,1,3,4,5])

gives the following output (index_col missing last data row, and other columns missing last header row):

Group, A, A, B, B
ID, 0, 1, 0, 1
X, 0, 1, 2, 3
200, 4, 5, 6, 7

Thinking that header=[...] indices might refer to row indices after removing skipped rows, I tried:

pd.read_excel(..., index_col=0, header=[0,1,2], skiprows = lambda row: row not in [0,1,3,4,5])

which gives the following output (index_col missing last header row, and other columns ok):

Group, A, A, B, B
ID, 0, 1, 0, 1
 , Y, Y, Y, Y
100, 0, 1, 2, 3
200, 4, 5, 6, 7

Note that if the index_col is ignored, then everything works if one assumes header=[...] denotes row indices after removing skipped rows (I find this HUGELY non-intuitive):

pd.read_excel(..., header=[0,1,2], skiprows = lambda row: row not in [0,1,3,4,5])

which gives the correct output (albeit the desired index column is now a regular column):

Group, A, A, B, B
ID, 0, 1, 0, 1
X, Y, Y, Y, Y
100, 0, 1, 2, 3
200, 4, 5, 6, 7

I have the suspicion that index_col and other columns may be treating the indices passed in header=[...] differently (e.g., row indices as in the excel sheet or row indices after removing skipped rows)? If so, I STRONGLY suggest that all row indices should be relative to the excel file, not the remaining rows after extracting the desired rows.

mroeschke · 2024-09-09T17:49:27Z

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

PL-SalvadorFandino requested a review from rhshadrach as a code owner June 2, 2024 18:33

PL-SalvadorFandino changed the title ~~Holes~~ [BUG] read_excel: fixes handling of multi index header and other corner cases Jun 2, 2024

PL-SalvadorFandino force-pushed the holes branch 2 times, most recently from 3dc434f to 77c02e5 Compare June 2, 2024 20:07

PL-SalvadorFandino force-pushed the holes branch from 77c02e5 to 1615371 Compare June 2, 2024 20:19

This comment was marked as resolved.

Sign in to view

github-actions bot added the Stale label Jul 4, 2024

rhshadrach reviewed Jul 4, 2024

View reviewed changes

PL-SalvadorFandino force-pushed the holes branch from 1ee6491 to b185fac Compare July 8, 2024 15:48

TST: read_excel: test multiindex headers with skip rows between.

ba8e8a1

read_excel function has several bugs regarding how it handles combinations of header, skiprows and index_col arguments. The tests here showcase some of them.

PL-SalvadorFandino force-pushed the holes branch from 2071f3e to c2dab1b Compare July 8, 2024 15:53

PL-SalvadorFandino force-pushed the holes branch from 38d8ef5 to de75c25 Compare July 8, 2024 18:39

PL-SalvadorFandino force-pushed the holes branch from d79a53a to e2cf329 Compare July 8, 2024 18:55

salva and others added 2 commits July 8, 2024 21:52

read_excel: call skiprows callback just once per row.

6ad5db5

For callable skiprows arguments, the callback was being called several times in every row. Now, it is called just once and the result cached.

[pre-commit.ci] auto fixes from pre-commit.com hooks

a23e91f

for more information, see https://pre-commit.ci

PL-SalvadorFandino force-pushed the holes branch from d8358e6 to a23e91f Compare July 8, 2024 19:53

[pre-commit.ci] auto fixes from pre-commit.com hooks

33e0d06

for more information, see https://pre-commit.ci

rhshadrach added Bug IO Excel read_excel, to_excel and removed Stale labels Jul 14, 2024

rhshadrach requested changes Jul 21, 2024

View reviewed changes

marcel-goldschen-ohm mentioned this pull request Aug 18, 2024

BUG: xl.parse index_col ignoring skiprows #50953

Open

3 tasks

mroeschke closed this Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] read_excel: fixes handling of multi index header and other corner cases #58899

[BUG] read_excel: fixes handling of multi index header and other corner cases #58899

PL-SalvadorFandino commented Jun 2, 2024 •

edited

Loading

PL-SalvadorFandino commented Jun 2, 2024 •

edited

Loading

This comment was marked as resolved.

Aloqeely commented Jun 3, 2024

PL-SalvadorFandino commented Jun 3, 2024

github-actions bot commented Jul 4, 2024

rhshadrach Jul 4, 2024 •

edited

Loading

PL-SalvadorFandino Jul 8, 2024

PL-SalvadorFandino commented Jul 8, 2024

PL-SalvadorFandino commented Jul 8, 2024

PL-SalvadorFandino commented Jul 8, 2024

PL-SalvadorFandino commented Jul 8, 2024

PL-SalvadorFandino commented Jul 11, 2024

rhshadrach Jul 21, 2024

rhshadrach Jul 21, 2024

rhshadrach Jul 21, 2024

rhshadrach Jul 21, 2024

rhshadrach Jul 21, 2024

rhshadrach Jul 21, 2024

marcel-goldschen-ohm commented Aug 17, 2024

marcel-goldschen-ohm commented Aug 17, 2024

mroeschke commented Sep 9, 2024

[BUG] read_excel: fixes handling of multi index header and other corner cases #58899

[BUG] read_excel: fixes handling of multi index header and other corner cases #58899

Conversation

PL-SalvadorFandino commented Jun 2, 2024 • edited Loading

PL-SalvadorFandino commented Jun 2, 2024 • edited Loading

This comment was marked as resolved.

Aloqeely commented Jun 3, 2024

PL-SalvadorFandino commented Jun 3, 2024

github-actions bot commented Jul 4, 2024

rhshadrach Jul 4, 2024 • edited Loading

Choose a reason for hiding this comment

PL-SalvadorFandino Jul 8, 2024

Choose a reason for hiding this comment

PL-SalvadorFandino commented Jul 8, 2024

PL-SalvadorFandino commented Jul 8, 2024

PL-SalvadorFandino commented Jul 8, 2024

PL-SalvadorFandino commented Jul 8, 2024

PL-SalvadorFandino commented Jul 11, 2024

rhshadrach Jul 21, 2024

Choose a reason for hiding this comment

rhshadrach Jul 21, 2024

Choose a reason for hiding this comment

rhshadrach Jul 21, 2024

Choose a reason for hiding this comment

rhshadrach Jul 21, 2024

Choose a reason for hiding this comment

rhshadrach Jul 21, 2024

Choose a reason for hiding this comment

rhshadrach Jul 21, 2024

Choose a reason for hiding this comment

marcel-goldschen-ohm commented Aug 17, 2024

marcel-goldschen-ohm commented Aug 17, 2024

mroeschke commented Sep 9, 2024

PL-SalvadorFandino commented Jun 2, 2024 •

edited

Loading

PL-SalvadorFandino commented Jun 2, 2024 •

edited

Loading

rhshadrach Jul 4, 2024 •

edited

Loading