Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix for 61123 read_excel-nrows-param-reads-extra-rows #61127

Conversation

zanuka
Copy link

@zanuka zanuka commented Mar 15, 2025

Issue: GH-61123
When reading Excel files with pd.read_excel and specifying nrows=4, the behavior differs depending on whether there’s a blank row between tables. For a file with two tables (each with a header and 3 data rows), nrows=4 should yield a DataFrame with one header and 3 data rows (shape (3, n)). However:

  • In test1.xlsx (with a blank row), it correctly reads the first table (header + 3 rows).
  • In test2.xlsx (no blank row), it incorrectly includes the second table’s header as a data row, resulting in a shape of (4, n).

This inconsistency occurs because read_excel doesn’t properly respect table boundaries when tables are adjacent, despite the nrows limit.

Fix:

  • Modified pandas/io/excel/_base.py and related reader modules (_openpyxl.py, _pyxlsb.py, _xlrd.py) to ensure nrows limits reading to the specified number of rows, excluding subsequent table headers even when tables are adjacent.
  • Added a new test test_excel_read_tables_with_and_without_blank_row in pandas/tests/io/excel/test_readers.py to verify that nrows=4 consistently returns a DataFrame with shape (3, 2) (header + 3 data rows) for both cases.

Changes:

  • Updated Excel reader logic to stop at nrows without parsing beyond table boundaries.
  • Ensured consistent behavior across openpyxl, pyxlsb, and xlrd engines.
  • Squashed commits into a single commit for clarity.

Verification:

  • Tested with test1.xlsx (blank row) and test2.xlsx (no blank row).
  • Confirmed both now yield a DataFrame with shape (3, 2) and only the first table’s data.

Steps to Test:

  1. Run pytest pandas/tests/io/excel/test_readers.py::TestReaders::test_excel_read_tables_with_and_without_blank_row.
  2. Verify df1.shape == (3, 2) and df2.shape == (3, 2) match the expected output.

Related Files:

  • pandas/io/excel/_base.py
  • pandas/io/excel/_openpyxl.py
  • pandas/io/excel/_pyxlsb.py
  • pandas/io/excel/_xlrd.py
  • pandas/tests/io/excel/test_readers.py

Closes #61123

⚡️ Commit from Jolt AI ⚡️

Fix Excel Test Indentation (https://app.usejolt.ai/code-chat/0d4546cc-38b6-4754-ae0a-55afa71f01ab)

Description:
Fix Excel Test Indentation

⚡️ Commit from Jolt AI ⚡️

Fix Excel Test Indentation (https://app.usejolt.ai/code-chat/0d4546cc-38b6-4754-ae0a-55afa71f01ab)

Description:
Fix Excel Test Indentation

⚡️ Commit from Jolt AI ⚡️

Fix Excel Test Indentation (https://app.usejolt.ai/code-chat/0d4546cc-38b6-4754-ae0a-55afa71f01ab)

Description:
Fix Excel Test Indentation

fixes tests
@zanuka zanuka requested a review from rhshadrach as a code owner March 15, 2025 05:07
@zanuka zanuka closed this Mar 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC: read_excel nrows parameter reads extra rows when tables are adjacent (no blank row)
1 participant