-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
[WIP] Excel table output #24899
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Excel table output #24899
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Gave it a very brief review but the critical thing missing here is tests. Ideally should have those first and foremost - can you add some for the expected output here?
@WillAyd, can you offer some guidance on how to test this functionality? I presume there is no suitable way to involve MS Excel in the process, what should be tested? If the function runs without errors? If the data can be loaded from the excel file? As far as I know there is not a simple way to load an excel table in python referenced by the e.g. tablename. |
Sure - check pandas/tests/io/data you will see Excel files created which explicitly show the intended layout. The corresponding writing test will compare a generated file against what was expected.
…Sent from my iPhone
On Jan 24, 2019, at 5:28 AM, Thijs Damsma ***@***.***> wrote:
@WillAyd, can you offer some guidance on how to test this functionality? I presume there is no suitable way to involve MS Excel in the process, what should be tested? If the function runs without errors? If the data can be loaded from the excel file? As far as I know there is not a simple way to load an excel table in python referenced by the e.g. tablename.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Codecov Report
@@ Coverage Diff @@
## master #24899 +/- ##
==========================================
- Coverage 92.38% 42.88% -49.5%
==========================================
Files 166 166
Lines 52398 52418 +20
==========================================
- Hits 48406 22480 -25926
- Misses 3992 29938 +25946
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #24899 +/- ##
==========================================
- Coverage 92.83% 92.38% -0.45%
==========================================
Files 182 166 -16
Lines 50430 52403 +1973
==========================================
+ Hits 46815 48411 +1596
- Misses 3615 3992 +377
Continue to review full report at Codecov.
|
@WillAyd, I added a simple round-trip test. I tried to look for a example on how to check formatting, but the only test for formatting is completely commented out ( |
@tdamsma yea this might be tricky; not sure of best approach off the top of my head, but are there any testing facilities you know of with say openpyxl that would allow us to make stronger assertions about what is being written out? As is now there's just no way that the tests ensure this is actually working |
@WillAyd, I wasn't aware of any options, but I just had a better look at openpyxl and this is indeed possible. In essence we need a function that can read an excel table into a dataframe, which would also be quite useful for pandas in general. |
@tdamsma given conversation in associated issue number and some of the other pre-cursors going to close this one for now to keep our queue minimal. If you plan on addressing soon let me know, otherwise can reopen as we get back around to this |
# Conflicts: # pandas/core/generic.py # pandas/io/excel.py # pandas/io/formats/excel.py # pandas/tests/io/test_excel.py
1aab7ca
to
cbc5d01
Compare
cbc5d01
to
86d5499
Compare
pandas/core/generic.py
Outdated
@@ -2182,8 +2182,8 @@ def _repr_data_resource_(self): | |||
freeze_panes : tuple of int (length 2), optional | |||
Specifies the one-based bottommost row and rightmost column that | |||
is to be frozen. | |||
|
|||
.. versionadded:: 0.20.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to keep this
pandas/io/excel/_xlsxwriter.py
Outdated
@@ -235,3 +235,51 @@ def write_cells( | |||
) | |||
else: | |||
wks.write(startrow + cell.row, startcol + cell.col, val, style) | |||
|
|||
def write_table( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this needs to be a separate method at all. Can't you just use the normal write method and apply the table after cells are written out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that can be done, but there are several differences with the normal write function so don't think the result will be that nice. Key differences:
- merge cells are not allowed for tables
- header needs to be dealt with differently
- individual call styling makes little sense (don't think it should be supported or at least discouraged) for tables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disregard the above, I think I can make it work with not too big adjustments
with ensure_clean(ext) as pth: | ||
df.to_excel(pth, header=True, table="Table1") | ||
|
||
xf = ExcelFile(pth) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use this as a context manager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure? This style is consistent with all the other tests
result = pd.read_excel(xf, xf.sheet_names[0], index_col=0) | ||
|
||
tm.assert_frame_equal(result, df) | ||
assert result.index.name == "foo" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better assertion to be had to make sure a table actually appears in the excel file? read_excel
isn't going to care about Table styles so I think need to approach another way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests definitely need some work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, implemented a proof of concept. Don't really like it to be hones as for openpyxl the default formatting needs to be disabled which was not really supported, and for xlsxwriter the column names need to be passed to the table explicitly, otherwise previously written header cells are over written with generated default names.
Unfortunately need some ugly hacks
@@ -2185,6 +2185,9 @@ def _repr_data_resource_(self): | |||
|
|||
.. versionadded:: 0.20.0. | |||
|
|||
table : string, default None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would accepting a TableStyle be an option instead (speaking only in openpyxl terms, not sure if xlsxwriter offers that)? I feel like that could do the same thing but also give users more power over output formatting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both accept a style, but of course the syntax is a bit different: openpyxl uses "TableStyleMedium9"
as default, and xlsxwriter uses "Table Style Medium 9"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm wasn't thinking about that as much as having the user pass in an actual object itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean something like?
style = TableStyleInfo(name="TableStyleMedium9", showFirstColumn=False,
showLastColumn=False, showRowStripes=True, showColumnStripes=True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right that's what I had in mind (not tied to it just asking)
pandas/io/excel/_openpyxl.py
Outdated
@@ -389,6 +389,18 @@ def _convert_to_protection(cls, protection_dict): | |||
|
|||
return Protection(**protection_dict) | |||
|
|||
@classmethod | |||
def _to_excel_range(cls, startrow, startcol, endrow, endcol): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than using this can you just pass a CellRange into openpyxl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but doesn't the CellRange constructor only accept a string of the form A1:B2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think can use min_col
, min_row
, max_col
, max_row
from constructor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed, the cell range can be constructed from those arguments, however the Table constructor itself really only accepts a string. Internally also the string representation is used, so I don't see a way around this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think can just cast to a str then:
>>> from openpyxl.worksheet.cell_range import CellRange
>>> str(CellRange(min_col=1, min_row=1, max_col=3, max_row=3))
'A1:C3'
That's what we are ultimately after here right? (save maybe off-by-one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was what I expected, but I was on openpyxl 2.4.8, and this apparently a 2.5.0 feature. Assuming updating the minimum supported version is not an issue I will fix this
for cell in cells: | ||
n_cols = max(n_cols, cell.col) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these be inferred outside of the loop from the dimensions of the frame?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is quite some code translating a frame to excel cells, dealing with multiindexes etc. So this is not too straight forward. But that code in intertwined with the formatting code, so I considered the following options:
- get size from frame, try to deal with edge cases for multiindex, index True/False, header True/False etc
- get result from the get_formatted_cells iterator, run through it a second time
- bypass the normal writer function altogether and use a separate, dedicated write_table function
- extract the size from the writer function
I picked the latter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm OK. At least outside the loop though wouldn't the nrows be len(cells) and the ncols just be the length of any item within cells?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately the cells are a 1D iterator of items (each with a row and col property), not a list of rows.
@@ -408,19 +418,7 @@ def __init__( | |||
self.header = header | |||
self.merge_cells = merge_cells | |||
self.inf_rep = inf_rep | |||
|
|||
@property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To provide an interface to override (i.e. disable) the cell styling
I think this has gone a little stale. Our PR queue has grown quite large so closing for now to keep things clean but ping if you'd like to pick back up! |
git diff upstream/master -u -- "*.py" | flake8 --diff
This PR should allow writing excel table objects. Proof of concept implemented for XlsxWriter and OpenPyXL. Exactly which options of the
to_excel
api should be supported is up for discussion, see #24862.