Skip to content

[WIP] Excel table output #24899

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 17 commits into from
Closed

Conversation

tdamsma
Copy link
Contributor

@tdamsma tdamsma commented Jan 24, 2019

This PR should allow writing excel table objects. Proof of concept implemented for XlsxWriter and OpenPyXL. Exactly which options of the to_excel api should be supported is up for discussion, see #24862.

import pandas as pd
data = {'B' : dict(col1=1), 'A' : dict(col1=2), 'C' :dict(col1=3, col2=4.1)}
df = pd.DataFrame.from_dict(data, orient='index')

df.to_excel('test1.xlsx', engine='xlsxwriter', as_table=True, index=True)
df.to_excel('test2.xlsx', engine='openpyxl', as_table=True, index=False)

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Gave it a very brief review but the critical thing missing here is tests. Ideally should have those first and foremost - can you add some for the expected output here?

@WillAyd WillAyd added the IO Excel read_excel, to_excel label Jan 24, 2019
@tdamsma
Copy link
Contributor Author

tdamsma commented Jan 24, 2019

@WillAyd, can you offer some guidance on how to test this functionality? I presume there is no suitable way to involve MS Excel in the process, what should be tested? If the function runs without errors? If the data can be loaded from the excel file? As far as I know there is not a simple way to load an excel table in python referenced by the e.g. tablename.

@WillAyd
Copy link
Member

WillAyd commented Jan 24, 2019 via email

@codecov
Copy link

codecov bot commented Jan 27, 2019

Codecov Report

Merging #24899 into master will decrease coverage by 49.49%.
The diff coverage is 12.5%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24899      +/-   ##
==========================================
- Coverage   92.38%   42.88%   -49.5%     
==========================================
  Files         166      166              
  Lines       52398    52418      +20     
==========================================
- Hits        48406    22480   -25926     
- Misses       3992    29938   +25946
Flag Coverage Δ
#multiple ?
#single 42.88% <12.5%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 39.86% <ø> (-56.77%) ⬇️
pandas/io/formats/excel.py 14.69% <12.5%> (-82.71%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.35%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.17%) ⬇️
... and 125 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2b16e2e...ca9c36e. Read the comment docs.

@codecov
Copy link

codecov bot commented Jan 27, 2019

Codecov Report

Merging #24899 into master will decrease coverage by 0.44%.
The diff coverage is 87.5%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24899      +/-   ##
==========================================
- Coverage   92.83%   92.38%   -0.45%     
==========================================
  Files         182      166      -16     
  Lines       50430    52403    +1973     
==========================================
+ Hits        46815    48411    +1596     
- Misses       3615     3992     +377
Flag Coverage Δ
#multiple 90.8% <87.5%> (-0.7%) ⬇️
#single 42.88% <12.5%> (+0.43%) ⬆️
Impacted Files Coverage Δ
pandas/core/generic.py 96.63% <ø> (+2.81%) ⬆️
pandas/io/formats/excel.py 97.12% <87.5%> (-0.34%) ⬇️
pandas/io/gbq.py 25% <0%> (-75%) ⬇️
pandas/compat/__init__.py 57.91% <0%> (-34.09%) ⬇️
pandas/plotting/_misc.py 38.68% <0%> (-26.18%) ⬇️
pandas/io/common.py 72.86% <0%> (-21.19%) ⬇️
pandas/io/gcs.py 80% <0%> (-20%) ⬇️
pandas/io/s3.py 86.36% <0%> (-13.64%) ⬇️
pandas/core/computation/expr.py 88.68% <0%> (-9.12%) ⬇️
pandas/core/groupby/base.py 91.83% <0%> (-8.17%) ⬇️
... and 180 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c64c9cb...d0a6391. Read the comment docs.

@pep8speaks
Copy link

pep8speaks commented Jan 27, 2019

Hello @tdamsma! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-07-15 09:50:42 UTC

@tdamsma
Copy link
Contributor Author

tdamsma commented Jan 27, 2019

@WillAyd, I added a simple round-trip test. I tried to look for a example on how to check formatting, but the only test for formatting is completely commented out (test_to_excel_header_styling_xls). As far as I can tell the example sheets are only used in read tests, and not for comparison with write tests. I addressed your other comments, can you give some feedback if the general approach fits with pandas? I know there are still many tests to add, but I would like to know if I am on the right track for having this merged eventually. Thanks!

@WillAyd
Copy link
Member

WillAyd commented Jan 29, 2019

@tdamsma yea this might be tricky; not sure of best approach off the top of my head, but are there any testing facilities you know of with say openpyxl that would allow us to make stronger assertions about what is being written out? As is now there's just no way that the tests ensure this is actually working

@tdamsma
Copy link
Contributor Author

tdamsma commented Jan 30, 2019

@WillAyd, I wasn't aware of any options, but I just had a better look at openpyxl and this is indeed possible. In essence we need a function that can read an excel table into a dataframe, which would also be quite useful for pandas in general.

@WillAyd
Copy link
Member

WillAyd commented Feb 5, 2019

@tdamsma given conversation in associated issue number and some of the other pre-cursors going to close this one for now to keep our queue minimal. If you plan on addressing soon let me know, otherwise can reopen as we get back around to this

@WillAyd WillAyd closed this Feb 5, 2019
tdamsma added 2 commits July 9, 2019 16:10
# Conflicts:
#	pandas/core/generic.py
#	pandas/io/excel.py
#	pandas/io/formats/excel.py
#	pandas/tests/io/test_excel.py
@WillAyd WillAyd reopened this Jul 9, 2019
@tdamsma tdamsma force-pushed the excel-tables-#24862 branch from 1aab7ca to cbc5d01 Compare July 10, 2019 08:59
@tdamsma tdamsma force-pushed the excel-tables-#24862 branch from cbc5d01 to 86d5499 Compare July 10, 2019 09:44
@@ -2182,8 +2182,8 @@ def _repr_data_resource_(self):
freeze_panes : tuple of int (length 2), optional
Specifies the one-based bottommost row and rightmost column that
is to be frozen.

.. versionadded:: 0.20.0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to keep this

@@ -235,3 +235,51 @@ def write_cells(
)
else:
wks.write(startrow + cell.row, startcol + cell.col, val, style)

def write_table(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this needs to be a separate method at all. Can't you just use the normal write method and apply the table after cells are written out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that can be done, but there are several differences with the normal write function so don't think the result will be that nice. Key differences:

  • merge cells are not allowed for tables
  • header needs to be dealt with differently
  • individual call styling makes little sense (don't think it should be supported or at least discouraged) for tables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disregard the above, I think I can make it work with not too big adjustments

with ensure_clean(ext) as pth:
df.to_excel(pth, header=True, table="Table1")

xf = ExcelFile(pth)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use this as a context manager

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? This style is consistent with all the other tests

result = pd.read_excel(xf, xf.sheet_names[0], index_col=0)

tm.assert_frame_equal(result, df)
assert result.index.name == "foo"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better assertion to be had to make sure a table actually appears in the excel file? read_excel isn't going to care about Table styles so I think need to approach another way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests definitely need some work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, implemented a proof of concept. Don't really like it to be hones as for openpyxl the default formatting needs to be disabled which was not really supported, and for xlsxwriter the column names need to be passed to the table explicitly, otherwise previously written header cells are over written with generated default names.

@@ -2185,6 +2185,9 @@ def _repr_data_resource_(self):

.. versionadded:: 0.20.0.

table : string, default None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would accepting a TableStyle be an option instead (speaking only in openpyxl terms, not sure if xlsxwriter offers that)? I feel like that could do the same thing but also give users more power over output formatting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both accept a style, but of course the syntax is a bit different: openpyxl uses "TableStyleMedium9" as default, and xlsxwriter uses "Table Style Medium 9".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm wasn't thinking about that as much as having the user pass in an actual object itself

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean something like?

style = TableStyleInfo(name="TableStyleMedium9", showFirstColumn=False,
                       showLastColumn=False, showRowStripes=True, showColumnStripes=True)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right that's what I had in mind (not tied to it just asking)

@@ -389,6 +389,18 @@ def _convert_to_protection(cls, protection_dict):

return Protection(**protection_dict)

@classmethod
def _to_excel_range(cls, startrow, startcol, endrow, endcol):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using this can you just pass a CellRange into openpyxl?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but doesn't the CellRange constructor only accept a string of the form A1:B2?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, the cell range can be constructed from those arguments, however the Table constructor itself really only accepts a string. Internally also the string representation is used, so I don't see a way around this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think can just cast to a str then:

>>> from openpyxl.worksheet.cell_range import CellRange
>>> str(CellRange(min_col=1, min_row=1, max_col=3, max_row=3))
'A1:C3'

That's what we are ultimately after here right? (save maybe off-by-one)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was what I expected, but I was on openpyxl 2.4.8, and this apparently a 2.5.0 feature. Assuming updating the minimum supported version is not an issue I will fix this

for cell in cells:
n_cols = max(n_cols, cell.col)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these be inferred outside of the loop from the dimensions of the frame?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is quite some code translating a frame to excel cells, dealing with multiindexes etc. So this is not too straight forward. But that code in intertwined with the formatting code, so I considered the following options:

  • get size from frame, try to deal with edge cases for multiindex, index True/False, header True/False etc
  • get result from the get_formatted_cells iterator, run through it a second time
  • bypass the normal writer function altogether and use a separate, dedicated write_table function
  • extract the size from the writer function

I picked the latter

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm OK. At least outside the loop though wouldn't the nrows be len(cells) and the ncols just be the length of any item within cells?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately the cells are a 1D iterator of items (each with a row and col property), not a list of rows.

@@ -408,19 +418,7 @@ def __init__(
self.header = header
self.merge_cells = merge_cells
self.inf_rep = inf_rep

@property
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To provide an interface to override (i.e. disable) the cell styling

@WillAyd
Copy link
Member

WillAyd commented Aug 24, 2019

I think this has gone a little stale. Our PR queue has grown quite large so closing for now to keep things clean but ping if you'd like to pick back up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Output excel table objects with to_xlsx()
3 participants