-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Wish: Input / output for Open Document Spreadsheet (ODS) #2311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It seems to me the best option for this would be:
This would be a rather substantial bit of work. A good start for the test cases would be to look at pandas/io/tests/test_excel.py and write test cases that mirror what is in that file. |
+1 for this feature. Working on an open-source data wrangling project and I would like to be able to interact with spreadsheet software that is not Excel. |
@H0R5E for now, you could just convert the resulting excel file to ODS (it's pretty much a simple set of numbers, nothing more): http://stackoverflow.com/questions/15257032/python-convert-excel-file-xls-or-xlsx-to-from-ods or you could use json as an intermediary and use |
+1 for this! |
I want to tune in with some additional notes:
Based on some initial looks, ezodf has a straight forward interface. For example, loading an ods spreadsheet into a DataFrame could go as follows: import pandas as pd
import ezodf
doc = ezodf.opendoc('somefile.ods')
print("Spreadsheet contains %d sheets.\n" % len(doc.sheets))
for sheet in doc.sheets:
if sheet.name == 'data':
data = sheet
print("Sheet name: '%s'" % sheet.name)
print("Size of Sheet : (rows=%d, cols=%d)" % (sheet.nrows(), sheet.ncols()) )
print("-"*40)
# a random cell
pos = (row, col)
data.get_cell(pos).display_form
df_dict = {}
for i, col in enumerate(data.columns()):
# col is a list that contains all the cells (rows of that column)
# assume the column name is on the first row
colname = col[0].display_form
df_dict[colname] = []
print(colname)
for j, row in enumerate(col):
# ignore the header
if j == 0:
continue
else:
df_dict[col_name].append(row.display_form)
# and convert to a DataFrame
df = pd.DataFrame(df_dict) I have no idea how odfpy and ezodf compare performance wise, or how their different memory footprints are for very large spreadsheets. I haven't tried to use odfpy because that seems more complicated to get going with. I also have not checked how robust the writing process of ezodf is. I'll try to see if I can get something similar going as for pandas/io/tests/test_excel.py. I will also mention @sorenroug from odfpy and @T0ha from ezodf just in case someone from upstream is interested in this discussion. |
After a short exploration, it seems doable to create an ods prototype reader based on the structure of pandas/io/excel.py using the ezodf module. I've forked pandas here and will implement an ods prototype reader in the io-ods branch. |
@davidovitch - one thing to keep in mind if you choose to handle both writing and reading: you can register an ODS writer in the excel module (just implementing the required methods as listed in the code) and then you can just add a simple subclass of the excel writer tests and get a ton of test cases for free. Just a thought. |
I would like to give an update on the current status. My fork can currently read ODF ODS files, but writing is not implemented yet. The corresponding pull request is still pending review, and it has been a lot more work than I originally anticipated. Just a few days ago I bumped into yet another library to read/write spreadsheet: pyexcel. What is different about this library is that it aims at creating a single API for all the different read/write libraries, and it builds on all the existing libraries out there (ezodf, xlrd, etc). This is actually kind of similar to what pandas currently has, and to what I am trying to extend in PR #9070. So I am wondering, would it make sense to use a single API library (with optional dependencies to all the relevant readers/writers for the different spreadsheet types) instead of developing something similar, but tailored to pandas use case? I am speculating that a lot of code in io/excel.py could be removed when relying on pyexcel, but it adds yet another dependency. At this stage I am not sure how all different edge cases on data types and other magic happening inside the spreadsheet interpretation differs between the current pandas implementation and pyexcel. |
The ods reader is not ready for the upcoming 0.17 release. PR #9070 is closed (see the PR for a technical discussion), and a new improved PR will be made by someone at some point in the future. I have a working version in the ezodf_reader branch of my pandas fork in case someone wants to have a look at it. Suggestions and improvements are welcome :-) |
@davidovitch thanks for your efforts! Its a shame that that reading and writing to ODF is not better supported in python, otherwise I'm sure this would be less of an ordeal. |
Working on how to structure data for paleoclimatology community, I would like to promote ODS as the standard format. Then of course, pandas with this possible reading and writing at ODS format would become the natural way to start nice analysis. +2 and more for this feature that seems so closed to be released. |
You should almost certainly be using netcdf for paleoclimate data. It's more or less industry standard. Checkout the python package 'xray' for a nice way to interface with it. On 3 November 2015 12:30:20 am AEDT, Patrick Brockmann notifications@github.com wrote:
Sent from my Android device with K-9 Mail. Please excuse my brevity. |
@naught101: I know quite well netCDF format and some conventions like CF (Climate and Forecast) widely used in the Earth System Modelling community but it seems that the PaleoClimate one has also good argument to keep working with others standard. An article that is under discussion describes this motivation and promote a JSON-LD format. Read https://www.authorea.com/users/17200/articles/19163/_show_article and please feel free to contribute it. |
@PBrockmann If your metadata is hierarchical, maybe it would be appropriate to use the hierarchical features of netCDF4/HDF5? Just a thought. In any case, I think we agree regardless that ODS support would be valuable for pandas. |
Is there any progress on this? Thanks |
Will take a look at it. Thank you for pointing it out. |
On a similar note has anyone considered Google Sheets? There are some limits to the size of a sheet (http://gappstips.com/google-sheets/google-spreadsheet-limitations/) that probably make it only usable for some cases but there are quite a lot of datasets that would fit within this restrictions and the ease of collaboration with Sheets is nice. |
@jameserrico There's a library https://github.com/embr/gcat that can read from google drive. I have a fork that I updated to work with python3 https://github.com/detrout/gcat but upstream hasn't merged the pull-request It knows how to return pandas data frames. I.e. I have some code like this: book = gcat.get_file(book_name, fmt='pandas_excel')
experiments = book.parse('Experiments', header=0)` |
I have a standalone reader styled like pandas.io.excel with tests. It uses odfpy as that seems like its still maintained. Any comments, or should I try reformatting into a pull request creating pandas/io/odf.py? I thought the .read_odf method seemed more reasonable, than overloading .read_excel |
Makes sense. Perhaps there could be a |
@naught101 The general reader might have some difficulties since there are differences in what the various formats support. CSV/TSV only have a single table, while Excel & ODF can have multiple tables per file. Also I'm currently contemplating passing the raw odfpy xml cell node for the converters callable, so those would be quite difficult to write if you didn't know the source file type. |
@detrout FYI, maybe things have changed since, but at the time of PR #9070 there was a very clear decision by the pandas devs not to do that and to keep |
yes that issue is a good starting point note that the internal code org has changed a lot since then |
Great to see the ODS reader implemented. Well done! I'm not sure this should have been closed though, as writing still isn't supported. |
i would open a new issue for write support (if it’s really needed) |
It would be nice to have, along with XLS and XLSX, also support for ODS (used notably by OpenOffice.org and Libreoffice, but not only that).
The main problem I saw is that there's a plethora of libraries to handle that: however not sure which is the "best" one (with a licensing that would fit with pandas).
The text was updated successfully, but these errors were encountered: