Skip to content

ENH: Excel to support reading Timedeltas #4332

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
timmie opened this issue Jul 23, 2013 · 33 comments
Closed

ENH: Excel to support reading Timedeltas #4332

timmie opened this issue Jul 23, 2013 · 33 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions IO Data IO issues that don't fit into a more specific label IO Excel read_excel, to_excel Timedelta Timedelta data type
Milestone

Comments

@timmie
Copy link
Contributor

timmie commented Jul 23, 2013

ExcelFile should print out line or even cell warnings

Today I was spending quite some time debugging why a decoding error stopped the code from reading in a table.

I thought the skiprows counts from 0 (= Excel row 1). It was always failing.

In one line there were column headers, the next line contained units which couldn't probably be parsed.

I think it could be helpful if the parser would show the line number or even cell were it fails to ready (like due to decoding errors).

@jreback
Copy link
Contributor

jreback commented Jul 23, 2013

can you put post the error you did get?

@jtratner
Copy link
Contributor

Also it would be helpful if you could print the versions of pandas and xlrd you are using (might need openpyxl if you're not using the dev version.)

@timmie
Copy link
Contributor Author

timmie commented Jul 24, 2013


pd.__version__
Out[116]: '0.9.1'

import xlrd

xlrd.__VERSION__
Out[118]: '0.9.1'

import openpyxl

openpyxl.__version__
Out[120]: '1.5.8'

@timmie
Copy link
Contributor Author

timmie commented Jul 24, 2013

please see an example error msg at #4339

@jreback
Copy link
Contributor

jreback commented Jul 24, 2013

@timmie

you have a pretty old version of pandas, 0.11 has been out since april, and 0.12 is releasing this week. excel parsing uses the csv parser under the hood, and pretty sure that all of your 3 posted issues are fixed in more recent versions. Pls try and close these issues if that is the case. (e.g. #4332 , #4340)

@timmie
Copy link
Contributor Author

timmie commented Jul 24, 2013

With

In [38]: pd.__version__

Out[38]: '0.12.0rc1'

The issue does nit arise. But now I get:

XLDateAmbiguous: 1.0

even if I change to parse_dates=False and index_col=0.

There is one column in the xlsx that has time (not date).

But aapraently, the parser expects a datetime:


C:\Python27\lib\site-packages\pandas\io\excel.pyc in _parse_excel(self, sheetname, header, skiprows, skip_footer, index_col, has_index_names, parse_cols, parse_dates, date_parser, na_values, thousands, chunksize, **kwds)
    184                 if parse_cols is None or should_parse[j]:
    185                     if typ == XL_CELL_DATE:
--> 186                         dt = xldate_as_tuple(value, datemode)
    187                         # how to produce this first case?
    188                         if dt[0] < datetime.MINYEAR:  # pragma: no cover

C:\Python27\lib\site-packages\xlrd\xldate.pyc in xldate_as_tuple(xldate, datemode)
     78 
     79     if xldays < 61 and datemode == 0:
---> 80         raise XLDateAmbiguous(xldate)
     81 
     82     jdn = xldays + _JDN_delta[datemode]

@jreback
Copy link
Contributor

jreback commented Jul 24, 2013

This is essentially a timedelta. Maybe just change the column formatting to text?

@timmie
Copy link
Contributor Author

timmie commented Jul 24, 2013

Can I not get around this?
Like let the program read it as strings and then convert later?

I am receiving these tables from elsewhere. So the process should be automatised and I'd rather not touch the Excel tables.

@jreback
Copy link
Contributor

jreback commented Jul 24, 2013

you could try passing : dtype = { 'column_name' : object }

@jtratner
Copy link
Contributor

@timmie that sounds like an xlrd error. What happens if you just try to
read the spreadsheet outside of pandas?

On Wed, Jul 24, 2013 at 8:50 AM, jreback notifications@github.com wrote:

you could try passing : dtype = { 'column_name' : object }


Reply to this email directly or view it on GitHubhttps://github.com//issues/4332#issuecomment-21482111
.

@timmie
Copy link
Contributor Author

timmie commented Jul 24, 2013

@jreback

it says: ValueError: dtype is not supported with python parser

@timmie
Copy link
Contributor Author

timmie commented Jul 24, 2013

@jtratner
in the source xlsx file I removed the date column. Now everything goes in smoothly.

So we would need to find a way to read the time cloumn.
In Excel, the cell properties say "userdefined".

So can we read it as string or alike?

@timmie timmie closed this as completed Jul 24, 2013
@timmie timmie reopened this Jul 24, 2013
@jreback
Copy link
Contributor

jreback commented Jul 24, 2013

ok...so maybe 2 bugs here, I thought dtype in the PythonParser worked.....

and 2 processing as @jtratner suggest....

@timmie
Copy link
Contributor Author

timmie commented Jul 24, 2013

I tested outside pandas:

In [51]: import xlrd

In [53]: wb = xlrd.open_workbook(example_path)

In [54]: sh = wb.sheet_by_name('mysheet')

In [76]: xldt = sh.row(21)[1]

In [78]: xldt.value

Out[78]: 0.006944444444444444

In [79]: xlrd.xldate_as_tuple(xldt.value, wb.datemode)

Out[79]: (0, 0, 0, 0, 30, 0)

Is that what you suggested?

@timmie
Copy link
Contributor Author

timmie commented Jul 24, 2013

Maybe this one could help to include better error msgs:

https://classic.scraperwiki.com/docs/python/python_excel_guide/
very at the end.

@jreback
Copy link
Contributor

jreback commented Jul 24, 2013

I thought there was an issue out there to interpret this as a timedelta, can't find it so converting this issue to do that

@timmie
Copy link
Contributor Author

timmie commented Jul 24, 2013

Sorry now I am lost.

  • pandas did read the times in correctly
  • xlrd did.

Where shall I look next?

@jreback
Copy link
Contributor

jreback commented Jul 24, 2013

is the wb.datemode not being passed in pandas (to xlrd)?

@timmie
Copy link
Contributor Author

timmie commented Jul 24, 2013

Acccording to the docs not:
not according to the docs
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.io.excel.ExcelFile.parse.html#pandas.io.excel.ExcelFile.parse

but the source shwos that is read automatically from the file:
https://github.com/pydata/pandas/blob/master/pandas/io/excel.py#L182

@jreback
Copy link
Contributor

jreback commented Jul 24, 2013

best thing to prob do is do a monkey patch (for now), if you really want that column:

start by defining _parse_excel (copy it from the source code)

def _parse_excel(......):
......

from pandas.io.excel import ExcelFile
ExcelFile._parse_excel = _parse_excel

so it will use your code (and essentially fix the bug locally for yourself)

@timmie
Copy link
Contributor Author

timmie commented Jul 24, 2013

mmh. this appraoch is still new for me.

I cannot imagine why my file would be so exotic. It seems that xlrd tries to be overly exclicit.

Would you say it's a pandas bug or from xlrd?
And would we see improvement with openpyxl?

(BTW, thanks a lot for all your responses!)

@jreback
Copy link
Contributor

jreback commented Jul 24, 2013

not sure

@jtratner
Copy link
Contributor

@timmie if you can share your data, I can try to figure out what's causing the bug and where the issue is occurring (can't promise super-fast turnaround, but probably by this weekend 😄)

@timmie
Copy link
Contributor Author

timmie commented Jul 24, 2013

@jtratner : Thank you. very generous! But this is difficult. Let me prepare an anonysed version tomorrow.

anyway, I know where the problem comes from. But do not know how to solve finally;-(

  • my data is a time series in 10min steps.
  • it starts at 00:10
  • ends at 00:00 (spreadsheets show also 24:00:00)
  • datemode = 0

The last row with 00:00 causes the problem:

  • look at this line: https://github.com/pydata/pandas/blob/master/pandas/io/excel.py#L198
  • this is the part where hourly data is separated --> works
  • reading the very cell with xlrd only returns: xldate:1.0
  • getting the date returns the error: xlrd.xldate_as_tuple( xldt.value, 0) --> xlrd.xldate.XLDateAmbiguous: 1.0
  • BUT: assuming a datemode = 1, it works
  • xlrd.xldate_as_tuple( xldt.value, 1) --> (1904, 1, 2, 0, 0, 0)

So it can be solved by the following code:

                        if dt[0] < datetime.MINYEAR:  # pragma: no cover
                            datemode = 1
                            dt_new = xldate_as_tuple(value, datemode)
                            value = datetime.time(*dt_new[3:])

The only problem is that the last two timestamps appear like

23:50:00              
1904-01-02 00:00:00

But later I will prepend a date anyway.

Is this an accepted solution for the core?

Issue #4340 still persists with this workaround
Issue #4339 is solved.

@timmie
Copy link
Contributor Author

timmie commented Jul 24, 2013

Thinking more over it, I would say that we would need a date_parser option similar to pd.read_csv

see: https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1665

this would need to be added to:

In many data files (like my excel file), creators count from hour 1 to hour 24.
The idea behind is to show that the data values are taken at the end of a summation or averaging interval.

This thinking is the source of the confusion.

And since padas has no metadata tag, we cannot find another way to show this relation.

What are your opinions?

@jtratner
Copy link
Contributor

Yeah, a minimal example (just enough to produce the failure) would be
perfect.
On Jul 24, 2013 6:18 PM, "timmie" notifications@github.com wrote:

Thinking more over it, I would say that we would need a date_parser option
similar to pd.read_csv

see:
https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1665

this would need to be added to:

In many data files (like my excel file), creators count from hour 1 to
hour 24.
The idea behind is to show that the data values are taken at the end of a
summation or averaging interval.

This thinking is the source of the confusion.

And since padas has no metadata tag, we cannot find another way to show
this relation.

What are your opinions?


Reply to this email directly or view it on GitHubhttps://github.com//issues/4332#issuecomment-21520715
.

@jreback
Copy link
Contributor

jreback commented Jul 24, 2013

@jtratner you can actually create a timedelta64[ns] column from this

essentially:

from datetime import timedelta
Series([ timedelta(days=1,hours=1), timedelta(seconds=10,microseconds=500) ])

Out[3]: 
0   1 days, 01:00:00
1    00:00:10.000500
dtype: timedelta64[ns]

@timmie
Copy link
Contributor Author

timmie commented Jul 24, 2013

@jreback I don't think that we are after timedelta, but rather adding a date_parser here.

look at: #4332 (comment)

I have already found a workaround for exetended date parsing. But I am unsure how to feed this back into pandas core

@jreback
Copy link
Contributor

jreback commented Jul 24, 2013

you create a timedelta which is why I out the example up here

@timmie
Copy link
Contributor Author

timmie commented Jul 25, 2013

@jtratner

Please find it here:
https://github.com/timmie/example_code_data/blob/master/example_file_2013-07-25.xlsx

I may add an example script later or tomorrow...

@timmie
Copy link
Contributor Author

timmie commented Jul 26, 2013

I added example code to the repo. Please have a look at:

https://github.com/timmie/example_code_data

@jmcnamara
Copy link
Contributor

The issue here is with the way xlrd handles time that exceed 24 hours.

Basically the logic is like this:

  1. If the Excel date/time is <= 1 then it is assumed to be a time and is parsed as such.
  2. If it is > 1 then it is treated as a date. So times > 24 hours are treated as dates with times.
  3. If the date is < 61 in the 1900 epoch (i.e., Windows versions of Excel) then the date is treated as "ambiguous" due to the famous Excel 1900 leap year bug and an exception is raised.
  4. The previous doesn't happen with Excel for Mac files which use a 1904 epoch (datemode = 1 in the code examples above).

This issue has been fixed via #6934 when using xldd >= 0.9.3.

So, as far as I can see, the (confusing) root cause of this issue has been fixed and this item can be closed. @jreback

@jreback
Copy link
Contributor

jreback commented Apr 24, 2014

closed via #6934

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions IO Data IO issues that don't fit into a more specific label IO Excel read_excel, to_excel Timedelta Timedelta data type
Projects
None yet
Development

No branches or pull requests

4 participants