ENH: Excel to support reading Timedeltas #4332

timmie · 2013-07-23T20:48:05Z

ExcelFile should print out line or even cell warnings

Today I was spending quite some time debugging why a decoding error stopped the code from reading in a table.

I thought the skiprows counts from 0 (= Excel row 1). It was always failing.

In one line there were column headers, the next line contained units which couldn't probably be parsed.

I think it could be helpful if the parser would show the line number or even cell were it fails to ready (like due to decoding errors).

jreback · 2013-07-23T22:17:11Z

can you put post the error you did get?

jtratner · 2013-07-24T01:03:02Z

Also it would be helpful if you could print the versions of pandas and xlrd you are using (might need openpyxl if you're not using the dev version.)

timmie · 2013-07-24T09:00:19Z


pd.__version__
Out[116]: '0.9.1'

import xlrd

xlrd.__VERSION__
Out[118]: '0.9.1'

import openpyxl

openpyxl.__version__
Out[120]: '1.5.8'

timmie · 2013-07-24T09:05:25Z

please see an example error msg at #4339

jreback · 2013-07-24T10:04:34Z

@timmie

you have a pretty old version of pandas, 0.11 has been out since april, and 0.12 is releasing this week. excel parsing uses the csv parser under the hood, and pretty sure that all of your 3 posted issues are fixed in more recent versions. Pls try and close these issues if that is the case. (e.g. #4332 , #4340)

timmie · 2013-07-24T11:39:45Z

With

In [38]: pd.__version__

Out[38]: '0.12.0rc1'

The issue does nit arise. But now I get:

XLDateAmbiguous: 1.0

even if I change to parse_dates=False and index_col=0.

There is one column in the xlsx that has time (not date).

But aapraently, the parser expects a datetime:


C:\Python27\lib\site-packages\pandas\io\excel.pyc in _parse_excel(self, sheetname, header, skiprows, skip_footer, index_col, has_index_names, parse_cols, parse_dates, date_parser, na_values, thousands, chunksize, **kwds)
    184                 if parse_cols is None or should_parse[j]:
    185                     if typ == XL_CELL_DATE:
--> 186                         dt = xldate_as_tuple(value, datemode)
    187                         # how to produce this first case?
    188                         if dt[0] < datetime.MINYEAR:  # pragma: no cover

C:\Python27\lib\site-packages\xlrd\xldate.pyc in xldate_as_tuple(xldate, datemode)
     78 
     79     if xldays < 61 and datemode == 0:
---> 80         raise XLDateAmbiguous(xldate)
     81 
     82     jdn = xldays + _JDN_delta[datemode]

jreback · 2013-07-24T11:55:24Z

This is essentially a timedelta. Maybe just change the column formatting to text?

timmie · 2013-07-24T12:27:54Z

Can I not get around this?
Like let the program read it as strings and then convert later?

I am receiving these tables from elsewhere. So the process should be automatised and I'd rather not touch the Excel tables.

jreback · 2013-07-24T12:50:00Z

you could try passing : dtype = { 'column_name' : object }

jtratner · 2013-07-24T12:54:07Z

@timmie that sounds like an xlrd error. What happens if you just try to
read the spreadsheet outside of pandas?

On Wed, Jul 24, 2013 at 8:50 AM, jreback notifications@github.com wrote:

you could try passing : dtype = { 'column_name' : object }

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/4332#issuecomment-21482111
.

timmie · 2013-07-24T13:50:20Z

@jreback

it says: ValueError: dtype is not supported with python parser

timmie · 2013-07-24T13:52:08Z

@jtratner
in the source xlsx file I removed the date column. Now everything goes in smoothly.

So we would need to find a way to read the time cloumn.
In Excel, the cell properties say "userdefined".

So can we read it as string or alike?

jreback · 2013-07-24T13:56:00Z

ok...so maybe 2 bugs here, I thought dtype in the PythonParser worked.....

and 2 processing as @jtratner suggest....

timmie · 2013-07-24T14:21:23Z

I tested outside pandas:

In [51]: import xlrd

In [53]: wb = xlrd.open_workbook(example_path)

In [54]: sh = wb.sheet_by_name('mysheet')

In [76]: xldt = sh.row(21)[1]

In [78]: xldt.value

Out[78]: 0.006944444444444444

In [79]: xlrd.xldate_as_tuple(xldt.value, wb.datemode)

Out[79]: (0, 0, 0, 0, 30, 0)

Is that what you suggested?

timmie · 2013-07-24T14:27:29Z

Maybe this one could help to include better error msgs:

https://classic.scraperwiki.com/docs/python/python_excel_guide/
very at the end.

jreback · 2013-07-24T14:30:25Z

I thought there was an issue out there to interpret this as a timedelta, can't find it so converting this issue to do that

timmie · 2013-07-24T14:32:48Z

Sorry now I am lost.

pandas did read the times in correctly
xlrd did.

Where shall I look next?

jreback · 2013-07-24T14:44:04Z

is the wb.datemode not being passed in pandas (to xlrd)?

timmie · 2013-07-24T15:47:15Z

Acccording to the docs not:
not according to the docs
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.io.excel.ExcelFile.parse.html#pandas.io.excel.ExcelFile.parse

but the source shwos that is read automatically from the file:
https://github.com/pydata/pandas/blob/master/pandas/io/excel.py#L182

jreback · 2013-07-24T15:59:57Z

best thing to prob do is do a monkey patch (for now), if you really want that column:

start by defining _parse_excel (copy it from the source code)

def _parse_excel(......):
......

from pandas.io.excel import ExcelFile
ExcelFile._parse_excel = _parse_excel

so it will use your code (and essentially fix the bug locally for yourself)

timmie · 2013-07-24T16:05:11Z

mmh. this appraoch is still new for me.

I cannot imagine why my file would be so exotic. It seems that xlrd tries to be overly exclicit.

Would you say it's a pandas bug or from xlrd?
And would we see improvement with openpyxl?

(BTW, thanks a lot for all your responses!)

jreback · 2013-07-24T16:05:45Z

not sure

jtratner · 2013-07-24T20:57:38Z

@timmie if you can share your data, I can try to figure out what's causing the bug and where the issue is occurring (can't promise super-fast turnaround, but probably by this weekend 😄)

timmie · 2013-07-24T22:02:16Z

@jtratner : Thank you. very generous! But this is difficult. Let me prepare an anonysed version tomorrow.

anyway, I know where the problem comes from. But do not know how to solve finally;-(

my data is a time series in 10min steps.
it starts at 00:10
ends at 00:00 (spreadsheets show also 24:00:00)
datemode = 0

The last row with 00:00 causes the problem:

look at this line: https://github.com/pydata/pandas/blob/master/pandas/io/excel.py#L198
this is the part where hourly data is separated --> works
reading the very cell with xlrd only returns: xldate:1.0
getting the date returns the error: xlrd.xldate_as_tuple( xldt.value, 0) --> xlrd.xldate.XLDateAmbiguous: 1.0
BUT: assuming a datemode = 1, it works
xlrd.xldate_as_tuple( xldt.value, 1) --> (1904, 1, 2, 0, 0, 0)

So it can be solved by the following code:

                        if dt[0] < datetime.MINYEAR:  # pragma: no cover
                            datemode = 1
                            dt_new = xldate_as_tuple(value, datemode)
                            value = datetime.time(*dt_new[3:])

The only problem is that the last two timestamps appear like

23:50:00              
1904-01-02 00:00:00

But later I will prepend a date anyway.

Is this an accepted solution for the core?

Issue #4340 still persists with this workaround
Issue #4339 is solved.

timmie · 2013-07-24T22:17:41Z

Thinking more over it, I would say that we would need a date_parser option similar to pd.read_csv

see: https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1665

this would need to be added to:

In many data files (like my excel file), creators count from hour 1 to hour 24.
The idea behind is to show that the data values are taken at the end of a summation or averaging interval.

This thinking is the source of the confusion.

And since padas has no metadata tag, we cannot find another way to show this relation.

What are your opinions?

jtratner · 2013-07-24T22:26:05Z

Yeah, a minimal example (just enough to produce the failure) would be
perfect.
On Jul 24, 2013 6:18 PM, "timmie" notifications@github.com wrote:

Thinking more over it, I would say that we would need a date_parser option
similar to pd.read_csv

see:
https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1665

this would need to be added to:

https://github.com/pydata/pandas/blob/master/pandas/io/excel.py#L199

https://github.com/pydata/pandas/blob/master/pandas/io/excel.py#L201

In many data files (like my excel file), creators count from hour 1 to
hour 24.
The idea behind is to show that the data values are taken at the end of a
summation or averaging interval.

This thinking is the source of the confusion.

And since padas has no metadata tag, we cannot find another way to show
this relation.

What are your opinions?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/4332#issuecomment-21520715
.

jreback · 2013-07-24T23:21:41Z

@jtratner you can actually create a timedelta64[ns] column from this

essentially:

from datetime import timedelta
Series([ timedelta(days=1,hours=1), timedelta(seconds=10,microseconds=500) ])

Out[3]: 
0   1 days, 01:00:00
1    00:00:10.000500
dtype: timedelta64[ns]

timmie · 2013-07-24T23:34:47Z

@jreback I don't think that we are after timedelta, but rather adding a date_parser here.

look at: #4332 (comment)

I have already found a workaround for exetended date parsing. But I am unsure how to feed this back into pandas core

jreback · 2013-07-24T23:39:14Z

you create a timedelta which is why I out the example up here

timmie · 2013-07-25T15:17:07Z

@jtratner

Please find it here:
https://github.com/timmie/example_code_data/blob/master/example_file_2013-07-25.xlsx

I may add an example script later or tomorrow...

timmie · 2013-07-26T15:50:26Z

I added example code to the repo. Please have a look at:

https://github.com/timmie/example_code_data

jmcnamara · 2014-04-24T09:25:03Z

The issue here is with the way xlrd handles time that exceed 24 hours.

Basically the logic is like this:

If the Excel date/time is <= 1 then it is assumed to be a time and is parsed as such.
If it is > 1 then it is treated as a date. So times > 24 hours are treated as dates with times.
If the date is < 61 in the 1900 epoch (i.e., Windows versions of Excel) then the date is treated as "ambiguous" due to the famous Excel 1900 leap year bug and an exception is raised.
The previous doesn't happen with Excel for Mac files which use a 1904 epoch (datemode = 1 in the code examples above).

This issue has been fixed via #6934 when using xldd >= 0.9.3.

So, as far as I can see, the (confusing) root cause of this issue has been fixed and this item can be closed. @jreback

jreback · 2014-04-24T10:27:41Z

closed via #6934

timmie closed this as completed Jul 24, 2013

timmie reopened this Jul 24, 2013

This was referenced Jul 29, 2013

xls.parse: fails to skip lines #4340

Closed

fix the excel reader: hours & header #4404

Merged

This was referenced Aug 21, 2013

excel reader & skip row between data & header & docs #4631

Merged

now sectionwise: date_converter: delta / time #4632

Closed

timmie added a commit to timmie/pandas that referenced this issue Aug 22, 2013

now sectionwise: date_converter: excel / date_parser pandas-dev#4332

8f417ac

timmie referenced this issue in jreback/pandas Aug 22, 2013

CLN: modified timmie PR 4631

73fbd21

dershow mentioned this issue Jan 15, 2014

Can't read excel decimal seconds #5945

Closed

jreback modified the milestones: 0.15.0, 0.14.0 Feb 15, 2014

jreback closed this as completed Apr 24, 2014

dacoex mentioned this issue Nov 25, 2014

IO: ensure compatibility with 01-24 date formats #8891

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Excel to support reading Timedeltas #4332

ENH: Excel to support reading Timedeltas #4332

timmie commented Jul 23, 2013

jreback commented Jul 23, 2013

jtratner commented Jul 24, 2013

timmie commented Jul 24, 2013

timmie commented Jul 24, 2013

jreback commented Jul 24, 2013

timmie commented Jul 24, 2013

jreback commented Jul 24, 2013

timmie commented Jul 24, 2013

jreback commented Jul 24, 2013

jtratner commented Jul 24, 2013

timmie commented Jul 24, 2013

timmie commented Jul 24, 2013

jreback commented Jul 24, 2013

timmie commented Jul 24, 2013

timmie commented Jul 24, 2013

jreback commented Jul 24, 2013

timmie commented Jul 24, 2013

jreback commented Jul 24, 2013

timmie commented Jul 24, 2013

jreback commented Jul 24, 2013

timmie commented Jul 24, 2013

jreback commented Jul 24, 2013

jtratner commented Jul 24, 2013

timmie commented Jul 24, 2013

timmie commented Jul 24, 2013

jtratner commented Jul 24, 2013

jreback commented Jul 24, 2013

timmie commented Jul 24, 2013

jreback commented Jul 24, 2013

timmie commented Jul 25, 2013

timmie commented Jul 26, 2013

jmcnamara commented Apr 24, 2014

jreback commented Apr 24, 2014

ENH: Excel to support reading Timedeltas #4332

ENH: Excel to support reading Timedeltas #4332

Comments

timmie commented Jul 23, 2013

jreback commented Jul 23, 2013

jtratner commented Jul 24, 2013

timmie commented Jul 24, 2013

timmie commented Jul 24, 2013

jreback commented Jul 24, 2013

timmie commented Jul 24, 2013

jreback commented Jul 24, 2013

timmie commented Jul 24, 2013

jreback commented Jul 24, 2013

jtratner commented Jul 24, 2013

timmie commented Jul 24, 2013

timmie commented Jul 24, 2013

jreback commented Jul 24, 2013

timmie commented Jul 24, 2013

timmie commented Jul 24, 2013

jreback commented Jul 24, 2013

timmie commented Jul 24, 2013

jreback commented Jul 24, 2013

timmie commented Jul 24, 2013

jreback commented Jul 24, 2013

timmie commented Jul 24, 2013

jreback commented Jul 24, 2013

jtratner commented Jul 24, 2013

timmie commented Jul 24, 2013

timmie commented Jul 24, 2013

jtratner commented Jul 24, 2013

jreback commented Jul 24, 2013

timmie commented Jul 24, 2013

jreback commented Jul 24, 2013

timmie commented Jul 25, 2013

timmie commented Jul 26, 2013

jmcnamara commented Apr 24, 2014

jreback commented Apr 24, 2014