Skip to content

BUG: pandas.read_excel(file_name, parse_dates = False) doesn't work #47269

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
Haiz14 opened this issue Jun 7, 2022 · 10 comments
Closed
3 tasks done

BUG: pandas.read_excel(file_name, parse_dates = False) doesn't work #47269

Haiz14 opened this issue Jun 7, 2022 · 10 comments
Labels
Bug IO Excel read_excel, to_excel Needs Info Clarification about behavior needed to assess issue

Comments

@Haiz14
Copy link

Haiz14 commented Jun 7, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
df = pd.read_excel(file_name, parse_dates = False)
df.to_excel(output_name, index = False)

Issue Description

The above code parses the date from (dd/mm/Y) to (Y-mm-dd hh:mm:ss) even though i specified to not parse dates.

Expected Behavior

The written file should retain the orignal date format.

Installed Versions

INSTALLED VERSIONS

commit : 06d2301
python : 3.10.4.final.0
python-bits : 64
OS : Linux
OS-release : 4.14.186+
Version : #1 SMP PREEMPT Tue Apr 5 22:31:47 CST 2022
machine : aarch64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.1
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
pip : 22.1.1
setuptools : 59.8.0
Cython : None
pytest : 7.1.2
hypothesis : 6.43.0
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.8.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
fastparquet : None
fsspec : 2022.3.0
gcsfs : None
matplotlib : 3.5.1
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 7.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : 0.8.9
xarray : None
xlrd : None
xlwt : None
zstandard : None

@Haiz14 Haiz14 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 7, 2022
@Haiz14
Copy link
Author

Haiz14 commented Jun 7, 2022

The original date format is dd/mm/Y

A working exame of above issue

>>> df = pd.read_excel('combine_me/'+files_path[2], header = None, parse_dates = False)
>>> df.dropna(subset=[0], inplace = True)     
>>> df[7]                                     
7                EventDate
9      2012-01-04 00:00:00
10     2012-01-05 00:00:00
11     2012-01-06 00:00:00
12     2012-01-11 00:00:00
              ...
175    2012-12-21 00:00:00
176    2012-12-21 00:00:00
177    2012-12-24 00:00:00
178    2012-12-24 00:00:00
179    2012-12-28 00:00:00
Name: 7, Length: 172, dtype: object
>>>

I tried to specify the column which has dates as str but it still gets parsed as dates

>>> df = pd.read_excel('combine_me/'+files_path[2], header = None, converters ={7:str})
>>> df.dropna(subset=[0], inplace = True)     
>>> df[7]
7                EventDate
9      2012-01-04 00:00:00
10     2012-01-05 00:00:00
11     2012-01-06 00:00:00
12     2012-01-11 00:00:00
              ...
175    2012-12-21 00:00:00
176    2012-12-21 00:00:00
177    2012-12-24 00:00:00
178    2012-12-24 00:00:00
179    2012-12-28 00:00:00
Name: 7, Length: 172, dtype: object
>>>

@Haiz14
Copy link
Author

Haiz14 commented Jun 7, 2022

Also one more thing, sometimes the dates are parsed wrong.

i got a freelance job once to do some operations on an excel file. As u can see above, the dates are parsed and converted automatically, i changed the date format back to orignal before writing it to a file. And i sent the file back to the guy who gave me the work.

A day later he tells me the dates have month and dates as swapped in some rows. For e.g. 02/03/2002 got converted to 03/02/2002.

That was quite bad cause it may have gone undetected.

@ahawryluk
Copy link
Contributor

Hi @Haiz14. My guess (having not seen your spreadsheet) is that your cells are already stored as dates in Excel. The parse_dates argument refers to the parsing of dates stored as text, so that is likely why you're seeing this behaviour. From the read_excel docs, "If you don`t want to parse some cells as date just change their type in Excel to 'Text'."

@phofl
Copy link
Member

phofl commented Jun 7, 2022

Could you please provide a reproducible example? Something that is copy and pasteable would be perfect

@phofl phofl added IO Excel read_excel, to_excel Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 7, 2022
@Haiz14
Copy link
Author

Haiz14 commented Jun 7, 2022

Here's, the excel file i used to show above examples.

Consolid splits corp actions 2012.xlsx

@Haiz14
Copy link
Author

Haiz14 commented Jun 7, 2022

@ahawryluk there's a lot of excel files, it's not possible for me to convert the excel dates as text manually for that many files

@Haiz14
Copy link
Author

Haiz14 commented Jun 8, 2022

Hi @Haiz14. My guess (having not seen your spreadsheet) is that your cells are already stored as dates in Excel. The parse_dates argument refers to the parsing of dates stored as text, so that is likely why you're seeing this behaviour. From the read_excel docs, "If you don`t want to parse some cells as date just change their type in Excel to 'Text'."

Yupp you are right, check the sample example i posted (2nd post in this thread), the datatype of printed DataFrame is object not "datetime", so pandas worked fine (it didn't parse it as dates, there should be some excel metadata ig). It's an excel problem.

@Haiz14
Copy link
Author

Haiz14 commented Jun 8, 2022

My above post isnt valid because u can see in the file i gave previously, it has a string in the date column. I checked in a simpler file
2012.xlsx

It does gets parsed as datetime even though i passed parse_dates = False

>>> df = pd.read_excel('2012.xlsx', parse_dates = False)
>>> df['EventDate']
0     2012-01-04
1     2012-01-05
2     2012-01-06
3            NaT
4            NaT
         ...
185          NaT
186          NaT
187          NaT
188          NaT
189          NaT
Name: EventDate, Length: 190, dtype: datetime64[ns]

@ahawryluk
Copy link
Contributor

Thanks @Haiz14 for the sample file. Here's how I understand the issue.

Pandas loads data from xlsx files in two steps:

  1. it uses openpyxl to read values from the worksheet
  2. it places those values into a new DataFrame

Dates represented in Excel as text can be parsed by pandas in step 2, but actual Excel dates are returned in step 1 in their actual format:

>>> import openpyxl
>>> wb = openpyxl.open('2012.xlsx')
>>> ws = wb.worksheets[0]
>>> ws['H1'].value
'EventDate'
>>> ws['H1'].data_type
's'  # string
>>> ws['H2'].value
datetime.datetime(2012, 1, 4, 0, 0)
>>> ws['H2'].data_type
'd'  # date
>>> ws['H2'].number_format
'mm-dd-yy'

The value in cell H2 is an Excel date, not a string, so the value of parse_dates has no effect. The appearance of the date within Excel is controlled by the number_format of that cell, but pandas doesn't load formats, just values.

If I understand your task correctly, you need to load and save many spreadsheets, and you need both the value and the formatting of these dates to be preserved. Perhaps you could specify the output datetime_format with ExcelWriter, or you could use openpyxl rather than pandas for this task, as it can read and write the number_format for each cell.

@Haiz14
Copy link
Author

Haiz14 commented Jun 9, 2022

Ty, closing this issue then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Excel read_excel, to_excel Needs Info Clarification about behavior needed to assess issue
Projects
None yet
Development

No branches or pull requests

3 participants