Inconsistent date parsing with read_csv #19402

errikos · 2018-01-25T22:47:45Z

Code Sample, a copy-pastable example if possible

> cat events.csv
id;name;date
1;event1;11/03/2011
2;event2;03/05/2011
3;event3;20/10/2011
4;event4;04/08/2011
5;event5;05/10/2011
6;event6;17/11/2011
7;event7;11/02/2011
8;event8;22/02/2011
9;event9;29/04/2011

import pandas as pd
pd.read_csv('events.csv', index_col=0, delimiter=';', parse_dates=[2])

      name       date
id
1   event1 2011-11-03
2   event2 2011-03-05
3   event3 2011-10-20
4   event4 2011-04-08
5   event5 2011-05-10
6   event6 2011-11-17
7   event7 2011-11-02
8   event8 2011-02-22
9   event9 2011-04-29

Problem description

Hello everyone. The main reason I am submitting this as a bug is because I spent about 2 hours conducting wrong analysis because of the behaviour I explain below.

As you can see above, the date column in the CSV file is in DD/MM/YYYY format. Now I know there exists a dayfirst flag, but here is the thing: the bahaviour is not consistent over the total rows that are being parsed. With the dayfirst flag set to False, some rows are parsed as being in MM/DD/YYYY format (e.g. rows 1 and 2), while others are parsed as being in DD/MM/YYYY format (e.g. rows 3 and 6). This is because rows 3 and 6 have dates where the day number is >12 and thus they are parsed as DD/MM/YYYY.

I understand that many people will find this a minor issue and maybe it is, because there are ways to overcome this, but in my opinion the illustrated behaviour is wrong and counter-intuitive, because it hides under the hood the fact that dates are parsed per row, when the elements of a column of a DataFrame supposedly have a uniform data type.

So I think that pandas should at least give the user a warning about that (the easy way) or try to apply consistent parsing among the rows (the hard and possibly slow way).

Expected Output

Either infer the date format and apply uniform parsing over all rows or show a warning that some rows are being parsed with a different date format in mind.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Darwin
OS-release: 17.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
LOCALE: fr_FR.UTF-8

pandas: 0.22.0
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: None
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.4
feather: None
matplotlib: 2.1.1
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

chris-b1 · 2018-01-26T16:25:06Z

Thanks for the report, this is same issue as #12585 - the fundamental problem is that were are calling out to the dateutil and there isn't a way to enforce or even capture strictness there - see dateutil/dateutil#214.

So we are all for changing this, I think some work has been going on within dateutil to maybe make this possible, but I haven't followed that closely - any kind of PR to help would be welcome!

errikos changed the title ~~Inconsistent date parsing with read_csv~~ Inconsistent date parsing with read_csv Jan 25, 2018

chris-b1 closed this as completed Jan 26, 2018

chris-b1 added the Duplicate Report Duplicate issue or pull request label Jan 26, 2018

chris-b1 added this to the No action milestone Jan 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent date parsing with read_csv #19402

Inconsistent date parsing with read_csv #19402

errikos commented Jan 25, 2018 •

edited

Loading

INSTALLED VERSIONS

chris-b1 commented Jan 26, 2018

Inconsistent date parsing with read_csv #19402

Inconsistent date parsing with read_csv #19402

Comments

errikos commented Jan 25, 2018 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

chris-b1 commented Jan 26, 2018

errikos commented Jan 25, 2018 •

edited

Loading

Output of `pd.show_versions()`