[data] date32 and datetime64 handling should be the same #41358

dromare · 2023-11-23T15:35:34Z

What happened + What you expected to happen

ray.data.read_csv() converts the date column to timestamp when reading a .csv file with date format MM/DD/YYYY;
to_modin() then converts the column to datetime64, which is the expected behavior (same end result as when reading directly with pandas.read_csv())

However, when reading a .csv file with dates as YYYY-MM-DD (ISO format) the column is read in as date32, and to_modin() converts it to object, which would be the expected behavior if the format was specified incorrectly, in tune with what one would get with pandas.read_csv() [...If a column or index cannot be represented as an array of datetime, say because of an unparsable value or a mixture of timezones, the column or index will be returned unaltered as an object data type.]

Versions / Dependencies

ray 2.8.0

INSTALLED VERSIONS

commit : 2a953cf80b77e4348bf50ed724f8abc0d814d9dd
python : 3.10.11.final.0
python-bits : 64
OS : Linux
OS-release : 4.19.128-microsoft-standard
Version : #1 SMP Tue Jun 23 12:58:10 UTC 2020
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.3
numpy : 1.25.2
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 65.5.1
pip : 23.0.1
Cython : None
pytest : 7.4.3
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.18.0
pandas_datareader : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : 2023.6.0
gcsfs : None
matplotlib : 3.8.0
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 13.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.4
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

Reproduction script

from pyarrow.csv import ConvertOptions
import ray
from ray.data import read_csv
ray.init()

1. THIS WORKS:

Content of demo_5_fu_20_obs_latam_dates.csv

store_id,product_id,date,sales
S1,P1,01/01/2023,5.1
S1,P1,01/08/2023,5.2
...

path = 'demo_5_fu_20_obs_latam_dates.csv'
timestamp_parser = ['%m/%d/%Y']

convert_options = ConvertOptions(timestamp_parsers=timestamp_parser) 
data = read_csv(path, convert_options=convert_options)
data.schema()

Column Type

store_id string
product_id string
date timestamp[s]
sales double

modin_data = data.to_modin()
modin_data.dtypes

store_id object
product_id object
date datetime64[s]
sales float64

2. THIS DOES NOT WORK:

Content of demo_5_fu_20_obs.csv

store_id,product_id,date,sales
S1,P1,2023-01-01,5.1
S1,P1,2023-01-08,5.2

path = 'demo_5_fu_20_obs.csv'
timestamp_parser = ['%Y-%m-%d']

convert_options = ConvertOptions(timestamp_parsers=timestamp_parser)
data = read_csv(path, convert_options=convert_options)
data.schema()

Column Type

store_id string
product_id string
date date32[day]
sales double

modin_data = data.to_modin()
modin_data.dtypes

store_id object
product_id object
date object
sales float64
...

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

dromare added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 23, 2023

anyscalesam added the data Ray Data-related issues label Nov 28, 2023

anyscalesam assigned scottjlee Nov 28, 2023

anyscalesam added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 28, 2023

richardliaw added P3 Issue moderate in impact or severity and removed P1 Issue that should be fixed within a few weeks labels Dec 27, 2024

richardliaw changed the title ~~[<Ray component: ray.data.read_csv>]~~ [data] date32 and datetime64 handling should be the same Dec 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] date32 and datetime64 handling should be the same #41358

[data] date32 and datetime64 handling should be the same #41358

dromare commented Nov 23, 2023 •

edited

Loading

[data] date32 and datetime64 handling should be the same #41358

[data] date32 and datetime64 handling should be the same #41358

Comments

dromare commented Nov 23, 2023 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

INSTALLED VERSIONS

Reproduction script

1. THIS WORKS:

Content of demo_5_fu_20_obs_latam_dates.csv

2. THIS DOES NOT WORK:

Content of demo_5_fu_20_obs.csv

Issue Severity

dromare commented Nov 23, 2023 •

edited

Loading