Description
Code Sample, a copy-pastable example if possible
from datetime import datetime, date
import pandas as pd
df = pd.DataFrame([(1, "x", date.today(), datetime(2018, 1, 2, 18, 53))],
columns=["number", "string", "date", "datetime"])
df.dtypes
# number int64
# string object
# date object
# datetime datetime64[ns]
# dtype: object
df["date"][0]
# datetime.date(2018, 3, 9)
df.to_parquet("test.gz.parquet", compression='gzip', engine='pyarrow')
dfp = pd.read_parquet("test.gz.parquet")
dfp.dtypes
# number int64
# string object
# date datetime64[ns] <- !!!! Type change
# datetime datetime64[ns]
# dtype: object
However if I do:
import pyarrow.parquet as pq
pq.read_table("test.gz.parquet")
I get:
pyarrow.Table
number: int64
string: string
date: date32[day]
datetime: timestamp[ms]
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
b' [{"name": "number", "field_name": "number", "pandas_type": "int'
b'64", "numpy_type": "int64", "metadata": null}, {"name": "string"'
b', "field_name": "string", "pandas_type": "unicode", "numpy_type"'
b': "object", "metadata": null}, {"name": "date", "field_name": "d'
b'ate", "pandas_type": "date", "numpy_type": "object", "metadata":'
b' null}, {"name": "datetime", "field_name": "datetime", "pandas_t'
b'ype": "datetime", "numpy_type": "datetime64[ns]", "metadata": nu'
b'll}, {"name": null, "field_name": "__index_level_0__", "pandas_t'
b'ype": "int64", "numpy_type": "int64", "metadata": null}], "panda'
b's_version": "0.22.0"}'}
Problem description
Pandas not preserving the date type on reading back parquet.
When pandas read a dataframe that originally had a date
type column it converts it to Timestamp
type. It is clear that it is a pandas problem since reading it using pyarrow, we get the original correct type
Expected Output
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-36-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: pt_BR.UTF-8
pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.4.0
Cython: None
numpy: 1.14.0
scipy: 1.0.0
pyarrow: 0.8.0
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: 0.1.2
fastparquet: None
pandas_gbq: None
pandas_datareader: None