Skip to content

Pandas not preserving date type on reading back parquet #20089

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
euclides-filho opened this issue Mar 10, 2018 · 7 comments
Closed

Pandas not preserving date type on reading back parquet #20089

euclides-filho opened this issue Mar 10, 2018 · 7 comments
Labels
Datetime Datetime data dtype Dtype Conversions Unexpected or buggy dtype conversions

Comments

@euclides-filho
Copy link

Code Sample, a copy-pastable example if possible

from datetime import datetime, date
import pandas as pd

df = pd.DataFrame([(1, "x", date.today(), datetime(2018, 1, 2, 18, 53))], 
                  columns=["number", "string", "date", "datetime"])
df.dtypes

# number               int64
# string              object
# date                object
# datetime    datetime64[ns]
# dtype: object

df["date"][0]
# datetime.date(2018, 3, 9)

df.to_parquet("test.gz.parquet", compression='gzip', engine='pyarrow')

dfp = pd.read_parquet("test.gz.parquet")
dfp.dtypes

# number               int64
# string              object
# date                datetime64[ns] <- !!!! Type change
# datetime    datetime64[ns]
# dtype: object

However if I do:

import pyarrow.parquet as pq
pq.read_table("test.gz.parquet")

I get:

pyarrow.Table
number: int64
string: string
date: date32[day]
datetime: timestamp[ms]
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "number", "field_name": "number", "pandas_type": "int'
            b'64", "numpy_type": "int64", "metadata": null}, {"name": "string"'
            b', "field_name": "string", "pandas_type": "unicode", "numpy_type"'
            b': "object", "metadata": null}, {"name": "date", "field_name": "d'
            b'ate", "pandas_type": "date", "numpy_type": "object", "metadata":'
            b' null}, {"name": "datetime", "field_name": "datetime", "pandas_t'
            b'ype": "datetime", "numpy_type": "datetime64[ns]", "metadata": nu'
            b'll}, {"name": null, "field_name": "__index_level_0__", "pandas_t'
            b'ype": "int64", "numpy_type": "int64", "metadata": null}], "panda'
            b's_version": "0.22.0"}'}

Problem description

Pandas not preserving the date type on reading back parquet.
When pandas read a dataframe that originally had a date type column it converts it to Timestamp type. It is clear that it is a pandas problem since reading it using pyarrow, we get the original correct type

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-36-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: pt_BR.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.4.0
Cython: None
numpy: 1.14.0
scipy: 1.0.0
pyarrow: 0.8.0
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: 0.1.2
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Mar 10, 2018

datetime.date are not first class datatypes in pandas, so this is as expected. The conversion is actually done in pyarrow.

@jreback jreback closed this as completed Mar 10, 2018
@jreback jreback added Datetime Datetime data dtype Dtype Conversions Unexpected or buggy dtype conversions labels Mar 10, 2018
@jreback jreback added this to the won't fix milestone Mar 10, 2018
@jreback
Copy link
Contributor

jreback commented Mar 10, 2018

you could open an issue on the arrow tracker, but likely the answer is going to be the same.

@TomAugspurger TomAugspurger modified the milestones: won't fix, No action Jul 6, 2018
@JivanRoquet
Copy link

datetime.date are not first class datatypes in pandas, so this is as expected. The conversion is actually done in pyarrow.

I believe the whole point of this issue is about providing a first-class support of datetime.date in Pandas.

@simonjayhawkins
Copy link
Member

we have #32473 for adding a date type

@yohplala
Copy link

yohplala commented Sep 21, 2020

Hello,
There is another bug that your code demonstrate. It is linked, and maybe is considered to be the same.
When saving to parquet, datetimes are saved in 'ms', not 'ns'. (this is normal, due to a pyarrow limitation)
So there is a mistake in the metadata pandas is writing.

from datetime import datetime, date
import pandas as pd

df = pd.DataFrame([(1, "x", date.today(), datetime(2018, 1, 2, 18, 53))], 
                  columns=["number", "string", "date", "datetime"])
df.dtypes

# number               int64
# string              object
# date                object
# datetime    datetime64[ns]     # <- here
# dtype: object

df.to_parquet("test.parquet", engine='pyarrow')

import pyarrow.parquet as pq
pq.read_table("test.parquet")

pyarrow.Table
number: int64
string: string
date: date32[day]
datetime: timestamp[ms]          # <- Here, yes, normal
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "number", "field_name": "number", "pandas_type": "int'
            b'64", "numpy_type": "int64", "metadata": null}, {"name": "string"'
            b', "field_name": "string", "pandas_type": "unicode", "numpy_type"'
            b': "object", "metadata": null}, {"name": "date", "field_name": "d'
            b'ate", "pandas_type": "date", "numpy_type": "object", "metadata":'
            b' null}, {"name": "datetime", "field_name": "datetime", "pandas_t'
            b'ype": "datetime", "numpy_type": "datetime64[ns]", "metadata": nu'       # <-  numpy_type: [ns]: NO
            b'll}, {"name": null, "field_name": "__index_level_0__", "pandas_t'
            b'ype": "int64", "numpy_type": "int64", "metadata": null}], "panda'
            b's_version": "0.22.0"}'}

Should I re-open a new ticket or can this ticket be re-opened?
Thanks,
Bests

@anderl80
Copy link

@yohplala we're struggling with the same issue here, is this still open/resolved?

@yohplala
Copy link

yohplala commented Jun 24, 2021

@yohplala we're struggling with the same issue here, is this still open/resolved?

Hi @anderl80 , I have switched to directly use fastparquet instead of pandas to write down parquet files (fastparquet offers additional options in terms of appending modes).

following your question, I had a look at pyarrow doc, and they state that now here:
Timestamps with nanoseconds can be stored without casting when using the more recent Parquet format version 2.0

So maybe the bug is naturally soved, I cannot say.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

No branches or pull requests

7 participants