Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datetime column type could not be recognized in Spark #646

Closed
zijie0 opened this issue Jul 22, 2021 · 7 comments
Closed

Datetime column type could not be recognized in Spark #646

zijie0 opened this issue Jul 22, 2021 · 7 comments

Comments

@zijie0
Copy link

zijie0 commented Jul 22, 2021

What happened:

We use fastparquet to write pandas dataframe with datetime columns. Then we use Spark to read the parquet file, all the datetime columns become 'bigint' type.

It worked in older version(0.6.0), but breaks in the latest release 0.7.0.

What you expected to happen:

Should get timestamp type in Spark.

Minimal Complete Verifiable Example:

import pyspark
import pandas as pd


pdf = pd.DataFrame([[pd.to_datetime('2021-01-01')]], columns=['c'])
pdf.to_parquet('tmp.parquet', engine='fastparquet')
print(pdf.dtypes)

spark = pyspark.sql.SparkSession.builder.getOrCreate()
sdf = spark.read.format('parquet').load('tmp.parquet')
print(sdf.dtypes)

output:
pandas schema: c datetime64[ns]
spark schema: [('c', 'bigint')]

Anything else we need to know?:

Environment:

  • fastparquet version: 0.7.0
  • Spark version: 3.0.1
  • Dask version: N/A
  • Python version: 3.7.9
  • Operating System: CentOS 7.6
  • Install method (conda, pip, source): pip/conda
@martindurant
Copy link
Member

Do you get an error or unexpected output? Can you please show?

Timestamps moved from "converted types" to "logical types" in this release, which has been part of the parquet spec for a long time. Maybe there's an option you need to enable in spark to interpret them.

@zijie0
Copy link
Author

zijie0 commented Jul 23, 2021

Hi @martindurant ,

The dataframe in pandas looks good:

In [7]: pdf
Out[7]:
           c
0 2021-01-01

But in Spark, it's totally different, and incompatible with previous version. Which means all the user code would be broken since this release...

In [9]: sdf.show()
+-------------------+
|                  c|
+-------------------+
|1609459200000000000|
+-------------------+

In fastparquet 0.6.x, the Spark code works:

In [11]: sdf.show()
+-------------------+
|                  c|
+-------------------+
|2021-01-01 08:00:00|
+-------------------+

@zijie0
Copy link
Author

zijie0 commented Jul 23, 2021

In the latest release of Spark, they are still using release 1.10.1 of org.apache.parquet: https://github.com/apache/spark/blob/v3.1.2/pom.xml#L138

I think it that's the reason why it's incompatible with the latest parquet format 2.9.0.

@martindurant
Copy link
Member

Have you tried times='int96' when writing? The previous behaviour was truncating pandas' ns-resolution timestamps to us, which was also unfortunate.

@martindurant
Copy link
Member

PS: if you can explicitly convert your time columns to ms or us resolution, this would work too, instead of int96. I don't actually know how to persuade pandas to do this.

@zijie0
Copy link
Author

zijie0 commented Jul 25, 2021

I tested the solution with times='int96', it works now!

But I didn't find a way to convert time columns to datetime[ms] or datetime[us] type... tried with pd.to_datetime(pdf['c'], unit='ms') and pdf['c'].astype('datetime[ms'), neither worked.

@martindurant
Copy link
Member

martindurant commented Jul 25, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants