Datetime column type could not be recognized in Spark #646

zijie0 · 2021-07-22T02:53:30Z

What happened:

We use fastparquet to write pandas dataframe with datetime columns. Then we use Spark to read the parquet file, all the datetime columns become 'bigint' type.

It worked in older version(0.6.0), but breaks in the latest release 0.7.0.

What you expected to happen:

Should get timestamp type in Spark.

Minimal Complete Verifiable Example:

import pyspark
import pandas as pd


pdf = pd.DataFrame([[pd.to_datetime('2021-01-01')]], columns=['c'])
pdf.to_parquet('tmp.parquet', engine='fastparquet')
print(pdf.dtypes)

spark = pyspark.sql.SparkSession.builder.getOrCreate()
sdf = spark.read.format('parquet').load('tmp.parquet')
print(sdf.dtypes)

output:
pandas schema: c datetime64[ns]
spark schema: [('c', 'bigint')]

Anything else we need to know?:

Environment:

fastparquet version: 0.7.0
Spark version: 3.0.1
Dask version: N/A
Python version: 3.7.9
Operating System: CentOS 7.6
Install method (conda, pip, source): pip/conda

martindurant · 2021-07-22T13:02:50Z

Do you get an error or unexpected output? Can you please show?

Timestamps moved from "converted types" to "logical types" in this release, which has been part of the parquet spec for a long time. Maybe there's an option you need to enable in spark to interpret them.

zijie0 · 2021-07-23T01:44:33Z

Hi @martindurant ,

The dataframe in pandas looks good:

In [7]: pdf
Out[7]:
           c
0 2021-01-01

But in Spark, it's totally different, and incompatible with previous version. Which means all the user code would be broken since this release...

In [9]: sdf.show()
+-------------------+
|                  c|
+-------------------+
|1609459200000000000|
+-------------------+

In fastparquet 0.6.x, the Spark code works:

In [11]: sdf.show()
+-------------------+
|                  c|
+-------------------+
|2021-01-01 08:00:00|
+-------------------+

zijie0 · 2021-07-23T02:01:54Z

In the latest release of Spark, they are still using release 1.10.1 of org.apache.parquet: https://github.com/apache/spark/blob/v3.1.2/pom.xml#L138

I think it that's the reason why it's incompatible with the latest parquet format 2.9.0.

martindurant · 2021-07-23T12:48:49Z

Have you tried times='int96' when writing? The previous behaviour was truncating pandas' ns-resolution timestamps to us, which was also unfortunate.

martindurant · 2021-07-23T13:46:58Z

PS: if you can explicitly convert your time columns to ms or us resolution, this would work too, instead of int96. I don't actually know how to persuade pandas to do this.

zijie0 · 2021-07-25T12:38:18Z

I tested the solution with times='int96', it works now!

But I didn't find a way to convert time columns to datetime[ms] or datetime[us] type... tried with pd.to_datetime(pdf['c'], unit='ms') and pdf['c'].astype('datetime[ms'), neither worked.

martindurant · 2021-07-25T14:56:22Z

Right, pandas doesn't like times in anything other than ns, but I think it can be done somehow.

…

On July 25, 2021 8:38:29 AM EDT, Yuan Zhou ***@***.***> wrote: I tested the solution with `times='int96'`, it works now! But I didn't find a way to convert time columns to datetime[ms] or datetime[us] type... tried with `pd.to_datetime(pdf['c'], unit='ms')` and `pdf['c'].astype('datetime[ms')`, neither worked. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #646 (comment)

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

zijie0 mentioned this issue Jul 23, 2021

Out of bounds error when reading data contains null value #647

Closed

zijie0 closed this as completed Jul 26, 2021

zijie0 mentioned this issue Aug 11, 2021

Support partition value string deserialization for timestamp/binary delta-io/delta-rs#371

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datetime column type could not be recognized in Spark #646

Datetime column type could not be recognized in Spark #646

zijie0 commented Jul 22, 2021 •

edited

Loading

martindurant commented Jul 22, 2021

zijie0 commented Jul 23, 2021

zijie0 commented Jul 23, 2021

martindurant commented Jul 23, 2021

martindurant commented Jul 23, 2021

zijie0 commented Jul 25, 2021

martindurant commented Jul 25, 2021 via email

Datetime column type could not be recognized in Spark #646

Datetime column type could not be recognized in Spark #646

Comments

zijie0 commented Jul 22, 2021 • edited Loading

martindurant commented Jul 22, 2021

zijie0 commented Jul 23, 2021

zijie0 commented Jul 23, 2021

martindurant commented Jul 23, 2021

martindurant commented Jul 23, 2021

zijie0 commented Jul 25, 2021

martindurant commented Jul 25, 2021 via email

zijie0 commented Jul 22, 2021 •

edited

Loading