-
-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datetime column type could not be recognized in Spark #646
Comments
Do you get an error or unexpected output? Can you please show? Timestamps moved from "converted types" to "logical types" in this release, which has been part of the parquet spec for a long time. Maybe there's an option you need to enable in spark to interpret them. |
Hi @martindurant , The dataframe in pandas looks good:
But in Spark, it's totally different, and incompatible with previous version. Which means all the user code would be broken since this release...
In fastparquet 0.6.x, the Spark code works:
|
In the latest release of Spark, they are still using release 1.10.1 of org.apache.parquet: https://github.com/apache/spark/blob/v3.1.2/pom.xml#L138 I think it that's the reason why it's incompatible with the latest parquet format 2.9.0. |
Have you tried |
PS: if you can explicitly convert your time columns to ms or us resolution, this would work too, instead of int96. I don't actually know how to persuade pandas to do this. |
I tested the solution with But I didn't find a way to convert time columns to datetime[ms] or datetime[us] type... tried with |
Right, pandas doesn't like times in anything other than ns, but I think it can be done somehow.
…On July 25, 2021 8:38:29 AM EDT, Yuan Zhou ***@***.***> wrote:
I tested the solution with `times='int96'`, it works now!
But I didn't find a way to convert time columns to datetime[ms] or
datetime[us] type... tried with `pd.to_datetime(pdf['c'], unit='ms')`
and `pdf['c'].astype('datetime[ms')`, neither worked.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#646 (comment)
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
What happened:
We use fastparquet to write pandas dataframe with datetime columns. Then we use Spark to read the parquet file, all the datetime columns become 'bigint' type.
It worked in older version(0.6.0), but breaks in the latest release 0.7.0.
What you expected to happen:
Should get timestamp type in Spark.
Minimal Complete Verifiable Example:
output:
pandas schema: c datetime64[ns]
spark schema: [('c', 'bigint')]
Anything else we need to know?:
Environment:
The text was updated successfully, but these errors were encountered: