-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19561][Python] cast TimestampType.toInternal output to long #16896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Retest this please. |
| def test_datetime_at_epoch(self): | ||
| epoch = datetime.datetime.fromtimestamp(0) | ||
| df = self.spark.createDataFrame([Row(date=epoch)]) | ||
| self.assertEqual(df.first()['date'], epoch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, before this patch, df.first() is Row(None) in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make a test case in class DataTypeTests(unittest.TestCase) instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, before this patch, df.first() is Row(None).
I tried putting it in DataTypeTests first, but it was difficult to get a reasonable failing test case there. Python ints are up to 2^63 on 64-bit systems, so it doesn't overflow to long there. The issue is b/c Scala int are 32-bit, so Py4J is the part that converts it to long.
We could put the test there, but it doesn't really capture the issue IMO.
|
Hi, @davies . |
python/pyspark/sql/types.py
Outdated
| seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo | ||
| else time.mktime(dt.timetuple())) | ||
| return int(seconds) * 1000000 + dt.microsecond | ||
| return long(int(seconds) * 1000000 + dt.microsecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you just replace the int as long?
|
Just one minor comment |
|
Modified as suggested. Don't think this has been through CI at all yet. |
|
Ping @davies |
|
This PR is pretty tiny, and corrects a problem that led to corrupt Parquet files in our case. Can anyone spare a minute to review? |
| seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo | ||
| else time.mktime(dt.timetuple())) | ||
| return int(seconds) * 1000000 + dt.microsecond | ||
| return long(seconds) * 1000000 + dt.microsecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. For me, it looks every review comments are applied.
|
+1 LGTM. |
|
lgtm, will merge it when I get a chance. |
## What changes were proposed in this pull request? Cast the output of `TimestampType.toInternal` to long to allow for proper Timestamp creation in DataFrames near the epoch. ## How was this patch tested? Added a new test that fails without the change. dongjoon-hyun davies Mind taking a look? The contribution is my original work and I license the work to the project under the project’s open source license. Author: Jason White <jason.white@shopify.com> Closes #16896 from JasonMWhite/SPARK-19561. (cherry picked from commit 6f46846) Signed-off-by: Davies Liu <davies.liu@gmail.com>
|
Merged into master and 2.1 branch. |
|
This PR didn't go through jenkins and break the build. I've reverted it from master and branch 2.1. @JasonMWhite can you submit a new PR please? thanks |
|
My bad, did not realized that, sorry. |
|
Sorry. I didn't realized too. |
## What changes were proposed in this pull request? Add handling of input of type `Int` for dataType `TimestampType` to `EvaluatePython.scala`. Py4J serializes ints smaller than MIN_INT or larger than MAX_INT to Long, which are handled correctly already, but values between MIN_INT and MAX_INT are serialized to Int. These range limits correspond to roughly half an hour on either side of the epoch. As a result, PySpark doesn't allow TimestampType values to be created in this range. Alternatives attempted: patching the `TimestampType.toInternal` function to cast return values to `long`, so Py4J would always serialize them to Scala Long. Python3 does not have a `long` type, so this approach failed on Python3. ## How was this patch tested? Added a new PySpark-side test that fails without the change. The contribution is my original work and I license the work to the project under the project’s open source license. Resubmission of #16896. The original PR didn't go through Jenkins and broke the build. davies dongjoon-hyun cloud-fan Could you kick off a Jenkins run for me? It passed everything for me locally, but it's possible something has changed in the last few weeks. Author: Jason White <jason.white@shopify.com> Closes #17200 from JasonMWhite/SPARK-19561. (cherry picked from commit 206030b) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request? Add handling of input of type `Int` for dataType `TimestampType` to `EvaluatePython.scala`. Py4J serializes ints smaller than MIN_INT or larger than MAX_INT to Long, which are handled correctly already, but values between MIN_INT and MAX_INT are serialized to Int. These range limits correspond to roughly half an hour on either side of the epoch. As a result, PySpark doesn't allow TimestampType values to be created in this range. Alternatives attempted: patching the `TimestampType.toInternal` function to cast return values to `long`, so Py4J would always serialize them to Scala Long. Python3 does not have a `long` type, so this approach failed on Python3. ## How was this patch tested? Added a new PySpark-side test that fails without the change. The contribution is my original work and I license the work to the project under the project’s open source license. Resubmission of apache#16896. The original PR didn't go through Jenkins and broke the build. davies dongjoon-hyun cloud-fan Could you kick off a Jenkins run for me? It passed everything for me locally, but it's possible something has changed in the last few weeks. Author: Jason White <jason.white@shopify.com> Closes apache#17200 from JasonMWhite/SPARK-19561.
What changes were proposed in this pull request?
Cast the output of
TimestampType.toInternalto long to allow for proper Timestamp creation in DataFrames near the epoch.How was this patch tested?
Added a new test that fails without the change.
@dongjoon-hyun @davies Mind taking a look?
The contribution is my original work and I license the work to the project under the project’s open source license.