[SPARK-19561][Python] cast TimestampType.toInternal output to long #16896

JasonMWhite · 2017-02-11T21:47:04Z

What changes were proposed in this pull request?

Cast the output of TimestampType.toInternal to long to allow for proper Timestamp creation in DataFrames near the epoch.

How was this patch tested?

Added a new test that fails without the change.

@dongjoon-hyun @davies Mind taking a look?

The contribution is my original work and I license the work to the project under the project’s open source license.

dongjoon-hyun · 2017-02-12T00:52:06Z

Retest this please.

dongjoon-hyun · 2017-02-12T00:53:13Z

python/pyspark/sql/tests.py

+    def test_datetime_at_epoch(self):
+        epoch = datetime.datetime.fromtimestamp(0)
+        df = self.spark.createDataFrame([Row(date=epoch)])
+        self.assertEqual(df.first()['date'], epoch)


So, before this patch, df.first() is Row(None) in this case?

Can we make a test case in class DataTypeTests(unittest.TestCase) instead?

Yes, before this patch, df.first() is Row(None).

I tried putting it in DataTypeTests first, but it was difficult to get a reasonable failing test case there. Python ints are up to 2^63 on 64-bit systems, so it doesn't overflow to long there. The issue is b/c Scala int are 32-bit, so Py4J is the part that converts it to long.

We could put the test there, but it doesn't really capture the issue IMO.

dongjoon-hyun · 2017-02-14T18:58:42Z

Hi, @davies .
Could you review this PR?

davies · 2017-02-15T05:20:33Z

python/pyspark/sql/types.py

            seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
                       else time.mktime(dt.timetuple()))
-            return int(seconds) * 1000000 + dt.microsecond
+            return long(int(seconds) * 1000000 + dt.microsecond)


Could you just replace the int as long?

davies · 2017-02-15T05:21:13Z

Just one minor comment

JasonMWhite · 2017-02-16T03:14:02Z

Modified as suggested. Don't think this has been through CI at all yet.

JasonMWhite · 2017-02-21T19:47:59Z

Ping @davies

JasonMWhite · 2017-03-07T05:07:13Z

This PR is pretty tiny, and corrects a problem that led to corrupt Parquet files in our case. Can anyone spare a minute to review?

dongjoon-hyun · 2017-03-07T07:11:47Z

python/pyspark/sql/types.py

            seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
                       else time.mktime(dt.timetuple()))
-            return int(seconds) * 1000000 + dt.microsecond
+            return long(seconds) * 1000000 + dt.microsecond


Yep. For me, it looks every review comments are applied.

dongjoon-hyun · 2017-03-07T07:13:27Z

+1 LGTM.
Could you review and merge this please, @davies ?

davies · 2017-03-07T07:33:50Z

lgtm, will merge it when I get a chance.

## What changes were proposed in this pull request? Cast the output of `TimestampType.toInternal` to long to allow for proper Timestamp creation in DataFrames near the epoch. ## How was this patch tested? Added a new test that fails without the change. dongjoon-hyun davies Mind taking a look? The contribution is my original work and I license the work to the project under the project’s open source license. Author: Jason White <jason.white@shopify.com> Closes #16896 from JasonMWhite/SPARK-19561. (cherry picked from commit 6f46846) Signed-off-by: Davies Liu <davies.liu@gmail.com>

davies · 2017-03-07T21:11:24Z

Merged into master and 2.1 branch.

cloud-fan · 2017-03-08T01:17:04Z

This PR didn't go through jenkins and break the build. I've reverted it from master and branch 2.1.

@JasonMWhite can you submit a new PR please? thanks

davies · 2017-03-08T05:58:36Z

My bad, did not realized that, sorry.

dongjoon-hyun · 2017-03-08T09:38:34Z

Sorry. I didn't realized too.

## What changes were proposed in this pull request? Add handling of input of type `Int` for dataType `TimestampType` to `EvaluatePython.scala`. Py4J serializes ints smaller than MIN_INT or larger than MAX_INT to Long, which are handled correctly already, but values between MIN_INT and MAX_INT are serialized to Int. These range limits correspond to roughly half an hour on either side of the epoch. As a result, PySpark doesn't allow TimestampType values to be created in this range. Alternatives attempted: patching the `TimestampType.toInternal` function to cast return values to `long`, so Py4J would always serialize them to Scala Long. Python3 does not have a `long` type, so this approach failed on Python3. ## How was this patch tested? Added a new PySpark-side test that fails without the change. The contribution is my original work and I license the work to the project under the project’s open source license. Resubmission of #16896. The original PR didn't go through Jenkins and broke the build. davies dongjoon-hyun cloud-fan Could you kick off a Jenkins run for me? It passed everything for me locally, but it's possible something has changed in the last few weeks. Author: Jason White <jason.white@shopify.com> Closes #17200 from JasonMWhite/SPARK-19561. (cherry picked from commit 206030b) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? Add handling of input of type `Int` for dataType `TimestampType` to `EvaluatePython.scala`. Py4J serializes ints smaller than MIN_INT or larger than MAX_INT to Long, which are handled correctly already, but values between MIN_INT and MAX_INT are serialized to Int. These range limits correspond to roughly half an hour on either side of the epoch. As a result, PySpark doesn't allow TimestampType values to be created in this range. Alternatives attempted: patching the `TimestampType.toInternal` function to cast return values to `long`, so Py4J would always serialize them to Scala Long. Python3 does not have a `long` type, so this approach failed on Python3. ## How was this patch tested? Added a new PySpark-side test that fails without the change. The contribution is my original work and I license the work to the project under the project’s open source license. Resubmission of apache#16896. The original PR didn't go through Jenkins and broke the build. davies dongjoon-hyun cloud-fan Could you kick off a Jenkins run for me? It passed everything for me locally, but it's possible something has changed in the last few weeks. Author: Jason White <jason.white@shopify.com> Closes apache#17200 from JasonMWhite/SPARK-19561.

cast TimestampType.toInternal output to long

4238533

dongjoon-hyun reviewed Feb 12, 2017

View reviewed changes

davies reviewed Feb 15, 2017

View reviewed changes

replace int function call with long

5b1dd67

dongjoon-hyun reviewed Mar 7, 2017

View reviewed changes

asfgit closed this in 6f46846 Mar 7, 2017

JasonMWhite mentioned this pull request Mar 8, 2017

[SPARK-19561][SQL] add int case handling for TimestampType #17200

Closed

[SPARK-19561][Python] cast TimestampType.toInternal output to long #16896

[SPARK-19561][Python] cast TimestampType.toInternal output to long #16896

Uh oh!

Conversation

JasonMWhite commented Feb 11, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

dongjoon-hyun commented Feb 12, 2017

Uh oh!

dongjoon-hyun Feb 12, 2017

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Feb 12, 2017

Choose a reason for hiding this comment

Uh oh!

JasonMWhite Feb 12, 2017

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Feb 14, 2017

Uh oh!

davies Feb 15, 2017

Choose a reason for hiding this comment

Uh oh!

davies commented Feb 15, 2017

Uh oh!

JasonMWhite commented Feb 16, 2017

Uh oh!

JasonMWhite commented Feb 21, 2017

Uh oh!

JasonMWhite commented Mar 7, 2017

Uh oh!

dongjoon-hyun Mar 7, 2017

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Mar 7, 2017

Uh oh!

davies commented Mar 7, 2017

Uh oh!

davies commented Mar 7, 2017

Uh oh!

cloud-fan commented Mar 8, 2017

Uh oh!

davies commented Mar 8, 2017

Uh oh!

dongjoon-hyun commented Mar 8, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants