Skip to content

Conversation

@JasonMWhite
Copy link

What changes were proposed in this pull request?

Cast the output of TimestampType.toInternal to long to allow for proper Timestamp creation in DataFrames near the epoch.

How was this patch tested?

Added a new test that fails without the change.

@dongjoon-hyun @davies Mind taking a look?

The contribution is my original work and I license the work to the project under the project’s open source license.

@dongjoon-hyun
Copy link
Member

Retest this please.

def test_datetime_at_epoch(self):
epoch = datetime.datetime.fromtimestamp(0)
df = self.spark.createDataFrame([Row(date=epoch)])
self.assertEqual(df.first()['date'], epoch)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, before this patch, df.first() is Row(None) in this case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make a test case in class DataTypeTests(unittest.TestCase) instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, before this patch, df.first() is Row(None).

I tried putting it in DataTypeTests first, but it was difficult to get a reasonable failing test case there. Python ints are up to 2^63 on 64-bit systems, so it doesn't overflow to long there. The issue is b/c Scala int are 32-bit, so Py4J is the part that converts it to long.

We could put the test there, but it doesn't really capture the issue IMO.

@dongjoon-hyun
Copy link
Member

Hi, @davies .
Could you review this PR?

seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
else time.mktime(dt.timetuple()))
return int(seconds) * 1000000 + dt.microsecond
return long(int(seconds) * 1000000 + dt.microsecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you just replace the int as long?

@davies
Copy link
Contributor

davies commented Feb 15, 2017

Just one minor comment

@JasonMWhite
Copy link
Author

Modified as suggested. Don't think this has been through CI at all yet.

@JasonMWhite
Copy link
Author

Ping @davies

@JasonMWhite
Copy link
Author

This PR is pretty tiny, and corrects a problem that led to corrupt Parquet files in our case. Can anyone spare a minute to review?

seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
else time.mktime(dt.timetuple()))
return int(seconds) * 1000000 + dt.microsecond
return long(seconds) * 1000000 + dt.microsecond
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. For me, it looks every review comments are applied.

@dongjoon-hyun
Copy link
Member

+1 LGTM.
Could you review and merge this please, @davies ?

@davies
Copy link
Contributor

davies commented Mar 7, 2017

lgtm, will merge it when I get a chance.

asfgit pushed a commit that referenced this pull request Mar 7, 2017
## What changes were proposed in this pull request?

Cast the output of `TimestampType.toInternal` to long to allow for proper Timestamp creation in DataFrames near the epoch.

## How was this patch tested?

Added a new test that fails without the change.

dongjoon-hyun davies Mind taking a look?

The contribution is my original work and I license the work to the project under the project’s open source license.

Author: Jason White <jason.white@shopify.com>

Closes #16896 from JasonMWhite/SPARK-19561.

(cherry picked from commit 6f46846)
Signed-off-by: Davies Liu <davies.liu@gmail.com>
@davies
Copy link
Contributor

davies commented Mar 7, 2017

Merged into master and 2.1 branch.

@asfgit asfgit closed this in 6f46846 Mar 7, 2017
@cloud-fan
Copy link
Contributor

This PR didn't go through jenkins and break the build. I've reverted it from master and branch 2.1.

@JasonMWhite can you submit a new PR please? thanks

@davies
Copy link
Contributor

davies commented Mar 8, 2017

My bad, did not realized that, sorry.

@dongjoon-hyun
Copy link
Member

Sorry. I didn't realized too.

asfgit pushed a commit that referenced this pull request Mar 9, 2017
## What changes were proposed in this pull request?

Add handling of input of type `Int` for dataType `TimestampType` to `EvaluatePython.scala`. Py4J serializes ints smaller than MIN_INT or larger than MAX_INT to Long, which are handled correctly already, but values between MIN_INT and MAX_INT are serialized to Int.

These range limits correspond to roughly half an hour on either side of the epoch. As a result, PySpark doesn't allow TimestampType values to be created in this range.

Alternatives attempted: patching the `TimestampType.toInternal` function to cast return values to `long`, so Py4J would always serialize them to Scala Long. Python3 does not have a `long` type, so this approach failed on Python3.

## How was this patch tested?

Added a new PySpark-side test that fails without the change.

The contribution is my original work and I license the work to the project under the project’s open source license.

Resubmission of #16896. The original PR didn't go through Jenkins and broke the build. davies dongjoon-hyun

cloud-fan Could you kick off a Jenkins run for me? It passed everything for me locally, but it's possible something has changed in the last few weeks.

Author: Jason White <jason.white@shopify.com>

Closes #17200 from JasonMWhite/SPARK-19561.

(cherry picked from commit 206030b)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
ghost pushed a commit to dbtsai/spark that referenced this pull request Mar 9, 2017
## What changes were proposed in this pull request?

Add handling of input of type `Int` for dataType `TimestampType` to `EvaluatePython.scala`. Py4J serializes ints smaller than MIN_INT or larger than MAX_INT to Long, which are handled correctly already, but values between MIN_INT and MAX_INT are serialized to Int.

These range limits correspond to roughly half an hour on either side of the epoch. As a result, PySpark doesn't allow TimestampType values to be created in this range.

Alternatives attempted: patching the `TimestampType.toInternal` function to cast return values to `long`, so Py4J would always serialize them to Scala Long. Python3 does not have a `long` type, so this approach failed on Python3.

## How was this patch tested?

Added a new PySpark-side test that fails without the change.

The contribution is my original work and I license the work to the project under the project’s open source license.

Resubmission of apache#16896. The original PR didn't go through Jenkins and broke the build. davies dongjoon-hyun

cloud-fan Could you kick off a Jenkins run for me? It passed everything for me locally, but it's possible something has changed in the last few weeks.

Author: Jason White <jason.white@shopify.com>

Closes apache#17200 from JasonMWhite/SPARK-19561.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants