[SPARK-22417][PYTHON] Fix for createDataFrame from pandas.DataFrame with timestamp #19646

BryanCutler · 2017-11-02T23:10:55Z

What changes were proposed in this pull request?

Currently, a pandas.DataFrame that contains a timestamp of type 'datetime64[ns]' when converted to a Spark DataFrame with createDataFrame will interpret the values as LongType. This fix will check for a timestamp type and convert it to microseconds which will allow Spark to read as TimestampType.

How was this patch tested?

Added unit test to verify Spark schema is expected for TimestampType and DateType when created from pandas

…ataFrame

BryanCutler · 2017-11-02T23:30:32Z

@ueshin and @HyukjinKwon this allows Spark to read non-arrow Pandas timestamps to TimestampTypes instead of long values, but there are a couple things to note. I did the conversion with numpy because we can not make changes to the input pandas.DataFrame and making a copy is too expensive. When to_records() is called, the pdf is changed to numpy records, and that is where the check/conversion is done. For date columns, if it has a dtype of datetime64[D] or is a datetime object, then Spark correctly interprets to DateType. Please take a look when you can, thanks!
cc @cloud-fan

BryanCutler · 2017-11-02T23:30:41Z

retest this please

SparkQA · 2017-11-02T23:53:44Z

Test build #83361 has finished for PR 19646 at commit 839bb50.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-11-03T01:03:18Z

python/pyspark/sql/session.py

+                        has_rec_fix = True
+                    record_type_list.append((str(col_names[i]), curr_type))
+                if not has_rec_fix:
+                    record_type_list = None


Shall we put this conversion into an internal method?

Yeah, probably a good idea. I'll see if I can clean it up some. Thanks @viirya !

viirya · 2017-11-03T01:07:57Z

python/pyspark/sql/session.py

+                col_names = cur_dtypes.names
+                record_type_list = []
+                has_rec_fix = False
+                for i in xrange(len(cur_dtypes)):


oh, session.py didn't define xrange for version > 3.

Ooops, I forgot about that. thx!

BryanCutler · 2017-11-03T04:13:36Z

retest this please

ueshin · 2017-11-03T04:17:27Z

python/pyspark/sql/session.py

+            # If type is a datetime64 timestamp, convert to microseconds
+            # NOTE: if dtype is M8[ns] then np.record.tolist() will output values as longs,
+            # this conversion will lead to an output of py datetime objects, see SPARK-22417
+            if curr_type == np.dtype('M8[ns]'):


Isn't this datetime64[ns]? What's the defference between M8[ns] and datetime64[ns]?

There shouldn't be any difference for the most part. I only used M8 here because when debugging these types, that is what was being output for the record types by numpy.record.dtype. Would you prefer datetime64 if that works?

Yes, I'd prefer it if that works, otherwise I'd like you to add some comments saying we should use M8[ns] instead of datetime64[ns].

BryanCutler · 2017-11-03T04:18:36Z

retest this please

ueshin · 2017-11-03T04:37:37Z

python/pyspark/sql/tests.py

+        from datetime import datetime
+        pdf = pd.DataFrame({"ts": [datetime(2017, 10, 31, 1, 1, 1)],
+                            "d": [pd.Timestamp.now().date()]})
+        df = self.spark.createDataFrame(pdf)


What if we specify the schema? For example:

df = self.spark.createDataFrame(pdf, "ts timestamp, d date")

I was checking this PR and ran this for my curiosity. I got:

import pandas as pd from datetime import datetime pdf = pd.DataFrame({"ts": [datetime(2017, 10, 31, 1, 1, 1)], "d": [pd.Timestamp.now().date()]}) spark.createDataFrame(pdf, "d date, ts timestamp")

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 587, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, data), schema) File "/.../spark/python/pyspark/sql/session.py", line 401, in _createFromLocal data = list(data) File "/.../spark/python/pyspark/sql/session.py", line 567, in prepare verify_func(obj) File "/.../spark/python/pyspark/sql/types.py", line 1411, in verify verify_value(obj) File "/.../spark/python/pyspark/sql/types.py", line 1392, in verify_struct verifier(v) File "/.../spark/python/pyspark/sql/types.py", line 1411, in verify verify_value(obj) File "/.../spark/python/pyspark/sql/types.py", line 1405, in verify_default verify_acceptable_types(obj) File "/.../spark/python/pyspark/sql/types.py", line 1300, in verify_acceptable_types % (dataType, obj, type(obj)))) TypeError: field ts: TimestampType can not accept object 1509411661000000000L in type <type 'long'>

Seems we should fix this one too.

Yes, looks like that needs to be fixed also. I thought it was working when schema was supplied, but I'll double-check and add that into the tests.

SparkQA · 2017-11-03T04:49:01Z

Test build #83380 has finished for PR 19646 at commit 3944b5c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-11-03T08:10:53Z

python/pyspark/sql/session.py

+         :return tuple of list of records and schema
+        """
+        # Convert pandas.DataFrame to list of numpy records
+        np_records = pdf.to_records(index=False)


after toRecords, what's the type of timestamp value? python datetime?

I got:

>>> pd.DataFrame({"ts": [datetime(2017, 10, 31, 1, 1, 1)]}).to_records(index=False)[0].tolist()[0] 1509411661000000000L >>> pd.DataFrame({"ts": [datetime(2017, 10, 31, 1, 1, 1)]}).to_records(index=False)[0][0] numpy.datetime64('2017-10-31T01:01:01.000000000')

whereas:

>>> pd.DataFrame({"d": [pd.Timestamp.now().date()]}).to_records(index=False)[0].tolist()[0] datetime.date(2017, 11, 3) >>> pd.DataFrame({"d": [pd.Timestamp.now().date()]}).to_records(index=False)[0][0] datetime.date(2017, 11, 3)

thanks! I also tried the data type:

>>> pd.DataFrame({"ts": [datetime(2017, 10, 31, 1, 1, 1)]}).dtypes ts datetime64[ns] dtype: object >>> pd.DataFrame({"d": [pd.Timestamp.now().date()]}).dtypes d object dtype: object

toRecords makes a numpy array of numpy records, and the timestamp dtype is datetime64. Calling toList() on a record converts everything to a list of python objects.

srowen · 2017-11-03T08:19:25Z

@BryanCutler is this tagged to the right JIRA?

HyukjinKwon · 2017-11-03T11:50:25Z

python/pyspark/sql/session.py

        data = [schema.toInternal(row) for row in data]
        return self._sc.parallelize(data), schema

+    def _getNumpyRecordDtypes(self, rec):


nit: _getNumpyRecordDtypes -> _get_numpy_record_dtypes.

I know it's confusing .. but I usually use thisNamingRule mainly on the purpose of API consistency and otherwise use this_naming_rule. I actually checked and read documentation and other codes few times for clarification for myself .. I believe this_naming_rule is preferred by PEP 8.

But I know that the doc says:

mixedCase is allowed only in contexts where that's already the prevailing style (e.g. threading.py), to retain backwards compatibility.

but .. I believe we should avoid thisNamingRule if it's in particular for internal use and/or unrelated with compatibility. Up to my knowledge, threading.py is the similar case I believe ...

yeah, I agree we should be using lowercase with underscores which is more of the convention for python. I was only using this format to stay consistent with the rest of the file, but I can change it. Just for the new methods right?

HyukjinKwon · 2017-11-03T11:54:24Z

python/pyspark/sql/session.py

+            record_type_list.append((str(col_names[i]), curr_type))
+        return record_type_list if has_rec_fix else None
+
+    def _convertFromPandas(self, pdf, schema):


ditto for naming

HyukjinKwon · 2017-11-03T11:58:39Z

I think this should be linked to SPARK-22417.

cloud-fan · 2017-11-03T13:22:03Z

python/pyspark/sql/session.py

        if has_pandas and isinstance(data, pandas.DataFrame):
-            if schema is None:
-                schema = [str(x) for x in data.columns]
-            data = [r.tolist() for r in data.to_records(index=False)]


seems r.tolist is the problem, how about r[i] for i in xrange(r.size)? Then we can get numpy.datatype64

>>> pd.DataFrame({"ts": [datetime(2017, 10, 31, 1, 1, 1)]}).to_records(index=False)[0].tolist()[0] 1509411661000000000L >>> pd.DataFrame({"ts": [datetime(2017, 10, 31, 1, 1, 1)]}).to_records(index=False)[0][0] numpy.datetime64('2017-10-31T02:01:01.000000000+0100') >>>

but ... numpy.datetime64 is not supported in createDataFrame IIUC:

import pandas as pd from datetime import datetime pdf = pd.DataFrame({"ts": [datetime(2017, 10, 31, 1, 1, 1)]}) print [[v for v in r] for r in pdf.to_records(index=False)] spark.createDataFrame([[v for v in r] for r in pdf.to_records(index=False)])

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 591, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, data), schema) File "/.../spark/python/pyspark/sql/session.py", line 404, in _createFromLocal struct = self._inferSchemaFromList(data) File "/.../spark/python/pyspark/sql/session.py", line 336, in _inferSchemaFromList schema = reduce(_merge_type, map(_infer_schema, data)) File "/.../spark/python/pyspark/sql/types.py", line 1095, in _infer_schema fields = [StructField(k, _infer_type(v), True) for k, v in items] File "/.../spark/python/pyspark/sql/types.py", line 1072, in _infer_type raise TypeError("not supported type: %s" % type(obj)) TypeError: not supported type: <type 'numpy.datetime64'>

(It reminds of me SPARK-6857 BTW)

according to the ticket, seems we need to convert numpy.datetime64 to python datetime manually.

the problem is that nanosecond values can not be converted to a python datetime object, which only has microsecond resolution, so numpy converts it to long. Numpy will convert microseconds and above to python datetime objects, which Spark will correctly infer.

according to the ticket, seems we need to convert numpy.datetime64 to python datetime manually.

This fix is just meant to convert nanosecond timestamps to microseconds so that calling tolist() can fit them in a python object. Does it seem ok to you guys to leave it at that scope for now?

cloud-fan · 2017-11-06T21:18:35Z

python/pyspark/sql/session.py

+
+        # Check if any columns need to be fixed for Spark to infer properly
+        if len(np_records) > 0:
+            record_type_list = self._get_numpy_record_dtypes(np_records[0])


can't we just use pdf.dtype?

>>> pdf.dtypes[0] dtype('int64') >>> pdf.to_records(index=False)[0].dtype[0] dtype('int64')

I think they are same

The dtype for a numpy record is in a different format

n [16]: r Out[16]: (0, datetime.date(2017, 11, 6), 1509411661000000000L) In [17]: r.dtype Out[17]: dtype((numpy.record, [(u'index', '<i8'), (u'd', 'O'), (u'ts', '<M8[ns]')]))

so when using numpy.record.astype() it has to be specified in the same format and include dtypes for all fields. If we try to do this with pandas dtypes from the DataFrame, there might be some differences that could cause errors, so I think it's safer to use the dtypes output from numpy and only change the timestamp resolution.

cloud-fan · 2017-11-06T21:23:28Z

python/pyspark/sql/session.py

+        if len(np_records) > 0:
+            record_type_list = self._get_numpy_record_dtypes(np_records[0])
+            if record_type_list is not None:
+                return [r.astype(record_type_list).tolist() for r in np_records], schema


instead of doing this, we should call DataFrame.astype, which accepts a python dict. Then we can create a dict that maps column name to corrected dtype(only include columns need to cast). We can also specify (copy=False) for better performance.

Then that would modify the input pandas.DataFrame from the user, which would be bad if they use it after this call. Making a copy of the DataFrame might not be good either if it is large.

ok let's copy it. Is it a valid idea to use DataFrame.astype?

I don't think this will increase performance, we will still have to iterate over each record and convert to a list in addition to making a copy of the timestamp data. Another issue is that using DataFrame.astype will truncate the resolution to microseconds, but Pandas will continue to store as datetime64[ns]. see https://stackoverflow.com/a/32827472

This means that we have to change the conversion routine to separate all columns in the DataFrame and manually convert to rows of records instead of using to_records() and tolist().

I think it would be best to keep the casting on the numpy side, it's safer and keeps things simpler.

SparkQA · 2017-11-06T21:30:37Z

Test build #83508 has finished for PR 19646 at commit 355edfc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2017-11-07T06:51:28Z

python/pyspark/sql/session.py

+            record_type_list.append((str(col_names[i]), curr_type))
+        return record_type_list if has_rec_fix else None
+
+    def _convert_from_pandas(self, pdf, schema):


I guess we can remove schema parameter from here because the schema doesn't affect the conversion now.

yes, it isn't used in the conversion any more, but I think it should still be here for the case when it's None and then assigned to a list of the pdf column names. That way we can keep all pandas related code in this method.

cloud-fan · 2017-11-07T20:33:04Z

LGTM, merging to master!

…ith timestamp Currently, a pandas.DataFrame that contains a timestamp of type 'datetime64[ns]' when converted to a Spark DataFrame with `createDataFrame` will interpret the values as LongType. This fix will check for a timestamp type and convert it to microseconds which will allow Spark to read as TimestampType. Added unit test to verify Spark schema is expected for TimestampType and DateType when created from pandas Author: Bryan Cutler <cutlerb@gmail.com> Closes #19646 from BryanCutler/pyspark-non-arrow-createDataFrame-ts-fix-SPARK-22417. (cherry picked from commit 1d34104) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2017-11-07T20:39:26Z

also backport to 2.2 since we consider it's a bug.

BryanCutler · 2017-11-07T21:31:38Z

Thanks @cloud-fan , ~~I'll open up a PR for the backport to 2.2~~ I saw you already got it backported, thanks!

dongjoon-hyun · 2017-11-09T03:37:16Z

python/pyspark/sql/tests.py

        df = self.spark.createDataFrame(data)
        self.assertEqual(df.first(), Row(longarray=[-9223372036854775808, 0, 9223372036854775807]))

+    @unittest.skipIf(not _have_pandas, "Pandas not installed")


Hi, @cloud-fan and @BryanCutler .
This seems to break branch-2.2, but has been hidden behind another SQL error. (cc @gatorsmile , @henryr)
Please see this.

cc @felixcheung since he is RM for 2.2.1.

Thanks @dongjoon-hyun for tracking this down. It looks like sql/tests.py for branch-2.2 is just missing the following

_have_pandas = False try: import pandas _have_pandas = True except: # No Pandas, but that's okay, we'll skip those tests pass

This was probably added from an earlier PR in master and wasn't included when this one was cherry-picked.

I can add a patch a little bit later tonight unless someone is able to get to it first.

I can take it over. I'll submit a pr soon.

Thank you, @BryanCutler !

Great, @ueshin ! :)

BTW, @ueshin .
branch-2.2 Jenkins will fail due to #19701 .
Could you review and merge #19701 first?

Ah, in that case, maybe we need to revert one of the two original patches and fix one by one, or merge the two follow-ups into one as a hot-fix pr. cc @gatorsmile @cloud-fan

…rom pandas.DataFrame with timestamp ## What changes were proposed in this pull request? This is a follow-up of #19646 for branch-2.2. The original pr breaks branch-2.2 because the cherry-picked patch doesn't include some code which exists in master. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #19704 from ueshin/issues/SPARK-22417_2.2/fup1.

…ith timestamp Currently, a pandas.DataFrame that contains a timestamp of type 'datetime64[ns]' when converted to a Spark DataFrame with `createDataFrame` will interpret the values as LongType. This fix will check for a timestamp type and convert it to microseconds which will allow Spark to read as TimestampType. Added unit test to verify Spark schema is expected for TimestampType and DateType when created from pandas Author: Bryan Cutler <cutlerb@gmail.com> Closes apache#19646 from BryanCutler/pyspark-non-arrow-createDataFrame-ts-fix-SPARK-22417. (cherry picked from commit 1d34104) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…rom pandas.DataFrame with timestamp ## What changes were proposed in this pull request? This is a follow-up of apache#19646 for branch-2.2. The original pr breaks branch-2.2 because the cherry-picked patch doesn't include some code which exists in master. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes apache#19704 from ueshin/issues/SPARK-22417_2.2/fup1.

BryanCutler added 2 commits November 2, 2017 16:03

added fix for pandas timestamp to convert to microseconds for createD…

cca6757

…ataFrame

did not need a func to apply fix

839bb50

viirya reviewed Nov 3, 2017

View reviewed changes

moved conversion to internal methods

3944b5c

ueshin reviewed Nov 3, 2017

View reviewed changes

cloud-fan reviewed Nov 3, 2017

View reviewed changes

HyukjinKwon reviewed Nov 3, 2017

View reviewed changes

cloud-fan reviewed Nov 3, 2017

View reviewed changes

BryanCutler changed the title ~~[SPARK-22147][PYTHON] Fix for createDataFrame from pandas.DataFrame with timestamp~~ [SPARK-22417][PYTHON] Fix for createDataFrame from pandas.DataFrame with timestamp Nov 3, 2017

use datetime64 instead of M8, minor fixes

355edfc

cloud-fan reviewed Nov 6, 2017

View reviewed changes

ueshin reviewed Nov 7, 2017

View reviewed changes

asfgit closed this in 1d34104 Nov 7, 2017

dongjoon-hyun reviewed Nov 9, 2017

View reviewed changes

ueshin mentioned this pull request Nov 9, 2017

[SPARK-22417][PYTHON][FOLLOWUP][BRANCH-2.2] Fix for createDataFrame from pandas.DataFrame with timestamp #19704

Closed

BryanCutler mentioned this pull request Nov 10, 2017

[SPARK-20791][PYSPARK] Use Arrow to create Spark DataFrame from Pandas #19459

Closed

BryanCutler deleted the pyspark-non-arrow-createDataFrame-ts-fix-SPARK-22417 branch March 6, 2018 23:38

[SPARK-22417][PYTHON] Fix for createDataFrame from pandas.DataFrame with timestamp #19646

[SPARK-22417][PYTHON] Fix for createDataFrame from pandas.DataFrame with timestamp #19646

Uh oh!

Conversation

BryanCutler commented Nov 2, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

BryanCutler commented Nov 2, 2017

Uh oh!

BryanCutler commented Nov 2, 2017

Uh oh!

SparkQA commented Nov 2, 2017

Uh oh!

viirya Nov 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Nov 3, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ueshin Nov 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Nov 3, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 3, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen commented Nov 3, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Nov 3, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Nov 3, 2017 •

edited

Loading

ueshin Nov 3, 2017 •

edited

Loading

HyukjinKwon Nov 3, 2017 •

edited

Loading

BryanCutler commented Nov 7, 2017 •

edited

Loading

BryanCutler Nov 9, 2017 •

edited

Loading