-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-22417][PYTHON] Fix for createDataFrame from pandas.DataFrame with timestamp #19646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
cca6757
839bb50
3944b5c
355edfc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -23,6 +23,7 @@ | |
|
|
||
| if sys.version >= '3': | ||
| basestring = unicode = str | ||
| xrange = range | ||
| else: | ||
| from itertools import imap as map | ||
|
|
||
|
|
@@ -416,6 +417,50 @@ def _createFromLocal(self, data, schema): | |
| data = [schema.toInternal(row) for row in data] | ||
| return self._sc.parallelize(data), schema | ||
|
|
||
| def _get_numpy_record_dtypes(self, rec): | ||
| """ | ||
| Used when converting a pandas.DataFrame to Spark using to_records(), this will correct | ||
| the dtypes of records so they can be properly loaded into Spark. | ||
| :param rec: a numpy record to check dtypes | ||
| :return corrected dtypes for a numpy.record or None if no correction needed | ||
| """ | ||
| import numpy as np | ||
| cur_dtypes = rec.dtype | ||
| col_names = cur_dtypes.names | ||
| record_type_list = [] | ||
| has_rec_fix = False | ||
| for i in xrange(len(cur_dtypes)): | ||
| curr_type = cur_dtypes[i] | ||
| # If type is a datetime64 timestamp, convert to microseconds | ||
| # NOTE: if dtype is datetime[ns] then np.record.tolist() will output values as longs, | ||
| # conversion from [us] or lower will lead to py datetime objects, see SPARK-22417 | ||
| if curr_type == np.dtype('datetime64[ns]'): | ||
| curr_type = 'datetime64[us]' | ||
| has_rec_fix = True | ||
| record_type_list.append((str(col_names[i]), curr_type)) | ||
| return record_type_list if has_rec_fix else None | ||
|
|
||
| def _convert_from_pandas(self, pdf, schema): | ||
| """ | ||
| Convert a pandas.DataFrame to list of records that can be used to make a DataFrame | ||
| :return tuple of list of records and schema | ||
| """ | ||
| # If no schema supplied by user then get the names of columns only | ||
| if schema is None: | ||
| schema = [str(x) for x in pdf.columns] | ||
|
|
||
| # Convert pandas.DataFrame to list of numpy records | ||
| np_records = pdf.to_records(index=False) | ||
|
|
||
| # Check if any columns need to be fixed for Spark to infer properly | ||
| if len(np_records) > 0: | ||
| record_type_list = self._get_numpy_record_dtypes(np_records[0]) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can't we just use I think they are same
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The dtype for a numpy record is in a different format so when using |
||
| if record_type_list is not None: | ||
| return [r.astype(record_type_list).tolist() for r in np_records], schema | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. instead of doing this, we should call
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Then that would modify the input pandas.DataFrame from the user, which would be bad if they use it after this call. Making a copy of the DataFrame might not be good either if it is large.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok let's copy it. Is it a valid idea to use
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think this will increase performance, we will still have to iterate over each record and convert to a list in addition to making a copy of the timestamp data. Another issue is that using This means that we have to change the conversion routine to separate all columns in the DataFrame and manually convert to rows of records instead of using I think it would be best to keep the casting on the numpy side, it's safer and keeps things simpler. |
||
|
|
||
| # Convert list of numpy records to python lists | ||
| return [r.tolist() for r in np_records], schema | ||
|
|
||
| @since(2.0) | ||
| @ignore_unicode_prefix | ||
| def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=True): | ||
|
|
@@ -512,9 +557,7 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr | |
| except Exception: | ||
| has_pandas = False | ||
| if has_pandas and isinstance(data, pandas.DataFrame): | ||
| if schema is None: | ||
| schema = [str(x) for x in data.columns] | ||
| data = [r.tolist() for r in data.to_records(index=False)] | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. seems
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. but ... import pandas as pd
from datetime import datetime
pdf = pd.DataFrame({"ts": [datetime(2017, 10, 31, 1, 1, 1)]})
print [[v for v in r] for r in pdf.to_records(index=False)]
spark.createDataFrame([[v for v in r] for r in pdf.to_records(index=False)])
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (It reminds of me SPARK-6857 BTW)
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. according to the ticket, seems we need to convert numpy.datetime64 to python datetime manually.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the problem is that nanosecond values can not be converted to a python datetime object, which only has microsecond resolution, so numpy converts it to long. Numpy will convert microseconds and above to python datetime objects, which Spark will correctly infer.
This fix is just meant to convert nanosecond timestamps to microseconds so that calling |
||
| data, schema = self._convert_from_pandas(data, schema) | ||
|
|
||
| if isinstance(schema, StructType): | ||
| verify_func = _make_type_verifier(schema) if verifySchema else lambda _: True | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2592,6 +2592,21 @@ def test_create_dataframe_from_array_of_long(self): | |
| df = self.spark.createDataFrame(data) | ||
| self.assertEqual(df.first(), Row(longarray=[-9223372036854775808, 0, 9223372036854775807])) | ||
|
|
||
| @unittest.skipIf(not _have_pandas, "Pandas not installed") | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi, @cloud-fan and @BryanCutler . cc @felixcheung since he is RM for 2.2.1.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @dongjoon-hyun for tracking this down. It looks like sql/tests.py for branch-2.2 is just missing the following This was probably added from an earlier PR in master and wasn't included when this one was cherry-picked.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can add a patch a little bit later tonight unless someone is able to get to it first.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can take it over. I'll submit a pr soon.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you, @BryanCutler !
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great, @ueshin ! :)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, in that case, maybe we need to revert one of the two original patches and fix one by one, or merge the two follow-ups into one as a hot-fix pr. cc @gatorsmile @cloud-fan |
||
| def test_create_dataframe_from_pandas_with_timestamp(self): | ||
| import pandas as pd | ||
| from datetime import datetime | ||
| pdf = pd.DataFrame({"ts": [datetime(2017, 10, 31, 1, 1, 1)], | ||
| "d": [pd.Timestamp.now().date()]}) | ||
| # test types are inferred correctly without specifying schema | ||
| df = self.spark.createDataFrame(pdf) | ||
| self.assertTrue(isinstance(df.schema['ts'].dataType, TimestampType)) | ||
| self.assertTrue(isinstance(df.schema['d'].dataType, DateType)) | ||
| # test with schema will accept pdf as input | ||
| df = self.spark.createDataFrame(pdf, schema="d date, ts timestamp") | ||
| self.assertTrue(isinstance(df.schema['ts'].dataType, TimestampType)) | ||
| self.assertTrue(isinstance(df.schema['d'].dataType, DateType)) | ||
|
|
||
|
|
||
| class HiveSparkSubmitTests(SparkSubmitTests): | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can remove
schemaparameter from here because theschemadoesn't affect the conversion now.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it isn't used in the conversion any more, but I think it should still be here for the case when it's None and then assigned to a list of the pdf column names. That way we can keep all pandas related code in this method.