-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16216][SQL] CSV data source supports custom date format for writing and timezone option for both reading and writing #13912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
78785e0
79a290a
8930bf7
649ab85
0e1901e
3c1343f
d03e7a0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -30,6 +30,7 @@ import org.apache.spark.rdd.RDD | |||||||||||||||||||||||||
| import org.apache.spark.sql._ | ||||||||||||||||||||||||||
| import org.apache.spark.sql.catalyst.InternalRow | ||||||||||||||||||||||||||
| import org.apache.spark.sql.catalyst.expressions.GenericMutableRow | ||||||||||||||||||||||||||
| import org.apache.spark.sql.catalyst.util.DateTimeUtils | ||||||||||||||||||||||||||
| import org.apache.spark.sql.execution.command.CreateDataSourceTableUtils | ||||||||||||||||||||||||||
| import org.apache.spark.sql.execution.datasources.{OutputWriter, OutputWriterFactory, PartitionedFile} | ||||||||||||||||||||||||||
| import org.apache.spark.sql.types._ | ||||||||||||||||||||||||||
|
|
@@ -179,6 +180,12 @@ private[sql] class CsvOutputWriter( | |||||||||||||||||||||||||
| // create the Generator without separator inserted between 2 records | ||||||||||||||||||||||||||
| private[this] val text = new Text() | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| // A `ValueConverter` is responsible for converting a field of an `InternalRow` to `String`. | ||||||||||||||||||||||||||
| private type ValueConverter = (InternalRow, Int) => String | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| // `ValueConverter`s for all fields of the schema | ||||||||||||||||||||||||||
| private val fieldsConverters: Seq[ValueConverter] = dataSchema.map(_.dataType).map(makeConverter) | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| private val recordWriter: RecordWriter[NullWritable, Text] = { | ||||||||||||||||||||||||||
| new TextOutputFormat[NullWritable, Text]() { | ||||||||||||||||||||||||||
| override def getDefaultWorkFile(context: TaskAttemptContext, extension: String): Path = { | ||||||||||||||||||||||||||
|
|
@@ -195,18 +202,40 @@ private[sql] class CsvOutputWriter( | |||||||||||||||||||||||||
| private var records: Long = 0L | ||||||||||||||||||||||||||
| private val csvWriter = new LineCsvWriter(params, dataSchema.fieldNames.toSeq) | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| private def rowToString(row: Seq[Any]): Seq[String] = row.map { field => | ||||||||||||||||||||||||||
| if (field != null) { | ||||||||||||||||||||||||||
| field.toString | ||||||||||||||||||||||||||
| } else { | ||||||||||||||||||||||||||
| params.nullValue | ||||||||||||||||||||||||||
| private def rowToString(row: InternalRow): Seq[String] = { | ||||||||||||||||||||||||||
| var i = 0 | ||||||||||||||||||||||||||
| val values = new Array[String](row.numFields) | ||||||||||||||||||||||||||
| while (i < row.numFields) { | ||||||||||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please use BTW, do we need
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using funtional transformation is generally slower due to virtual function calls. This part is executed a lot and such overhead will become really heavy. See https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex |
||||||||||||||||||||||||||
| if (!row.isNullAt(i)) { | ||||||||||||||||||||||||||
| values(i) = fieldsConverters(i).apply(row, i) | ||||||||||||||||||||||||||
| } else { | ||||||||||||||||||||||||||
| values(i) = params.nullValue | ||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||
| i += 1 | ||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||
| values | ||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| private def makeConverter(dataType: DataType): ValueConverter = dataType match { | ||||||||||||||||||||||||||
| case DateType if params.dateFormat != null => | ||||||||||||||||||||||||||
| (row: InternalRow, ordinal: Int) => | ||||||||||||||||||||||||||
| params.dateFormat.format(DateTimeUtils.toJavaDate(row.getInt(ordinal))) | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| case TimestampType if params.dateFormat != null => | ||||||||||||||||||||||||||
| (row: InternalRow, ordinal: Int) => | ||||||||||||||||||||||||||
| params.dateFormat.format(DateTimeUtils.toJavaTimestamp(row.getLong(ordinal))) | ||||||||||||||||||||||||||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @srowen Maybe you are looking for this case. I had to do it like this to avoid per-record type dispatch.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is formatting to a string, rather than parsing from a string right?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, sorry. This is not the part of this PR. This is spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala Lines 280 to 291 in e1dc853
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This was merged before, here but So, this PR adds the support for..
|
||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| case udt: UserDefinedType[_] => makeConverter(udt.sqlType) | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| case dt: DataType => | ||||||||||||||||||||||||||
| (row: InternalRow, ordinal: Int) => | ||||||||||||||||||||||||||
| row.get(ordinal, dt).toString | ||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| override def write(row: Row): Unit = throw new UnsupportedOperationException("call writeInternal") | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| override protected[sql] def writeInternal(row: InternalRow): Unit = { | ||||||||||||||||||||||||||
| csvWriter.writeRow(rowToString(row.toSeq(dataSchema)), records == 0L && params.headerFlag) | ||||||||||||||||||||||||||
| csvWriter.writeRow(rowToString(row), records == 0L && params.headerFlag) | ||||||||||||||||||||||||||
| records += 1 | ||||||||||||||||||||||||||
| if (records % FLUSH_BATCH_SIZE == 0) { | ||||||||||||||||||||||||||
| flush() | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get what this is saying -- if an input string specifies a time zone then that is the time zone to use to parse it. That's it right?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will use the timezone specified in the input. Since
DateandTimestampdo not keep timezone information, it calculates the differences between specified timezone in the input and in the option but the standard is the one specified in the option.I meant to say, for example, if
timezoneis set toGMT, all the readDateandTimestampare inGMTtimezone after calculating the differences. So..If the CSV data is as below:
If this is read as below:
it will become as below in dataframe (difference between
GMTandPDTis 7 hours), these below will be timestamp objects but I just represented them as strings.EDITTED: - I added an example for writing as well just to be clear
Then, if you write this as below:
the results will be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this behaviour looks fine, I will just change the documentation to be more clear..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the example above, the input specifies a timezone and that must be used to interpret it. You say "it becomes in the dataframe" but what it becomes is a timestamp, which is unambiguous and has no timezone. Timezone matters when converting back to a string for display, but, your example only shows the parameter used on reading, and shows no timezone in the output. I am not sure that this is the intended behavior?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it loses the timezone information after being loaded into Spark. I mean,
TimestampandDateinstances don't have timezone information in them. The timezone specified in the input is being used in the example...I am sorry that I think I didn't understand cleanly. Do you mind if I ask what you expect in before being read, after being read (in dataframe) and after being written please?
(I just added a example.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thank you for the detailed explanation. I should correct comments and the behaviour.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it is a default behaviour for
SimpleDateFormat. I will look into this deeper and will fix or leave some more commemts tomorrow. Thanks again!There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just have timezone as part of the dateFormat, so users can specify it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ic - you want to control the zone this gets converted to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will clean up the PR description and all those soon with a better proposal. Thanks!