Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 17 additions & 15 deletions sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
Original file line number Diff line number Diff line change
Expand Up @@ -269,14 +269,15 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
* <li>`allowBackslashEscapingAnyCharacter` (default `false`): allows accepting quoting of all
* character using backslash quoting mechanism</li>
* <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records
* during parsing.</li>
* <ul>
* <li> - `PERMISSIVE` : sets other fields to `null` when it meets a corrupted record, and puts
* the malformed string into a new field configured by `columnNameOfCorruptRecord`. When
* a schema is set by user, it sets `null` for extra fields.</li>
* <li> - `DROPMALFORMED` : ignores the whole corrupted records.</li>
* <li> - `FAILFAST` : throws an exception when it meets corrupted records.</li>
* </ul>
* during parsing.
* <ul>
* <li>`PERMISSIVE` : sets other fields to `null` when it meets a corrupted record, and puts
* the malformed string into a new field configured by `columnNameOfCorruptRecord`. When
* a schema is set by user, it sets `null` for extra fields.</li>
* <li>`DROPMALFORMED` : ignores the whole corrupted records.</li>
* <li>`FAILFAST` : throws an exception when it meets corrupted records.</li>
* </ul>
* </li>
* <li>`columnNameOfCorruptRecord` (default is the value specified in
* `spark.sql.columnNameOfCorruptRecord`): allows renaming the new field having malformed string
* created by `PERMISSIVE` mode. This overrides `spark.sql.columnNameOfCorruptRecord`.</li>
Expand Down Expand Up @@ -395,13 +396,14 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
* <li>`maxMalformedLogPerPartition` (default `10`): sets the maximum number of malformed rows
* Spark will log for each partition. Malformed records beyond this number will be ignored.</li>
* <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records
* during parsing.</li>
* <ul>
* <li> - `PERMISSIVE` : sets other fields to `null` when it meets a corrupted record. When
* a schema is set by user, it sets `null` for extra fields.</li>
* <li> - `DROPMALFORMED` : ignores the whole corrupted records.</li>
* <li> - `FAILFAST` : throws an exception when it meets corrupted records.</li>
* </ul>
* during parsing.
* <ul>
* <li>`PERMISSIVE` : sets other fields to `null` when it meets a corrupted record. When
* a schema is set by user, it sets `null` for extra fields.</li>
* <li>`DROPMALFORMED` : ignores the whole corrupted records.</li>
* <li>`FAILFAST` : throws an exception when it meets corrupted records.</li>
* </ul>
* </li>
* </ul>
* @since 2.0.0
*/
Expand Down
12 changes: 12 additions & 0 deletions sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
Original file line number Diff line number Diff line change
Expand Up @@ -397,7 +397,9 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
* your external database systems.
*
* You can set the following JDBC-specific option(s) for storing JDBC:
* <ul>
* <li>`truncate` (default `false`): use `TRUNCATE TABLE` instead of `DROP TABLE`.</li>
* </ul>
*
* In case of failures, users should turn off `truncate` option to use `DROP TABLE` again. Also,
* due to the different behavior of `TRUNCATE TABLE` among DBMS, it's not always safe to use this.
Expand Down Expand Up @@ -486,6 +488,7 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
* }}}
*
* You can set the following JSON-specific option(s) for writing JSON files:
* <ul>
* <li>`compression` (default `null`): compression codec to use when saving to file. This can be
* one of the known case-insensitive shorten names (`none`, `bzip2`, `gzip`, `lz4`,
* `snappy` and `deflate`). </li>
Expand All @@ -495,6 +498,7 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
* <li>`timestampFormat` (default `yyyy-MM-dd'T'HH:mm:ss.SSSZZ`): sets the string that
* indicates a timestamp format. Custom date formats follow the formats at
* `java.text.SimpleDateFormat`. This applies to timestamp type.</li>
* </ul>
*
* @since 1.4.0
*/
Expand All @@ -510,10 +514,12 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
* }}}
*
* You can set the following Parquet-specific option(s) for writing Parquet files:
* <ul>
* <li>`compression` (default is the value specified in `spark.sql.parquet.compression.codec`):
* compression codec to use when saving to file. This can be one of the known case-insensitive
* shorten names(none, `snappy`, `gzip`, and `lzo`). This will override
* `spark.sql.parquet.compression.codec`.</li>
* </ul>
*
* @since 1.4.0
*/
Expand All @@ -529,9 +535,11 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
* }}}
*
* You can set the following ORC-specific option(s) for writing ORC files:
* <ul>
* <li>`compression` (default `snappy`): compression codec to use when saving to file. This can be
* one of the known case-insensitive shorten names(`none`, `snappy`, `zlib`, and `lzo`).
* This will override `orc.compress`.</li>
* </ul>
*
* @since 1.5.0
* @note Currently, this method can only be used after enabling Hive support
Expand All @@ -553,9 +561,11 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
* }}}
*
* You can set the following option(s) for writing text files:
* <ul>
* <li>`compression` (default `null`): compression codec to use when saving to file. This can be
* one of the known case-insensitive shorten names (`none`, `bzip2`, `gzip`, `lz4`,
* `snappy` and `deflate`). </li>
* </ul>
*
* @since 1.6.0
*/
Expand All @@ -571,6 +581,7 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
* }}}
*
* You can set the following CSV-specific option(s) for writing CSV files:
* <ul>
* <li>`sep` (default `,`): sets the single character as a separator for each
* field and value.</li>
* <li>`quote` (default `"`): sets the single character used for escaping quoted values where
Expand All @@ -593,6 +604,7 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
* <li>`timestampFormat` (default `yyyy-MM-dd'T'HH:mm:ss.SSSZZ`): sets the string that
* indicates a timestamp format. Custom date formats follow the formats at
* `java.text.SimpleDateFormat`. This applies to timestamp type.</li>
* </ul>
*
* @since 2.0.0
*/
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,7 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo
* schema in advance, use the version that specifies the schema to avoid the extra scan.
*
* You can set the following JSON-specific options to deal with non-standard JSON files:
* <ul>
* <li>`maxFilesPerTrigger` (default: no max limit): sets the maximum number of new files to be
* considered in every trigger.</li>
* <li>`primitivesAsString` (default `false`): infers all primitive values as a string type</li>
Expand All @@ -175,14 +176,15 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo
* <li>`allowBackslashEscapingAnyCharacter` (default `false`): allows accepting quoting of all
* character using backslash quoting mechanism</li>
* <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records
* during parsing.</li>
* <ul>
* <li>`PERMISSIVE` : sets other fields to `null` when it meets a corrupted record, and puts the
* malformed string into a new field configured by `columnNameOfCorruptRecord`. When
* a schema is set by user, it sets `null` for extra fields.</li>
* <li>`DROPMALFORMED` : ignores the whole corrupted records.</li>
* <li>`FAILFAST` : throws an exception when it meets corrupted records.</li>
* </ul>
* during parsing.
* <ul>
* <li>`PERMISSIVE` : sets other fields to `null` when it meets a corrupted record, and puts
* the malformed string into a new field configured by `columnNameOfCorruptRecord`. When
* a schema is set by user, it sets `null` for extra fields.</li>
* <li>`DROPMALFORMED` : ignores the whole corrupted records.</li>
* <li>`FAILFAST` : throws an exception when it meets corrupted records.</li>
* </ul>
* </li>
* <li>`columnNameOfCorruptRecord` (default is the value specified in
* `spark.sql.columnNameOfCorruptRecord`): allows renaming the new field having malformed string
* created by `PERMISSIVE` mode. This overrides `spark.sql.columnNameOfCorruptRecord`.</li>
Expand All @@ -192,6 +194,7 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo
* <li>`timestampFormat` (default `yyyy-MM-dd'T'HH:mm:ss.SSSZZ`): sets the string that
* indicates a timestamp format. Custom date formats follow the formats at
* `java.text.SimpleDateFormat`. This applies to timestamp type.</li>
* </ul>
*
* @since 2.0.0
*/
Expand All @@ -207,6 +210,7 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo
* specify the schema explicitly using [[schema]].
*
* You can set the following CSV-specific options to deal with CSV files:
* <ul>
* <li>`maxFilesPerTrigger` (default: no max limit): sets the maximum number of new files to be
* considered in every trigger.</li>
* <li>`sep` (default `,`): sets the single character as a separator for each
Expand Down Expand Up @@ -245,12 +249,14 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo
* <li>`maxCharsPerColumn` (default `1000000`): defines the maximum number of characters allowed
* for any given value being read.</li>
* <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records
* during parsing.</li>
* <ul>
* <li>`PERMISSIVE` : sets other fields to `null` when it meets a corrupted record. When
* a schema is set by user, it sets `null` for extra fields.</li>
* <li>`DROPMALFORMED` : ignores the whole corrupted records.</li>
* <li>`FAILFAST` : throws an exception when it meets corrupted records.</li>
* during parsing.
* <ul>
* <li>`PERMISSIVE` : sets other fields to `null` when it meets a corrupted record. When
* a schema is set by user, it sets `null` for extra fields.</li>
* <li>`DROPMALFORMED` : ignores the whole corrupted records.</li>
* <li>`FAILFAST` : throws an exception when it meets corrupted records.</li>
* </ul>
* </li>
* </ul>
*
* @since 2.0.0
Expand All @@ -263,12 +269,14 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo
* Loads a Parquet file stream, returning the result as a [[DataFrame]].
*
* You can set the following Parquet-specific option(s) for reading Parquet files:
* <ul>
* <li>`maxFilesPerTrigger` (default: no max limit): sets the maximum number of new files to be
* considered in every trigger.</li>
* <li>`mergeSchema` (default is the value specified in `spark.sql.parquet.mergeSchema`): sets
* whether we should merge schemas collected from all
* Parquet part-files. This will override
* `spark.sql.parquet.mergeSchema`.</li>
* </ul>
*
* @since 2.0.0
*/
Expand All @@ -292,8 +300,10 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo
* }}}
*
* You can set the following text-specific options to deal with text files:
* <ul>
* <li>`maxFilesPerTrigger` (default: no max limit): sets the maximum number of new files to be
* considered in every trigger.</li>
* </ul>
*
* @since 2.0.0
*/
Expand Down