diff --git a/docs/sql-data-sources-csv.md b/docs/sql-data-sources-csv.md index d5390e535eff1..2fe8f77ff6675 100644 --- a/docs/sql-data-sources-csv.md +++ b/docs/sql-data-sources-csv.md @@ -21,8 +21,6 @@ license: | Spark SQL provides `spark.read().csv("file_name")` to read a file or directory of files in CSV format into Spark DataFrame, and `dataframe.write().csv("path")` to write to a CSV file. Function `option()` can be used to customize the behavior of reading or writing, such as controlling behavior of the header, delimiter character, character set, and so on. - -
@@ -38,3 +36,217 @@ Spark SQL provides `spark.read().csv("file_name")` to read a file or directory o
+ +## Data Source Option + +Data source options of CSV can be set via: +* the `.option`/`.options` methods of + * `DataFrameReader` + * `DataFrameWriter` + * `DataStreamReader` + * `DataStreamWriter` +* the built-in functions below + * `from_csv` + * `to_csv` + * `schema_of_csv` +* `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Property NameDefaultMeaningScope
sep,Sets a separator for each field and value. This separator can be one or more characters.read/write
encodingUTF-8For reading, decodes the CSV files by the given encoding type. For writing, specifies encoding (charset) of saved CSV filesread/write
quote"Sets a single character used for escaping quoted values where the separator can be part of the value. For reading, if you would like to turn off quotations, you need to set not null but an empty string. For writing, if an empty string is set, it uses u0000 (null character).read/write
quoteAllfalseA flag indicating whether all values should always be enclosed in quotes. Default is to only escape values containing a quote character.write
escape\Sets a single character used for escaping quotes inside an already quoted value.read/write
escapeQuotestrueA flag indicating whether values containing quotes should always be enclosed in quotes. Default is to escape all values containing a quote character.write
commentSets a single character used for skipping lines beginning with this character. By default, it is disabled.read
headerfalseFor reading, uses the first line as names of columns. For writing, writes the names of columns as the first line. Note that if the given path is a RDD of Strings, this header option will remove all lines same with the header if exists.read/write
inferSchemafalseInfers the input schema automatically from data. It requires one extra pass over the data.read
enforceSchematrueIf it is set to true, the specified or inferred schema will be forcibly applied to datasource files, and headers in CSV files will be ignored. If the option is set to false, the schema will be validated against all headers in CSV files in the case when the header option is set to true. Field names in the schema and column names in CSV headers are checked by their positions taking into account spark.sql.caseSensitive. Though the default value is true, it is recommended to disable the enforceSchema option to avoid incorrect results.read
ignoreLeadingWhiteSpacefalse (for reading), true (for writing)A flag indicating whether or not leading whitespaces from values being read/written should be skipped.read/write
ignoreTrailingWhiteSpacefalse (for reading), true (for writing)A flag indicating whether or not trailing whitespaces from values being read/written should be skipped.read/write
nullValueSets the string representation of a null value. Since 2.0.1, this nullValue param applies to all supported types including the string type.read/write
nanValueNaNSets the string representation of a non-number value.read
positiveInfInfSets the string representation of a positive infinity value.read
negativeInf-InfSets the string representation of a negative infinity value.read
dateFormatyyyy-MM-ddSets the string that indicates a date format. Custom date formats follow the formats at Datetime Patterns. This applies to date type.read/write
timestampFormatyyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]Sets the string that indicates a timestamp format. Custom date formats follow the formats at Datetime Patterns. This applies to timestamp type.read/write
maxColumns20480Defines a hard limit of how many columns a record can have.read
maxCharsPerColumn-1Defines the maximum number of characters allowed for any given value being read. By default, it is -1 meaning unlimited lengthread
modePERMISSIVEAllows a mode for dealing with corrupt records during parsing. It supports the following case-insensitive modes. Note that Spark tries to parse only required columns in CSV under column pruning. Therefore, corrupt records can be different based on required set of fields. This behavior can be controlled by spark.sql.csv.parser.columnPruning.enabled (enabled by default).
+
    +
  • PERMISSIVE: when it meets a corrupted record, puts the malformed string into a field configured by columnNameOfCorruptRecord, and sets malformed fields to null. To keep corrupt records, an user can set a string type field named columnNameOfCorruptRecord in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. A record with less/more tokens than schema is not a corrupted record to CSV. When it meets a record having fewer tokens than the length of the schema, sets null to extra fields. When the record has more tokens than the length of the schema, it drops extra tokens.
  • +
  • DROPMALFORMED: ignores the whole corrupted records.
  • +
  • FAILFAST: throws an exception when it meets corrupted records.
  • +
+
read
columnNameOfCorruptRecordThe value specified in spark.sql.columnNameOfCorruptRecordAllows renaming the new field having malformed string created by PERMISSIVE mode. This overrides spark.sql.columnNameOfCorruptRecord.read
multiLinefalseParse one record, which may span multiple lines, per file.read
charToEscapeQuoteEscapingescape or \0Sets a single character used for escaping the escape for the quote character. The default value is escape character when escape and quote characters are different, \0 otherwise.read/write
samplingRatio1.0Defines fraction of rows used for schema inferring.read
emptyValue (for reading), "" (for writing)Sets the string representation of an empty value.read/write
localeen-USSets a locale as language tag in IETF BCP 47 format. For instance, this is used while parsing dates and timestamps.read
lineSep\r, \r\n and \n (for reading), \n (for writing)Defines the line separator that should be used for parsing/writing. Maximum length is 1 character.read/write
unescapedQuoteHandlingSTOP_AT_DELIMITERDefines how the CsvParser will handle values with unescaped quotes.
+
    +
  • STOP_AT_CLOSING_QUOTE: If unescaped quotes are found in the input, accumulate the quote character and proceed parsing the value as a quoted value, until a closing quote is found.
  • +
  • BACK_TO_DELIMITER: If unescaped quotes are found in the input, consider the value as an unquoted value. This will make the parser accumulate all characters of the current parsed value until the delimiter is found. If no delimiter is found in the value, the parser will continue accumulating characters from the input until a delimiter or line ending is found.
  • +
  • STOP_AT_DELIMITER: If unescaped quotes are found in the input, consider the value as an unquoted value. This will make the parser accumulate all characters until the delimiter or a line ending is found in the input.
  • +
  • SKIP_VALUE: If unescaped quotes are found in the input, the content parsed for the given value will be skipped and the value set in nullValue will be produced instead.
  • +
  • RAISE_ERROR: If unescaped quotes are found in the input, a TextParsingException will be thrown.
  • +
+
read
compression(none)Compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate).write
+Other generic options can be found in Generic File Source Options. diff --git a/docs/sql-data-sources-text.md b/docs/sql-data-sources-text.md index d72b543f54797..fac874afa0df9 100644 --- a/docs/sql-data-sources-text.md +++ b/docs/sql-data-sources-text.md @@ -45,7 +45,7 @@ Data source options of text can be set via: * `DataFrameWriter` * `DataStreamReader` * `DataStreamWriter` - * `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html) +* `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html) diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index 7719d48f6ef7c..f9e37341dcd64 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -195,9 +195,11 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None, ---------------- Extra options For the extra options, refer to - `Data Source Option `_ # noqa + `Data Source Option `_ in the version you use. + .. # noqa + Examples -------- >>> df1 = spark.read.json('python/test_support/sql/people.json') @@ -273,9 +275,11 @@ def parquet(self, *paths, **options): ---------------- **options For the extra options, refer to - `Data Source Option `_ # noqa + `Data Source Option `_ in the version you use. + .. # noqa + Examples -------- >>> df = spark.read.parquet('python/test_support/sql/parquet_partitioned') @@ -318,9 +322,11 @@ def text(self, paths, wholetext=False, lineSep=None, pathGlobFilter=None, ---------------- Extra options For the extra options, refer to - `Data Source Option `_ # noqa + `Data Source Option `_ in the version you use. + .. # noqa + Examples -------- >>> df = spark.read.text('python/test_support/sql/text-test.txt') @@ -364,172 +370,15 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non schema : :class:`pyspark.sql.types.StructType` or str, optional an optional :class:`pyspark.sql.types.StructType` for the input schema or a DDL-formatted string (For example ``col0 INT, col1 DOUBLE``). - sep : str, optional - sets a separator (one or more characters) for each field and value. If None is - set, it uses the default value, ``,``. - encoding : str, optional - decodes the CSV files by the given encoding type. If None is set, - it uses the default value, ``UTF-8``. - quote : str, optional - sets a single character used for escaping quoted values where the - separator can be part of the value. If None is set, it uses the default - value, ``"``. If you would like to turn off quotations, you need to set an - empty string. - escape : str, optional - sets a single character used for escaping quotes inside an already - quoted value. If None is set, it uses the default value, ``\``. - comment : str, optional - sets a single character used for skipping lines beginning with this - character. By default (None), it is disabled. - header : str or bool, optional - uses the first line as names of columns. If None is set, it uses the - default value, ``false``. - - .. note:: if the given path is a RDD of Strings, this header - option will remove all lines same with the header if exists. - - inferSchema : str or bool, optional - infers the input schema automatically from data. It requires one extra - pass over the data. If None is set, it uses the default value, ``false``. - enforceSchema : str or bool, optional - If it is set to ``true``, the specified or inferred schema will be - forcibly applied to datasource files, and headers in CSV files will be - ignored. If the option is set to ``false``, the schema will be - validated against all headers in CSV files or the first header in RDD - if the ``header`` option is set to ``true``. Field names in the schema - and column names in CSV headers are checked by their positions - taking into account ``spark.sql.caseSensitive``. If None is set, - ``true`` is used by default. Though the default value is ``true``, - it is recommended to disable the ``enforceSchema`` option - to avoid incorrect results. - ignoreLeadingWhiteSpace : str or bool, optional - A flag indicating whether or not leading whitespaces from - values being read should be skipped. If None is set, it - uses the default value, ``false``. - ignoreTrailingWhiteSpace : str or bool, optional - A flag indicating whether or not trailing whitespaces from - values being read should be skipped. If None is set, it - uses the default value, ``false``. - nullValue : str, optional - sets the string representation of a null value. If None is set, it uses - the default value, empty string. Since 2.0.1, this ``nullValue`` param - applies to all supported types including the string type. - nanValue : str, optional - sets the string representation of a non-number value. If None is set, it - uses the default value, ``NaN``. - positiveInf : str, optional - sets the string representation of a positive infinity value. If None - is set, it uses the default value, ``Inf``. - negativeInf : str, optional - sets the string representation of a negative infinity value. If None - is set, it uses the default value, ``Inf``. - dateFormat : str, optional - sets the string that indicates a date format. Custom date formats - follow the formats at - `datetime pattern `_. # noqa - This applies to date type. If None is set, it uses the - default value, ``yyyy-MM-dd``. - timestampFormat : str, optional - sets the string that indicates a timestamp format. - Custom date formats follow the formats at - `datetime pattern `_. # noqa - This applies to timestamp type. If None is set, it uses the - default value, ``yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]``. - maxColumns : str or int, optional - defines a hard limit of how many columns a record can have. If None is - set, it uses the default value, ``20480``. - maxCharsPerColumn : str or int, optional - defines the maximum number of characters allowed for any given - value being read. If None is set, it uses the default value, - ``-1`` meaning unlimited length. - maxMalformedLogPerPartition : str or int, optional - this parameter is no longer used since Spark 2.2.0. - If specified, it is ignored. - mode : str, optional - allows a mode for dealing with corrupt records during parsing. If None is - set, it uses the default value, ``PERMISSIVE``. Note that Spark tries to - parse only required columns in CSV under column pruning. Therefore, corrupt - records can be different based on required set of fields. This behavior can - be controlled by ``spark.sql.csv.parser.columnPruning.enabled`` - (enabled by default). - - * ``PERMISSIVE``: when it meets a corrupted record, puts the malformed string \ - into a field configured by ``columnNameOfCorruptRecord``, and sets malformed \ - fields to ``null``. To keep corrupt records, an user can set a string type \ - field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \ - schema does not have the field, it drops corrupt records during parsing. \ - A record with less/more tokens than schema is not a corrupted record to CSV. \ - When it meets a record having fewer tokens than the length of the schema, \ - sets ``null`` to extra fields. When the record has more tokens than the \ - length of the schema, it drops extra tokens. - * ``DROPMALFORMED``: ignores the whole corrupted records. - * ``FAILFAST``: throws an exception when it meets corrupted records. - - columnNameOfCorruptRecord : str, optional - allows renaming the new field having malformed string - created by ``PERMISSIVE`` mode. This overrides - ``spark.sql.columnNameOfCorruptRecord``. If None is set, - it uses the value specified in - ``spark.sql.columnNameOfCorruptRecord``. - multiLine : str or bool, optional - parse records, which may span multiple lines. If None is - set, it uses the default value, ``false``. - charToEscapeQuoteEscaping : str, optional - sets a single character used for escaping the escape for - the quote character. If None is set, the default value is - escape character when escape and quote characters are - different, ``\0`` otherwise. - samplingRatio : str or float, optional - defines fraction of rows used for schema inferring. - If None is set, it uses the default value, ``1.0``. - emptyValue : str, optional - sets the string representation of an empty value. If None is set, it uses - the default value, empty string. - locale : str, optional - sets a locale as language tag in IETF BCP 47 format. If None is set, - it uses the default value, ``en-US``. For instance, ``locale`` is used while - parsing dates and timestamps. - lineSep : str, optional - defines the line separator that should be used for parsing. If None is - set, it covers all ``\\r``, ``\\r\\n`` and ``\\n``. - Maximum length is 1 character. - pathGlobFilter : str or bool, optional - an optional glob pattern to only include files with paths matching - the pattern. The syntax follows `org.apache.hadoop.fs.GlobFilter`. - It does not change the behavior of - `partition discovery `_. # noqa - recursiveFileLookup : str or bool, optional - recursively scan a directory for files. Using this option disables - `partition discovery `_. # noqa - - modification times occurring before the specified time. The provided timestamp - must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00) - modifiedBefore (batch only) : an optional timestamp to only include files with - modification times occurring before the specified time. The provided timestamp - must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00) - modifiedAfter (batch only) : an optional timestamp to only include files with - modification times occurring after the specified time. The provided timestamp - must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00) - unescapedQuoteHandling : str, optional - defines how the CsvParser will handle values with unescaped quotes. If None is - set, it uses the default value, ``STOP_AT_DELIMITER``. - - * ``STOP_AT_CLOSING_QUOTE``: If unescaped quotes are found in the input, accumulate - the quote character and proceed parsing the value as a quoted value, until a closing - quote is found. - * ``BACK_TO_DELIMITER``: If unescaped quotes are found in the input, consider the value - as an unquoted value. This will make the parser accumulate all characters of the current - parsed value until the delimiter is found. If no delimiter is found in the value, the - parser will continue accumulating characters from the input until a delimiter or line - ending is found. - * ``STOP_AT_DELIMITER``: If unescaped quotes are found in the input, consider the value - as an unquoted value. This will make the parser accumulate all characters until the - delimiter or a line ending is found in the input. - * ``SKIP_VALUE``: If unescaped quotes are found in the input, the content parsed - for the given value will be skipped and the value set in nullValue will be produced - instead. - * ``RAISE_ERROR``: If unescaped quotes are found in the input, a TextParsingException - will be thrown. + + Other Parameters + ---------------- + Extra options + For the extra options, refer to + `Data Source Option `_ + in the version you use. + + .. # noqa Examples -------- @@ -595,9 +444,11 @@ def orc(self, path, mergeSchema=None, pathGlobFilter=None, recursiveFileLookup=N ---------------- Extra options For the extra options, refer to - `Data Source Option `_ # noqa + `Data Source Option `_ in the version you use. + .. # noqa + Examples -------- >>> df = spark.read.orc('python/test_support/sql/orc_partitioned') @@ -963,9 +814,11 @@ def json(self, path, mode=None, compression=None, dateFormat=None, timestampForm ---------------- Extra options For the extra options, refer to - `Data Source Option `_ # noqa + `Data Source Option `_ in the version you use. + .. # noqa + Examples -------- >>> df.write.json(os.path.join(tempfile.mkdtemp(), 'data')) @@ -1000,9 +853,11 @@ def parquet(self, path, mode=None, partitionBy=None, compression=None): ---------------- Extra options For the extra options, refer to - `Data Source Option `_ # noqa + `Data Source Option `_ in the version you use. + .. # noqa + Examples -------- >>> df.write.parquet(os.path.join(tempfile.mkdtemp(), 'data')) @@ -1028,9 +883,11 @@ def text(self, path, compression=None, lineSep=None): ---------------- Extra options For the extra options, refer to - `Data Source Option `_ # noqa + `Data Source Option `_ in the version you use. + .. # noqa + The DataFrame must have only one column that is of string type. Each row becomes a new line in the output file. """ @@ -1058,68 +915,14 @@ def csv(self, path, mode=None, compression=None, sep=None, quote=None, escape=No * ``error`` or ``errorifexists`` (default case): Throw an exception if data already \ exists. - compression : str, optional - compression codec to use when saving to file. This can be one of the - known case-insensitive shorten names (none, bzip2, gzip, lz4, - snappy and deflate). - sep : str, optional - sets a separator (one or more characters) for each field and value. If None is - set, it uses the default value, ``,``. - quote : str, optional - sets a single character used for escaping quoted values where the - separator can be part of the value. If None is set, it uses the default - value, ``"``. If an empty string is set, it uses ``u0000`` (null character). - escape : str, optional - sets a single character used for escaping quotes inside an already - quoted value. If None is set, it uses the default value, ``\`` - escapeQuotes : str or bool, optional - a flag indicating whether values containing quotes should always - be enclosed in quotes. If None is set, it uses the default value - ``true``, escaping all values containing a quote character. - quoteAll : str or bool, optional - a flag indicating whether all values should always be enclosed in - quotes. If None is set, it uses the default value ``false``, - only escaping values containing a quote character. - header : str or bool, optional - writes the names of columns as the first line. If None is set, it uses - the default value, ``false``. - nullValue : str, optional - sets the string representation of a null value. If None is set, it uses - the default value, empty string. - dateFormat : str, optional - sets the string that indicates a date format. Custom date formats follow - the formats at - `datetime pattern `_. # noqa - This applies to date type. If None is set, it uses the - default value, ``yyyy-MM-dd``. - timestampFormat : str, optional - sets the string that indicates a timestamp format. - Custom date formats follow the formats at - `datetime pattern `_. # noqa - This applies to timestamp type. If None is set, it uses the - default value, ``yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]``. - ignoreLeadingWhiteSpace : str or bool, optional - a flag indicating whether or not leading whitespaces from - values being written should be skipped. If None is set, it - uses the default value, ``true``. - ignoreTrailingWhiteSpace : str or bool, optional - a flag indicating whether or not trailing whitespaces from - values being written should be skipped. If None is set, it - uses the default value, ``true``. - charToEscapeQuoteEscaping : str, optional - sets a single character used for escaping the escape for - the quote character. If None is set, the default value is - escape character when escape and quote characters are - different, ``\0`` otherwise.. - encoding : str, optional - sets the encoding (charset) of saved csv files. If None is set, - the default UTF-8 charset will be used. - emptyValue : str, optional - sets the string representation of an empty value. If None is set, it uses - the default value, ``""``. - lineSep : str, optional - defines the line separator that should be used for writing. If None is - set, it uses the default value, ``\\n``. Maximum length is 1 character. + Other Parameters + ---------------- + Extra options + For the extra options, refer to + `Data Source Option `_ + in the version you use. + + .. # noqa Examples -------- @@ -1159,9 +962,11 @@ def orc(self, path, mode=None, partitionBy=None, compression=None): ---------------- Extra options For the extra options, refer to - `Data Source Option `_ # noqa + `Data Source Option `_ in the version you use. + .. # noqa + Examples -------- >>> orc_df = spark.read.orc('python/test_support/sql/orc_partitioned') diff --git a/python/pyspark/sql/streaming.py b/python/pyspark/sql/streaming.py index f7ec69a414241..08c8934fbf03d 100644 --- a/python/pyspark/sql/streaming.py +++ b/python/pyspark/sql/streaming.py @@ -484,9 +484,11 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None, ---------------- Extra options For the extra options, refer to - `Data Source Option `_ # noqa + `Data Source Option `_ in the version you use. + .. # noqa + Notes ----- This API is evolving. @@ -524,9 +526,11 @@ def orc(self, path, mergeSchema=None, pathGlobFilter=None, recursiveFileLookup=N ---------------- Extra options For the extra options, refer to - `Data Source Option `_ # noqa + `Data Source Option `_ in the version you use. + .. # noqa + Examples -------- >>> orc_sdf = spark.readStream.schema(sdf_schema).orc(tempfile.mkdtemp()) @@ -558,9 +562,11 @@ def parquet(self, path, mergeSchema=None, pathGlobFilter=None, recursiveFileLook ---------------- Extra options For the extra options, refer to - `Data Source Option `_. # noqa + `Data Source Option `_. in the version you use. + .. # noqa + Examples -------- >>> parquet_sdf = spark.readStream.schema(sdf_schema).parquet(tempfile.mkdtemp()) @@ -598,9 +604,11 @@ def text(self, path, wholetext=False, lineSep=None, pathGlobFilter=None, ---------------- Extra options For the extra options, refer to - `Data Source Option `_ # noqa + `Data Source Option `_ in the version you use. + .. # noqa + Notes ----- This API is evolving. @@ -642,154 +650,18 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non schema : :class:`pyspark.sql.types.StructType` or str, optional an optional :class:`pyspark.sql.types.StructType` for the input schema or a DDL-formatted string (For example ``col0 INT, col1 DOUBLE``). - sep : str, optional - sets a separator (one or more characters) for each field and value. If None is - set, it uses the default value, ``,``. - encoding : str, optional - decodes the CSV files by the given encoding type. If None is set, - it uses the default value, ``UTF-8``. - quote : str, optional sets a single character used for escaping quoted values where the - separator can be part of the value. If None is set, it uses the default - value, ``"``. If you would like to turn off quotations, you need to set an - empty string. - escape : str, optional - sets a single character used for escaping quotes inside an already - quoted value. If None is set, it uses the default value, ``\``. - comment : str, optional - sets a single character used for skipping lines beginning with this - character. By default (None), it is disabled. - header : str or bool, optional - uses the first line as names of columns. If None is set, it uses the - default value, ``false``. - inferSchema : str or bool, optional - infers the input schema automatically from data. It requires one extra - pass over the data. If None is set, it uses the default value, ``false``. - enforceSchema : str or bool, optional - If it is set to ``true``, the specified or inferred schema will be - forcibly applied to datasource files, and headers in CSV files will be - ignored. If the option is set to ``false``, the schema will be - validated against all headers in CSV files or the first header in RDD - if the ``header`` option is set to ``true``. Field names in the schema - and column names in CSV headers are checked by their positions - taking into account ``spark.sql.caseSensitive``. If None is set, - ``true`` is used by default. Though the default value is ``true``, - it is recommended to disable the ``enforceSchema`` option - to avoid incorrect results. - ignoreLeadingWhiteSpace : str or bool, optional - a flag indicating whether or not leading whitespaces from - values being read should be skipped. If None is set, it - uses the default value, ``false``. - ignoreTrailingWhiteSpace : str or bool, optional - a flag indicating whether or not trailing whitespaces from - values being read should be skipped. If None is set, it - uses the default value, ``false``. - nullValue : str, optional - sets the string representation of a null value. If None is set, it uses - the default value, empty string. Since 2.0.1, this ``nullValue`` param - applies to all supported types including the string type. - nanValue : str, optional - sets the string representation of a non-number value. If None is set, it - uses the default value, ``NaN``. - positiveInf : str, optional - sets the string representation of a positive infinity value. If None - is set, it uses the default value, ``Inf``. - negativeInf : str, optional - sets the string representation of a negative infinity value. If None - is set, it uses the default value, ``Inf``. - dateFormat : str, optional - sets the string that indicates a date format. Custom date formats - follow the formats at - `datetime pattern `_. # noqa - This applies to date type. If None is set, it uses the - default value, ``yyyy-MM-dd``. - timestampFormat : str, optional - sets the string that indicates a timestamp format. - Custom date formats follow the formats at - `datetime pattern `_. # noqa - This applies to timestamp type. If None is set, it uses the - default value, ``yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]``. - maxColumns : str or int, optional - defines a hard limit of how many columns a record can have. If None is - set, it uses the default value, ``20480``. - maxCharsPerColumn : str or int, optional - defines the maximum number of characters allowed for any given - value being read. If None is set, it uses the default value, - ``-1`` meaning unlimited length. - maxMalformedLogPerPartition : str or int, optional - this parameter is no longer used since Spark 2.2.0. - If specified, it is ignored. - mode : str, optional - allows a mode for dealing with corrupt records during parsing. If None is - set, it uses the default value, ``PERMISSIVE``. - - * ``PERMISSIVE``: when it meets a corrupted record, puts the malformed string \ - into a field configured by ``columnNameOfCorruptRecord``, and sets malformed \ - fields to ``null``. To keep corrupt records, an user can set a string type \ - field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \ - schema does not have the field, it drops corrupt records during parsing. \ - A record with less/more tokens than schema is not a corrupted record to CSV. \ - When it meets a record having fewer tokens than the length of the schema, \ - sets ``null`` to extra fields. When the record has more tokens than the \ - length of the schema, it drops extra tokens. - * ``DROPMALFORMED``: ignores the whole corrupted records. - * ``FAILFAST``: throws an exception when it meets corrupted records. - - columnNameOfCorruptRecord : str, optional - allows renaming the new field having malformed string - created by ``PERMISSIVE`` mode. This overrides - ``spark.sql.columnNameOfCorruptRecord``. If None is set, - it uses the value specified in - ``spark.sql.columnNameOfCorruptRecord``. - multiLine : str or bool, optional - parse one record, which may span multiple lines. If None is - set, it uses the default value, ``false``. - charToEscapeQuoteEscaping : str, optional - sets a single character used for escaping the escape for - the quote character. If None is set, the default value is - escape character when escape and quote characters are - different, ``\0`` otherwise. - emptyValue : str, optional - sets the string representation of an empty value. If None is set, it uses - the default value, empty string. - locale : str, optional - sets a locale as language tag in IETF BCP 47 format. If None is set, - it uses the default value, ``en-US``. For instance, ``locale`` is used while - parsing dates and timestamps. - lineSep : str, optional - defines the line separator that should be used for parsing. If None is - set, it covers all ``\\r``, ``\\r\\n`` and ``\\n``. - Maximum length is 1 character. - pathGlobFilter : str or bool, optional - an optional glob pattern to only include files with paths matching - the pattern. The syntax follows `org.apache.hadoop.fs.GlobFilter`. - It does not change the behavior of - `partition discovery `_. # noqa - recursiveFileLookup : str or bool, optional - recursively scan a directory for files. Using this option disables - `partition discovery `_. # noqa - unescapedQuoteHandling : str, optional - defines how the CsvParser will handle values with unescaped quotes. If None is - set, it uses the default value, ``STOP_AT_DELIMITER``. - - * ``STOP_AT_CLOSING_QUOTE``: If unescaped quotes are found in the input, accumulate - the quote character and proceed parsing the value as a quoted value, until a closing - quote is found. - * ``BACK_TO_DELIMITER``: If unescaped quotes are found in the input, consider the value - as an unquoted value. This will make the parser accumulate all characters of the current - parsed value until the delimiter is found. If no delimiter is found in the value, the - parser will continue accumulating characters from the input until a delimiter or line - ending is found. - * ``STOP_AT_DELIMITER``: If unescaped quotes are found in the input, consider the value - as an unquoted value. This will make the parser accumulate all characters until the - delimiter or a line ending is found in the input. - * ``SKIP_VALUE``: If unescaped quotes are found in the input, the content parsed - for the given value will be skipped and the value set in nullValue will be produced - instead. - * ``RAISE_ERROR``: If unescaped quotes are found in the input, a TextParsingException - will be thrown. .. versionadded:: 2.0.0 + Other Parameters + ---------------- + Extra options + For the extra options, refer to + `Data Source Option `_ + in the version you use. + + .. # noqa + Notes ----- This API is evolving. diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala index ea84785f27af8..8a066bf298976 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala @@ -556,119 +556,9 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * is enabled. To avoid going through the entire data once, disable `inferSchema` option or * specify the schema explicitly using `schema`. * - * You can set the following CSV-specific options to deal with CSV files: - *
    - *
  • `sep` (default `,`): sets a separator for each field and value. This separator can be one - * or more characters.
  • - *
  • `encoding` (default `UTF-8`): decodes the CSV files by the given encoding - * type.
  • - *
  • `quote` (default `"`): sets a single character used for escaping quoted values where - * the separator can be part of the value. If you would like to turn off quotations, you need to - * set not `null` but an empty string. This behaviour is different from - * `com.databricks.spark.csv`.
  • - *
  • `escape` (default `\`): sets a single character used for escaping quotes inside - * an already quoted value.
  • - *
  • `charToEscapeQuoteEscaping` (default `escape` or `\0`): sets a single character used for - * escaping the escape for the quote character. The default value is escape character when escape - * and quote characters are different, `\0` otherwise.
  • - *
  • `comment` (default empty string): sets a single character used for skipping lines - * beginning with this character. By default, it is disabled.
  • - *
  • `header` (default `false`): uses the first line as names of columns.
  • - *
  • `enforceSchema` (default `true`): If it is set to `true`, the specified or inferred schema - * will be forcibly applied to datasource files, and headers in CSV files will be ignored. - * If the option is set to `false`, the schema will be validated against all headers in CSV files - * in the case when the `header` option is set to `true`. Field names in the schema - * and column names in CSV headers are checked by their positions taking into account - * `spark.sql.caseSensitive`. Though the default value is true, it is recommended to disable - * the `enforceSchema` option to avoid incorrect results.
  • - *
  • `inferSchema` (default `false`): infers the input schema automatically from data. It - * requires one extra pass over the data.
  • - *
  • `samplingRatio` (default is 1.0): defines fraction of rows used for schema inferring.
  • - *
  • `ignoreLeadingWhiteSpace` (default `false`): a flag indicating whether or not leading - * whitespaces from values being read should be skipped.
  • - *
  • `ignoreTrailingWhiteSpace` (default `false`): a flag indicating whether or not trailing - * whitespaces from values being read should be skipped.
  • - *
  • `nullValue` (default empty string): sets the string representation of a null value. Since - * 2.0.1, this applies to all supported types including the string type.
  • - *
  • `emptyValue` (default empty string): sets the string representation of an empty value.
  • - *
  • `nanValue` (default `NaN`): sets the string representation of a non-number" value.
  • - *
  • `positiveInf` (default `Inf`): sets the string representation of a positive infinity - * value.
  • - *
  • `negativeInf` (default `-Inf`): sets the string representation of a negative infinity - * value.
  • - *
  • `dateFormat` (default `yyyy-MM-dd`): sets the string that indicates a date format. - * Custom date formats follow the formats at - * - * Datetime Patterns. - * This applies to date type.
  • - *
  • `timestampFormat` (default `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]`): sets the string that - * indicates a timestamp format. Custom date formats follow the formats at - * - * Datetime Patterns. - * This applies to timestamp type.
  • - *
  • `maxColumns` (default `20480`): defines a hard limit of how many columns - * a record can have.
  • - *
  • `maxCharsPerColumn` (default `-1`): defines the maximum number of characters allowed - * for any given value being read. By default, it is -1 meaning unlimited length
  • - *
  • `unescapedQuoteHandling` (default `STOP_AT_DELIMITER`): defines how the CsvParser - * will handle values with unescaped quotes. - *
      - *
    • `STOP_AT_CLOSING_QUOTE`: If unescaped quotes are found in the input, accumulate - * the quote character and proceed parsing the value as a quoted value, until a closing - * quote is found.
    • - *
    • `BACK_TO_DELIMITER`: If unescaped quotes are found in the input, consider the value - * as an unquoted value. This will make the parser accumulate all characters of the current - * parsed value until the delimiter is found. If no - * delimiter is found in the value, the parser will continue accumulating characters from - * the input until a delimiter or line ending is found.
    • - *
    • `STOP_AT_DELIMITER`: If unescaped quotes are found in the input, consider the value - * as an unquoted value. This will make the parser accumulate all characters until the - * delimiter or a line ending is found in the input.
    • - *
    • `SKIP_VALUE`: If unescaped quotes are found in the input, the content parsed - * for the given value will be skipped and the value set in nullValue will be produced - * instead.
    • - *
    • `RAISE_ERROR`: If unescaped quotes are found in the input, a TextParsingException - * will be thrown.
    • - *
    - *
  • - *
  • `mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records - * during parsing. It supports the following case-insensitive modes. Note that Spark tries - * to parse only required columns in CSV under column pruning. Therefore, corrupt records - * can be different based on required set of fields. This behavior can be controlled by - * `spark.sql.csv.parser.columnPruning.enabled` (enabled by default). - *
      - *
    • `PERMISSIVE` : when it meets a corrupted record, puts the malformed string into a - * field configured by `columnNameOfCorruptRecord`, and sets malformed fields to `null`. - * To keep corrupt records, an user can set a string type field named - * `columnNameOfCorruptRecord` in an user-defined schema. If a schema does not have - * the field, it drops corrupt records during parsing. A record with less/more tokens - * than schema is not a corrupted record to CSV. When it meets a record having fewer - * tokens than the length of the schema, sets `null` to extra fields. When the record - * has more tokens than the length of the schema, it drops extra tokens.
    • - *
    • `DROPMALFORMED` : ignores the whole corrupted records.
    • - *
    • `FAILFAST` : throws an exception when it meets corrupted records.
    • - *
    - *
  • - *
  • `columnNameOfCorruptRecord` (default is the value specified in - * `spark.sql.columnNameOfCorruptRecord`): allows renaming the new field having malformed string - * created by `PERMISSIVE` mode. This overrides `spark.sql.columnNameOfCorruptRecord`.
  • - *
  • `multiLine` (default `false`): parse one record, which may span multiple lines.
  • - *
  • `locale` (default is `en-US`): sets a locale as language tag in IETF BCP 47 format. - * For instance, this is used while parsing dates and timestamps.
  • - *
  • `lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line separator - * that should be used for parsing. Maximum length is 1 character.
  • - *
  • `pathGlobFilter`: an optional glob pattern to only include files with paths matching - * the pattern. The syntax follows org.apache.hadoop.fs.GlobFilter. - * It does not change the behavior of partition discovery.
  • - *
  • `modifiedBefore` (batch only): an optional timestamp to only include files with - * modification times occurring before the specified Time. The provided timestamp - * must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
  • - *
  • `modifiedAfter` (batch only): an optional timestamp to only include files with - * modification times occurring after the specified Time. The provided timestamp - * must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
  • - *
  • `recursiveFileLookup`: recursively scan a directory for files. Using this option - * disables partition discovery
  • - *
+ * You can find the CSV-specific options for reading CSV files in + * + * Data Source Option in the version you use. * * @since 2.0.0 */ diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala index cb1029579aa5e..a8af7c8ba850e 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala @@ -850,48 +850,9 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { * format("csv").save(path) * }}} * - * You can set the following CSV-specific option(s) for writing CSV files: - *
    - *
  • `sep` (default `,`): sets a single character as a separator for each - * field and value.
  • - *
  • `quote` (default `"`): sets a single character used for escaping quoted values where - * the separator can be part of the value. If an empty string is set, it uses `u0000` - * (null character).
  • - *
  • `escape` (default `\`): sets a single character used for escaping quotes inside - * an already quoted value.
  • - *
  • `charToEscapeQuoteEscaping` (default `escape` or `\0`): sets a single character used for - * escaping the escape for the quote character. The default value is escape character when escape - * and quote characters are different, `\0` otherwise.
  • - *
  • `escapeQuotes` (default `true`): a flag indicating whether values containing - * quotes should always be enclosed in quotes. Default is to escape all values containing - * a quote character.
  • - *
  • `quoteAll` (default `false`): a flag indicating whether all values should always be - * enclosed in quotes. Default is to only escape values containing a quote character.
  • - *
  • `header` (default `false`): writes the names of columns as the first line.
  • - *
  • `nullValue` (default empty string): sets the string representation of a null value.
  • - *
  • `emptyValue` (default `""`): sets the string representation of an empty value.
  • - *
  • `encoding` (by default it is not set): specifies encoding (charset) of saved csv - * files. If it is not set, the UTF-8 charset will be used.
  • - *
  • `compression` (default `null`): compression codec to use when saving to file. This can be - * one of the known case-insensitive shorten names (`none`, `bzip2`, `gzip`, `lz4`, - * `snappy` and `deflate`).
  • - *
  • `dateFormat` (default `yyyy-MM-dd`): sets the string that indicates a date format. - * Custom date formats follow the formats at - * - * Datetime Patterns. - * This applies to date type.
  • - *
  • `timestampFormat` (default `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]`): sets the string that - * indicates a timestamp format. Custom date formats follow the formats at - * - * Datetime Patterns. - * This applies to timestamp type.
  • - *
  • `ignoreLeadingWhiteSpace` (default `true`): a flag indicating whether or not leading - * whitespaces from values being written should be skipped.
  • - *
  • `ignoreTrailingWhiteSpace` (default `true`): a flag indicating defines whether or not - * trailing whitespaces from values being written should be skipped.
  • - *
  • `lineSep` (default `\n`): defines the line separator that should be used for writing. - * Maximum length is 1 character.
  • - *
+ * You can find the CSV-specific options for writing CSV files in + * + * Data Source Option in the version you use. * * @since 2.0.0 */ diff --git a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala index 8a278a504e4d9..c446d6b96d1a9 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala @@ -4607,6 +4607,7 @@ object functions { @scala.annotation.varargs def map_concat(cols: Column*): Column = withExpr { MapConcat(cols.map(_.expr)) } + // scalastyle:off line.size.limit /** * Parses a column containing a CSV string into a `StructType` with the specified schema. * Returns `null`, in the case of an unparseable string. @@ -4615,15 +4616,21 @@ object functions { * @param schema the schema to use when parsing the CSV string * @param options options to control how the CSV is parsed. accepts the same options and the * CSV data source. + * See + * + * Data Source Option in the version you use. * * @group collection_funcs * @since 3.0.0 */ + // scalastyle:on line.size.limit def from_csv(e: Column, schema: StructType, options: Map[String, String]): Column = withExpr { val replaced = CharVarcharUtils.failIfHasCharVarchar(schema).asInstanceOf[StructType] CsvToStructs(replaced, options, e.expr) } + // scalastyle:off line.size.limit /** * (Java-specific) Parses a column containing a CSV string into a `StructType` * with the specified schema. Returns `null`, in the case of an unparseable string. @@ -4632,10 +4639,15 @@ object functions { * @param schema the schema to use when parsing the CSV string * @param options options to control how the CSV is parsed. accepts the same options and the * CSV data source. + * See + * + * Data Source Option in the version you use. * * @group collection_funcs * @since 3.0.0 */ + // scalastyle:on line.size.limit def from_csv(e: Column, schema: Column, options: java.util.Map[String, String]): Column = { withExpr(new CsvToStructs(e.expr, schema.expr, options.asScala.toMap)) } @@ -4660,32 +4672,44 @@ object functions { */ def schema_of_csv(csv: Column): Column = withExpr(new SchemaOfCsv(csv.expr)) + // scalastyle:off line.size.limit /** * Parses a CSV string and infers its schema in DDL format using options. * * @param csv a foldable string column containing a CSV string. * @param options options to control how the CSV is parsed. accepts the same options and the - * json data source. See [[DataFrameReader#csv]]. + * CSV data source. + * See + * + * Data Source Option in the version you use. * @return a column with string literal containing schema in DDL format. * * @group collection_funcs * @since 3.0.0 */ + // scalastyle:on line.size.limit def schema_of_csv(csv: Column, options: java.util.Map[String, String]): Column = { withExpr(SchemaOfCsv(csv.expr, options.asScala.toMap)) } + // scalastyle:off line.size.limit /** * (Java-specific) Converts a column containing a `StructType` into a CSV string with * the specified schema. Throws an exception, in the case of an unsupported type. * * @param e a column containing a struct. * @param options options to control how the struct column is converted into a CSV string. - * It accepts the same options and the json data source. + * It accepts the same options and the CSV data source. + * See + * + * Data Source Option in the version you use. * * @group collection_funcs * @since 3.0.0 */ + // scalastyle:on line.size.limit def to_csv(e: Column, options: java.util.Map[String, String]): Column = withExpr { StructsToCsv(options.asScala.toMap, e.expr) } diff --git a/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala b/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala index 6c3fbaf00e2f7..e6e65cd1b69d9 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala @@ -239,105 +239,16 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo * is enabled. To avoid going through the entire data once, disable `inferSchema` option or * specify the schema explicitly using `schema`. * - * You can set the following CSV-specific options to deal with CSV files: + * You can set the following option(s): *
    *
  • `maxFilesPerTrigger` (default: no max limit): sets the maximum number of new files to be * considered in every trigger.
  • - *
  • `sep` (default `,`): sets a single character as a separator for each - * field and value.
  • - *
  • `encoding` (default `UTF-8`): decodes the CSV files by the given encoding - * type.
  • - *
  • `quote` (default `"`): sets a single character used for escaping quoted values where - * the separator can be part of the value. If you would like to turn off quotations, you need to - * set not `null` but an empty string. This behaviour is different form - * `com.databricks.spark.csv`.
  • - *
  • `escape` (default `\`): sets a single character used for escaping quotes inside - * an already quoted value.
  • - *
  • `charToEscapeQuoteEscaping` (default `escape` or `\0`): sets a single character used for - * escaping the escape for the quote character. The default value is escape character when escape - * and quote characters are different, `\0` otherwise.
  • - *
  • `comment` (default empty string): sets a single character used for skipping lines - * beginning with this character. By default, it is disabled.
  • - *
  • `header` (default `false`): uses the first line as names of columns.
  • - *
  • `inferSchema` (default `false`): infers the input schema automatically from data. It - * requires one extra pass over the data.
  • - *
  • `ignoreLeadingWhiteSpace` (default `false`): a flag indicating whether or not leading - * whitespaces from values being read should be skipped.
  • - *
  • `ignoreTrailingWhiteSpace` (default `false`): a flag indicating whether or not trailing - * whitespaces from values being read should be skipped.
  • - *
  • `nullValue` (default empty string): sets the string representation of a null value. Since - * 2.0.1, this applies to all supported types including the string type.
  • - *
  • `emptyValue` (default empty string): sets the string representation of an empty value.
  • - *
  • `nanValue` (default `NaN`): sets the string representation of a non-number" value.
  • - *
  • `positiveInf` (default `Inf`): sets the string representation of a positive infinity - * value.
  • - *
  • `negativeInf` (default `-Inf`): sets the string representation of a negative infinity - * value.
  • - *
  • `dateFormat` (default `yyyy-MM-dd`): sets the string that indicates a date format. - * Custom date formats follow the formats at - * - * Datetime Patterns. - * This applies to date type.
  • - *
  • `timestampFormat` (default `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]`): sets the string that - * indicates a timestamp format. Custom date formats follow the formats at - * - * Datetime Patterns. - * This applies to timestamp type.
  • - *
  • `maxColumns` (default `20480`): defines a hard limit of how many columns - * a record can have.
  • - *
  • `maxCharsPerColumn` (default `-1`): defines the maximum number of characters allowed - * for any given value being read. By default, it is -1 meaning unlimited length
  • - *
  • `unescapedQuoteHandling` (default `STOP_AT_DELIMITER`): defines how the CsvParser - * will handle values with unescaped quotes. - *
      - *
    • `STOP_AT_CLOSING_QUOTE`: If unescaped quotes are found in the input, accumulate - * the quote character and proceed parsing the value as a quoted value, until a closing - * quote is found.
    • - *
    • `BACK_TO_DELIMITER`: If unescaped quotes are found in the input, consider the value - * as an unquoted value. This will make the parser accumulate all characters of the current - * parsed value until the delimiter is found. If no delimiter is found in the value, the - * parser will continue accumulating characters from the input until a delimiter or line - * ending is found.
    • - *
    • `STOP_AT_DELIMITER`: If unescaped quotes are found in the input, consider the value - * as an unquoted value. This will make the parser accumulate all characters until the - * delimiter or a line ending is found in the input.
    • - *
    • `SKIP_VALUE`: If unescaped quotes are found in the input, the content parsed - * for the given value will be skipped and the value set in nullValue will be produced - * instead.
    • - *
    • `RAISE_ERROR`: If unescaped quotes are found in the input, a TextParsingException - * will be thrown.
    • - *
    - *
  • - *
  • `mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records - * during parsing. It supports the following case-insensitive modes. - *
      - *
    • `PERMISSIVE` : when it meets a corrupted record, puts the malformed string into a - * field configured by `columnNameOfCorruptRecord`, and sets malformed fields to `null`. - * To keep corrupt records, an user can set a string type field named - * `columnNameOfCorruptRecord` in an user-defined schema. If a schema does not have - * the field, it drops corrupt records during parsing. A record with less/more tokens - * than schema is not a corrupted record to CSV. When it meets a record having fewer - * tokens than the length of the schema, sets `null` to extra fields. When the record - * has more tokens than the length of the schema, it drops extra tokens.
    • - *
    • `DROPMALFORMED` : ignores the whole corrupted records.
    • - *
    • `FAILFAST` : throws an exception when it meets corrupted records.
    • - *
    - *
  • - *
  • `columnNameOfCorruptRecord` (default is the value specified in - * `spark.sql.columnNameOfCorruptRecord`): allows renaming the new field having malformed string - * created by `PERMISSIVE` mode. This overrides `spark.sql.columnNameOfCorruptRecord`.
  • - *
  • `multiLine` (default `false`): parse one record, which may span multiple lines.
  • - *
  • `locale` (default is `en-US`): sets a locale as language tag in IETF BCP 47 format. - * For instance, this is used while parsing dates and timestamps.
  • - *
  • `lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line separator - * that should be used for parsing. Maximum length is 1 character.
  • - *
  • `pathGlobFilter`: an optional glob pattern to only include files with paths matching - * the pattern. The syntax follows org.apache.hadoop.fs.GlobFilter. - * It does not change the behavior of partition discovery.
  • - *
  • `recursiveFileLookup`: recursively scan a directory for files. Using this option - * disables partition discovery
  • *
* + * You can find the CSV-specific options for reading CSV file stream in + * + * Data Source Option in the version you use. + * * @since 2.0.0 */ def csv(path: String): DataFrame = format("csv").load(path)
Property NameDefaultMeaningScope