diff --git a/docs/sql-data-sources-text.md b/docs/sql-data-sources-text.md index c32395f8ebb1c..d72b543f54797 100644 --- a/docs/sql-data-sources-text.md +++ b/docs/sql-data-sources-text.md @@ -21,8 +21,6 @@ license: | Spark SQL provides `spark.read().text("file_name")` to read a file or directory of text files into a Spark DataFrame, and `dataframe.write().text("path")` to write to a text file. When reading a text file, each line becomes each row that has string "value" column by default. The line separator can be changed as shown in the example below. The `option()` function can be used to customize the behavior of reading or writing, such as controlling behavior of the line separator, compression, and so on. - -
@@ -38,3 +36,36 @@ Spark SQL provides `spark.read().text("file_name")` to read a file or directory
+ +## Data Source Option + +Data source options of text can be set via: +* the `.option`/`.options` methods of + * `DataFrameReader` + * `DataFrameWriter` + * `DataStreamReader` + * `DataStreamWriter` + * `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html) + + + + + + + + + + + + + + + + + + + + + +
Property NameDefaultMeaningScope
wholetextfalseIf true, read each file from input path(s) as a single row.read
lineSep\r, \r\n, \n for reading, \n for writingDefines the line separator that should be used for reading or writing.read/write
compression(none)Compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate).write
+Other generic options can be found in Generic File Source Options. diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index b9a975ffdcc51..7719d48f6ef7c 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -313,28 +313,13 @@ def text(self, paths, wholetext=False, lineSep=None, pathGlobFilter=None, ---------- paths : str or list string, or list of strings, for input path(s). - wholetext : str or bool, optional - if true, read each file from input path(s) as a single row. - lineSep : str, optional - defines the line separator that should be used for parsing. If None is - set, it covers all ``\\r``, ``\\r\\n`` and ``\\n``. - pathGlobFilter : str or bool, optional - an optional glob pattern to only include files with paths matching - the pattern. The syntax follows `org.apache.hadoop.fs.GlobFilter`. - It does not change the behavior of - `partition discovery `_. # noqa - recursiveFileLookup : str or bool, optional - recursively scan a directory for files. Using this option disables - `partition discovery `_. # noqa - modification times occurring before the specified time. The provided timestamp - must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00) - modifiedBefore (batch only) : an optional timestamp to only include files with - modification times occurring before the specified time. The provided timestamp - must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00) - modifiedAfter (batch only) : an optional timestamp to only include files with - modification times occurring after the specified time. The provided timestamp - must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00) + Other Parameters + ---------------- + Extra options + For the extra options, refer to + `Data Source Option `_ # noqa + in the version you use. Examples -------- @@ -1038,13 +1023,13 @@ def text(self, path, compression=None, lineSep=None): ---------- path : str the path in any Hadoop supported file system - compression : str, optional - compression codec to use when saving to file. This can be one of the - known case-insensitive shorten names (none, bzip2, gzip, lz4, - snappy and deflate). - lineSep : str, optional - defines the line separator that should be used for writing. If None is - set, it uses the default value, ``\\n``. + + Other Parameters + ---------------- + Extra options + For the extra options, refer to + `Data Source Option `_ # noqa + in the version you use. The DataFrame must have only one column that is of string type. Each row becomes a new line in the output file. diff --git a/python/pyspark/sql/streaming.py b/python/pyspark/sql/streaming.py index ad71c5041b82d..f1fbf73dce764 100644 --- a/python/pyspark/sql/streaming.py +++ b/python/pyspark/sql/streaming.py @@ -593,19 +593,13 @@ def text(self, path, wholetext=False, lineSep=None, pathGlobFilter=None, ---------- paths : str or list string, or list of strings, for input path(s). - wholetext : str or bool, optional - if true, read each file from input path(s) as a single row. - lineSep : str, optional - defines the line separator that should be used for parsing. If None is - set, it covers all ``\\r``, ``\\r\\n`` and ``\\n``. - pathGlobFilter : str or bool, optional - an optional glob pattern to only include files with paths matching - the pattern. The syntax follows `org.apache.hadoop.fs.GlobFilter`. - It does not change the behavior of `partition discovery`_. - recursiveFileLookup : str or bool, optional - recursively scan a directory for files. Using this option - disables - `partition discovery `_. # noqa + + Other Parameters + ---------------- + Extra options + For the extra options, refer to + `Data Source Option `_ # noqa + in the version you use. Notes ----- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala index e2c9e3126c6fb..ea84785f27af8 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala @@ -773,24 +773,9 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * spark.read().text("/path/to/spark/README.md") * }}} * - * You can set the following text-specific option(s) for reading text files: - *
    - *
  • `wholetext` (default `false`): If true, read a file as a single row and not split by "\n". - *
  • - *
  • `lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line separator - * that should be used for parsing.
  • - *
  • `pathGlobFilter`: an optional glob pattern to only include files with paths matching - * the pattern. The syntax follows org.apache.hadoop.fs.GlobFilter. - * It does not change the behavior of partition discovery.
  • - *
  • `modifiedBefore` (batch only): an optional timestamp to only include files with - * modification times occurring before the specified Time. The provided timestamp - * must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
  • - *
  • `modifiedAfter` (batch only): an optional timestamp to only include files with - * modification times occurring after the specified Time. The provided timestamp - * must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
  • - *
  • `recursiveFileLookup`: recursively scan a directory for files. Using this option - * disables partition discovery
  • - *
+ * You can find the text-specific options for reading text files in + * + * Data Source Option in the version you use. * * @param paths input paths * @since 1.6.0 diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala index 8c8def396a4d4..cb1029579aa5e 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala @@ -833,13 +833,9 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { * }}} * The text files will be encoded as UTF-8. * - * You can set the following option(s) for writing text files: - *
    - *
  • `compression` (default `null`): compression codec to use when saving to file. This can be - * one of the known case-insensitive shorten names (`none`, `bzip2`, `gzip`, `lz4`, - * `snappy` and `deflate`).
  • - *
  • `lineSep` (default `\n`): defines the line separator that should be used for writing.
  • - *
+ * You can find the text-specific options for writing text files in + * + * Data Source Option in the version you use. * * @since 1.6.0 */ diff --git a/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala b/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala index b369a0a59af3e..6c3fbaf00e2f7 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala @@ -413,21 +413,16 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo * spark.readStream().text("/path/to/directory/") * }}} * - * You can set the following text-specific options to deal with text files: + * You can set the following option(s): *
    *
  • `maxFilesPerTrigger` (default: no max limit): sets the maximum number of new files to be * considered in every trigger.
  • - *
  • `wholetext` (default `false`): If true, read a file as a single row and not split by "\n". - *
  • - *
  • `lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line separator - * that should be used for parsing.
  • - *
  • `pathGlobFilter`: an optional glob pattern to only include files with paths matching - * the pattern. The syntax follows org.apache.hadoop.fs.GlobFilter. - * It does not change the behavior of partition discovery.
  • - *
  • `recursiveFileLookup`: recursively scan a directory for files. Using this option - * disables partition discovery
  • *
* + * You can find the text-specific options for reading text files in + * + * Data Source Option in the version you use. + * * @since 2.0.0 */ def text(path: String): DataFrame = format("text").load(path)