-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-27627][SQL] Make option "pathGlobFilter" as a general option for all file sources #24518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
53e9b6b
5aac13c
a308f3f
dcab4f9
4d2c6e1
ddf1874
5d678e2
d8f8420
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| do not read this |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -57,6 +57,10 @@ private[avro] class AvroFileFormat extends FileFormat | |
| options: Map[String, String], | ||
| files: Seq[FileStatus]): Option[StructType] = { | ||
| val conf = spark.sessionState.newHadoopConf() | ||
| if (options.contains("ignoreExtension")) { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| logWarning(s"Option ${AvroOptions.ignoreExtensionKey} is deprecated. Please use the " + | ||
| "general data source option pathGlobFilter for filtering file names.") | ||
| } | ||
| val parsedOptions = new AvroOptions(options, conf) | ||
|
|
||
| // User can specify an optional avro json schema. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -59,14 +59,15 @@ class AvroOptions( | |
| * If the option is not set, the Hadoop's config `avro.mapred.ignore.inputs.without.extension` | ||
| * is taken into account. If the former one is not set too, file extensions are ignored. | ||
| */ | ||
| @deprecated("Use the general data source option pathGlobFilter for filtering file names", "3.0") | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am wondering whom is this deprecation warning to? Spark users don't use
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it already shows an warning at https://github.com/apache/spark/pull/24518/files/d8f8420d9d3c97f96c1e09855e008ece3f275ad3#diff-8b28467c7f7a28d7fcf208a613a373c8R61
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. only in one case at schema inferring. I would remove this annotation and print warning in initialization of
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should remove It would be great if we can put that logic into parameters
.get(AvroOptions.ignoreExtensionKey)
.map { v =>
logWarning(...)
v.toBoolean
}.getOrElse(!ignoreFilesWithoutExtension)However, can you make it doesn't show the logs too many times? If we put there, seems like it will show the same logs multiple times.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you can find a better way, please go and open a PR (and some nits I picked below) |
||
| val ignoreExtension: Boolean = { | ||
| val ignoreFilesWithoutExtensionByDefault = false | ||
| val ignoreFilesWithoutExtension = conf.getBoolean( | ||
| AvroFileFormat.IgnoreFilesWithoutExtensionProperty, | ||
| ignoreFilesWithoutExtensionByDefault) | ||
|
|
||
| parameters | ||
| .get("ignoreExtension") | ||
| .get(AvroOptions.ignoreExtensionKey) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| .map(_.toBoolean) | ||
| .getOrElse(!ignoreFilesWithoutExtension) | ||
| } | ||
|
|
@@ -93,4 +94,6 @@ object AvroOptions { | |
| .getOrElse(new Configuration()) | ||
| new AvroOptions(CaseInsensitiveMap(parameters), hadoopConf) | ||
| } | ||
|
|
||
| val ignoreExtensionKey = "ignoreExtension" | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -120,6 +120,9 @@ def option(self, key, value): | |
| * ``timeZone``: sets the string that indicates a timezone to be used to parse timestamps | ||
| in the JSON/CSV datasources or partition values. | ||
| If it isn't set, it uses the default value, session local timezone. | ||
| * ``pathGlobFilter``: an optional glob pattern to only include files with paths matching | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, actually can we move this documentation to each implementation of CSV, Parquet, ORC, text? It will only work with such internal file based sources. |
||
| the pattern. The syntax follows org.apache.hadoop.fs.GlobFilter. | ||
| It does not change the behavior of partition discovery. | ||
HyukjinKwon marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| """ | ||
| self._jreader = self._jreader.option(key, to_str(value)) | ||
| return self | ||
|
|
@@ -132,6 +135,9 @@ def options(self, **options): | |
| * ``timeZone``: sets the string that indicates a timezone to be used to parse timestamps | ||
| in the JSON/CSV datasources or partition values. | ||
| If it isn't set, it uses the default value, session local timezone. | ||
| * ``pathGlobFilter``: an optional glob pattern to only include files with paths matching | ||
| the pattern. The syntax follows org.apache.hadoop.fs.GlobFilter. | ||
| It does not change the behavior of partition discovery. | ||
| """ | ||
| for k in options: | ||
| self._jreader = self._jreader.option(k, to_str(options[k])) | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.