-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40667][SQL] Refactor File Data Source Options #38113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala
Outdated
Show resolved
Hide resolved
connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala
Outdated
Show resolved
Hide resolved
connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala
Outdated
Show resolved
Hide resolved
connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala
Outdated
Show resolved
Hide resolved
|
@brkyvz committed changes to address your comments, please take another review! |
connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala
Outdated
Show resolved
Hide resolved
|
@brkyvz comment addressed, feel free to take another look |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/FileSourceOptions.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/FileSourceOptions.scala
Outdated
Show resolved
Hide resolved
connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/DataSourceOptions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/DataSourceOptions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/DataSourceOptions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/DataSourceOptions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
Outdated
Show resolved
Hide resolved
sadikovi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you for making the changes. I left a few comments, would appreciate it if you could take a look.
connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
Outdated
Show resolved
Hide resolved
|
Can one of the admins verify this patch? |
| } | ||
|
|
||
| test("SPARK-40667: Check the number of valid Avro options") { | ||
| assert(AvroOptions.getAllValidOptionNames.size == 9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the point of having this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally we want use this simple test to remind developers of what should be done when introducing a new option, but I just realized it will not serve that purpose but just piss off developers. Let me remove them.
| val CHARSET = newOption("charset") | ||
| val ENCODING = newOption("encoding", Some(CHARSET)) | ||
| val CODEC = newOption("codec") | ||
| val COMPRESSION = newOption("compression", Some(CODEC)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks confusing. why do we need to register an option twice if it has an alternaive? I thought we only need to do it once as the register API allows you to specify an alternative.
|
@brkyvz @sadikovi @cloud-fan comments addressed, please take another look. |
sadikovi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I left a few comments, can you address them before merging?
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/DataSourceOptions.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala
Outdated
Show resolved
Hide resolved
brkyvz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a minor comment on the completeness of the CSV and JSON option tests. We can add the number of options count back to the test too as long as we validate that all the options are registered
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
Outdated
Show resolved
Hide resolved
| * @param alternative Alternative option name | ||
| */ | ||
| protected def newOption(name: String, alternative: String): Unit = { | ||
| // Register both of the options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how can we know which one is the primary one? for example,
val charset = parameters.getOrElse(ENCODING,
parameters.getOrElse(CHARSET, StandardCharsets.UTF_8.name()))
ENCODING is the primary one as it will be respected if both are set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't really care about which one is primary here, the reason we want to track alternative options is that callers may want to provide an error / log a warning if both of the alternative options are provided. Which one will be respected could be decided by the caller.
|
thanks, merging to master! |
What changes were proposed in this pull request?
Code refactor on all File data source options:
TextOptionsCSVOptionsJSONOptionsAvroOptionsParquetOptionsOrcOptionsFileIndexrelated optionsChange semantics:
DataSourceOptions, which defines the following functions:newOption(name): Register a new optionnewOption(name, alternative): Register a new option with alternativegetAllValidOptions: retrieve all valid optionsisValidOption(name): validate a given option namegetAlternativeOption(name): get alternative option name if anyWhy are the changes needed?
Currently for each file data source, all options are placed sparsely in the options class and there is no clear list of all options supported. As more and more options are added, the readability get worse. Thus, we want to refactor those codes so that
Does this PR introduce any user-facing change?
No
How was this patch tested?