Skip to content

Conversation

@xiaonanyang-db
Copy link
Contributor

@xiaonanyang-db xiaonanyang-db commented Oct 5, 2022

What changes were proposed in this pull request?

Code refactor on all File data source options:

  • TextOptions
  • CSVOptions
  • JSONOptions
  • AvroOptions
  • ParquetOptions
  • OrcOptions
  • FileIndex related options

Change semantics:

  • First, we introduce a new trait DataSourceOptions, which defines the following functions:
    • newOption(name): Register a new option
    • newOption(name, alternative): Register a new option with alternative
    • getAllValidOptions: retrieve all valid options
    • isValidOption(name): validate a given option name
    • getAlternativeOption(name): get alternative option name if any
  • Then, for each class above
    • Create/update its companion object to extend from the trait above and register all valid options within it.
    • Update places where name strings are used directly to fetch option values to use those option constants instead.
    • Add a unit test for each file data source options

Why are the changes needed?

Currently for each file data source, all options are placed sparsely in the options class and there is no clear list of all options supported. As more and more options are added, the readability get worse. Thus, we want to refactor those codes so that

  • we can easily get a list of supported options for each data source
  • enforce better practice for adding new options going forwards.

Does this PR introduce any user-facing change?

No

How was this patch tested?

@HyukjinKwon HyukjinKwon changed the title [SPARK-40667] Refactor File Data Source Options [SPARK-40667][SQL] Refactor File Data Source Options Oct 6, 2022
@xiaonanyang-db
Copy link
Contributor Author

@brkyvz committed changes to address your comments, please take another review!

@xiaonanyang-db
Copy link
Contributor Author

@brkyvz comment addressed, feel free to take another look

Copy link
Contributor

@sadikovi sadikovi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you for making the changes. I left a few comments, would appreciate it if you could take a look.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

}

test("SPARK-40667: Check the number of valid Avro options") {
assert(AvroOptions.getAllValidOptionNames.size == 9)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the point of having this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally we want use this simple test to remind developers of what should be done when introducing a new option, but I just realized it will not serve that purpose but just piss off developers. Let me remove them.

val CHARSET = newOption("charset")
val ENCODING = newOption("encoding", Some(CHARSET))
val CODEC = newOption("codec")
val COMPRESSION = newOption("compression", Some(CODEC))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks confusing. why do we need to register an option twice if it has an alternaive? I thought we only need to do it once as the register API allows you to specify an alternative.

@xiaonanyang-db
Copy link
Contributor Author

@brkyvz @sadikovi @cloud-fan comments addressed, please take another look.

Copy link
Contributor

@sadikovi sadikovi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I left a few comments, can you address them before merging?

Copy link
Contributor

@brkyvz brkyvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a minor comment on the completeness of the CSV and JSON option tests. We can add the number of options count back to the test too as long as we validate that all the options are registered

* @param alternative Alternative option name
*/
protected def newOption(name: String, alternative: String): Unit = {
// Register both of the options
Copy link
Contributor

@cloud-fan cloud-fan Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can we know which one is the primary one? for example,

  val charset = parameters.getOrElse(ENCODING,
    parameters.getOrElse(CHARSET, StandardCharsets.UTF_8.name()))

ENCODING is the primary one as it will be respected if both are set.

Copy link
Contributor Author

@xiaonanyang-db xiaonanyang-db Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't really care about which one is primary here, the reason we want to track alternative options is that callers may want to provide an error / log a warning if both of the alternative options are provided. Which one will be respected could be decided by the caller.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 8e85393 Oct 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants