-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Allow custom hadoop properties to be loaded in the Spark data source #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Supporting these custom Hadoop properties should also be done in other Iceberg integrations in subsequent patches.
spark/src/main/java/com/netflix/iceberg/spark/source/IcebergSource.java
Outdated
Show resolved
Hide resolved
spark/src/main/java/com/netflix/iceberg/spark/source/IcebergSource.java
Outdated
Show resolved
Hide resolved
spark/src/test/java/com/netflix/iceberg/spark/source/TestIcebergSource.java
Outdated
Show resolved
Hide resolved
spark/src/main/java/com/netflix/iceberg/spark/source/IcebergSource.java
Outdated
Show resolved
Hide resolved
…hadoop-properties
…ies' into custom-hadoop-properties
|
Addressed comments and is ready for another round of review. |
spark/src/main/java/com/netflix/iceberg/spark/source/IcebergSource.java
Outdated
Show resolved
Hide resolved
|
Addressed all comments so far. |
| } | ||
|
|
||
| return Optional.of(new Writer(table, lazyConf(), format)); | ||
| return Optional.of(new Writer(table, conf, format)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After #47, we may not need to pass in conf. (Not something we need to change here.)
| } | ||
|
|
||
| protected SparkSession lazySparkSession() { | ||
| private SparkSession lazySparkSession() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice to have in subclasses, which is why it is protected. We use it in findTable to get information about the catalog to use. Not a big deal if it becomes private, since we can make a quick change in our add-on library and keep track of it there.
| Table table = findTable(options, conf); | ||
| // Set confs from table properties, but do not overwrite options from the Spark Context with | ||
| // configurations from the table | ||
| mergeIcebergHadoopConfs(conf, table.properties(), false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this still needs to be true, in which case we can remove the option. Table properties still need to override those set in the Hadoop Configuration. Then we re-apply the ones from options to fix up precedence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I would think that properties set in the JVM, particularly if set on the Spark Context via spark.hadoop.*, should take precedence over the table properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Values set in the Configuration are session specific and what we want is to move to table settings instead of Spark settings for configuration like Parquet row group size that are tied to the data. Write-specific settings from the write config can override.
Table settings should take priority over session-wide settings because session-wide config would apply for all tables, and that's not usually appropriate like the row group size example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair enough, I suppose as long as the behavior is well documented it should be clear to the user on how to get the final configurations they want.
| Configuration baseConf, Map<String, String> options, boolean overwrite) { | ||
| options.keySet().stream() | ||
| .filter(key -> key.startsWith("iceberg.hadoop")) | ||
| .filter(key -> overwrite || baseConf.get(key.replaceFirst("iceberg.hadoop", "")) == null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't overwrite discard all keys? I don't think it matters now because it isn't needed anymore.
) Properties that start with iceberg.hadoop are copied into the Hadoop Configuration used in the Spark source. These may be set in table properties or in read and write options passed to the Spark operation. Read and write options take precedence over the table properties. Supporting these custom Hadoop properties should also be done in other Iceberg integrations in subsequent patches.
|
@aokolnychyi, yes. Want to open a PR? |
|
Yeah, will do so. |
# This is the 1st commit message: Issue-629: Cherrypick Id # This is the commit message #2: Removed redundant methods and changed method name # This is the commit message #3: Fix Imports # This is the commit message #4: Fix Operation Check # This is the commit message apache#5: Fix Error Message # This is the commit message apache#6: Cherry picking operation to apply changes from incoming snapshot on current snapshot # This is the commit message apache#7: Initial working version of cherry-pick operation which applies appends only
(cherry picked from commit e2af91c) Co-authored-by: Rishi <sririshindra@gmail.com>
….openapitools-openapi-generator-gradle-plugin-7.13.0 Build: Bump org.openapitools:openapi-generator-gradle-plugin from 7.12.0 to 7.13.0
Supporting these custom Hadoop properties should also be done in other Iceberg integrations in subsequent patches.
Closes Netflix/iceberg#91.