-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Doc: refactor Hive documentation with catalog loading examples #2544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| #### Hive catalog tables | ||
|
|
||
| As described before, tables created by the `HiveCatalog` with Hive engine feature enabled are directly visible by the Hive engine, so there is no need to create an overlay. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we should mention that for HiveCatalog we can now create tables using Hive's own column and partitioning syntax.
CREATE TABLE database_a.table_b (id bigint, name string) PARTITIONED BY (dept string) STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler';
Note: This creates an unpartitioned HMS table (while the underlying Iceberg table is partitioned).
Also, we could mention the compatibility matrix between Iceberg and Hive types:
- which Hive types are supported out-of-the-box since they have a direct equivalent in Iceberg
- which Hive types can be autoconverted (e.g. short -> int), if enabled
- which ones are unsupported altogether (e.g. interval).
I'm happy to address this in a follow-up PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the suggestion, I completely forgot this one, added now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably make this a table eventually, like we have in the Spark docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue added a section in the end for forward and backward conversion.
|
@jackye1995: Big thanks for picking this up! It was on our TODO list, but we kept deprioritizing it. |
|
@pvary @marton-bod @lcspinter thanks for the review, I updated based on the comments. For For the Hadoop catalog table case, I have combined it with the custom catalog table case in |
yyanyy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great from my perspective as a person who don't use hive, thanks for this change!
|
|
||
| | Config Key | Description | | ||
| | --------------------------------------------- | ------------------------------------------------------ | | ||
| | iceberg.catalog.<catalog_name\>.type | type of catalog: `hive`,`hadoop` or `custom` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is custom supported? I thought that if catalog-impl is set, it overrides the value of type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the type must not be null. It can be any string, not necessarily custom. If null, the current logic will not load any catalog implementation because the catalog type is identified as NO_CATALOG_TYPE:
iceberg/mr/src/main/java/org/apache/iceberg/mr/Catalogs.java
Lines 206 to 207 in 83886c8
| if (NO_CATALOG_TYPE.equalsIgnoreCase(catalogType)) { | |
| return Optional.empty(); |
iceberg/mr/src/main/java/org/apache/iceberg/mr/Catalogs.java
Lines 269 to 272 in 83886c8
| if (catalogName != null) { | |
| String catalogType = conf.get(String.format(InputFormatConfig.CATALOG_TYPE_TEMPLATE, catalogName)); | |
| if (catalogName.equals(ICEBERG_HADOOP_TABLE_NAME) || catalogType == null) { | |
| return NO_CATALOG_TYPE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue what you describe is the behavior for loading in Spark or Flink, and type can be null when catalog-impl is set.
catalog-impl does override type if both are set and both values are correct, but if someone sets catalog-impl=xxxxx and type=hive, it is most likely something wrong with the user input, that's why a consistent value custom is suggested (and also used in tests based on what I see).
I think we should probably have followed the same way in Hive to keep the experience in all engines consistent, but I did not notice this until the PR was merged, so really sorry for the inconsistency.
@lcspinter was there any concern at that time that prevented us from following the same pattern as Spark and Flink? If not, we can probably have another PR to make them consistent.
(this does not need to happen before this PR, we can merge this first and change it later).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we should fix this to have consistent behavior across engines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue @aokolnychyi cool, I will fix it in another PR.
| !!! Note | ||
| Due to the limitation of Hive `PARTITIONED BY` syntax, currently you can only partition by columns, | ||
| which is translated to Iceberg identity partition transform. | ||
| You cannot partition by other Iceberg partition transforms such as `days(timestamp)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't create tables partitioned by other Iceberg partition transforms, but I believe that they are supported if you create the table through some other engine. Right, @pvary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is only for CREATE TABLE. Tables created by other engines are created through CREATE EXTERNAL TABLE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue Good point, that's correct, reads/writes are supported for partition transforms too, it's only the create table syntax which has the limitation. It's probably worth making that clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, let me make that more clear
|
Nice work, @jackye1995! Thanks for working on the docs like this! |
|
@pvary @marton-bod addressed all the comments, please let me know if there are anything other concerns, thanks. |
site/docs/hive.md
Outdated
| LOCATION 'hdfs://some_bucket/some_path/table_a'; | ||
| If the Iceberg storage handler is not in Hive's classpath, then Hive cannot load or update the metadata for an Iceberg table when the storage handler is set. | ||
| To avoid the appearance of broken tables in Hive, Iceberg will not add the storage handler to a table unless Hive support is enabled. | ||
| The storage handler is kept in sync (added or removed) every time a table is updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: just as a quick clarification, the storage handler is added/removed only if the engine.hive.enabled table property changes from true to false (or vice versa), not just for any table updates. So maybe we can reword it a bit:
The storage handler is kept in sync (added or removed) every time Hive engine support for the table is updated, i.e. turned on or off in the table properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this was set every time and changed depending on the current value of engine.hive.enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, you're exactly right. What I wanted to emphasize was the second part of your sentence, that the handler is set/removed from the HMS table whenever the engine.hive.enabled property changes, and in no other table update scenario.
... and changed depending on the current value of engine.hive.enabled
marton-bod
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for your work @jackye1995. Just left a couple more minor comments, but otherwise good to go I think
site/docs/hive.md
Outdated
| For example, setting this in the `hive-site.xml` loaded by Spark will enable the storage handler for all tables created by Spark. | ||
|
|
||
| !!! Warning | ||
| When using Tez, you also have to disable vectorization for now (`hive.vectorized.execution.enabled=false`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you include a version number? That way it shows up if we grep for the version and helps us keep these up to date.
site/docs/hive.md
Outdated
|
|
||
| For cases 2 and 3 above, users can create an overlay of an Iceberg table in the Hive metastore, | ||
| so that different table types can work together in the same Hive environment. | ||
| See [CREATE EXTERNAL TABLE](#create-external-table) for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should drop EXTERNAL here as well, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to mention both CREATE EXTERNAL TABLE and CREATE TABLE, just to avoid confusion for the 2 use cases.
|
@rdblue @marton-bod thanks for the new review, I moved the type conversion to the end to cover both forward and backward conversions. Please verify if the mappings are correct. Apart from that, I think everything should be good to go. For the |
site/docs/hive.md
Outdated
| | float | float | | | ||
| | double | double | | | ||
| | date | date | | | ||
| | time | string | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% sure about the type from iceberg to hive, I am writing this based on the object inspectors in the MR package. Please confirm if this conversion is correct, thanks! @marton-bod @pvary
|
restart test |
|
This PR is open for quite a long time and major concerns are resolved, we will merge this and address further comments in new PRs to keep things moving, thanks for the reviews. |
@pvary, @marton-bod, @lcspinter, @rdblue
As @openinx suggested in #2535, we lack documentation for Hive catalog loading after #2129 is merged. This PR adds examples of loading the catalog and creating tables with custom catalogs. I also reorganized the doc to follow the same structure as Spark and Flink, and added more details for each section.
Also there are some documentations in the code that were not updated so I also fixed those.
The doc changes are quite messy, using the split view to read the updated content might be easier, thanks!