Doc: refactor Hive documentation with catalog loading examples #2544

jackye1995 · 2021-04-29T03:53:10Z

@pvary, @marton-bod, @lcspinter, @rdblue

As @openinx suggested in #2535, we lack documentation for Hive catalog loading after #2129 is merged. This PR adds examples of loading the catalog and creating tables with custom catalogs. I also reorganized the doc to follow the same structure as Spark and Flink, and added more details for each section.

Also there are some documentations in the code that were not updated so I also fixed those.

The doc changes are quite messy, using the split view to read the updated content might be easier, thanks!

site/docs/hive.md

marton-bod · 2021-04-29T13:22:38Z

site/docs/hive.md

+#### Hive catalog tables
+
+As described before, tables created by the `HiveCatalog` with Hive engine feature enabled are directly visible by the Hive engine, so there is no need to create an overlay.
+


Here we should mention that for HiveCatalog we can now create tables using Hive's own column and partitioning syntax.

CREATE TABLE database_a.table_b (id bigint, name string) PARTITIONED BY (dept string) STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler';

Note: This creates an unpartitioned HMS table (while the underlying Iceberg table is partitioned).

Also, we could mention the compatibility matrix between Iceberg and Hive types:

which Hive types are supported out-of-the-box since they have a direct equivalent in Iceberg

which Hive types can be autoconverted (e.g. short -> int), if enabled

which ones are unsupported altogether (e.g. interval).

I'm happy to address this in a follow-up PR though.

thanks for the suggestion, I completely forgot this one, added now

Thanks for updating it!

We should probably make this a table eventually, like we have in the Spark docs.

@rdblue added a section in the end for forward and backward conversion.

pvary · 2021-04-29T13:23:24Z

@jackye1995: Big thanks for picking this up! It was on our TODO list, but we kept deprioritizing it.

site/docs/hive.md

jackye1995 · 2021-04-29T20:27:23Z

@pvary @marton-bod @lcspinter thanks for the review, I updated based on the comments.

For iceberg.mr.catalog, I know it can still be used to indicate a default catalog type to be hive or no catalog, but I think it's probably better to not mention it at all in the documentation, and I also marked it as deprecated in javadoc. Please let me know if this approach sounds good.

For the Hadoop catalog table case, I have combined it with the custom catalog table case in CREATE EXTERNAL TABLE, by using the hadoop example for custom type, so that people use the same way to create those external tables. Hopefully this is clearer and have less confusion when reading.

yyanyy

This looks great from my perspective as a person who don't use hive, thanks for this change!

site/docs/hive.md

rdblue · 2021-05-02T21:16:21Z

site/docs/hive.md

+
+| Config Key                                    | Description                                            |
+| --------------------------------------------- | ------------------------------------------------------ |
+| iceberg.catalog.<catalog_name\>.type          | type of catalog: `hive`,`hadoop` or `custom`             |


Is custom supported? I thought that if catalog-impl is set, it overrides the value of type.

Currently the type must not be null. It can be any string, not necessarily custom. If null, the current logic will not load any catalog implementation because the catalog type is identified as NO_CATALOG_TYPE:

iceberg/mr/src/main/java/org/apache/iceberg/mr/Catalogs.java

Lines 206 to 207 in 83886c8

if (NO_CATALOG_TYPE.equalsIgnoreCase(catalogType)) {

return Optional.empty();

iceberg/mr/src/main/java/org/apache/iceberg/mr/Catalogs.java

Lines 269 to 272 in 83886c8

if (catalogName != null) {

String catalogType = conf.get(String.format(InputFormatConfig.CATALOG_TYPE_TEMPLATE, catalogName));

if (catalogName.equals(ICEBERG_HADOOP_TABLE_NAME) || catalogType == null) {

return NO_CATALOG_TYPE;

@rdblue what you describe is the behavior for loading in Spark or Flink, and type can be null when catalog-impl is set.

catalog-impl does override type if both are set and both values are correct, but if someone sets catalog-impl=xxxxx and type=hive, it is most likely something wrong with the user input, that's why a consistent value custom is suggested (and also used in tests based on what I see).

I think we should probably have followed the same way in Hive to keep the experience in all engines consistent, but I did not notice this until the PR was merged, so really sorry for the inconsistency.

@lcspinter was there any concern at that time that prevented us from following the same pattern as Spark and Flink? If not, we can probably have another PR to make them consistent.

(this does not need to happen before this PR, we can merge this first and change it later).

Yeah, we should fix this to have consistent behavior across engines.

+1 for consistency

@rdblue @aokolnychyi cool, I will fix it in another PR.

site/docs/hive.md

rdblue · 2021-05-02T21:20:00Z

site/docs/hive.md

+!!! Note
+    Due to the limitation of Hive `PARTITIONED BY` syntax, currently you can only partition by columns, 
+    which is translated to Iceberg identity partition transform.
+    You cannot partition by other Iceberg partition transforms such as `days(timestamp)`.


You can't create tables partitioned by other Iceberg partition transforms, but I believe that they are supported if you create the table through some other engine. Right, @pvary?

Yes, this is only for CREATE TABLE. Tables created by other engines are created through CREATE EXTERNAL TABLE.

@rdblue Good point, that's correct, reads/writes are supported for partition transforms too, it's only the create table syntax which has the limitation. It's probably worth making that clear.

yeah, let me make that more clear

site/docs/hive.md

rdblue · 2021-05-02T21:22:11Z

Nice work, @jackye1995! Thanks for working on the docs like this!

jackye1995 · 2021-05-05T07:06:25Z

@pvary @marton-bod addressed all the comments, please let me know if there are anything other concerns, thanks.

marton-bod · 2021-05-06T13:36:33Z

site/docs/hive.md

-LOCATION 'hdfs://some_bucket/some_path/table_a';
+If the Iceberg storage handler is not in Hive's classpath, then Hive cannot load or update the metadata for an Iceberg table when the storage handler is set.
+To avoid the appearance of broken tables in Hive, Iceberg will not add the storage handler to a table unless Hive support is enabled.
+The storage handler is kept in sync (added or removed) every time a table is updated.


nit: just as a quick clarification, the storage handler is added/removed only if the engine.hive.enabled table property changes from true to false (or vice versa), not just for any table updates. So maybe we can reword it a bit:

The storage handler is kept in sync (added or removed) every time Hive engine support for the table is updated, i.e. turned on or off in the table properties.

I thought this was set every time and changed depending on the current value of engine.hive.enabled?

Yep, you're exactly right. What I wanted to emphasize was the second part of your sentence, that the handler is set/removed from the HMS table whenever the engine.hive.enabled property changes, and in no other table update scenario.

... and changed depending on the current value of engine.hive.enabled

site/docs/hive.md

marton-bod

Looks good, thanks for your work @jackye1995. Just left a couple more minor comments, but otherwise good to go I think

rdblue · 2021-05-06T20:53:01Z

site/docs/hive.md

+For example, setting this in the `hive-site.xml` loaded by Spark will enable the storage handler for all tables created by Spark.
+
+!!! Warning
+    When using Tez, you also have to disable vectorization for now (`hive.vectorized.execution.enabled=false`)


Can you include a version number? That way it shows up if we grep for the version and helps us keep these up to date.

rdblue · 2021-05-06T20:58:36Z

site/docs/hive.md

+
+For cases 2 and 3 above, users can create an overlay of an Iceberg table in the Hive metastore,
+so that different table types can work together in the same Hive environment.
+See [CREATE EXTERNAL TABLE](#create-external-table) for more details.


We should drop EXTERNAL here as well, right?

I changed it to mention both CREATE EXTERNAL TABLE and CREATE TABLE, just to avoid confusion for the 2 use cases.

site/docs/hive.md

jackye1995 · 2021-05-07T04:30:41Z

@rdblue @marton-bod thanks for the new review, I moved the type conversion to the end to cover both forward and backward conversions. Please verify if the mappings are correct.

Apart from that, I think everything should be good to go. For the custom catalog type issue, I will put another PR later to fix it and make the experience consistent.

jackye1995 · 2021-05-10T19:57:51Z

site/docs/hive.md

+| float                      | float                   |               |
+| double                     | double                  |               |
+| date                       | date                    |               |
+| time                       | string                  |               |


Not 100% sure about the type from iceberg to hive, I am writing this based on the object inspectors in the MR package. Please confirm if this conversion is correct, thanks! @marton-bod @pvary

jackye1995 · 2021-05-12T01:47:25Z

restart test

…rning

jackye1995 · 2021-05-12T15:57:37Z

This PR is open for quite a long time and major concerns are resolved, we will merge this and address further comments in new PRs to keep things moving, thanks for the reviews.

Doc: refactor Hive documentation with catalog loading examples

7eb75b7

github-actions bot added docs MR labels Apr 29, 2021

fix checkstyle

2dc696c

pvary reviewed Apr 29, 2021

View reviewed changes

site/docs/hive.md Outdated Show resolved Hide resolved

marton-bod reviewed Apr 29, 2021

View reviewed changes

site/docs/hive.md Outdated Show resolved Hide resolved

pvary reviewed Apr 29, 2021

View reviewed changes

site/docs/hive.md Outdated Show resolved Hide resolved

pvary reviewed Apr 29, 2021

View reviewed changes

site/docs/hive.md Outdated Show resolved Hide resolved

pvary reviewed Apr 29, 2021

View reviewed changes

site/docs/hive.md Show resolved Hide resolved

marton-bod reviewed Apr 29, 2021

View reviewed changes

lcspinter reviewed Apr 29, 2021

View reviewed changes

site/docs/hive.md Outdated Show resolved Hide resolved

marton-bod reviewed Apr 29, 2021

View reviewed changes

site/docs/hive.md Show resolved Hide resolved

marton-bod reviewed Apr 29, 2021

View reviewed changes

site/docs/hive.md Outdated Show resolved Hide resolved

fix based on comments

68a24b9

add notes for create table

4e21d49

yyanyy reviewed Apr 30, 2021

View reviewed changes

site/docs/hive.md Outdated Show resolved Hide resolved

site/docs/hive.md Outdated Show resolved Hide resolved

fix based on comments

ce4b20b

pvary reviewed Apr 30, 2021

View reviewed changes

site/docs/hive.md Show resolved Hide resolved

Jack Ye added 2 commits April 30, 2021 09:44

update purge behavior

d64ec91

fix comma

3e24c1c

rdblue reviewed May 2, 2021

View reviewed changes

site/docs/hive.md Show resolved Hide resolved

rdblue reviewed May 2, 2021

View reviewed changes

site/docs/hive.md Outdated Show resolved Hide resolved

rdblue reviewed May 2, 2021

View reviewed changes

site/docs/hive.md Outdated Show resolved Hide resolved

rdblue reviewed May 2, 2021

View reviewed changes

site/docs/hive.md Outdated Show resolved Hide resolved

rdblue reviewed May 2, 2021

View reviewed changes

site/docs/hive.md Outdated Show resolved Hide resolved

rdblue reviewed May 2, 2021

View reviewed changes

site/docs/hive.md Outdated Show resolved Hide resolved

rdblue reviewed May 2, 2021

View reviewed changes

site/docs/hive.md Show resolved Hide resolved

rdblue reviewed May 2, 2021

View reviewed changes

site/docs/hive.md Outdated Show resolved Hide resolved

rdblue reviewed May 2, 2021

View reviewed changes

site/docs/hive.md Outdated Show resolved Hide resolved

Jack Ye added 5 commits May 2, 2021 18:56

move tez warning to top

b197417

fix based on comments

8a56f54

fix based on comments

d5fa01b

fix typo

fa4c982

update based on comments

9d14dfe

jackye1995 requested review from pvary and rdblue May 5, 2021 16:27

marton-bod reviewed May 6, 2021

View reviewed changes

site/docs/hive.md Outdated Show resolved Hide resolved

marton-bod reviewed May 6, 2021

View reviewed changes

site/docs/hive.md Outdated Show resolved Hide resolved

marton-bod approved these changes May 6, 2021

View reviewed changes

rdblue reviewed May 6, 2021

View reviewed changes

openinx reviewed May 7, 2021

View reviewed changes

site/docs/hive.md Show resolved Hide resolved

update based on comments

11e5379

jackye1995 mentioned this pull request May 7, 2021

Hive: unify catalog experience across engines #2565

Merged

jackye1995 commented May 10, 2021

View reviewed changes

jackye1995 closed this May 12, 2021

jackye1995 reopened this May 12, 2021

temporarily remove iceberg to hive conversion, add version for tez wa…

c853033

…rning

yyanyy merged commit 0a9aa96 into apache:master May 12, 2021

		#### Hive catalog tables

		As described before, tables created by the `HiveCatalog` with Hive engine feature enabled are directly visible by the Hive engine, so there is no need to create an overlay.

	if (NO_CATALOG_TYPE.equalsIgnoreCase(catalogType)) {
	return Optional.empty();

	if (catalogName != null) {
	String catalogType = conf.get(String.format(InputFormatConfig.CATALOG_TYPE_TEMPLATE, catalogName));
	if (catalogName.equals(ICEBERG_HADOOP_TABLE_NAME) \|\| catalogType == null) {
	return NO_CATALOG_TYPE;

Doc: refactor Hive documentation with catalog loading examples #2544

Doc: refactor Hive documentation with catalog loading examples #2544

Uh oh!

Conversation

jackye1995 commented Apr 29, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marton-bod Apr 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pvary commented Apr 29, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jackye1995 commented Apr 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yyanyy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rdblue commented May 2, 2021

Uh oh!

jackye1995 commented May 5, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

marton-bod left a comment

Choose a reason for hiding this comment

marton-bod Apr 29, 2021 •

edited

Loading

jackye1995 commented Apr 29, 2021 •

edited

Loading