-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Hive: unify catalog experience across engines #2565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Optional<Catalog> hadoopCatalog = Catalogs.loadCatalog(conf, catalogName); | ||
| Assert.assertTrue(hadoopCatalog.isPresent()); | ||
| Assert.assertTrue(hadoopCatalog.get() instanceof HadoopCatalog); | ||
| Assert.assertEquals("HadoopCatalog{name=barCatalog, location=/tmp/mylocation}", hadoopCatalog.get().toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think hardcoding the toString in the assert can become brittle and difficult to maintain over time. Shouldn't we add a warehouseLocation() method to HadoopCatalog, and then assert for catalog.name() and ((HadoopCatalog) catalog).warehouseLocation() for the same effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was in the original test, so sorry I did not look too much into it. I agree that it's better to test with those methods, but unfortunately HadoopCatalog.warehouseLocation() does not exist, so we cannot really use that, so I suppose that is why this test is written in this way to get the warehouse location.
| String catalogType = conf.get(String.format(InputFormatConfig.CATALOG_TYPE_TEMPLATE, catalogName)); | ||
| if (catalogName.equals(ICEBERG_HADOOP_TABLE_NAME) || catalogType == null) { | ||
| String catalogType = conf.get(InputFormatConfig.catalogTypeConfigKey(catalogName)); | ||
| if (catalogName.equals(ICEBERG_HADOOP_TABLE_NAME)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to update the javadocs of this method according to this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the doc of this method to Catalogs class level, because private method docs are not generated in html.
| * @return Hadoop config key of catalog type for the catalog name | ||
| */ | ||
| public static String catalogTypeConfigKey(String catalogName) { | ||
| return String.format(InputFormatConfig.CATALOG_TYPE_TEMPLATE, catalogName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we add type to the CatalogProperties as well so we can reuse catalogPropertyConfigKey here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was not added because type is not consistent across engines. Flink has to use catalog-type instead of type because type is a reserved config key.
With that being said, we can probably add it and specify that Flink has to use catalog-type. Let me do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Which one does Spark use? Would it make sense to standardize w/ the other engines now and use catalog-type here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's actually one way we can consider, @rdblue @aokolnychyi , would it be acceptable if we unify all engines to ause catalog-type instead, and type is just an alias for catalog-type in Spark and Hive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So while Ryan and Anton have not replied, I will keep it like this, we can always add catalog-type as a new catalog properties in a new PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, I am treating CatalogUtil.ICEBERG_CATALOG_TYPE as the catalog property to use, and removed reference to InputFormatConfig.CATALOG_TYPE_TEMPLATE. I also added documentation inside CatalogUtil to make things clear.
I don't personally like the fact that CatalogUtil.ICEBERG_CATALOG_TYPE is in CatalogUtil instead of CatalogProperties, but we already released it in 0.11 as public variables so I decide to just keep it this way.
| AssertHelpers.assertThrows( | ||
| "should complain about catalog not supported", UnsupportedOperationException.class, | ||
| "Unknown catalog type", () -> Catalogs.loadCatalog(conf, null)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need a test case for legacy_custom too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah there was no test for that, I will add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, because the CATALOG_LOADER_CLASS (iceberg.mr.catalog.loader.class) is no longer used, we cannot load a custom catalog in the legacy mode, so there is no test for it.
Looking at https://github.com/apache/iceberg/pull/2129/files#diff-c183ea3aa154c2a5012f87d7a06dba3cff3f27975384e9fb4040fe6850a98bd6L192-L193, this seems like a backwards incompatible change. @lcspinter is this intentional, or should we add that part of the logic back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for now, I have updated the code to allow this legacy behavior, and added the missing test.
|
Thanks @jackye1995 for this, a few comments. |
| * @param catalogName catalog name | ||
| * @return Hadoop config key of catalog warehouse location for the catalog name | ||
| */ | ||
| public static String catalogWarehouseConfigKey(String catalogName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we use this in other part of the code than the ones that we will deprecate when we deprecate the old configs?
If so we might want to add @deprecated so nobody will start to use it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I added this method because the warehouse property is commonly used. Technically we can just use catalogPropertyConfigKey(...) method for everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warehouse should be an internal thing for the HadoopCatalog. I think no other Catalog should use that. Do I miss something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pvary it is used in all catalogs except HiveCatalog, because Hive uses HiveConf.ConfVars.METASTOREWAREHOUSE instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might missing something, but we use this catalogWarehouseConfigKey only in tests. The reason here is that in the tests we would like to set / get the warehouse for the HadoopCatalog and the CustomCatalog (which is again only a test class). In production code we do not need this key, so I think we should move it from the live code to a test class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can remove that method
@marton-bod depending on which one gets merged first, I will make change in the other PR accordingly. |
|
We have merged the hive doc PR, so I have added updates based on that. |
|
@pvary @marton-bod updated and fixed all tests, should be good for another review, thanks! |
|
@lcspinter: could you please take a look, as you are to one most familiar with this code now? Thanks, |
lcspinter
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @jackye1995 for the PR! lgtm
marton-bod
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for fixing this!
|
Thanks @jackye1995 for the patch and @marton-bod , @lcspinter for the reviews! |
- Add blurbs for apache#2565, apache#2583, and apache#2547. - Make formatting consistent.
* Add 0.12.0 release notes pt 2 * Add more blurbs and fix formatting. - Add blurbs for #2565, #2583, and #2547. - Make formatting consistent. * Add blurb for #2613 Hive Vectorized Reader * Reword blurbs for #2565 and #2365 * More changes based on review comments * More updates to the 0.12.0 release notes * Add blurb for #2232 fix parquet row group filters * Add blurb for #2308
As discussed in #2544, make Hive catalog experience consistent with Spark and Flink, specifically:
typeisnull, look atcatalog-impl, if that is alsonullthen useHiveCatalogas default.CATALOG_WAREHOUSE_TEMPLATEandCATALOG_CLASS_TEMPLATE, because they are unified inCatalogProperties.I also rewrote all the tests to make each test case clear in separated tests instead of a single giant test, please take a look if I missed any edge case in testing.
@pvary @marton-bod @lcspinter @aokolnychyi @rdblue