-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add REST Catalog tests to Spark 3.5 integration test #11093
Changes from all commits
8d58247
aad8b27
18d7c8a
b82ca03
89a6a1d
00d9fcd
5010a71
8ec1929
c07b005
0a613f6
c914475
d18c3fa
82cca8b
4672381
c1db1f9
de4be14
4e2dd03
c549eb3
043cd83
b854759
bf0dd18
b729f4f
45b76f4
c9f06b9
2f1ad6c
a23bc7f
eb1a345
048f037
bac5d17
85655d8
0ef2c2b
2d31dc0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,6 +18,11 @@ | |
*/ | ||
package org.apache.iceberg.spark; | ||
|
||
import static org.apache.iceberg.CatalogProperties.CATALOG_IMPL; | ||
import static org.apache.iceberg.CatalogUtil.ICEBERG_CATALOG_TYPE; | ||
import static org.apache.iceberg.CatalogUtil.ICEBERG_CATALOG_TYPE_HADOOP; | ||
import static org.apache.iceberg.CatalogUtil.ICEBERG_CATALOG_TYPE_HIVE; | ||
import static org.apache.iceberg.CatalogUtil.ICEBERG_CATALOG_TYPE_REST; | ||
import static org.assertj.core.api.Assertions.assertThat; | ||
|
||
import java.io.File; | ||
|
@@ -36,17 +41,38 @@ | |
import org.apache.iceberg.catalog.SupportsNamespaces; | ||
import org.apache.iceberg.catalog.TableIdentifier; | ||
import org.apache.iceberg.hadoop.HadoopCatalog; | ||
import org.apache.iceberg.inmemory.InMemoryCatalog; | ||
import org.apache.iceberg.rest.RESTCatalog; | ||
import org.apache.iceberg.rest.RESTCatalogServer; | ||
import org.apache.iceberg.rest.RESTServerExtension; | ||
import org.apache.iceberg.util.PropertyUtil; | ||
import org.junit.jupiter.api.AfterAll; | ||
import org.junit.jupiter.api.BeforeAll; | ||
import org.junit.jupiter.api.BeforeEach; | ||
import org.junit.jupiter.api.extension.ExtendWith; | ||
import org.junit.jupiter.api.extension.RegisterExtension; | ||
import org.junit.jupiter.api.io.TempDir; | ||
|
||
@ExtendWith(ParameterizedTestExtension.class) | ||
public abstract class TestBaseWithCatalog extends TestBase { | ||
protected static File warehouse = null; | ||
|
||
@RegisterExtension | ||
private static final RESTServerExtension REST_SERVER_EXTENSION = | ||
new RESTServerExtension( | ||
Map.of( | ||
RESTCatalogServer.REST_PORT, | ||
RESTServerExtension.FREE_PORT, | ||
// In-memory sqlite database by default is private to the connection that created it. | ||
// If more than 1 jdbc connection backed by in-memory sqlite is created behind one | ||
// JdbcCatalog, then different jdbc connections could provide different views of table | ||
// status even belonging to the same catalog. Reference: | ||
// https://www.sqlite.org/inmemorydb.html | ||
CatalogProperties.CLIENT_POOL_SIZE, | ||
"1")); | ||
|
||
protected static RESTCatalog restCatalog; | ||
|
||
@Parameters(name = "catalogName = {0}, implementation = {1}, config = {2}") | ||
protected static Object[][] parameters() { | ||
return new Object[][] { | ||
|
@@ -59,13 +85,14 @@ protected static Object[][] parameters() { | |
} | ||
|
||
@BeforeAll | ||
public static void createWarehouse() throws IOException { | ||
public static void setUpAll() throws IOException { | ||
TestBaseWithCatalog.warehouse = File.createTempFile("warehouse", null); | ||
assertThat(warehouse.delete()).isTrue(); | ||
restCatalog = REST_SERVER_EXTENSION.client(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel like we're almost there, but we're still doing too much here that should be one in the extension. We should make this test base implement either the @Override
public void beforeAll(ExtensionContext extensionContext) {
restCatalog = (RESTCatalog) extensionContext.getStore(ExtensionContext.Namespace.create("rest")).get("rest-catalog");
} This will actually get called before the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think if you were to use the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As discussed offline, the biggest issue right now to incorporate rest client init into BeforeEach call are the fact that
For 1, the current work around is to give the extension class a getter method so that the test class can obtain necessary information. For 2, as an example, the existing test Hive Metastore server for spark tests also has its port dynamically assigned. The existing approach is that the test Hive Metastore will be initialized in the test class itself in its BeforeAll call, so that the server hiveConf could be used to initialize Spark session in later steps of the BeforeAll calls. Looking back at rest client, if we don't refactor the Spark session to be initialized in BeforeEach stage, then it's unlikely we can delay the init of rest client to BeforeEach stage. Meanwhile, init Spark session at BeforeEach stage (every test has its own Spark session spin up) would be quite expensive. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm thinking, the other way of refactoring this is we can initialize Spark session in a BeforeAllCallBack extension class (and that perhaps also mean we need to refactor the existing HiveCatalog & HadoopCatalog initialization into separate Extension classes to take care of the existing Spark tests). Like mentioned before, as long as we are still initializing Spark session in BeforeAll stage, we unlikely can delay init of catalogs to BeforeEach stage. Yet if the Spark session is also init in a separate extension class, we might not need to write getters on the extension class for catalogs to expose anything related to catalogs to the outside world (info can pass freely between extension classes) - we only need to write getters for Spark extension class so that the test classes themselves can get a handle to the Spark session. |
||
} | ||
|
||
@AfterAll | ||
public static void dropWarehouse() throws IOException { | ||
public static void tearDownAll() throws IOException { | ||
if (warehouse != null && warehouse.exists()) { | ||
Path warehousePath = new Path(warehouse.getAbsolutePath()); | ||
FileSystem fs = warehousePath.getFileSystem(hiveConf); | ||
|
@@ -89,13 +116,37 @@ public static void dropWarehouse() throws IOException { | |
protected TableIdentifier tableIdent = TableIdentifier.of(Namespace.of("default"), "table"); | ||
protected String tableName; | ||
|
||
private void configureValidationCatalog() { | ||
if (catalogConfig.containsKey(ICEBERG_CATALOG_TYPE)) { | ||
haizhou-zhao marked this conversation as resolved.
Show resolved
Hide resolved
|
||
switch (catalogConfig.get(ICEBERG_CATALOG_TYPE)) { | ||
case ICEBERG_CATALOG_TYPE_HADOOP: | ||
this.validationCatalog = | ||
new HadoopCatalog(spark.sessionState().newHadoopConf(), "file:" + warehouse); | ||
break; | ||
case ICEBERG_CATALOG_TYPE_REST: | ||
this.validationCatalog = restCatalog; | ||
break; | ||
case ICEBERG_CATALOG_TYPE_HIVE: | ||
this.validationCatalog = catalog; | ||
break; | ||
default: | ||
throw new IllegalArgumentException("Unknown catalog type"); | ||
} | ||
} else if (catalogConfig.containsKey(CATALOG_IMPL)) { | ||
switch (catalogConfig.get(CATALOG_IMPL)) { | ||
case "org.apache.iceberg.inmemory.InMemoryCatalog": | ||
this.validationCatalog = new InMemoryCatalog(); | ||
break; | ||
default: | ||
throw new IllegalArgumentException("Unknown catalog impl"); | ||
} | ||
} | ||
this.validationNamespaceCatalog = (SupportsNamespaces) validationCatalog; | ||
} | ||
|
||
@BeforeEach | ||
public void before() { | ||
this.validationCatalog = | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what happens if we don't do any changes to how the validation catalog is configured? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. something like you created a table on REST catalog, while validate against Hive catalog. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. my point is that I think all tests should be passing when we don't do any changes to the validation catalog There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I understand correctly, the original That will work in the old days, as we only have 2 types catalogs, either Hadoop or Hive, being tested - if the test subject is not a Hadoop Catalog, then setting the validation catalog to Hive Catalog will suffice the validation purpose. However, with the introduction of a 3rd type, REST catalog: when the test subject is a REST catalog, without changing how Unless, you are suggesting that we should make changes to |
||
catalogName.equals("testhadoop") | ||
? new HadoopCatalog(spark.sessionState().newHadoopConf(), "file:" + warehouse) | ||
: catalog; | ||
this.validationNamespaceCatalog = (SupportsNamespaces) validationCatalog; | ||
configureValidationCatalog(); | ||
|
||
spark.conf().set("spark.sql.catalog." + catalogName, implementation); | ||
catalogConfig.forEach( | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this refresh only needed for REST?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, RESTTableOperations and other TableOperations has different mechanisms of refreshing metadata.