-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark: Fix CREATE VIEW IF NOT EXISTS failure when non-Iceberg view exists in SparkSessionCatalog #14930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Spark: Fix CREATE VIEW IF NOT EXISTS failure when non-Iceberg view exists in SparkSessionCatalog #14930
Conversation
…ists in SparkSessionCatalog
|
@huaxingao The previous PR was automatically closed due to a force push, so I’ve opened a new one. |
| import org.junit.jupiter.api.BeforeAll; | ||
| import org.junit.jupiter.api.Test; | ||
|
|
||
| public class TestSparkSessionCatalogWithExtensions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to first fix the issue in one Spark version and later backport stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to first fix 4.1 and then back-porting
| import org.junit.jupiter.api.BeforeAll; | ||
| import org.junit.jupiter.api.Test; | ||
|
|
||
| public class TestSparkSessionCatalogWithExtensions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to first fix 4.1 and then back-porting
| } | ||
| } | ||
|
|
||
| public static void setUpCatalog() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: private?
| spark.conf().set("spark.sql.catalog.spark_catalog.type", "hive"); | ||
| } | ||
|
|
||
| public static void resetSparkCatalog() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: private?
| protected static TestHiveMetastore metastore = null; | ||
| protected static HiveConf hiveConf = null; | ||
| protected static SparkSession spark = null; | ||
| protected static JavaSparkContext sparkContext = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this necessary? If not, can we remove?
| spark | ||
| .conf() | ||
| .set("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog"); | ||
| spark.conf().set("spark.sql.catalog.spark_catalog.type", "hive"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add spark.sessionState().catalogManager().reset() when flipping these configs (either inside the helper methods or immediately after calling them in the tests), similar to how spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/TestSparkSessionCatalog.java does it?
…ists in SparkSessionCatalog
|
@nastra @huaxingao Thanks for the review and the suggestion. I've updated the commit. Appreciate your feedback! |
|
@stuxuhai thanks for submitting the PR. We might need to revisit the behavior of views in the Basically right now the where we either have the actual Iceberg catalog or the underlying Spark session catalog implementing Spark's I understand that using a |
| } | ||
|
|
||
| @AfterEach | ||
| public void useHiveCatalog() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is then going to use Spark's HiveSessionCatalog right? We might want to be clearer here as otherwise one would expect that we're using Iceberg's HiveCatalog
| try { | ||
| // create Hive view | ||
| spark.sql(String.format("CREATE VIEW %s AS SELECT 1 AS id", viewName)); | ||
| } finally { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need any of those try-finally blocks
| } | ||
|
|
||
| try { | ||
| spark.sql(String.format("CREATE VIEW IF NOT EXISTS %s AS SELECT 2 AS id", viewName)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of using spark.sql(...) you can directly use sql(...)
| } | ||
|
|
||
| @TestTemplate | ||
| public void testCreateViewWithExistingHiveView() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: no need to use test as a prefix in the method names as it doesn't add any value and we try to avoid using that prefix for new tests
| } | ||
|
|
||
| @TestTemplate | ||
| public void testCreateViewIfNotExistsWithExistingHiveView() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please also add the same set of tests for tables, where we have a v1 hive table and we want to create another one using/not using IF NOT EXISTS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is so that it's easier to check how tables/views behave exactly and to align their behavior
|
@nastra Thanks a lot for the thoughtful feedback and for taking the time to review this. Based on our testing, applying this change should not affect the behavior of For example, with the following sequence: -- create hive view
create view test_hive_view as select 1 as id, 'hive_view' as name;
-- use SparkSessionCatalog to create iceberg view
create view test_iceberg_view as select 2 as id, 'iceberg_view' as name;
-- ERROR: org.apache.iceberg.exceptions.NoSuchIcebergViewException: Not an iceberg view
create view if not exists test_hive_view as select 1 as id, 'iceberg' as name;
-- ERROR: [VIEW_NOT_FOUND] The view test_hive_view cannot be found.
create or replace view test_hive_view as select 2 as id, 'create or replace by iceberg' as name;
-- ERROR: [VIEW_NOT_FOUND] The view test_hive_view cannot be found.
drop view test_hive_view;
-- Succeeds, but actually queries the Hive view instead of the Iceberg view
select * from test_iceberg_view;
-- Succeeds, but should fail with WRONG_COMMAND_FOR_OBJECT_TYPE
drop table test_iceberg_view;
-- ERROR: [VIEW_NOT_FOUND] The view test_iceberg_table cannot be found (should be WRONG_COMMAND_FOR_OBJECT_TYPE)
drop view test_iceberg_table;
-- Only drops metadata but does not delete data unless PURGE is specified, which is also different in behavior
drop table test_hive_managed_table;From a user perspective, these behaviors are quite surprising and make it difficult to reason about how My understanding is that At the moment, For the specific Thanks again for the review and for the discussion — I really appreciate it. |
|
I think it would be great to summarize all gaps that lead to weird/inconsistent behavior by e.g. writing this down through tests (which would currently fail). Right now I'm missing an overview to make a proper decision on how we would want to proceed, since we really want to fix this issue in a consistent manner. I wouldn't want to fix this only in one place, knowing that the same issue exists in a different place as well. |
|
Thanks for the suggestion — that makes a lot of sense. I agree that having a comprehensive view of all the gaps leading to these inconsistent behaviors would be very helpful, and that addressing them in a consistent way is the right direction. I can summarize the observed issues and also try to capture them in a set of tests that document the current behavior (and would fail). That should give us a clearer picture to evaluate different approaches going forward. I’ll follow up with an update once I have that ready. Thanks again for the guidance. |
This is a follow-up PR. The previous PR was closed after the branch was force-reset to
apache:main.Purpose
This PR fixes a bug where
CREATE VIEW IF NOT EXISTSfails with aNoSuchIcebergViewException: Not an iceberg view(wrapped inQueryExecutionException) instead of succeeding silently when a non-Iceberg view (e.g., a Hive view) already exists in theSparkSessionCatalog.The Problem
When
SparkSessionCatalogis configured withspark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensionsspark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalogspark.sql.catalog.spark_catalog.type=hiveCREATE VIEW IF NOT EXISTS db.view_name AS ....db.view_namealready exists as a Hive View (or any non-Iceberg table/view).SparkSessionCatalog.createViewcurrently delegates directly to the underlying Iceberg catalog (asViewCatalog.createView).NoSuchIcebergViewException.ViewAlreadyExistsExceptionto handle theIF NOT EXISTSlogic. Because it receives a different exception, the query fails entirely.The Fix
Before delegating the creation to the Iceberg catalog, we explicitly check if the identifier already exists in the underlying session catalog (which is the source of truth for the global namespace).
If
getSessionCatalog().tableExists(ident)returns true, we immediately throwViewAlreadyExistsException. This allows Spark's analysis rules to correctly catch the exception and ignore the operation as perIF NOT EXISTSsemantics.Verification
TestSparkSessionCatalogto verify thatCREATE VIEW IF NOT EXISTSsucceeds when a Hive view exists.CREATE VIEW(without if not exists) correctly throwsAnalysisException(Table or view already exists).