Skip to content

Conversation

@stuxuhai
Copy link
Contributor

This is a follow-up PR. The previous PR was closed after the branch was force-reset to apache:main.

Purpose

This PR fixes a bug where CREATE VIEW IF NOT EXISTS fails with a NoSuchIcebergViewException: Not an iceberg view (wrapped in QueryExecutionException) instead of succeeding silently when a non-Iceberg view (e.g., a Hive view) already exists in the SparkSessionCatalog.

The Problem

When SparkSessionCatalog is configured with spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
spark.sql.catalog.spark_catalog.type=hive

  1. A user executes CREATE VIEW IF NOT EXISTS db.view_name AS ....
  2. If db.view_name already exists as a Hive View (or any non-Iceberg table/view).
  3. SparkSessionCatalog.createView currently delegates directly to the underlying Iceberg catalog (asViewCatalog.createView).
  4. The Iceberg catalog (e.g., HiveCatalog) attempts to load the view. Since it is not an Iceberg view, it throws NoSuchIcebergViewException.
  5. Spark expects ViewAlreadyExistsException to handle the IF NOT EXISTS logic. Because it receives a different exception, the query fails entirely.

The Fix

Before delegating the creation to the Iceberg catalog, we explicitly check if the identifier already exists in the underlying session catalog (which is the source of truth for the global namespace).

If getSessionCatalog().tableExists(ident) returns true, we immediately throw ViewAlreadyExistsException. This allows Spark's analysis rules to correctly catch the exception and ignore the operation as per IF NOT EXISTS semantics.

Verification

  • Added a new unit test in TestSparkSessionCatalog to verify that CREATE VIEW IF NOT EXISTS succeeds when a Hive view exists.
  • Verified that CREATE VIEW (without if not exists) correctly throws AnalysisException (Table or view already exists).

@github-actions github-actions bot added the spark label Dec 27, 2025
@stuxuhai
Copy link
Contributor Author

stuxuhai commented Jan 6, 2026

@huaxingao The previous PR was automatically closed due to a force push, so I’ve opened a new one.
Could you please help review it when you have time? Thanks!

import org.junit.jupiter.api.BeforeAll;
import org.junit.jupiter.api.Test;

public class TestSparkSessionCatalogWithExtensions {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to first fix the issue in one Spark version and later backport stuff

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to first fix 4.1 and then back-porting

@nastra nastra requested a review from huaxingao January 8, 2026 16:04
import org.junit.jupiter.api.BeforeAll;
import org.junit.jupiter.api.Test;

public class TestSparkSessionCatalogWithExtensions {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to first fix 4.1 and then back-porting

}
}

public static void setUpCatalog() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: private?

spark.conf().set("spark.sql.catalog.spark_catalog.type", "hive");
}

public static void resetSparkCatalog() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: private?

protected static TestHiveMetastore metastore = null;
protected static HiveConf hiveConf = null;
protected static SparkSession spark = null;
protected static JavaSparkContext sparkContext = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary? If not, can we remove?

spark
.conf()
.set("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog");
spark.conf().set("spark.sql.catalog.spark_catalog.type", "hive");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add spark.sessionState().catalogManager().reset() when flipping these configs (either inside the helper methods or immediately after calling them in the tests), similar to how spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/TestSparkSessionCatalog.java does it?

@stuxuhai
Copy link
Contributor Author

@nastra @huaxingao Thanks for the review and the suggestion. I've updated the commit. Appreciate your feedback!

@nastra
Copy link
Contributor

nastra commented Jan 14, 2026

@stuxuhai thanks for submitting the PR. We might need to revisit the behavior of views in the SparkSessionCatalog (there was also #14557 that ran into issues with Legacy Hive views) and define the behavior we actually want to have.

Basically right now the SparkSessionCatalog only supports Iceberg views since 1.8.0, hence why we have checks like

  public boolean viewExists(Identifier ident) {
    return (asViewCatalog != null && asViewCatalog.viewExists(ident))
        || (isViewCatalog() && getSessionCatalog().viewExists(ident));
  }

where we either have the actual Iceberg catalog or the underlying Spark session catalog implementing Spark's ViewCatalog API. If that's not the case, we don't fall back Spark's session catalog and check getSessionCatalog().tableExists(ident), which would also detect a v1 view.

I understand that using a IF NOT EXISTS on a v1 Hive view you'd expect to not fail, but what are the implications of that when you e.g. run describe or show views? I haven't tested that, so we might want to explore that all of the operations you can do against a View don't produce weird results when we apply this diff.
I'm still currently undecided on what the right approach here would be, and I've seen that e.g. creating a Table behaves slightly different than creating a View.

}

@AfterEach
public void useHiveCatalog() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is then going to use Spark's HiveSessionCatalog right? We might want to be clearer here as otherwise one would expect that we're using Iceberg's HiveCatalog

try {
// create Hive view
spark.sql(String.format("CREATE VIEW %s AS SELECT 1 AS id", viewName));
} finally {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need any of those try-finally blocks

}

try {
spark.sql(String.format("CREATE VIEW IF NOT EXISTS %s AS SELECT 2 AS id", viewName));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of using spark.sql(...) you can directly use sql(...)

}

@TestTemplate
public void testCreateViewWithExistingHiveView() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need to use test as a prefix in the method names as it doesn't add any value and we try to avoid using that prefix for new tests

}

@TestTemplate
public void testCreateViewIfNotExistsWithExistingHiveView() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please also add the same set of tests for tables, where we have a v1 hive table and we want to create another one using/not using IF NOT EXISTS

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is so that it's easier to check how tables/views behave exactly and to align their behavior

@stuxuhai
Copy link
Contributor Author

@nastra Thanks a lot for the thoughtful feedback and for taking the time to review this.

Based on our testing, applying this change should not affect the behavior of DESCRIBE or SHOW VIEWS. The impact is limited to CREATE VIEW statements. While using SparkSessionCatalog in practice, we encountered a number of confusing and unintuitive behaviors, especially when working with non-Iceberg views and tables. This PR is intended to address one of those cases.

For example, with the following sequence:

-- create hive view
create view test_hive_view as select 1 as id, 'hive_view' as name;

-- use SparkSessionCatalog to create iceberg view
create view test_iceberg_view as select 2 as id, 'iceberg_view' as name;

-- ERROR: org.apache.iceberg.exceptions.NoSuchIcebergViewException: Not an iceberg view
create view if not exists test_hive_view as select 1 as id, 'iceberg' as name;

-- ERROR: [VIEW_NOT_FOUND] The view test_hive_view cannot be found.
create or replace view test_hive_view as select 2 as id, 'create or replace by iceberg' as name;

-- ERROR: [VIEW_NOT_FOUND] The view test_hive_view cannot be found.
drop view test_hive_view;

-- Succeeds, but actually queries the Hive view instead of the Iceberg view
select * from test_iceberg_view;

-- Succeeds, but should fail with WRONG_COMMAND_FOR_OBJECT_TYPE
drop table test_iceberg_view;

-- ERROR: [VIEW_NOT_FOUND] The view test_iceberg_table cannot be found (should be WRONG_COMMAND_FOR_OBJECT_TYPE)
drop view test_iceberg_table;

-- Only drops metadata but does not delete data unless PURGE is specified, which is also different in behavior
drop table test_hive_managed_table;

From a user perspective, these behaviors are quite surprising and make it difficult to reason about how SparkSessionCatalog should be used safely.

My understanding is that SparkSessionCatalog is primarily intended to manage Iceberg tables and views. For non-Iceberg objects, it would be ideal if the behavior could fall back to Spark’s session catalog so that the results remain consistent with running Spark without Iceberg.

At the moment, getSessionCatalog() returns a V2SessionCatalog instance, and due to current Spark design constraints, V2SessionCatalog does not implement the ViewCatalog interface, which makes fallback handling for Hive views more complicated.

For the specific CREATE VIEW IF NOT EXISTS case, it appears that the issue can be resolved with a small and localized change, which is what this PR focuses on. I’m very happy to iterate on this further and explore the best long-term approach together.

Thanks again for the review and for the discussion — I really appreciate it.

@nastra
Copy link
Contributor

nastra commented Jan 14, 2026

I think it would be great to summarize all gaps that lead to weird/inconsistent behavior by e.g. writing this down through tests (which would currently fail). Right now I'm missing an overview to make a proper decision on how we would want to proceed, since we really want to fix this issue in a consistent manner. I wouldn't want to fix this only in one place, knowing that the same issue exists in a different place as well.
Would you be willing to summarize all of this in text or in tests so that we can start thinking about the best approach going forward?

@stuxuhai
Copy link
Contributor Author

Thanks for the suggestion — that makes a lot of sense.

I agree that having a comprehensive view of all the gaps leading to these inconsistent behaviors would be very helpful, and that addressing them in a consistent way is the right direction.

I can summarize the observed issues and also try to capture them in a set of tests that document the current behavior (and would fail). That should give us a clearer picture to evaluate different approaches going forward.

I’ll follow up with an update once I have that ready. Thanks again for the guidance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants