Skip to content

Refresh cached table in IcebergTableProvider #1877

@CTTY

Description

@CTTY

Is your feature request related to a problem or challenge?

In the current IcebergTableProvider, we allow users to create a table provider directly from a static Table, enabling queries without configuring a catalog (#650).

However, this creates a subtle issue for users reading a live, changing table: IcebergTableProvider will not automatically refresh the table. Users must manually refresh the catalog to ensure they see the latest data:

// Refresh context to avoid getting a stale table
let catalog = Arc::new(IcebergCatalogProvider::try_new(client).await?);
ctx.register_catalog("catalog", catalog);

Supporting static tables in IcebergTableProvider also means the catalog may be None. When the catalog is None, users must construct and register a new static table every time they want to read the table.

This problem has become even more noticeable now that we support INSERT INTO in DataFusion, allowing users to read and write Iceberg tables within the same session:

INSERT INTO test_table VALUES ...;
SELECT * FROM test_table; 
-- The inserted rows won't appear because the registered table wasn't refreshed.

There is some ongoing work related to this, such as #1297, but I believe we need a broader design discussion to address this issue once and for all. Hence this issue.

Describe the solution you'd like

  • Option 1: Splitting the existing IcebergTableProvider into two
  1. IcebergTableProvider: this provider has a Arc<dyn Catalog> and does not hold a Table cache. Whenever it needs to get Iceberg Table, it calls catalog.load_table
  2. IcebergStaticTableProvider: this provider only contains table: Table cache and is not aware of any catalog.
    This way, users can decide which table provider they need based on their use cases. Each table provider will be solid for each use case. But this will be a breaking change
  • Option 2: Refresh when Catalog is available
    This is basically what Support retrieving the latest Iceberg table on table scan #1297 suggests, except in the latest code we have catalog: Option<Arc<dyn Catalog>> rather than catalog: Arc<dyn Catalog>, so we can only refresh when Catalog is not None
    This option will require less changes and won't break existing use cases, but users will need extra caution to get the wanted behavior

Would love to hear if there are any other potential solutions!

Willingness to contribute

I would be willing to contribute to this feature with guidance from the Iceberg Rust community

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions