-
Notifications
You must be signed in to change notification settings - Fork 354
Description
Is your feature request related to a problem or challenge?
In the current IcebergTableProvider, we allow users to create a table provider directly from a static Table, enabling queries without configuring a catalog (#650).
However, this creates a subtle issue for users reading a live, changing table: IcebergTableProvider will not automatically refresh the table. Users must manually refresh the catalog to ensure they see the latest data:
// Refresh context to avoid getting a stale table
let catalog = Arc::new(IcebergCatalogProvider::try_new(client).await?);
ctx.register_catalog("catalog", catalog);Supporting static tables in IcebergTableProvider also means the catalog may be None. When the catalog is None, users must construct and register a new static table every time they want to read the table.
This problem has become even more noticeable now that we support INSERT INTO in DataFusion, allowing users to read and write Iceberg tables within the same session:
INSERT INTO test_table VALUES ...;
SELECT * FROM test_table;
-- The inserted rows won't appear because the registered table wasn't refreshed.There is some ongoing work related to this, such as #1297, but I believe we need a broader design discussion to address this issue once and for all. Hence this issue.
Describe the solution you'd like
- Option 1: Splitting the existing
IcebergTableProviderinto two
IcebergTableProvider: this provider has aArc<dyn Catalog>and does not hold aTablecache. Whenever it needs to get IcebergTable, it callscatalog.load_tableIcebergStaticTableProvider: this provider only containstable: Tablecache and is not aware of any catalog.
This way, users can decide which table provider they need based on their use cases. Each table provider will be solid for each use case. But this will be a breaking change
- Option 2: Refresh when
Catalogis available
This is basically what Support retrieving the latest Iceberg table on table scan #1297 suggests, except in the latest code we havecatalog: Option<Arc<dyn Catalog>>rather thancatalog: Arc<dyn Catalog>, so we can only refresh whenCatalogis notNone
This option will require less changes and won't break existing use cases, but users will need extra caution to get the wanted behavior
Would love to hear if there are any other potential solutions!
Willingness to contribute
I would be willing to contribute to this feature with guidance from the Iceberg Rust community