feat: support refreshing Iceberg tables #5707

lbooker42 · 2024-07-02T15:47:35Z

Add two methods of refreshing tables:

Manual refreshing - user specifies which snapshot to load and the engine will parse the snapshot to add/remove Iceberg data files as needed and notify downstream tables of the changes
Auto refreshing - at regular intervals (user configurable) the engine will query Iceberg for the latest snapshot and then parse and load

Example code:

Java automatic and manually refreshing tables

import io.deephaven.iceberg.util.*;
import org.apache.iceberg.catalog.*;

adapter = IcebergToolsS3.createS3Rest(
        "minio-iceberg",
        "http://rest:8181",
        "s3a://warehouse/wh",
        "us-east-1",
        "admin",
        "password",
        "http://minio:9000");

//////////////////////////////////////////////////////////////////////

import io.deephaven.extensions.s3.*;

s3_instructions = S3Instructions.builder()
    .regionName("us-east-1")
    .credentials(Credentials.basic("admin", "password"))
    .endpointOverride("http://minio:9000")
    .build()

import io.deephaven.iceberg.util.IcebergUpdateMode;

// Automatic refreshing every 1 second 
iceberg_instructions = IcebergInstructions.builder()
    .dataInstructions(s3_instructions)
    .updateMode(IcebergUpdateMode.autoRefreshing(1_000L))
    .build()

// Automatic refreshing (default 60 seconds)
iceberg_instructions = IcebergInstructions.builder()
    .dataInstructions(s3_instructions)
    .updateMode(IcebergUpdateMode.AUTO_REFRESHING)
    .build()

// Load the table and monitor changes
sales_multi = adapter.readTable(
        "sales.sales_multi",
        iceberg_instructions)

//////////////////////////////////////////////////////////////////////

// Manual refreshing
iceberg_instructions = IcebergInstructions.builder()
    .dataInstructions(s3_instructions)
    .updateMode(IcebergUpdateMode.MANUAL_REFRESHING)
    .build()

// Load a table with a specific snapshot
sales_multi = adapter.readTable(
        "sales.sales_multi",
        5120804857276751995,
        iceberg_instructions)

// Update the table to a specific snapshot
sales_multi.update(848129305390678414)

// Update to the latest snapshot
sales_multi.update()

Python automatic and manually refreshing tables

from deephaven.experimental import s3, iceberg

local_adapter = iceberg.adapter_s3_rest(
        name="minio-iceberg",
        catalog_uri="http://rest:8181",
        warehouse_location="s3a://warehouse/wh",
        region_name="us-east-1",
        access_key_id="admin",
        secret_access_key="password",
        end_point_override="http://minio:9000");

#################################################

s3_instructions = s3.S3Instructions(
        region_name="us-east-1",
        access_key_id="admin",
        secret_access_key="password",
        endpoint_override="http://minio:9000"
        )

# Auto-refresh every 1000 ms
iceberg_instructions = iceberg.IcebergInstructions(
        data_instructions=s3_instructions,
        update_mode=iceberg.IcebergUpdateMode.auto_refreshing(1000))

sales_multi = local_adapter.read_table(table_identifier="sales.sales_multi", instructions=iceberg_instructions)

#################################################

# Manual refresh the table
iceberg_instructions = iceberg.IcebergInstructions(
        data_instructions=s3_instructions,
        update_mode=iceberg.IcebergUpdateMode.MANUAL_REFRESHING)

sales_multi = local_adapter.read_table(
    table_identifier="sales.sales_multi",
    snapshot_id=5120804857276751995,
    instructions=iceberg_instructions)

sales_multi.update(848129305390678414)
sales_multi.update(3019545135163225470)
sales_multi.update()

…olean

engine/table/src/main/java/io/deephaven/engine/table/impl/locations/TableLocationProvider.java

rcaudy · 2024-07-03T15:53:09Z

engine/table/src/main/java/io/deephaven/engine/table/impl/locations/TableLocationProvider.java

 /**
 * Notify the listener of a {@link TableLocationKey} encountered while initiating or maintaining the location
 * subscription. This should occur at most once per location, but the order of delivery is <i>not</i>
 * guaranteed.
 *
 * @param tableLocationKey The new table location key
 */
- void handleTableLocationKey(@NotNull ImmutableTableLocationKey tableLocationKey);
+ void handleTableLocationKeyAdded(@NotNull ImmutableTableLocationKey tableLocationKey);


Good change, may be breaking for DHE, please consult Andy pre-merge.

rcaudy · 2024-07-03T16:04:46Z

engine/table/src/main/java/io/deephaven/engine/table/impl/locations/TableLocationProvider.java

+ void beginTransaction();
+
+ void endTransaction();
+
 /**
 * Notify the listener of a {@link TableLocationKey} encountered while initiating or maintaining the location
 * subscription. This should occur at most once per location, but the order of delivery is <i>not</i>


Consider whether we can have add + remove + add. What about remove + add in the same pull?
Should document that this may change the "at most once per location" guarantee, and define semantics.
I think it should be something like:
We allow re-add of a removed TLK. Downstream consumers should process these in an order that respects delivery and transactionality.

~~Within one transaction, expect at most one of "remove" or "add" for a given TLK.~~
Within one transaction, we can allow remove followed by add, but not add followed by remove. This dictates that we deliver pending removes before pending adds in processPending.
That is, one transaction allows:

Replace a TLK (remove followed by add)

Remove a TLK (remove)

Add a TLK (add)
Double add, double remove, or add followed by remove is right out.

Processing an addition to a transaction.

Remove: If there's an existing accumulated remove, error. Else, if there's an existing accumulated add, error. Else, accumulate the remove.

Add: If there's an existing accumulated add, error. Else, accumulate the add.

Across multiple transactions delivered as a batch, ensure that the right end-state is achieved.

Add + remove collapses pairwise to no-op

Remove + add (assuming prior add) should be processed in order. We might very well choose to not allow re-add at this time, I don't expect Iceberg to do this. If we do allow it, we need to be conscious that the removed location's region(s) need(s) to be used for previous data, while the added one needs to be used for current data.

Multiple adds or removes ~~within~~ without their opposite intervening is an error.

null token should be handled exactly the same as a single-element transaction.

Processing a transaction:

Process removes first. If there's an add pending, then delete, swallow the remove. Else, if there's a remove pending, error. Else, store the remove as pending.

Process adds. If there's an add pending, error. Else, store the add as pending.

Note: removal support means that RegionedColumnSources may no longer be immutable! We need to be sure that we are aware of whether a particular TLP might remove data, and ensure that in those cases the RCS is not marked immutable. REVISED: ONLY REPLACE IS AN ISSUE FOR IMMUTABILITY, AS LONG AS WE DON'T RESUSE SLOTS.

We discussed that TLPs should probably specify whether they are guaranteeing that they will never remove TLKs, and whether their TLs will never remove or modify rows. I think if and when we encounter data sources that require modify support, we should probably just use SourcePartitionedTable instead of PartitionAwareSourceTable.

I'm not sure if I need to handle the RCS immutability question in this PR since Iceberg will not modify rows.

Removing a region makes the values in the corresponding row key range disappear. That's OK for immutability.
If you allow a new region to use the same slot, or allow the old region to reincarnate in the same slot potentially with different data, you are violating immutability.

Not reusing slots means that a long-lived iceberg table may eventually exhaust its row key space.

Replace (remove + add of a TLK) requires some kind of versioning of the TL, in a way that the TLK is aware of in order to ensure that we provide the table with the right TL for the version. AbstractTableLocationProvider's location caching layer is not currently sufficient for atomically replacing TLs.

...e/src/main/java/io/deephaven/engine/table/impl/locations/impl/CompositeTableDataService.java

...c/main/java/io/deephaven/engine/table/impl/locations/impl/AbstractTableLocationProvider.java

extensions/iceberg/src/main/java/io/deephaven/iceberg/util/IcebergCatalogAdapter.java

extensions/iceberg/src/main/java/io/deephaven/iceberg/util/IcebergTable.java

...ceberg/src/main/java/io/deephaven/iceberg/layout/IcebergRefreshingTableLocationProvider.java

engine/table/src/main/java/io/deephaven/engine/table/impl/SourceTable.java

engine/table/src/main/java/io/deephaven/engine/table/impl/locations/TableLocationProvider.java

...c/main/java/io/deephaven/engine/table/impl/sources/regioned/RegionedColumnSourceManager.java

...ceberg/src/main/java/io/deephaven/iceberg/layout/IcebergRefreshingTableLocationProvider.java

...ions/iceberg/src/main/java/io/deephaven/iceberg/layout/IcebergTableLocationProviderBase.java

extensions/iceberg/src/main/java/io/deephaven/iceberg/util/IcebergTableRefreshing.java

…rceTable.

py/server/deephaven/experimental/iceberg.py

… call.

engine/updategraph/src/main/java/io/deephaven/engine/liveness/LivenessManager.java

engine/updategraph/src/main/java/io/deephaven/engine/liveness/SingletonLivenessManager.java

engine/updategraph/src/main/java/io/deephaven/engine/liveness/DelegatingLivenessNode.java

engine/table/src/main/java/io/deephaven/engine/table/impl/PartitionAwareSourceTable.java

rcaudy · 2024-09-27T22:13:25Z

engine/table/src/main/java/io/deephaven/engine/table/impl/PartitionAwareSourceTable.java

+ final Collection<ImmutableTableLocationKey> immutableTableLocationKeys = foundLocationKeys.stream()
+ .map(LiveSupplier::get)
+ .collect(Collectors.toList());
+
 // TODO (https://github.com/deephaven/deephaven-core/issues/867): Refactor around a ticking partition table


Check and see if we can just close this ticket, and maybe delete the todo.

engine/table/src/main/java/io/deephaven/engine/table/impl/PartitionAwareSourceTable.java

engine/table/src/main/java/io/deephaven/engine/table/impl/SourcePartitionedTable.java

rcaudy

Finished table data infrastructure. May need to look further at Iceberg code and tests.

engine/table/src/main/java/io/deephaven/engine/table/impl/ColumnSourceManager.java

engine/table/src/main/java/io/deephaven/engine/table/impl/SourcePartitionedTable.java

engine/table/src/main/java/io/deephaven/engine/table/impl/SourceTable.java

engine/table/src/main/java/io/deephaven/engine/table/impl/TableUpdateMode.java

...main/java/io/deephaven/engine/table/impl/locations/util/ExecutorTableDataRefreshService.java

py/server/deephaven/experimental/iceberg.py

...ns/iceberg/src/main/java/io/deephaven/iceberg/layout/IcebergStaticTableLocationProvider.java

...eberg/src/main/java/io/deephaven/iceberg/layout/IcebergAutoRefreshTableLocationProvider.java

rcaudy

Minor commentary. We can dig into the rest of the Iceberg layer tomorrow.

...c/main/java/io/deephaven/engine/table/impl/locations/impl/AbstractTableLocationProvider.java

...e/src/main/java/io/deephaven/engine/table/impl/locations/impl/CompositeTableDataService.java

...le/src/main/java/io/deephaven/engine/table/impl/locations/impl/FilteredTableDataService.java

engine/table/src/main/java/io/deephaven/engine/table/impl/TableUpdateMode.java

rcaudy

Partial review of Iceberg-specific code.

extensions/iceberg/src/main/java/io/deephaven/iceberg/util/IcebergTable.java

extensions/iceberg/src/main/java/io/deephaven/iceberg/util/IcebergTableImpl.java

extensions/iceberg/src/main/java/io/deephaven/iceberg/util/IcebergUpdateMode.java

rcaudy · 2024-10-09T22:11:57Z

extensions/iceberg/src/main/java/io/deephaven/iceberg/util/IcebergUpdateMode.java

+
+@Value.Immutable
+@BuildableStyle
+public abstract class IcebergUpdateMode {


I think we might want to consider whether initial snapshot should be a parameter.

extensions/iceberg/src/main/java/io/deephaven/iceberg/util/IcebergUpdateMode.java

extensions/iceberg/src/main/java/io/deephaven/iceberg/util/IcebergCatalogAdapter.java

extensions/iceberg/src/main/java/io/deephaven/iceberg/util/IcebergTableAdapter.java

...erg/src/main/java/io/deephaven/iceberg/layout/IcebergManualRefreshTableLocationProvider.java

...ne/table/src/test/java/io/deephaven/engine/table/impl/AbstractTableLocationProviderTest.java

rcaudy · 2024-10-17T23:12:34Z

...ions/iceberg/src/main/java/io/deephaven/iceberg/layout/IcebergTableLocationProviderBase.java

- * must also be newer (higher in sequence number) than the current snapshot or an {@link IllegalArgumentException}
- * is thrown.
+ * Update a manually refreshing table location provider with a specific snapshot from the catalog. If the
+ * {@code snapshotId} is not found in the list of snapshots for the table, an {@link IllegalArgumentException} is


Does Iceberg have a SnapshotNotFoundException? If not, should we add one of our own? IAE is OK, though.

Iceberg does not produce an error when a snapshot is not matched. Intending to keep these as IAE rather than create an exception used exactly once.

rcaudy

.

...erg/src/main/java/io/deephaven/iceberg/layout/IcebergManualRefreshTableLocationProvider.java

extensions/iceberg/src/main/java/io/deephaven/iceberg/util/IcebergInstructions.java

extensions/iceberg/src/main/java/io/deephaven/iceberg/util/IcebergTable.java

extensions/iceberg/src/main/java/io/deephaven/iceberg/util/IcebergTableAdapter.java

extensions/iceberg/src/main/java/io/deephaven/iceberg/layout/IcebergBaseLayout.java

extensions/iceberg/src/main/java/io/deephaven/iceberg/util/IcebergTableAdapter.java

py/server/deephaven/experimental/iceberg.py

rcaudy · 2024-10-18T19:42:53Z

py/server/deephaven/experimental/iceberg.py

+ def update(self, snapshot_id:Optional[int] = None):
+ """
+ Updates the table with a specific snapshot. If no snapshot is provided, the most recent snapshot is used.


Updates the table to match the contents of the specified snapshot. This may result in row removes and additions that will be propagated asynchronously via this IcebergTable's UpdateGraph.

py/server/deephaven/experimental/iceberg.py

rcaudy

.

lbooker42 added 2 commits July 2, 2024 08:28

Initial commit of refreshing Iceberg.

470b09c

Rebased to main.

a8d957a

lbooker42 added query engine DocumentationNeeded ReleaseNotesNeeded Release notes are needed labels Jul 2, 2024

lbooker42 added this to the 0.36.0 milestone Jul 2, 2024

lbooker42 self-assigned this Jul 2, 2024

lbooker42 requested a review from rcaudy July 2, 2024 15:47

lbooker42 added 2 commits July 2, 2024 09:02

Change IcebergInstructions refreshing indicator to enum instead of bo…

264fdb1

…olean

WIP, for review

58d0a73

rcaudy reviewed Jul 3, 2024

View reviewed changes

lbooker42 added 2 commits July 22, 2024 19:47

Manual and auto-refreshing working, better documentation.

e090474

Addressed more PR comments, some remaining.

57021ad

rcaudy reviewed Jul 26, 2024

View reviewed changes

lbooker42 added 6 commits July 26, 2024 12:11

WIP, some PR comments addressed.

fb882e8

WIP, even more PR comments addressed.

5bbdeb2

Nearly all PR comments addressed.

3da205c

merged with main

91acf9b

Adjustment to IcebergInstructions update mode.

08dd329

Added python wrapper for Iceberg refreshing tables.

7af0d1d

lbooker42 requested review from chipkent and jmao-denver as code owners July 29, 2024 21:40

Changes to mocked tests for ColumnSourceManager and PartitionAwareSou…

2d79c38

…rceTable.

jmao-denver reviewed Jul 30, 2024

View reviewed changes

py/server/deephaven/experimental/iceberg.py Outdated Show resolved Hide resolved

jmao-denver reviewed Jul 30, 2024

View reviewed changes

py/server/deephaven/experimental/iceberg.py Outdated Show resolved Hide resolved

lbooker42 added 5 commits July 30, 2024 10:11

Added DHError handler and add'l documentation to python snapshots()…

3809f21

… call.

Fixed typo in JavaDoc

5273a15

WIP

b9e2c6e

Suggestion from review

9937f79

WIP, changes to revert some transaction token code.

cd08038

lbooker42 added 2 commits September 25, 2024 12:23

Added TLP state (add, append, static, refreshing)

dead9c4

Added TLP state (add, append, static, refreshing)

e5d10e7

rcaudy reviewed Sep 27, 2024

View reviewed changes

lbooker42 added 3 commits September 30, 2024 13:44

Addressed PR comments, some TODO remaining to address.

b30e240

Improved table location management in SourcePartitionedTable

fa9d154

Merge with main

d807a94

lbooker42 mentioned this pull request Oct 2, 2024

feat: Added support to write iceberg tables #5989

Open

lbooker42 added 2 commits October 2, 2024 13:20

Post-merge cleanup

db2b031

Post-merge cleanup and test updating.

746b343

rcaudy reviewed Oct 3, 2024

View reviewed changes

lbooker42 added 3 commits October 4, 2024 14:34

Addressed PR comments and test failures.

d69ddcd

More test failure fixes.

06a3bd5

Liveness management re-ordering in SourcePartitionedTable

91ba92c

rcaudy reviewed Oct 8, 2024

View reviewed changes

Addressing open PR comments.

73e8824

rcaudy reviewed Oct 9, 2024

View reviewed changes

rcaudy reviewed Oct 10, 2024

View reviewed changes

lbooker42 added 2 commits October 15, 2024 14:44

Many PR comments addressed.

be689ef

IcebergTableAdapter synchronization changes and cleanup.

b15c78a

rcaudy reviewed Oct 17, 2024

View reviewed changes

lbooker42 added 3 commits October 18, 2024 11:02

PR comments addressed.

de9f6ae

Fix for iceberg.py to use table adapter

d5c13e6

Merged with main.

851260c

rcaudy reviewed Oct 18, 2024

View reviewed changes

chipkent reviewed Oct 18, 2024

View reviewed changes

Addressed PR comments.

d1557c5

rcaudy reviewed Oct 18, 2024

View reviewed changes

Close to merging.

975a013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support refreshing Iceberg tables #5707

feat: support refreshing Iceberg tables #5707

lbooker42 commented Jul 2, 2024 •

edited

Loading

rcaudy Jul 3, 2024

rcaudy Jul 3, 2024 •

edited

Loading

lbooker42 Jul 23, 2024

rcaudy Jul 25, 2024

rcaudy Jul 25, 2024

rcaudy Sep 27, 2024

rcaudy left a comment

rcaudy left a comment

rcaudy left a comment

rcaudy Oct 9, 2024

rcaudy Oct 17, 2024

lbooker42 Oct 18, 2024

rcaudy left a comment

rcaudy Oct 18, 2024

rcaudy left a comment

feat: support refreshing Iceberg tables #5707

Are you sure you want to change the base?

feat: support refreshing Iceberg tables #5707

Conversation

lbooker42 commented Jul 2, 2024 • edited Loading

Example code:

Java automatic and manually refreshing tables

Python automatic and manually refreshing tables

Choose a reason for hiding this comment

rcaudy Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcaudy left a comment

Choose a reason for hiding this comment

rcaudy left a comment

Choose a reason for hiding this comment

rcaudy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcaudy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcaudy left a comment

Choose a reason for hiding this comment

lbooker42 commented Jul 2, 2024 •

edited

Loading

rcaudy Jul 3, 2024 •

edited

Loading