Flink: FLIP-27 source enumerator help classes#4329
Conversation
fc10e58 to
9c1001d
Compare
| */ | ||
| class ScanContext implements Serializable { | ||
| @Internal | ||
| public class ScanContext implements Serializable { |
There was a problem hiding this comment.
This is made public because it is accessed by ContinuousSplitPlannerImpl class in this PR.
flink/v1.14/flink/src/main/java/org/apache/iceberg/flink/source/ScanContext.java
Show resolved
Hide resolved
yittg
left a comment
There was a problem hiding this comment.
Thanks @stevenzwu, i have some question about this change, hoping you can help
flink/v1.14/flink/src/main/java/org/apache/iceberg/flink/source/ScanContext.java
Show resolved
Hide resolved
| private void validate() { | ||
| Preconditions.checkArgument(scanContext.snapshotId() == null, | ||
| "Can't set snapshotId in ScanContext for continuous enumerator"); | ||
| Preconditions.checkArgument(scanContext.asOfTimestamp() == null, | ||
| "Can't set asOfTimestamp in ScanContext for continuous enumerator"); | ||
| Preconditions.checkArgument(scanContext.startSnapshotId() == null, | ||
| "Can't set startSnapshotId in ScanContext for continuous enumerator"); | ||
| Preconditions.checkArgument(scanContext.endSnapshotId() == null, | ||
| "Can't set endSnapshotId in ScanContext for continuous enumerator"); |
There was a problem hiding this comment.
Does that mean scanContext#monitorInterval should be null, because it is configured in IcebergEnumeratorConfig?
There was a problem hiding this comment.
that is a good question. I am also pondering maybe we should merge the IcebergEnumeratorConfig with ScanContext. As you pointed out, there are some overlapped configs, like monitorInterval, startSnapshotId. Please let me know your preference. Will also get some input from Ryan.
Right now, I am leaning toward merging them.
There was a problem hiding this comment.
merged the IcebergEnumeratorConfig into ScanContext.
| public Builder startingStrategy(StartingStrategy strategy) { | ||
| this.startingStrategy = strategy; | ||
| return this; | ||
| } |
There was a problem hiding this comment.
Would it be better to use named strategy method accepting required options? like
| public Builder startingStrategy(StartingStrategy strategy) { | |
| this.startingStrategy = strategy; | |
| return this; | |
| } | |
| public Builder startingWithSpecificSnapshot(long startId) { | |
| this.startingStrategy = StartingStrategy.SPECIFIC_START_SNAPSHOT_ID; | |
| this.startSnapshotId = startId; | |
| return this; | |
| } |
There was a problem hiding this comment.
agree, I like your suggestion.
There was a problem hiding this comment.
actually while working on merging the configs to ScanContext, I think we need to stay in the simple POJO style for ScanContext.Builder.fromProperties
| } | ||
|
|
||
| @VisibleForTesting | ||
| static HistoryEntry getStartSnapshot(Table table, IcebergEnumeratorConfig enumeratorConfig) { |
There was a problem hiding this comment.
it's a little confused whether the start snapshot is counted or not? it is exclusive based on TableScan#appendsBetween. However, it does not match the intuition for strategy EARLIEST_SNAPSHOT, even SPECIFIC_START_SNAPSHOT_ID and SPECIFIC_START_SNAPSHOT_TIMESTAMP
There was a problem hiding this comment.
start snapshot is exclusive. I am thinking that we can document the behavior better for StartingStrategy enum class.
There was a problem hiding this comment.
IIUC, these strategies mean semantic with AFTER?
EARLIEST_SNAPSHOT actually means START_AFTER_ EARLIEST_SNAPSHOT;
SPECIFIC_START_SNAPSHOT_ID actually means START_AFTER_SNAPSHOT_ID;
SPECIFIC_START_SNAPSHOT_TIMESTAMP actually means START_AFTER_SNAPSHOT_TIMESTAMP;
...ink/src/main/java/org/apache/iceberg/flink/source/enumerator/ContinuousSplitPlannerImpl.java
Show resolved
Hide resolved
c70e105 to
651affb
Compare
| if (isStreaming) { | ||
| if (startingStrategy == StreamingStartingStrategy.SPECIFIC_START_SNAPSHOT_ID) { | ||
| Preconditions.checkArgument(startSnapshotId != null, | ||
| "startSnapshotId cannot be null for SPECIFIC_START_SNAPSHOT_ID starting strategy"); |
There was a problem hiding this comment.
We probably want to remove the specific variable names from the error messages. And we also phrase error messages more directly: (Problem): (context)
This should probably be: Invalid starting snapshot for SPECIFIC_START_SNAPSHOT_ID strategy: null
| * Start incremental mode from a specific startTimestamp. | ||
| * Starting snapshot has a timestamp lower than or equal to the specified timestamp. | ||
| */ | ||
| SPECIFIC_START_SNAPSHOT_TIMESTAMP |
There was a problem hiding this comment.
What happens if table history doesn't go back that far? It should probably fail because the user's request can't be satisfied.
There was a problem hiding this comment.
switched to SnapshotUtil#snapshotIdAsOfTime, which handles this scenario internally (throwing exception)
There was a problem hiding this comment.
INCREMENTAL_AFTER_TIMESTAMP?
| /** | ||
| * Start incremental mode from a specific startSnapshotId | ||
| */ | ||
| SPECIFIC_START_SNAPSHOT_ID, |
There was a problem hiding this comment.
Inclusive of the changes in this snapshot?
There was a problem hiding this comment.
mentioned the exclusive behave at the class-level Javadoc
| this.scanContext = scanContext; | ||
| // Within a JVM, table name should be unique across sources. | ||
| // Hence it is used as worker pool thread name prefix. | ||
| this.workerPool = ThreadPools.newWorkerPool("iceberg-worker-pool-" + table.name(), scanContext.planParallelism()); |
There was a problem hiding this comment.
Isn't it possible to process the table twice in the same JVM? I agree operator ID would be better.
There was a problem hiding this comment.
make the thread pool name a constructor argument.
Because FLIP-27 source interface doesn't expose the operator ID, for now tableName-UUID is used to guarantee uniqueness. comments will be updated.
| public ContinuousEnumerationResult planSplits(IcebergEnumeratorPosition lastPosition) { | ||
| table.refresh(); | ||
| if (lastPosition != null) { | ||
| return discoverDeltaSplits(lastPosition); |
There was a problem hiding this comment.
I would avoid using "Delta" because that has multiple meanings. Incremental is better.
| } | ||
|
|
||
| @VisibleForTesting | ||
| static HistoryEntry getStartSnapshot(Table table, ScanContext scanContext) { |
There was a problem hiding this comment.
Here you probably want to use SnapshotUtil, which handles most of these cases using the current table state's ancestors.
| "Snapshot id not found in history: {}" + scanContext.startSnapshotId()); | ||
| } | ||
| break; | ||
| case SPECIFIC_START_SNAPSHOT_TIMESTAMP: |
There was a problem hiding this comment.
| case LATEST_SNAPSHOT: | ||
| startEntry = historyEntries.get(historyEntries.size() - 1); | ||
| break; | ||
| case EARLIEST_SNAPSHOT: |
There was a problem hiding this comment.
| switch (scanContext.startingStrategy()) { | ||
| case TABLE_SCAN_THEN_INCREMENTAL: | ||
| case LATEST_SNAPSHOT: | ||
| startEntry = historyEntries.get(historyEntries.size() - 1); |
| startEntry = historyEntries.get(0); | ||
| break; | ||
| case SPECIFIC_START_SNAPSHOT_ID: | ||
| Optional<HistoryEntry> matchedEntry = historyEntries.stream() |
| */ | ||
| private ContinuousEnumerationResult discoverInitialSplits() { | ||
| HistoryEntry startSnapshotEntry = getStartSnapshot(table, scanContext); | ||
| LOG.info("get startSnapshotId {} based on starting strategy {}", |
...ink/src/main/java/org/apache/iceberg/flink/source/enumerator/ContinuousSplitPlannerImpl.java
Show resolved
Hide resolved
| } | ||
| if (startingStrategy == StreamingStartingStrategy.SPECIFIC_START_SNAPSHOT_TIMESTAMP) { | ||
| Preconditions.checkArgument(startSnapshotTimestamp != null, | ||
| "startSnapshotTimestamp cannot be null for SPECIFIC_START_SNAPSHOT_TIMESTAMP starting strategy"); |
There was a problem hiding this comment.
Should we validate that startSnapshotId isn't supplied if the user has set the startingStrategy to be by timestamp.
As well as making sure there's not a provided startSnapshotTimestamp when using snapshot id starting strategy?
There was a problem hiding this comment.
sure. will add them
651affb to
8112ddf
Compare
| this.table = table; | ||
| this.scanContext = scanContext; | ||
| this.workerPool = ThreadPools.newWorkerPool( | ||
| "iceberg-enumerator-pool-" + threadPoolName, scanContext.planParallelism()); |
There was a problem hiding this comment.
Nit: Would it be more informative to mention split-planner-pool-** or something that correlates more to the class name?
There was a problem hiding this comment.
if we use the current naming convention in FlinkInputFormat, we can probably change it to iceberg-plan-worker-pool-<threadPoolName>.
| } | ||
|
|
||
| @VisibleForTesting | ||
| static Snapshot getStartSnapshot(Table table, ScanContext scanContext) { |
There was a problem hiding this comment.
How would you deal with an empty table, i.e. no snapshots are produced?
Guess this method should return an Optional<Snapshot> ?
There was a problem hiding this comment.
good point. it is a corner case. will switch to Optional and also add a unit test
...ink/src/main/java/org/apache/iceberg/flink/source/enumerator/ContinuousSplitPlannerImpl.java
Show resolved
Hide resolved
4dcb0e2 to
54baa78
Compare
0b9a4ae to
3fc48b3
Compare
| if (matchedSnapshotById != null) { | ||
| return Optional.of(matchedSnapshotById); | ||
| } else { | ||
| throw new IllegalArgumentException( |
There was a problem hiding this comment.
We usually prefer Preconditions.checkArgument instead of an extra if statement.
| case INCREMENTAL_FROM_SNAPSHOT_TIMESTAMP: | ||
| long snapshotIdAsOfTime = SnapshotUtil.snapshotIdAsOfTime(table, scanContext.startSnapshotTimestamp()); | ||
| Snapshot matchedSnapshotByTimestamp = table.snapshot(snapshotIdAsOfTime); | ||
| if (matchedSnapshotByTimestamp != null) { |
There was a problem hiding this comment.
If we are guaranteed that the snapshot from snapshotIdAsOfTime is known, then we can avoid this check.
| LOG.info("Skip incremental scan because table is empty"); | ||
| return new ContinuousEnumerationResult(Collections.emptyList(), lastPosition, lastPosition); | ||
| } else { | ||
| if (lastPosition.snapshotId() != null && currentSnapshot.snapshotId() == lastPosition.snapshotId()) { |
There was a problem hiding this comment.
Nit: this could be an else if
| import org.junit.rules.TemporaryFolder; | ||
| import org.junit.rules.TestRule; | ||
|
|
||
| public class TestContinuousSplitPlannerImplStartStrategy { |
| Assert.assertEquals(1, result.splits().size()); | ||
| IcebergSourceSplit split = result.splits().iterator().next(); | ||
| Assert.assertEquals(1, split.task().files().size()); | ||
| Assert.assertEquals(dataFile.path().toString(), split.task().files().iterator().next().file().path().toString()); |
There was a problem hiding this comment.
Minor: you can use Iterables.getOnlyElement to avoid calling iterator().next() without validation.
| public ContinuousSplitPlannerImpl(Table table, ScanContext scanContext, String threadPoolName) { | ||
| this.table = table; | ||
| this.scanContext = scanContext; | ||
| this.workerPool = ThreadPools.newWorkerPool( |
There was a problem hiding this comment.
Minor: for testing, could we set this to null so that we don't spawn a worker pool?
There was a problem hiding this comment.
Also, the pool is never closed. Should this be closeable? If not, should we pass in the worker pool so that its lifecycle is attached to a closeable instance?
| ContinuousEnumerationResult emptyTableInitialDiscoveryResult = splitPlanner.planSplits(null); | ||
| Assert.assertTrue(emptyTableInitialDiscoveryResult.splits().isEmpty()); | ||
| Assert.assertNull(emptyTableInitialDiscoveryResult.fromPosition()); | ||
| Assert.assertNull(emptyTableInitialDiscoveryResult.toPosition().snapshotId()); |
There was a problem hiding this comment.
We may want to add an isEmpty method to check this rather than checking snapshotId() is null.
… also added unit test coverage for the scenario
…Impl that closes private thread pool. address other comments from Ryan too.
3b79198 to
3ed15ce
Compare
|
Thanks, @stevenzwu! |
It mainly contains classes for streaming read (continuous split discovery) along with config and enumeration position classes.