Skip to content

Conversation

@rionmonster
Copy link
Contributor

@rionmonster rionmonster commented Dec 27, 2025

Purpose

Linked issue: close #2262

Per Issue #2262, this pull request addresses a race condition that could sometimes occur and result in the IcebergRewriteITCase.testLogTableCompaction test case failing (particularly during CI builds).

Brief change log

This change updates the LakeTableTieringManager to address potential race conditions surrounding state transitions. In some cases, delayed callbacks or retries could act on tables before their new state was recorded, or stale entries in the pending queue could be handed out even though the table was no longer eligible for tiering. Some additional changes were made to improve table requesting/removal use-cases as well.

Tests

These changes were initially verified after successfully reproducing the issue (via iterating over the IcebergRewriteITCase.testLogTableCompaction test). After the root cause was identified, new tests were introduced to help verify these issues:

  • LakeTableTieringManagerTest.testRequestTableSkipsStaleQueueEntries - Verifies the changes made to LakeTableTieringManager.requestTable to ensure that invalid or stale state transitions would not negatively affect the tiering process.
  • LakeTableTieringManagerTest.testScheduledStateRecordedBeforeTimerCallbackRuns - Verifies any potential race-conditions related to immediate state changes to ensure no side-effects are created.
  • LakeTableTieringManagerTest.testRemoveLakeTableRemovesAllPendingQueueOccurrences - Verifies that any potential duplicate entries within the internal pendingTieringTables queue are removed during table removal (via removeLakeTable().

API and Format

N/A

Documentation

N/A

Reviewer(s) Requested

@swuferhong (as original reporter), @beryllw (as original author)

checkFileStatusInIcebergTable(t1, 3, false);

// Ensure tiering job has fully processed the previous writes
assertReplicaStatus(t1Bucket, i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about it. If the files count already is 3 in iceberg, shouldn't it also mean that the tiering has already tiered all records since we only write 3 records.

Copy link
Contributor Author

@rionmonster rionmonster Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luoyuxia

I agree that it’s a bit odd. The additional replica assertion seems to alleviate any race conditions within this specific test case that we were seeing arise within the failing CI instances (at least based on my local testing to reproduce the issue).

It may simply be a testing artifact as opposed to a legitimate issue, but if it doesn’t seem like a fix, we can explore a few other avenues. It seemed like one of those common race conditions, so leveraging an existing function to help seemed like a decent approach. Happy to explore some additional avenues though, if we feel that we need a bit more exhaustive checks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rionmonster I tried to reporduce it in my local. But fails. I think it'll be better to explore the root cause for I'm afraid of it is caused by another critical issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luoyuxia

Good to know! It seemed to be pretty consistent regarding pass/fail on my end, but clearly there’s something else at play.

I’ll do some more exploration and see what I find. Thanks for the feedback!

Copy link
Contributor Author

@rionmonster rionmonster Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luoyuxia

I've been digging into this a bit further and it seems there's some disparity between the actual files being written and retrieving the latest offsets for those (specifically after any types of asynchronous operations, such as compaction, etc.). I think we may need some mechanism to improve the consistency, at least within the bounds of the tests.

I wrote a little monitor to run through several iterations of the tests to see what the state of various bits looked like during each iteration. Here's a sample of those:

Successful Test (Successfully Performs Compaction as Expected)
[MONITOR] log_table_33 - TableBucket{tableId=32, bucket=0}

IcebergFiles IcebergSnapshotId LakeSnapshotId ReplicaLakeSnapId ReplicaLakeLogEndOff Timestamp
0 -1 -1 -1 -1 1767113560800
0 -1 -1 -1 -1 1767113561046
... ... ... ... ... ...
3 5182733673261799288 5182733673261799288 5443797100773076340 3 1767113615059
3 5182733673261799288 5182733673261799288 5443797100773076340 3 1767113615307
2 2575057976625237982 2575057976625237982 5443797100773076340 4 1767113615557
2 2575057976625237982 2575057976625237982 5443797100773076340 4 1767113615808

Failing Test (File appeared to never be properly written before expected offset)
[MONITOR] log_table_34 - TableBucket{tableId=33, bucket=0}

IcebergFiles IcebergSnapshotId LakeSnapshotId ReplicaLakeSnapId ReplicaLakeLogEndOff Timestamp
0 -1 -1 -1 -1 1767113616327
... ... ... ... ... ...
2 7273969972093574431 7273969972093574431 7273969972093574431 2 1767113861627
2 7273969972093574431 7273969972093574431 7273969972093574431 2 1767113861882
2 7273969972093574431 7273969972093574431 7273969972093574431 2 1767113862135
2 7273969972093574431 7273969972093574431 7273969972093574431 2 1767113862381
2 7273969972093574431 7273969972093574431 7273969972093574431 2 1767113862633
[ASSERTION FAILURE] Expected offset 3 but got 2 for bucket TableBucket{tableId=33, bucket=0}
  Replica Lake Snapshot ID: 7273969972093574431
  Current State:
    Iceberg Files: 2
    Iceberg Snapshot ID: 7273969972093574431
    Lake Snapshot ID (from admin): 7273969972093574431
    Replica Lake Snapshot ID: 7273969972093574431
    Replica Lake Log End Offset: 2

I'm not sure if this is more of an artifact of the tests themselves or a legitimate issue. Any thoughts? Happy to continue digging. It feels like a race-condition due to the inconsistency, either on writing the files or reading stale offsets from the data lake directly. I'm sure in a real-world environment, this might be tolerable (as I suspect it's just a minor latency spike which would eventually resolve), but in the confines of a test, it's flaky.

Copy link
Contributor

@luoyuxia luoyuxia Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was possible to request a table from the queue that was no longer actually in the "Pending" state.

@rionmonster
Could you please explain a bit about in which case the table in the queue that was no longer actually in the "Pending" state.? Although it's no in "Pending" state, the tiering service still can requst the table and do the tiering. Also, from the log, I can't see the log like "Fail to change state for table ".

To me, it looks like that the pendingTieringTables is empty, but I still can't figure out why

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, could you please share me your reproduce branch so that I can reproduce it in my local env.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luoyuxia

Sure! Cleaning up some of my notes from the troubleshooting session, but hopefully this workflow makes sense:

  1. A table T becomes eligible and is assigned normally (Pending to Tiering).
  2. Later, a stale duplicate entry for T ends up in the pending queue (e.g., due to retries, late timer callbacks, or test churn).
  3. Before the fix, requestTable() would poll T from the pending queue and attempt to assign it again, even though its actual state was no longer Pending.
  4. That resulted in an invalid state transition attempt (non-Pending/Tiering to Tiering), but the table was still returned to the tiering service, causing things to become inconsistent between the pendingTieringTables and tieringStates.

The key difference with this change is that after it requestTable() actually verifies the table is in the Pending state before assigning it instead of just relying on its presence in the pending queue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rionmonster

Later, a stale duplicate entry for T ends up in the pending queue (e.g., due to retries, late timer callbacks, or test churn).

As for the fail test, which case cause a stale duplicate T entering into pending queue? Do you have ever find any logs?

would poll T from the pending queue and attempt to assign it again, even though its actual state was no longer Pending

If it's a invalid state changte, we'll print

Fail to change state for table xxx

But I can't see such log in the fail ci

but the table was still returned to the tiering service, causing things to become inconsistent between the pendingTieringTables and tieringStates.

To me, it looks me like that the table won't be returned to the tiering service anymore from the log.

So, i'm a little doubt about this case.

Could you please share me your reproduce branch, I want reproduce it in my local env to find the root cause.

Copy link
Contributor Author

@rionmonster rionmonster Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luoyuxia

Based on our private conversations exploring this, I'm including the following based on

Okay — I was able to reproduce the failure successfully with some extended debugging logs added throughout the LakeTableTieringManager. I've created a gist here with the logs. Reviewing over them, it looks like we have the following chain of events:

  • At 411079 we have a series of these "polled tableId=x but tablePath is null (state=null, epoch=null)" type of calls, which repeats 10+ times
  • After this burst we can see that the test fails (since rewrite/compaction didn't complete in time)

I think we can interpret this as:

  • Tiering service requested work (via requestTable())
  • Manager pulled table ids from pendingTieringTables but one of the following was true:
    • tablePaths didn't contain the id
    • tieringStates didn't contain the id
    • tableTierEpoch didn't contain the id
  • This means that the table ids were no longer registered (e.g., previous dropped/removed) but were still present in the pendingTieringTables
  • This causes any requestTable() calls to continually drain/recurse/loop over these stale entries which could delay processing of actual pending tables
  • This delay could be enough to cause the failed assertion

I think the proposed fixes would help with this, although we may need to consider adjusting the removeLakeTable() call as well to ensure we remove the requested table from the pending tables as well, which I'll add to the PR:

public void removeLakeTable(long tableId) {
    inLock(lock, () -> {
        // Omitted for brevity

        pendingTieringTables.remove(tableId);
    });
}

I've created a branch with these additional logs if you would like to explore it yourself at https://github.com/rionmonster/fluss/tree/for-yuxia-with-logs

Copy link
Contributor

@luoyuxia luoyuxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rionmonster Thanks for the pr. Only one question

…g state handling

[FLUSS-2262][lake][server] Address lake table tiering stability during state handling
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unstable test IcebergRewriteITCase.testLogTableCompaction

2 participants