-
Notifications
You must be signed in to change notification settings - Fork 480
[FLUSS-2262][lake] Improved Stability for Iceberg Log Table Compaction Test #2265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| checkFileStatusInIcebergTable(t1, 3, false); | ||
|
|
||
| // Ensure tiering job has fully processed the previous writes | ||
| assertReplicaStatus(t1Bucket, i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about it. If the files count already is 3 in iceberg, shouldn't it also mean that the tiering has already tiered all records since we only write 3 records.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it’s a bit odd. The additional replica assertion seems to alleviate any race conditions within this specific test case that we were seeing arise within the failing CI instances (at least based on my local testing to reproduce the issue).
It may simply be a testing artifact as opposed to a legitimate issue, but if it doesn’t seem like a fix, we can explore a few other avenues. It seemed like one of those common race conditions, so leveraging an existing function to help seemed like a decent approach. Happy to explore some additional avenues though, if we feel that we need a bit more exhaustive checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rionmonster I tried to reporduce it in my local. But fails. I think it'll be better to explore the root cause for I'm afraid of it is caused by another critical issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know! It seemed to be pretty consistent regarding pass/fail on my end, but clearly there’s something else at play.
I’ll do some more exploration and see what I find. Thanks for the feedback!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been digging into this a bit further and it seems there's some disparity between the actual files being written and retrieving the latest offsets for those (specifically after any types of asynchronous operations, such as compaction, etc.). I think we may need some mechanism to improve the consistency, at least within the bounds of the tests.
I wrote a little monitor to run through several iterations of the tests to see what the state of various bits looked like during each iteration. Here's a sample of those:
Successful Test (Successfully Performs Compaction as Expected)
[MONITOR] log_table_33 - TableBucket{tableId=32, bucket=0}
| IcebergFiles | IcebergSnapshotId | LakeSnapshotId | ReplicaLakeSnapId | ReplicaLakeLogEndOff | Timestamp |
|---|---|---|---|---|---|
| 0 | -1 | -1 | -1 | -1 | 1767113560800 |
| 0 | -1 | -1 | -1 | -1 | 1767113561046 |
| ... | ... | ... | ... | ... | ... |
| 3 | 5182733673261799288 | 5182733673261799288 | 5443797100773076340 | 3 | 1767113615059 |
| 3 | 5182733673261799288 | 5182733673261799288 | 5443797100773076340 | 3 | 1767113615307 |
| 2 | 2575057976625237982 | 2575057976625237982 | 5443797100773076340 | 4 | 1767113615557 |
| 2 | 2575057976625237982 | 2575057976625237982 | 5443797100773076340 | 4 | 1767113615808 |
Failing Test (File appeared to never be properly written before expected offset)
[MONITOR] log_table_34 - TableBucket{tableId=33, bucket=0}
| IcebergFiles | IcebergSnapshotId | LakeSnapshotId | ReplicaLakeSnapId | ReplicaLakeLogEndOff | Timestamp |
|---|---|---|---|---|---|
| 0 | -1 | -1 | -1 | -1 | 1767113616327 |
| ... | ... | ... | ... | ... | ... |
| 2 | 7273969972093574431 | 7273969972093574431 | 7273969972093574431 | 2 | 1767113861627 |
| 2 | 7273969972093574431 | 7273969972093574431 | 7273969972093574431 | 2 | 1767113861882 |
| 2 | 7273969972093574431 | 7273969972093574431 | 7273969972093574431 | 2 | 1767113862135 |
| 2 | 7273969972093574431 | 7273969972093574431 | 7273969972093574431 | 2 | 1767113862381 |
| 2 | 7273969972093574431 | 7273969972093574431 | 7273969972093574431 | 2 | 1767113862633 |
[ASSERTION FAILURE] Expected offset 3 but got 2 for bucket TableBucket{tableId=33, bucket=0}
Replica Lake Snapshot ID: 7273969972093574431
Current State:
Iceberg Files: 2
Iceberg Snapshot ID: 7273969972093574431
Lake Snapshot ID (from admin): 7273969972093574431
Replica Lake Snapshot ID: 7273969972093574431
Replica Lake Log End Offset: 2
I'm not sure if this is more of an artifact of the tests themselves or a legitimate issue. Any thoughts? Happy to continue digging. It feels like a race-condition due to the inconsistency, either on writing the files or reading stale offsets from the data lake directly. I'm sure in a real-world environment, this might be tolerable (as I suspect it's just a minor latency spike which would eventually resolve), but in the confines of a test, it's flaky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was possible to request a table from the queue that was no longer actually in the "Pending" state.
@rionmonster
Could you please explain a bit about in which case the table in the queue that was no longer actually in the "Pending" state.? Although it's no in "Pending" state, the tiering service still can requst the table and do the tiering. Also, from the log, I can't see the log like "Fail to change state for table ".
To me, it looks like that the pendingTieringTables is empty, but I still can't figure out why
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, could you please share me your reproduce branch so that I can reproduce it in my local env.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! Cleaning up some of my notes from the troubleshooting session, but hopefully this workflow makes sense:
- A table
Tbecomes eligible and is assigned normally (Pending to Tiering). - Later, a stale duplicate entry for
Tends up in the pending queue (e.g., due to retries, late timer callbacks, or test churn). - Before the fix,
requestTable()would pollTfrom the pending queue and attempt to assign it again, even though its actual state was no longer Pending. - That resulted in an invalid state transition attempt (non-Pending/Tiering to Tiering), but the table was still returned to the tiering service, causing things to become inconsistent between the
pendingTieringTablesandtieringStates.
The key difference with this change is that after it requestTable() actually verifies the table is in the Pending state before assigning it instead of just relying on its presence in the pending queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Later, a stale duplicate entry for T ends up in the pending queue (e.g., due to retries, late timer callbacks, or test churn).
As for the fail test, which case cause a stale duplicate T entering into pending queue? Do you have ever find any logs?
would poll T from the pending queue and attempt to assign it again, even though its actual state was no longer Pending
If it's a invalid state changte, we'll print
Fail to change state for table xxx
But I can't see such log in the fail ci
but the table was still returned to the tiering service, causing things to become inconsistent between the pendingTieringTables and tieringStates.
To me, it looks me like that the table won't be returned to the tiering service anymore from the log.
So, i'm a little doubt about this case.
Could you please share me your reproduce branch, I want reproduce it in my local env to find the root cause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on our private conversations exploring this, I'm including the following based on
Okay — I was able to reproduce the failure successfully with some extended debugging logs added throughout the LakeTableTieringManager. I've created a gist here with the logs. Reviewing over them, it looks like we have the following chain of events:
- At
411079we have a series of these "polled tableId=x but tablePath is null (state=null, epoch=null)" type of calls, which repeats 10+ times - After this burst we can see that the test fails (since rewrite/compaction didn't complete in time)
I think we can interpret this as:
- Tiering service requested work (via
requestTable()) - Manager pulled table ids from pendingTieringTables but one of the following was true:
tablePathsdidn't contain the idtieringStatesdidn't contain the idtableTierEpochdidn't contain the id
- This means that the table ids were no longer registered (e.g., previous dropped/removed) but were still present in the
pendingTieringTables - This causes any
requestTable()calls to continually drain/recurse/loop over these stale entries which could delay processing of actual pending tables - This delay could be enough to cause the failed assertion
I think the proposed fixes would help with this, although we may need to consider adjusting the removeLakeTable() call as well to ensure we remove the requested table from the pending tables as well, which I'll add to the PR:
public void removeLakeTable(long tableId) {
inLock(lock, () -> {
// Omitted for brevity
pendingTieringTables.remove(tableId);
});
}
I've created a branch with these additional logs if you would like to explore it yourself at https://github.com/rionmonster/fluss/tree/for-yuxia-with-logs
luoyuxia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rionmonster Thanks for the pr. Only one question
…g state handling [FLUSS-2262][lake][server] Address lake table tiering stability during state handling
…during table removal
Purpose
Linked issue: close #2262
Per Issue #2262, this pull request addresses a race condition that could sometimes occur and result in the
IcebergRewriteITCase.testLogTableCompactiontest case failing (particularly during CI builds).Brief change log
This change updates the
LakeTableTieringManagerto address potential race conditions surrounding state transitions. In some cases, delayed callbacks or retries could act on tables before their new state was recorded, or stale entries in the pending queue could be handed out even though the table was no longer eligible for tiering. Some additional changes were made to improve table requesting/removal use-cases as well.Tests
These changes were initially verified after successfully reproducing the issue (via iterating over the
IcebergRewriteITCase.testLogTableCompactiontest). After the root cause was identified, new tests were introduced to help verify these issues:LakeTableTieringManagerTest.testRequestTableSkipsStaleQueueEntries- Verifies the changes made toLakeTableTieringManager.requestTableto ensure that invalid or stale state transitions would not negatively affect the tiering process.LakeTableTieringManagerTest.testScheduledStateRecordedBeforeTimerCallbackRuns- Verifies any potential race-conditions related to immediate state changes to ensure no side-effects are created.LakeTableTieringManagerTest.testRemoveLakeTableRemovesAllPendingQueueOccurrences- Verifies that any potential duplicate entries within the internalpendingTieringTablesqueue are removed during table removal (viaremoveLakeTable().API and Format
N/A
Documentation
N/A
Reviewer(s) Requested
@swuferhong (as original reporter), @beryllw (as original author)