-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix][ml] Persist correct markDeletePosition to prevent message loss #18237
[fix][ml] Persist correct markDeletePosition to prevent message loss #18237
Conversation
Sorry we missed your PR before @AnonHxy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm
Codecov Report
@@ Coverage Diff @@
## master #18237 +/- ##
============================================
+ Coverage 34.91% 38.60% +3.69%
- Complexity 5707 8220 +2513
============================================
Files 607 683 +76
Lines 53396 67289 +13893
Branches 5712 7218 +1506
============================================
+ Hits 18644 25979 +7335
- Misses 32119 38321 +6202
- Partials 2633 2989 +356
Flags with carried forward coverage won't be shown. Click here to find out more.
|
could you please cherry-pick this PR to branch-2.9? thanks. |
Thanks for cherry picking it @congbobo184! |
…pache#18237) (cherry picked from commit d612858) (cherry picked from commit 159333c)
…pache#18237) (cherry picked from commit d612858) (cherry picked from commit 159333c)
Fixes: #18236
Motivation
See #18236 for details on the problematic behavior.
The fundamental problem is that the cursor persists the
readPosition
instead of persisting themarkDeletePosition
in theManagedCursor#internalResetCursor
method. This method is only called on specific occasions, the broker persists the correctmarkDeletePosition
in zookeeper, and the bookkeeper's cursor data is only used when the broker doesn't update the cursor in zookeeper, so this bug is rare.I found this bug at first by producing to a newly created topic and subscription where the subscription was set to "latest". In that case, when the broker evaluates the following code, the
position
evaluates toledgerId:0
instead ofledgerId:-1
. It seems that this bug affects all reset cursor calls except "earliest" because of the way "earliest" always goes to entryId -1.pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java
Lines 1186 to 1191 in fa328a4
After setting the position incorrectly for
latest
and for regular resets, the method persistsnewPosition
(which is defined byposition
) as themarkDeletePosition
:pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java
Line 1281 in fa328a4
As a result, the cursor's persisted data is incorrect for those two cases.
Interestingly, we do not see this bug much because a cleanly closed cursor ledger persists its
markDeletePosition
in Zookeeper, and when the broker recovers the cursor, it just uses the zookeeper data:pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java
Line 458 in fa328a4
Note that the only reason the zk metadata is correct is because we update the
markDeletePosition
to a new value in a callback run after persisting a differentmarkDeletePosition
:pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java
Line 1217 in fa328a4
Note also that this bug does not occur unless you produce a message to the topic after triggering the
resetCursor
logic. If you do not produce a message, you'll see a log line likeCurrent position 4:0 is ahead of last position 4:-1
, which was introduced by #2673. It's possible that PR partially found this bug.Finally, this bug may be related to #15031 #15067.
Modifications
ManagedCursor#internalResetCursor
so that it persists themarkDeletePosition
instead of thereadPosition
to the cursor's bookkeeper ledger. The rest of the changes make sure that the correct variables are shared around.resetCursor
updates thereadPosition
Verifying this change
This PR includes two new tests that fail without the code changes and all current tests continue to pass.
Does this pull request potentially affect one of the following parts:
The only possible "breaking" change is that this PR asserts that the
resetCursor
logic resets the read position (not the mark delete position). I think this is very sensible, but I'll need verification. Note that we already follow this logic in the way that we update themarkDeletePosition
in memory and in zookeeper. This change just updates what gets written to the cursor's ledger.The main breaking change is that users likely adapted their expectations of resetCursor to be off by one when using specific message ids. We'll need to communicate this change. The one benefit is that those users will get one extra message, which is generally preferable over one less message.
Documentation
doc
Matching PR in forked repository
PR in forked repository: michaeljmarshall#7