Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][ml] Persist correct markDeletePosition to prevent message loss #18237

Merged
merged 2 commits into from
Oct 31, 2022

Conversation

michaeljmarshall
Copy link
Member

@michaeljmarshall michaeljmarshall commented Oct 28, 2022

Fixes: #18236

Motivation

See #18236 for details on the problematic behavior.

The fundamental problem is that the cursor persists the readPosition instead of persisting the markDeletePosition in the ManagedCursor#internalResetCursor method. This method is only called on specific occasions, the broker persists the correct markDeletePosition in zookeeper, and the bookkeeper's cursor data is only used when the broker doesn't update the cursor in zookeeper, so this bug is rare.

I found this bug at first by producing to a newly created topic and subscription where the subscription was set to "latest". In that case, when the broker evaluates the following code, the position evaluates to ledgerId:0 instead of ledgerId:-1. It seems that this bug affects all reset cursor calls except "earliest" because of the way "earliest" always goes to entryId -1.

protected void internalResetCursor(PositionImpl position, AsyncCallbacks.ResetCursorCallback resetCursorCallback) {
if (position.equals(PositionImpl.EARLIEST)) {
position = ledger.getFirstPosition();
} else if (position.equals(PositionImpl.LATEST)) {
position = ledger.getLastPosition().getNext();
}

After setting the position incorrectly for latest and for regular resets, the method persists newPosition (which is defined by position) as the markDeletePosition:

internalAsyncMarkDelete(newPosition, isCompactionCursor() ? getProperties() : Collections.emptyMap(),

As a result, the cursor's persisted data is incorrect for those two cases.

Interestingly, we do not see this bug much because a cleanly closed cursor ledger persists its markDeletePosition in Zookeeper, and when the broker recovers the cursor, it just uses the zookeeper data:

Note that the only reason the zk metadata is correct is because we update the markDeletePosition to a new value in a callback run after persisting a different markDeletePosition:

PositionImpl newMarkDeletePosition = ledger.getPreviousPosition(newPosition);

Note also that this bug does not occur unless you produce a message to the topic after triggering the resetCursor logic. If you do not produce a message, you'll see a log line like Current position 4:0 is ahead of last position 4:-1, which was introduced by #2673. It's possible that PR partially found this bug.

Finally, this bug may be related to #15031 #15067.

Modifications

  • Update ManagedCursor#internalResetCursor so that it persists the markDeletePosition instead of the readPosition to the cursor's bookkeeper ledger. The rest of the changes make sure that the correct variables are shared around.
  • Improve logging to make it clearer that resetCursor updates the readPosition
  • Fix test assertion ordering for semi-related test (this test failed as I iterated over the solution)

Verifying this change

This PR includes two new tests that fail without the code changes and all current tests continue to pass.

Does this pull request potentially affect one of the following parts:

The only possible "breaking" change is that this PR asserts that the resetCursor logic resets the read position (not the mark delete position). I think this is very sensible, but I'll need verification. Note that we already follow this logic in the way that we update the markDeletePosition in memory and in zookeeper. This change just updates what gets written to the cursor's ledger.

The main breaking change is that users likely adapted their expectations of resetCursor to be off by one when using specific message ids. We'll need to communicate this change. The one benefit is that those users will get one extra message, which is generally preferable over one less message.

Documentation

  • doc

Matching PR in forked repository

PR in forked repository: michaeljmarshall#7

@michaeljmarshall michaeljmarshall added type/bug The PR fixed a bug or issue reported a bug area/broker doc-not-needed Your PR changes do not impact docs release/2.9.4 release/2.8.5 release/2.11.1 release/2.10.3 labels Oct 28, 2022
@michaeljmarshall michaeljmarshall added this to the 2.12.0 milestone Oct 28, 2022
@michaeljmarshall michaeljmarshall self-assigned this Oct 28, 2022
@github-actions github-actions bot added the doc Your PR contains doc changes, no matter whether the changes are in markdown or code files. label Oct 28, 2022
@AnonHxy
Copy link
Contributor

AnonHxy commented Oct 28, 2022

It seems that this PR some the same problem #17668 cc @HQebupt

@codelipenghui
Copy link
Contributor

Sorry we missed your PR before @AnonHxy

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@codecov-commenter
Copy link

codecov-commenter commented Oct 28, 2022

Codecov Report

Merging #18237 (4c61048) into master (6c65ca0) will increase coverage by 3.69%.
The diff coverage is 36.62%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #18237      +/-   ##
============================================
+ Coverage     34.91%   38.60%   +3.69%     
- Complexity     5707     8220    +2513     
============================================
  Files           607      683      +76     
  Lines         53396    67289   +13893     
  Branches       5712     7218    +1506     
============================================
+ Hits          18644    25979    +7335     
- Misses        32119    38321    +6202     
- Partials       2633     2989     +356     
Flag Coverage Δ
unittests 38.60% <36.62%> (+3.69%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...keeper/mledger/impl/ReadOnlyManagedLedgerImpl.java 0.00% <0.00%> (ø)
...pache/bookkeeper/mledger/offload/OffloadUtils.java 0.00% <0.00%> (ø)
.../main/java/org/apache/pulsar/PulsarStandalone.java 0.00% <0.00%> (ø)
.../apache/pulsar/broker/admin/impl/ClustersBase.java 76.66% <0.00%> (+67.62%) ⬆️
...pache/pulsar/broker/admin/impl/NamespacesBase.java 63.32% <ø> (+48.79%) ⬆️
.../pulsar/broker/service/AbstractBaseDispatcher.java 50.92% <0.00%> (+5.06%) ⬆️
...che/pulsar/broker/service/BacklogQuotaManager.java 12.39% <0.00%> (+2.91%) ⬆️
.../pulsar/broker/service/BrokerServiceException.java 38.88% <0.00%> (+13.88%) ⬆️
...ava/org/apache/pulsar/broker/service/Consumer.java 68.08% <0.00%> (+6.07%) ⬆️
...ava/org/apache/pulsar/broker/service/Producer.java 62.42% <0.00%> (+2.78%) ⬆️
... and 712 more

@Technoboy-
Copy link
Contributor

It seems that this PR some the same problem #17668 cc @HQebupt

@HQebupt
Do you agree to use this pr and close #17668, as this pr has added the test to avoid the regression?

@congbobo184
Copy link
Contributor

could you please cherry-pick this PR to branch-2.9? thanks.

congbobo184 pushed a commit that referenced this pull request Nov 17, 2022
@congbobo184 congbobo184 added the cherry-picked/branch-2.9 Archived: 2.9 is end of life label Nov 17, 2022
@michaeljmarshall
Copy link
Member Author

Thanks for cherry picking it @congbobo184!

congbobo184 pushed a commit that referenced this pull request Dec 7, 2022
liangyepianzhou pushed a commit that referenced this pull request Dec 14, 2022
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Jan 10, 2023
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Jan 11, 2023
@michaeljmarshall michaeljmarshall deleted the fix-internal-reset-cursor branch February 13, 2023 23:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/broker cherry-picked/branch-2.9 Archived: 2.9 is end of life cherry-picked/branch-2.10 cherry-picked/branch-2.11 doc Your PR contains doc changes, no matter whether the changes are in markdown or code files. doc-not-needed Your PR changes do not impact docs release/2.8.5 release/2.9.4 release/2.10.3 release/2.11.0 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Message lost after forcefully stopping broker
8 participants