Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] [ml] Add entry fail due to race condition about add entry failed/timeout and switch ledger #22221

Merged
merged 1 commit into from
May 21, 2024

Conversation

poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented Mar 7, 2024

Issue 1 without enable managedLedgerAddEntryTimeoutSeconds

Background: Flow when adding entry failed

  • Switch to a new ledger
  • Close and duplicate all the pending add requests[3]
    • The callback of the add entry request will be discarded because it is closed.
  • Redo all the pending add requests.

Issue

time add entries failed switch ledger
1 add the first entry failed
2 start to switch ledger, new ledger was created
3 close all pending OpAddEntry[3]
4 recycle OpAddEntry due to change state to COMPLETED failed[4]
5 get an NPE when duplicating pending OpAddEntry

Reproduce: no test to reproduce this case yet.


Issue 2 that enabled managedLedgerAddEntryTimeoutSeconds

Background: Flow when adding entry timeout

  • (Highlight) Close the first add entry request[1].
  • Switch to a new ledger
  • Close and duplicate all the pending add requests
    • The callback of the add entry request will be discarded because it is closed.
  • Redo all the pending add requests.

Issue
The flow when adding entry timeout closing the first add entry request at first, it may cause an issue like this:

  • timeout task - step 1: Close the first add entry request.
  • callback of OpAddEntry 1: discard the callback of the first OpAddEntry because it is closed, recycle the object OpAddEntry.
  • callback of OpAddEntry 2: tries do a completed callback
    • Confirm the first request equals the current one: false, and the first OpAddEntry has been recycled, Pulsar will get an NPE[2].

Reproduce: the new test testAddEntryResponseTimeout


Modifications

  • Recycle OpAddEntry after it is closed when change state to COMPLETED failed.
  • Make the timeout process exactly the same as the process that failed to add

Footnotes

[1]: OpAddEntry.handleAddTimeoutFailure

void handleAddTimeoutFailure(final LedgerHandle ledger, Object ctx) {
  if (checkAndCompleteOp(ctx)) {
      this.close();
      this.handleAddFailure(ledger);
  }
}

[2]: OpAddEntry.run

OpAddEntry firstInQueue = ml.pendingAddEntries.poll();
if (firstInQueue == null) {
    return;
}
if (this != firstInQueue) {
    firstInQueue.failed(new ManagedLedgerException("Unexpected add entry op when complete the add entry op."));
    return;
}

[3]: ManagedLedgerImpl.createNewOpAddEntryForNewLedger

do {
    existsOp = pendingAddEntries.poll();
    existsOp.close();
    pendingAddEntries.add(OpAddEntry.createNoRetainBuffer(....));
} while (...)

[4]: OpAddEntry.addComplete

if (!STATE_UPDATER.compareAndSet(OpAddEntry.this, State.INITIATED, State.COMPLETED)) {
    OpAddEntry.this.recycle();
    return;
}

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: x

@poorbarcode poorbarcode added this to the 3.3.0 milestone Mar 7, 2024
@poorbarcode poorbarcode self-assigned this Mar 7, 2024
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Mar 7, 2024
@poorbarcode poorbarcode changed the title [draft] [fix] [ml] Add entry always fail when enabled "managedLedgerAddEntryTimeoutSeconds" [fix] [ml] Add entry always fail when enabled "managedLedgerAddEntryTimeoutSeconds" Mar 11, 2024
@poorbarcode poorbarcode changed the title [fix] [ml] Add entry always fail when enabled "managedLedgerAddEntryTimeoutSeconds" [fix] [ml] Add entry fail due to race condition about add entry failed/timeout and switch ledger Mar 11, 2024
@poorbarcode poorbarcode requested review from lhotari and dao-jun and removed request for dao-jun March 13, 2024 18:44
@poorbarcode poorbarcode requested a review from lhotari April 15, 2024 08:21
@poorbarcode
Copy link
Contributor Author

/pulsarbot rerun-failure-checks

@codelipenghui
Copy link
Contributor

@poorbarcode Could you please rebase to the master branch since the last commit of this PR was happened a month ago.

@poorbarcode poorbarcode force-pushed the fix/AddEntryTimeoutSeconds branch from 8f1969f to cda2b7d Compare May 20, 2024 12:12
@poorbarcode
Copy link
Contributor Author

@poorbarcode Could you please rebase to the master branch since the last commit of this PR was happened a month ago.

Rebased from branch-master.

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
this seems a pretty serious bug, to port to all the active branches

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 95.65217% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 73.20%. Comparing base (bbc6224) to head (cda2b7d).
Report is 291 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #22221      +/-   ##
============================================
- Coverage     73.57%   73.20%   -0.37%     
+ Complexity    32624     2511   -30113     
============================================
  Files          1877     1889      +12     
  Lines        139502   141424    +1922     
  Branches      15299    15518     +219     
============================================
+ Hits         102638   103536     +898     
- Misses        28908    29883     +975     
- Partials       7956     8005      +49     
Flag Coverage Δ
inttests 27.44% <30.43%> (+2.86%) ⬆️
systests 24.66% <43.47%> (+0.33%) ⬆️
unittests 72.22% <95.65%> (-0.63%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...org/apache/bookkeeper/mledger/impl/OpAddEntry.java 73.84% <100.00%> (+1.02%) ⬆️
...okkeeper/mledger/impl/ShadowManagedLedgerImpl.java 57.35% <100.00%> (+1.68%) ⬆️
...che/bookkeeper/mledger/impl/ManagedLedgerImpl.java 81.12% <92.30%> (+0.45%) ⬆️

... and 344 files with indirect coverage changes

@poorbarcode poorbarcode merged commit ae9616b into apache:master May 21, 2024
50 checks passed
Technoboy- pushed a commit that referenced this pull request May 21, 2024
Technoboy- pushed a commit that referenced this pull request May 21, 2024
poorbarcode added a commit that referenced this pull request May 28, 2024
…d/timeout and switch ledger (#22221)

(cherry picked from commit ae9616b)
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request May 31, 2024
…d/timeout and switch ledger (apache#22221)

(cherry picked from commit b798e7f)
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Jun 4, 2024
…d/timeout and switch ledger (apache#22221)

(cherry picked from commit b798e7f)
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Jun 4, 2024
…d/timeout and switch ledger (apache#22221)

(cherry picked from commit b798e7f)
coderzc pushed a commit that referenced this pull request Jun 5, 2024
…d/timeout and switch ledger (#22221)

(cherry picked from commit ae9616b)
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Jun 7, 2024
…d/timeout and switch ledger (apache#22221)

(cherry picked from commit b798e7f)
@lhotari
Copy link
Member

lhotari commented Aug 21, 2024

@poorbarcode I have created #23208 about the excessive log warnings that this change causes. Please take a look.

@lhotari
Copy link
Member

lhotari commented Aug 21, 2024

PR that fixes the excessive warnings: #23209

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants