Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][test] Fix ManagedCursorTest.testForceCursorRecovery #23518

Merged
merged 6 commits into from
Nov 5, 2024

Conversation

ZhaoGuorui666
Copy link
Contributor

@ZhaoGuorui666 ZhaoGuorui666 commented Oct 27, 2024

Fixes #23417

Motivation

Ideally, recoverFromLedger should only call operationComplete upon successful recovery and operationFailed upon failure. However, in the actual implementation, both methods may be called due to the following reasons:

Concurrent Execution: If recoverFromLedger is executed in different threads, it may simultaneously meet both success and failure conditions, leading to both callbacks being triggered.

After adding logs, we may see both:

  • Cursor recovery success
  • Cursor recovery failed

Modifications

Use an atomic variable. Use an AtomicBoolean in the callback to mark whether the callback method has already been called, preventing duplicate calls.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Copy link

@ZhaoGuorui666 Please add the following content to your PR description and select a checkbox:

- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

@github-actions github-actions bot added doc-not-needed Your PR changes do not impact docs and removed doc-label-missing labels Oct 27, 2024
@nodece
Copy link
Member

nodece commented Oct 28, 2024

@ZhaoGuorui666
Copy link
Contributor Author

I understand the ManagedCursorImpl.recoverFromLedger() called by the bottom layer of this test, which calls Bookeeper.syncOpenLedger() instead of TestPulsarMockBookKeeper

image

image

@ZhaoGuorui666
Copy link
Contributor Author

Execute 3000 times, all passed
image

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.31%. Comparing base (bbc6224) to head (f3c433e).
Report is 701 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #23518      +/-   ##
============================================
+ Coverage     73.57%   74.31%   +0.74%     
- Complexity    32624    34906    +2282     
============================================
  Files          1877     1943      +66     
  Lines        139502   147001    +7499     
  Branches      15299    16196     +897     
============================================
+ Hits         102638   109247    +6609     
- Misses        28908    29306     +398     
- Partials       7956     8448     +492     
Flag Coverage Δ
inttests 27.34% <ø> (+2.76%) ⬆️
systests 24.39% <ø> (+0.07%) ⬆️
unittests 73.67% <ø> (+0.82%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 650 files with indirect coverage changes

@nodece
Copy link
Member

nodece commented Oct 30, 2024

In this test, we are using org.apache.bookkeeper.mledger.impl.ManagedCursorTest.TestPulsarMockBookKeeper.

ManagedCursorImpl.recoverFromLedger() will call this method from that:

  public void asyncOpenLedger(final long lId, final DigestType digestType, final byte[] passwd,
                final OpenCallback cb, final Object ctx) {
            if (ledgerErrors.containsKey(lId)) {
                cb.openComplete(ledgerErrors.get(lId), null, ctx);
                return; // TODO: Please add a return.
            }
            super.asyncOpenLedger(lId, digestType, passwd, cb, ctx);
        }

It seems that the test results were affected by multiple callback calls.

Copy link
Member

@nodece nodece left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment.

@ZhaoGuorui666
Copy link
Contributor Author

@nodece Thank you for your prompt and suggestion. It has been revised.

@summeriiii
Copy link
Contributor

2024-10-30T12:38:15,591 - ERROR - [main:ManagedCursorImpl@553] - [my_test_ledger] Error opening metadata ledger -1 for cursor c1: Bookie handle is not available
2024-10-30T12:38:15,592 - INFO  - [main:ManagedCursorImpl@750] - [my_test_ledger] Cursor c1 recovered to position 3:-1
2024-10-30T12:38:15,592 - INFO  - [main:MetaStoreImpl@257] - expectedVersion:0 [my_test_ledger] Creating consumer c1 on meta-data store with cursorsLedgerId: -1
markDeleteLedgerId: 3
markDeleteEntryId: -1
lastActive: 1730263095578

2024-10-30T12:38:15,593 - INFO  - [test-OrderedScheduler-0-0:ManagedCursorImpl@550] - [my_test_ledger] Opened ledger -1 for cursor c1. rc=-7
2024-10-30T12:38:15,593 - ERROR - [test-OrderedScheduler-0-0:ManagedCursorImpl@553] - [my_test_ledger] Error opening metadata ledger -1 for cursor c1: No such ledger exists on Bookies
2024-10-30T12:38:15,593 - INFO  - [test-OrderedScheduler-0-0:ManagedCursorImpl@750] - [my_test_ledger] Cursor c1 recovered to position 3:-1
2024-10-30T12:38:15,593 - INFO  - [test-OrderedScheduler-0-0:MetaStoreImpl@257] - expectedVersion:0 [my_test_ledger] Creating consumer c1 on meta-data store with cursorsLedgerId: -1
markDeleteLedgerId: 3
markDeleteEntryId: -1
lastActive: 1730263095578

2024-10-30T12:38:15,601 - INFO  - [bookkeeper-ml-scheduler-OrderedScheduler-1-0:ManagedCursorImpl$25@2772] - ledger.getStore().asyncUpdateCursorInfo operation complete
2024-10-30T12:38:15,602 - WARN  - [bookkeeper-ml-scheduler-OrderedScheduler-1-0:ManagedCursorImpl$25@2779] - [my_test_ledger] Failed to update cursor metadata for c1 due to version conflict org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException:

java.lang.AssertionError:
Expected :true
Actual   :false
<Click to see difference>


	at org.testng.Assert.fail(Assert.java:110)
	at org.testng.Assert.failNotEquals(Assert.java:1577)
	at org.testng.Assert.assertTrue(Assert.java:56)
	at org.testng.Assert.assertTrue(Assert.java:66)
	at org.apache.bookkeeper.mledger.impl.ManagedCursorTest.testForceCursorRecovery(ManagedCursorTest.java:4858)

From the log of failed test, we can see expectedVersion:0 [my_test_ledger] Creating consumer c1 on meta-data store with cursorsLedgerId: -1 twice.
The root cause of this flaky test is we call the cb.openComplete() twice, which may cause race condition: when the first cursorInfo has not been saved, the second update cursorInfo started use the oldVersion, so when to save the second cursorInfo meta, it will throw BadVersion exception

So We don't need to add AtomicBoolean callbackInvoked = new AtomicBoolean(false);, we can just fix this to add return in TestPulsarMockBookKeeper.asyncOpenLedger or add else like

public void asyncOpenLedger(final long lId, final DigestType digestType, final byte[] passwd,
        final OpenCallback cb, final Object ctx) {
    if (ledgerErrors.containsKey(lId)) {
        cb.openComplete(ledgerErrors.get(lId), null, ctx);
    } else {
        super.asyncOpenLedger(lId, digestType, passwd, cb, ctx);
    }
}

@ZhaoGuorui666
Copy link
Contributor Author

2024-10-30T12:38:15,591 - ERROR - [main:ManagedCursorImpl@553] - [my_test_ledger] Error opening metadata ledger -1 for cursor c1: Bookie handle is not available
2024-10-30T12:38:15,592 - INFO  - [main:ManagedCursorImpl@750] - [my_test_ledger] Cursor c1 recovered to position 3:-1
2024-10-30T12:38:15,592 - INFO  - [main:MetaStoreImpl@257] - expectedVersion:0 [my_test_ledger] Creating consumer c1 on meta-data store with cursorsLedgerId: -1
markDeleteLedgerId: 3
markDeleteEntryId: -1
lastActive: 1730263095578

2024-10-30T12:38:15,593 - INFO  - [test-OrderedScheduler-0-0:ManagedCursorImpl@550] - [my_test_ledger] Opened ledger -1 for cursor c1. rc=-7
2024-10-30T12:38:15,593 - ERROR - [test-OrderedScheduler-0-0:ManagedCursorImpl@553] - [my_test_ledger] Error opening metadata ledger -1 for cursor c1: No such ledger exists on Bookies
2024-10-30T12:38:15,593 - INFO  - [test-OrderedScheduler-0-0:ManagedCursorImpl@750] - [my_test_ledger] Cursor c1 recovered to position 3:-1
2024-10-30T12:38:15,593 - INFO  - [test-OrderedScheduler-0-0:MetaStoreImpl@257] - expectedVersion:0 [my_test_ledger] Creating consumer c1 on meta-data store with cursorsLedgerId: -1
markDeleteLedgerId: 3
markDeleteEntryId: -1
lastActive: 1730263095578

2024-10-30T12:38:15,601 - INFO  - [bookkeeper-ml-scheduler-OrderedScheduler-1-0:ManagedCursorImpl$25@2772] - ledger.getStore().asyncUpdateCursorInfo operation complete
2024-10-30T12:38:15,602 - WARN  - [bookkeeper-ml-scheduler-OrderedScheduler-1-0:ManagedCursorImpl$25@2779] - [my_test_ledger] Failed to update cursor metadata for c1 due to version conflict org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException:

java.lang.AssertionError:
Expected :true
Actual   :false
<Click to see difference>


	at org.testng.Assert.fail(Assert.java:110)
	at org.testng.Assert.failNotEquals(Assert.java:1577)
	at org.testng.Assert.assertTrue(Assert.java:56)
	at org.testng.Assert.assertTrue(Assert.java:66)
	at org.apache.bookkeeper.mledger.impl.ManagedCursorTest.testForceCursorRecovery(ManagedCursorTest.java:4858)

From the log of failed test, we can see expectedVersion:0 [my_test_ledger] Creating consumer c1 on meta-data store with cursorsLedgerId: -1 twice. The root cause of this flaky test is we call the cb.openComplete() twice, which may cause race condition: when the first cursorInfo has not been saved, the second update cursorInfo started use the oldVersion, so when to save the second cursorInfo meta, it will throw BadVersion exception

So We don't need to add AtomicBoolean callbackInvoked = new AtomicBoolean(false);, we can just fix this to add return in TestPulsarMockBookKeeper.asyncOpenLedger or add else like

public void asyncOpenLedger(final long lId, final DigestType digestType, final byte[] passwd,
        final OpenCallback cb, final Object ctx) {
    if (ledgerErrors.containsKey(lId)) {
        cb.openComplete(ledgerErrors.get(lId), null, ctx);
    } else {
        super.asyncOpenLedger(lId, digestType, passwd, cb, ctx);
    }
}

Yes, doing an if else or adding a return is the way to truly solve the problem. Here are the test results:

企业微信截图_17302657085370

Copy link
Member

@nodece nodece left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ZhaoGuorui666 ZhaoGuorui666 requested a review from lhotari November 4, 2024 08:53
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good work @ZhaoGuorui666. Thanks for putting effort in fixing flaky tests.

@nodece nodece merged commit d2d05c2 into apache:master Nov 5, 2024
52 checks passed
visxu pushed a commit to vissxu/pulsar that referenced this pull request Nov 6, 2024
lhotari pushed a commit that referenced this pull request Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky-test: ManagedCursorTest.testForceCursorRecovery
6 participants