[fix][storage] refresh the ledgers map when the offload complete failed #17228

zymap · 2022-08-23T04:54:53Z

Motivation

We found there has an incorrect state when the offload complete failed.
It failed by a connection loss exception but the ledger info updated
into the meta store successfully. Which makes the in memory data is
different to the meta store. Then the offloader will remove the previous
offload information and cleanup the ledgers.
We have added retry when the meta store received a connection loss exception.
This PR trying to make the in memory data won't be different to the meta
store when the exception throws but the data is updated to the meta store.

Modifications

Describe the modifications you've done.

Verifying this change

Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end deployment with large payloads (10MB)
Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API: (yes / no)
The schema: (yes / no / don't know)
The default values of configurations: (yes / no)
The wire protocol: (yes / no)
The rest endpoints: (yes / no)
The admin cli options: (yes / no)
Anything that affects deployment: (yes / no / don't know)

Documentation

Check the box below or label this PR directly.

Need to update docs?

doc-required
(Your PR needs to update docs and you will update later)
doc-not-needed
(Please explain why)
doc
(Your PR contains doc changes)
doc-complete
(Docs have been already added)

--- *Motivation* We found there has an incorrect state when the offload complete failed. It failed by a connection loss exception but the ledger info updated into the meta store successfully. Which makes the in memory data is different to the meta store. Then the offloader will remove the previous offload information and cleanup the ledgers. We have added retry when the meta store received connection loss exception. This PR trying to makes the in memory data won't be different to the meta store when the exception throws but the data updated to the meta store.

zymap · 2022-08-29T01:26:10Z

ping @hangc0276 @codelipenghui . Could you please take a look?

codelipenghui

After checking the offloading process.
Can we just add a check in cleanupOffloaded() method?
If we encounter the zookeeper operation timeout,
we should only cleanup the offloaded data until the metadata refreshed.

I see you have added lastOffloadCompleteFailed and refreshedIfOffloadCompleteFailed, if I understand currently, you want to avoid the subsequent write operation based on the unrefreshed metadata. The managed ledger
already handled this case by updating znode with ledgersStat.

codelipenghui · 2022-09-02T10:26:43Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

+                if (injection != null) {
+                    lastOffloadCompleteFailed = true;
+                    refreshedIfOffloadCompleteFailed = false;
+                    injection.throwException(ledgerId);
+                }
+                if (exception == null) {
                        log.info("[{}] End Offload. ledger={}, uuid={}", name, ledgerId, uuid);
                    } else {
+                        lastOffloadCompleteFailed = true;
+                        refreshedIfOffloadCompleteFailed = false;


Please check the code style.

codelipenghui · 2022-09-02T10:27:01Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

+                refreshFuture.whenComplete((unused, throwable) -> {
+                    if (throwable != null) {
+                        log.error("Failed to refresh the ledger info list", throwable);
+                        unlockingPromise.completeExceptionally(throwable);


Should add return here?

eolivelli · 2022-09-05T14:57:08Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

                        log.warn("[{}] Failed to complete offload of ledger {}, uuid {}",
                                 name, ledgerId, uuid, exception);
                    }
                });
    }

+    private Injection injection;


please do not add this mechanism on core classes.
we can use Mockito to interact with the internals and simulate problems

eolivelli · 2022-09-05T14:57:45Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

@@ -220,6 +220,10 @@ public class ManagedLedgerImpl implements ManagedLedger, CreateCallback {
    private long lastOffloadLedgerId = 0;
    private long lastOffloadSuccessTimestamp = 0;
    private long lastOffloadFailureTimestamp = 0;
+    @Getter
+    private boolean lastOffloadCompleteFailed = false;


I don't think that we are handing correctly concurrent access to these fields
they should be at least volatile

hangc0276 · 2022-09-05T15:01:11Z

When completeLedgerInfoForOffloaded(ledgerId, uuid) failed, does it run into the next whenComplete part and call cleanupOffloaded(ledgerId, uuid,driverName, driverMetadata,"Metastore failure"); to delete the ledger data from the tiered storage?

zymap · 2022-09-06T03:43:39Z

When completeLedgerInfoForOffloaded(ledgerId, uuid) failed, does it run into the next whenComplete part and call cleanupOffloaded(ledgerId, uuid,driverName, driverMetadata,"Metastore failure"); to delete the ledger data from the tiered storage?

Yes

This reverts commit a591721.

…ailed" This reverts commit ae45239.

zymap · 2022-09-06T03:54:51Z

@hangc0276 @eolivelli @codelipenghui Thanks for your review!

I reconsider and change the implementation of the PR. Please take a look again. Thank you.

zymap · 2022-09-06T03:56:02Z

This is another issue I want to mention, I can send it to the mailing list if you prefer to.

I take a deep look yesterday, what we want to resolve by this PR is trying to make the ledgers map consistent between the memory and zookeeper server when offloading fails.

I saw in Pulsar Metadata handler, we retry the operation when zookeeper throws connection loss exception. But the operation may fail after the retry.

For example, we update the ledgers map in memory after successfully updating the LedgerInfo in the zookeeper . If the zookeeper update operation executes successfully on server but throws connection loss on the client, and we have to retry on the connection loss exception, then the callback may receive a BadVersion exception. At this moment, the memory ledgers list is different from the zookeeper server. And that may cause some other issues on the broker.

I'm not sure if I missing something. But looks like there have many places in our code we do not consider that situation.

zymap · 2022-09-07T01:48:15Z

ping @eolivelli @hangc0276 @codelipenghui

eolivelli

overall LGTM
I have left some suggestions

eolivelli · 2022-09-07T13:05:24Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

+            @Override
+            public void operationComplete(ManagedLedgerInfo mlInfo, Stat stat) {
+                ledgersStat = stat;
+                synchronized (this) {


what about an explicit ?
synchronized (ManagedLedgerImpl.this) {

eolivelli · 2022-09-07T13:05:46Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

+                        }
+                    }
+                }
+                metadataMutex.unlock();


this should be in a finally block

zymap · 2022-09-08T09:18:57Z

@eolivelli PTAL. thank you

codelipenghui · 2022-09-08T14:16:11Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

+            public void operationComplete(ManagedLedgerInfo mlInfo, Stat stat) {
+                ledgersStat = stat;
+                try {
+                    synchronized (ManagedLedgerImpl.this) {


I think here should be an issue because maybe there is another operation that updated the ledger list (take the lock first), and then here takes the lock which will mess up the ledger list again.

All the ledger's update operations should be guarded by the metadata lock because it needs to make sure the ledger stat is the latest version.
And the synchronized to make sure there hasn't remove/add operation when the ledger is closing or creating.

I haven't found other places to operate the map without locks. Do you know if other places still have concurrency issues?

And the synchronized to make sure there hasn't remove/add operation when the ledger is closing or creating.

Yes, that is what I want to say. If there are other operations that will update the ledgers, it will introduce the problem.

For example:

asyncRefreshLedgersInfoOnBadVersion get the metadataMutex lock

we got the returned ManageLedgerInfo at line 2370

another operation changed the ledgers, we will waiting here line 2373

after the ledgers changed, we will continue to run lines 2374 - 2377

we will lose the changes at step 3

And it also might introduce deadlock? someone is getting the synchronized but waiting for the metadataMutex lock, here is waiting for the synchronized lock but can't release the metadataMutex lock

The method synchronized void ledgerClosed(final LedgerHandle lh) will update the ledgers only with synchronized lock, no metadataMutex

Because when we use metadataMutex, we are trying to lock it, I think it won't cause a deadlock?

When we lock the metadataMutex, I saw all other places will retry. It won't update the ledgers or metadata successfully. Then it has to wait for the refresh done and then do other things.

codelipenghui · 2022-09-08T14:18:01Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

@@ -2366,6 +2402,9 @@ private void maybeOffload(CompletableFuture<PositionImpl> finalPromise) {
            unlockingPromise.whenComplete((res, ex) -> {
                    offloadMutex.unlock();
                    if (ex != null) {
+                        if (FutureUtil.unwrapCompletionException(ex) instanceof ManagedLedgerException) {


Looks like we don't need check here? asyncRefreshLedgersInfoOnBadVersion already checked exception type

Because the exception may a CompletionException, we need to make sure it is a ManagedLedgerException.

zymap · 2022-09-09T01:00:31Z

@codelipenghui PTAL

zymap · 2022-09-09T01:02:13Z

ping @eolivelli

zymap · 2022-09-14T03:30:16Z

ping @eolivelli @codelipenghui

eolivelli

LGTM

Jason918 · 2022-09-15T02:19:42Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

@@ -2957,7 +2998,12 @@ public void asyncOffloadPrefix(Position pos, OffloadCallback callback, Object ct
            promise.whenComplete((result, exception) -> {
                offloadMutex.unlock();
                if (exception != null) {
-                    callback.offloadFailed(new ManagedLedgerException(exception), ctx);
+                    Throwable t = FutureUtil.unwrapCompletionException(exception);


why don't we call asyncRefreshLedgersInfoOnBadVersion here but in offloadPrefix?
asyncOffloadPrefix is a public method.

Great catch! We need to do the refresh here

Jason918 · 2022-09-15T02:42:22Z

For example, we update the ledgers map in memory after successfully updating the LedgerInfo in the zookeeper . If the zookeeper update operation executes successfully on server but throws connection loss on the client, and we have to retry on the connection loss exception, then the callback may receive a BadVersion exception. At this moment, the memory ledgers list is different from the zookeeper server. And that may cause some other issues on the broker.

@zymap
I wonder why don't handle this case in metadata, retry the write operation if we got BadVersion after a Connection loss, to avoid throwing the exception to broker module. And so that this fix can apply to other metadata writes too.

zymap · 2022-09-15T06:24:40Z

@Jason918 Because we cannot know who cause the bad version. If the data has been written by another broker in some cases, we will also get a bad version.
The bad version case is complicated and we suspect there has this case but we haven't evidence for it.

I added some logs in another pr to check it.

Jason918 · 2022-09-15T09:16:50Z

Because we cannot know who cause the bad version. If the data has been written by another broker in some cases, we will also get a bad version.
The bad version case is complicated and we suspect there has this case but we haven't evidence for it.

I see, we stored ledgersStat locally. The "refresh way" make sense to me.

dlg99 · 2022-09-15T21:13:21Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

+                    synchronized (ManagedLedgerImpl.this) {
+                        for (LedgerInfo li : mlInfo.getLedgerInfoList()) {
+                            long ledgerId = li.getLedgerId();
+                            ledgers.put(ledgerId, li);


is it possible some ledgerIds need to be removed (anything not present in mlInfo.getLedgerInfoList())?

Because we trigger this operation after offloading failed. It shouldn't have any remove operation on the metadata store. If it have remove, it should succeed or fail before the offload, right?

Jason918 · 2022-09-17T09:44:29Z

Discussed with @zymap, this should be merged after #17512 helped confirmed the root cause of this issue. Move to release/2.10.3.

FYI @codelipenghui

eolivelli · 2022-09-20T12:03:17Z

This patch is related to this problem, but from another perspective.
#17736

eolivelli

I am not sure anymore about this patch.
In case of BadVersion we don't know the cause of the inconsistency.
And we should run thru the whole recovery procedure of the ledger.
See the linked PR of mine

zymap · 2022-09-20T13:44:28Z

I will close this PR.

Your's should be a better way to handle this case.

zymap added the type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages label Aug 23, 2022

zymap requested review from merlimat, hangc0276, Technoboy-, eolivelli, codelipenghui and liudezhi2098 August 23, 2022 04:54

zymap self-assigned this Aug 23, 2022

github-actions bot added the doc-not-needed Your PR changes do not impact docs label Aug 23, 2022

zymap added release/2.9.4 release/2.8.5 release/2.10.2 labels Aug 23, 2022

zymap added 2 commits August 23, 2022 14:46

Add a ut to make sure the ledger refresh

a591721

Merge branch 'master' into fetch-metdata-when-offload-failed

cf81eed

Merge branch 'master' into fetch-metdata-when-offload-failed

b29b64f

codelipenghui reviewed Sep 2, 2022

View reviewed changes

Jason918 added release/2.10.3 and removed release/2.10.2 labels Sep 4, 2022

eolivelli requested changes Sep 5, 2022

View reviewed changes

zymap added 3 commits September 6, 2022 11:44

Revert "Add a ut to make sure the ledger refresh"

8fcd0ad

This reverts commit a591721.

Revert "[fix][ml] refresh the ledgers map when the offload complete f…

8e99b67

…ailed" This reverts commit ae45239.

Refresh the ledger infos only failed by Badversion

2012115

codelipenghui added this to the 2.11.0 milestone Sep 7, 2022

codelipenghui added release/2.10.2 and removed release/2.10.3 labels Sep 7, 2022

zymap added 2 commits September 7, 2022 15:29

syncronized this before operating the ledgers

2aef87a

unwrap the completion exception

97c55c9

eolivelli reviewed Sep 7, 2022

View reviewed changes

Jason918 added the component/storage label Sep 7, 2022

Jason918 changed the title ~~[fix][ml] refresh the ledgers map when the offload complete failed~~ [fix][storage] refresh the ledgers map when the offload complete failed Sep 7, 2022

Address comments

5741c13

codelipenghui reviewed Sep 8, 2022

View reviewed changes

fix a minor issue

cd7ad4b

Merge branch 'master' into fetch-metdata-when-offload-failed

ae7a2e3

eolivelli approved these changes Sep 14, 2022

View reviewed changes

Jason918 reviewed Sep 15, 2022

View reviewed changes

dlg99 reviewed Sep 15, 2022

View reviewed changes

Jason918 added release/2.10.3 and removed release/2.10.2 labels Sep 17, 2022

eolivelli requested changes Sep 20, 2022

View reviewed changes

zymap closed this Sep 20, 2022

[fix][storage] refresh the ledgers map when the offload complete failed #17228

[fix][storage] refresh the ledgers map when the offload complete failed #17228

Conversation

zymap commented Aug 23, 2022

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

zymap commented Aug 29, 2022

codelipenghui left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hangc0276 commented Sep 5, 2022

zymap commented Sep 6, 2022

zymap commented Sep 6, 2022 • edited Loading

zymap commented Sep 6, 2022

zymap commented Sep 7, 2022

eolivelli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zymap commented Sep 8, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zymap commented Sep 9, 2022

zymap commented Sep 9, 2022

zymap commented Sep 14, 2022

eolivelli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jason918 commented Sep 15, 2022

zymap commented Sep 15, 2022

Jason918 commented Sep 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jason918 commented Sep 17, 2022

eolivelli commented Sep 20, 2022 • edited Loading

eolivelli left a comment

Choose a reason for hiding this comment

zymap commented Sep 20, 2022

zymap commented Sep 6, 2022 •

edited

Loading

eolivelli commented Sep 20, 2022 •

edited

Loading