Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix][broker] Fix NPE when ledger id not found in OpReadEntry #15837

Merged
merged 2 commits into from
Jun 7, 2022
Merged

[Fix][broker] Fix NPE when ledger id not found in OpReadEntry #15837

merged 2 commits into from
Jun 7, 2022

Conversation

mattisonchao
Copy link
Member

Motivation

In current implementation, We have more than one potential NPE relate ManagedLedgerImpl#startReadOperationOnLedger method.

1. Unbox NPE

PositionImpl startReadOperationOnLedger(PositionImpl position, OpReadEntry opReadEntry) {
Long ledgerId = ledgers.ceilingKey(position.getLedgerId());
if (null == ledgerId) {
opReadEntry.readEntriesFailed(new ManagedLedgerException.NoMoreEntriesToReadException("The ceilingKey(K key"
+ ") method is used to return the least key greater than or equal to the given key, "
+ "or null if there is no such key"), null);
}
if (ledgerId != position.getLedgerId()) {
// The ledger pointed by this position does not exist anymore. It was deleted because it was empty. We need
// to skip on the next available ledger
position = new PositionImpl(ledgerId, 0);
}
return position;

According to this code, we can know that when the ledger id is null, we will fail OpReadEntry. However, there is no direct return here after failure. NPE will be thrown on line 2238. Because we want to unbox null.

2. null OpReadEntry context

According to the code above, we can see when we fail OpReadEntry, we pass a null value as context(line 2235). there is an NPE when the dispatcher gets the callback.

private synchronized void internalReadEntriesFailed(ManagedLedgerException exception, Object ctx) {
havePendingRead = false;
ReadEntriesCtx readEntriesCtx = (ReadEntriesCtx) ctx;
Consumer c = readEntriesCtx.getConsumer();
readEntriesCtx.recycle();
long waitTimeMillis = readFailureBackoff.next();

3. OpReadEntry#create

public static OpReadEntry create(ManagedCursorImpl cursor, PositionImpl readPositionRef, int count,
ReadEntriesCallback callback, Object ctx, PositionImpl maxPosition) {
OpReadEntry op = RECYCLER.get();
op.readPosition = cursor.ledger.startReadOperationOnLedger(readPositionRef, op);
op.cursor = cursor;
op.count = count;
op.callback = callback;
op.entries = Lists.newArrayList();
if (maxPosition == null) {
maxPosition = PositionImpl.LATEST;
}
op.maxPosition = maxPosition;
op.ctx = ctx;
op.nextReadPosition = PositionImpl.get(op.readPosition);
return op;
}

When we create OpReadEntry and then call ManagedCursor#startReadOperationOnLedger , assuming the ledger id is equal to null. At this point, we will fail this OpReadEntry. But at the current time, other parameters are not initialized. When we call OpReadEntry#readEntriesFailed. The cursor will be null and the NPE will be thrown.

@Override
public void readEntriesFailed(ManagedLedgerException exception, Object ctx) {
cursor.readOperationCompleted();
if (!entries.isEmpty()) {

Modifications

  • Remove fail OpReadEntry logic in ManagedCursor#startReadOperationOnLedger, when ledger id equals null, we can return the original value. the ManagedLedger will validate if this operation is legal.

From the perspective of the overall design, the OpReadEntry is just a middle state object, that may have illegal value, we have to check this OpReadEntry is valid in ManagedLedger#asyncReadEntries(Current we already do that).

Verifying this change

  • Make sure that the change passes the CI checks.

Documentation

  • doc-not-needed
    (Please explain why)

@mattisonchao mattisonchao self-assigned this May 30, 2022
@mattisonchao mattisonchao added this to the 2.11.0 milestone May 30, 2022
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label May 30, 2022
@mattisonchao mattisonchao added area/broker and removed doc-not-needed Your PR changes do not impact docs labels May 30, 2022
@github-actions
Copy link

@mattisonchao:Thanks for your contribution. For this PR, do we need to update docs?
(The PR template contains info about doc, which helps others know more about the changes. Can you provide doc-related info in this and future PR descriptions? Thanks)

@HQebupt
Copy link
Contributor

HQebupt commented May 30, 2022

/pulsarbot run-failure-checks

@Anonymitaet Anonymitaet added doc-not-needed Your PR changes do not impact docs and removed doc-label-missing labels May 31, 2022
@mattisonchao mattisonchao reopened this Jun 2, 2022
@mattisonchao
Copy link
Member Author

mattisonchao commented Jun 7, 2022

After discussing with @Technoboy-, we may find why the ledgers.ceilingKey get Null; please consider this condition:

Precondition

  • We have a ledger 12222:-1 at the ManagedLedger ledgers.
  • We have an inactive cursor that read position is 12222:0.

Cases

  1. When a new producer wants to produce a message on the topic but adds an entry to BK failed. It will prepare to close the current ledger and create a new one. because the current ledger entries are 0. So, we decided to delete it immediately. The logic shows below:

synchronized void ledgerClosed(final LedgerHandle lh) {
final State state = STATE_UPDATER.get(this);
LedgerHandle currentLedger = this.currentLedger;
if (currentLedger == lh && (state == State.ClosingLedger || state == State.LedgerOpened)) {
STATE_UPDATER.set(this, State.ClosedLedger);
} else if (state == State.Closed) {
// The managed ledger was closed during the write operation
clearPendingAddEntries(new ManagedLedgerAlreadyClosedException("Managed ledger was already closed"));
return;
} else {
// In case we get multiple write errors for different outstanding write request, we should close the ledger
// just once
return;
}
long entriesInLedger = lh.getLastAddConfirmed() + 1;
if (log.isDebugEnabled()) {
log.debug("[{}] Ledger has been closed id={} entries={}", name, lh.getId(), entriesInLedger);
}
if (entriesInLedger > 0) {
LedgerInfo info = LedgerInfo.newBuilder().setLedgerId(lh.getId()).setEntries(entriesInLedger)
.setSize(lh.getLength()).setTimestamp(clock.millis()).build();
ledgers.put(lh.getId(), info);
} else {
// The last ledger was empty, so we can discard it
ledgers.remove(lh.getId());
mbean.startDataLedgerDeleteOp();
bookKeeper.asyncDeleteLedger(lh.getId(), (rc, ctx) -> {
mbean.endDataLedgerDeleteOp();
log.info("[{}] Delete complete for empty ledger {}. rc={}", name, lh.getId(), rc);
}, null);
}
trimConsumedLedgersInBackground();
maybeOffloadInBackground(NULL_OFFLOAD_PROMISE);
if (!pendingAddEntries.isEmpty()) {
// Need to create a new ledger to write pending entries
createLedgerAfterClosed();
}
}

After line:1643, we may have empty ledgers.

  1. In the meantime, the new subscription wants to read messages and create the OpReadEntry. but the ledgers are empty then NPE occur at startReadOperationOnLedger method.

/cc @Technoboy- @codelipenghui

@mattisonchao
Copy link
Member Author

And because the ManagedLedger has the state to check if we can do something. So, I think the null check at OpReadEntry is enough.

Copy link
Contributor

@gaoran10 gaoran10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gaoran10
Copy link
Contributor

gaoran10 commented Jun 7, 2022

Maybe we could add a unit test to reproduce this problem, for example reading a position greater than max ledger id exists in ManagedLedgerImpl#ledgers? Or the ledgers is an empty set.

@mattisonchao
Copy link
Member Author

@gaoran10 Got it, I will do that.

@mattisonchao
Copy link
Member Author

Maybe we could add a unit test to reproduce this problem, for example reading a position greater than max ledger id exists in ManagedLedgerImpl#ledgers? Or the ledgers is an empty set.

Done, PTAL.

@mattisonchao mattisonchao requested a review from gaoran10 June 7, 2022 06:27
@eolivelli eolivelli merged commit 7a3ad61 into apache:master Jun 7, 2022
codelipenghui pushed a commit that referenced this pull request Jun 10, 2022
@codelipenghui codelipenghui added the cherry-picked/branch-2.9 Archived: 2.9 is end of life label Jun 10, 2022
codelipenghui pushed a commit that referenced this pull request Jun 28, 2022
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Jul 4, 2022
BewareMyPower pushed a commit that referenced this pull request Jul 27, 2022
@BewareMyPower BewareMyPower added the cherry-picked/branch-2.8 Archived: 2.8 is end of life label Jul 27, 2022
Jason918 pushed a commit to Jason918/pulsar that referenced this pull request Aug 7, 2022
@Jason918
Copy link
Contributor

Jason918 commented Aug 7, 2022

move release/2.7.5 to #16966

This was referenced Apr 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/broker cherry-picked/branch-2.8 Archived: 2.8 is end of life cherry-picked/branch-2.9 Archived: 2.9 is end of life cherry-picked/branch-2.10 doc-not-needed Your PR changes do not impact docs release/2.8.4 release/2.9.3 release/2.10.2 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants