Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] MarkDeletePosition causes the corresponding ledger to not be deleted in borderline cases #19077

Open
1 of 2 tasks
wolfstudy opened this issue Dec 27, 2022 · 6 comments
Open
1 of 2 tasks
Labels
Stale type/bug The PR fixed a bug or issue reported a bug

Comments

@wolfstudy
Copy link
Member

wolfstudy commented Dec 27, 2022

Search before asking

  • I searched in the issues and found nothing similar.

Version

Bookie Version:4.14.4
Broker Version: 2.9.2
OS: Linux Centos7

Minimal reproduce step

Under the above version, we found that some EntryLogs in Bookie could not be deleted for a long time. At first, we thought that it was caused by not triggering minor GC or major GC. Then we adjusted the thresholds of minor GC and major GC, and found that the recovery effect was not very obvious. Then we scanned the problematic EntryLog file, and then obtained the list of all Ledgers in the current EntryLog, and obtained the ledger metadata corresponding to each Ledger, and then parsed out the Topic information corresponding to the Ledger, and found the following situation:

  1. The ledger metadata as follows:
2022-12-26 20:28:06,184 - INFO  - [main:ParseEntryLog@217] - ManagedLedger: LedgerID: 1174539 --- LedgerCreateTime: 1654146303273 --- LedgerState: CLOSED --- Ensembles: {0=[xxx:3181, xxx:3181, xxx:3181]} --- TopicName: pulsar-xxx/test/test2-partition-2
  1. The topic stats internal as follows

stats-internal.log

image

We can see that the markDeletePosition of all subscriptions under the current topic has been updated to the last message, indicating that all the messages in this topic have been correctly consumed and confirmed. But in fact, this ledger has not been deleted, which leads to the fact that the proportion of valid data in the EntryLog does not meet the triggering conditions of major GC, which further causes the EntryLog to exist for a long time and cannot be deleted.

Observing the topic stats internal, we can see that it is currently at a boundary position, markDeletePosition is at the position of the last message of the previous Ledger, and there is no message in the next new Ledger, so whether there is a Ledger in the boundary case If it cannot be deleted, observe that the state of the ledger is already in the CLOSED state

The following is the data in the EntryLog file scanned by the scan script:
entry.log

What did you expect to see?

The CLOSED ledger will be deleted.

What did you see instead?

The CLOSED ledger not be deleted.

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@wolfstudy wolfstudy added the type/bug The PR fixed a bug or issue reported a bug label Dec 27, 2022
@gaoran10
Copy link
Contributor

gaoran10 commented Dec 27, 2022

It seems that it's due to the read position still belonging to the ledger 1174539, so the trim ledger task will not check the ledger 1174539, it only checks ledgers less than 1174539.

@wolfstudy
Copy link
Member Author

It seems that it's due to the read position still belonging to the ledger 1174539, so the trim ledger task will not check the ledger 1174539, it only checks ledgers less than 1174539.

The processing that should belong to the boundary case here is inappropriate, because the calculation of entries in stats-internal starts from 1. But the position of the first message messageID in a ledger is ledgerID:entryID = 1:0. So when MarkDeletePosition is marked to 1174539:226934, it means that 226935 messages in ledger 1174539 have been Acked, because we have set retention and TTL in the background, the maximum is no more than 10 days, and this Ledger is the data of 3 months ago, so These data must be expired by TTL

@wolfstudy
Copy link
Member Author

wolfstudy commented Dec 28, 2022

Found a new thread, for this case, all ledger flags have the status: NoLedger

image

This is a strange thing. From the perspective of Bookie, the current status of this Ledger is CLOSED, and we can determine this through ledgermetadata; from the perspective of Broker, the status of this Ledger is NoLedger. We use topics stats- internal can confirm,

image

In our scenario, the operation of reset cursor is not called, the message is expired through TTL, it seems that the same problem is encountered, not sure whether it is related to the following fix:

@lhotari @michaeljmarshall PTAL thanks!

@github-actions
Copy link

The issue had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label Jan 28, 2023
@3286360470
Copy link

Is this problem solved? Anyone else looking at this question?

@github-actions github-actions bot removed the Stale label Jul 1, 2023
@github-actions
Copy link

github-actions bot commented Aug 1, 2023

The issue had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stale type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

No branches or pull requests

3 participants