-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consumer not persisting Cursor #180
Comments
@merlimat @rdhabalia we really need help with this it's causing issues to us in production. |
@sschepens Few questions to try identify the issue:
|
Given the current implementation (using a guava rate limiter), subsequent acks are tracked in memory. The problem might be related to the fact that if there are no more writes on the topic, the persisted position for the cursor doesn't get a chance to get updated. |
Yes it is, we also tried consuming harder.
It wouldn't appear so, as backlog reported by broker before restarting is 0, which would confirm that it considers all messages were acked.
We're already using ack timeout for all consumers.
I'll try, we're now using
Hmm shouldn't this have at least a way of checking when state has not been saved for some time and force a save? |
Right now all the checking is done when processing an acknowledgment. There was the plan of fixing that, by going through all the topics every 1min to flush all the pending cursor updates, thought it's still like this. |
@merlimat does changing |
it requires re-loading the topics, |
Are these consumers on the same subscription or in 4 different subscriptions? |
4 different subscriptions, all are shared subscriptions and are hosted on the same 2 instances |
Ok, anything in the logs that could signal there was any failure in recovering the cursor after the unload? |
I couln't find any log specifying a failure recovering the cursor, on the contrary, the cursor appeared to be recovered correctly, but on a really old position. |
If you're able to reproduce it, can you past the stats-internal for the topic (after consuming and before restarting)? |
I wasn't able to reproduce this again, but we had already set markDeleteRatio=0, and also reset the consumer. |
@merlimat this has happened again for different consumers when restarting a broker. We have the following persistence for the namespace settings:
This is absolutely not tolerable, we must find a way to fix this |
By the way, this are all partitioned topics with 10 partitions. |
@sschepens as @merlimat mentioned: is it possible to get stats-internal for a given topic before restarting the server.
|
@rdhabalia i'm checking this, we restarted another broker and the backlog bumped up yet again, and the consumer had no individuallyDeletedMessages pending... |
@sschepens In the stats internal you can see also the real cursor position, apart from the individuallyDeletedMessages. That could give few hints on where to look. Also can you attach the logs for that particular topic, before/after restart? That might be useful as well. |
@sschepens also, having both internal and normal stats, might be useful in setting up a quick test to try to reproduce the issue. |
@merlimat here I upload topic stats-internal for each partition before and after restart. As the name specifies, the first one is a test consumer, i'm more interested in the second one. I can see that the first did have several unacked messages before restarting, even though, with each restart it seems to get more and more lagged. |
I'm attaching the logs of the instance before and after restart. |
we're collectiong logs mentioning the topic from all other instances after the restart |
@sschepens Do you have message retention on time enabled, correct? (I'm seeing older ledgers are not being deleted, just wanted to make sure) |
@sschepens about the violet consumer, in the logs you pasted, does it gets backlog on all the partitions? I don't see any problem in the "after" stats for that subscription. |
These are the logs of all of our brokers before and after restart. |
I cannot now that right now, we don't have our metrics split by partition, and we didn't get the normal stats before and after. |
Yes, this is our retention policy: {
"retentionTimeInMinutes" : 10080,
"retentionSizeInMB" : 2147483647
} |
@merlimat I would expect that the increased backlog should come from partitions that got unloaded from the restarted broker only. Which should be visible in the logs |
@sschepens My point is that, looking that "after" internal stats, the cursor for f6f49168781f402c99ddfa871bc0e90c-fury-orders.purchases-api doens't show backlog : {
"entriesAddedCounter": 36299,
"numberOfEntries": 6101124,
"totalSize": 784103746,
"currentLedgerEntries": 36299,
"currentLedgerSize": 4663005,
"lastLedgerCreatedTimestamp": "2017-02-03 17:47:24.912",
"lastLedgerCreationFailureTimestamp": null,
"waitingCursorsCount": 5,
"pendingAddEntriesCount": 0,
"lastConfirmedEntry": "69312:36298",
"state": "LedgerOpened",
"ledgers": [
....
"f6f49168781f402c99ddfa871bc0e90c-fury-orders.purchases-api": {
"markDeletePosition": "69312:36297",
"readPosition": "69312:36299",
"waitingReadOp": true,
"pendingReadOps": 0,
"messagesConsumedCounter": 36298,
"cursorLedger": 69336,
"cursorLedgerLastEntry": 31461,
"individuallyDeletedMessages": "[]",
"lastLedgerSwitchTimestamp": "2017-02-03 17:47:24.972",
"state": "Open"
},
.... If you subtract |
.. Ok, found the same thing on partition-6: "f6f49168781f402c99ddfa871bc0e90c-fury-orders.purchases-api": {
"markDeletePosition": "59439:44524",
"readPosition": "61542:30970",
"waitingReadOp": false,
"pendingReadOps": 0,
"messagesConsumedCounter": -986146,
"cursorLedger": -1,
"cursorLedgerLastEntry": -1,
"individuallyDeletedMessages": "[(59439:44525‥60577:45163], (60577:45164‥60886:818], (60886:819‥60886:21168], (60886:21169‥60886:27590], (60886:27591‥60886:41528], (60886:41529‥60886:49252], (60886:49253‥61145:10375], (61145:10376‥61145:36353], (61145:36354‥61542:7158], (61542:7159‥61542:14589], (61542:14590‥61542:26054], (61542:26055‥61542:29862], (61542:29863‥61542:30509], (61542:30510‥61542:30515], (61542:30516‥61542:30517], (61542:30518‥61542:30534], (61542:30535‥61542:30537], (61542:30538‥61542:30556], (61542:30557‥61542:30559], (61542:30560‥61542:30563], (61542:30565‥61542:30566], (61542:30567‥61542:30568], (61542:30569‥61542:30573], (61542:30577‥61542:30581], (61542:30582‥61542:30583], (61542:30585‥61542:30587], (61542:30589‥61542:30591], (61542:30610‥61542:30613], (61542:30619‥61542:30637], (61542:30638‥61542:30640], (61542:30641‥61542:30643], (61542:30647‥61542:30649], (61542:30672‥61542:30674], (61542:30675‥61542:30677], (61542:30678‥61542:30679]]",
"lastLedgerSwitchTimestamp": "2017-02-03 18:33:43.431",
"state": "NoLedger"
}, |
So, that cursors had 1 single unacked message (59439:44525) before restart : "f6f49168781f402c99ddfa871bc0e90c-fury-orders.purchases-api": {
"markDeletePosition": "59439:44524",
"readPosition": "70009:12925",
"waitingReadOp": true,
"pendingReadOps": 0,
"messagesConsumedCounter": 12922,
"cursorLedger": -1,
"cursorLedgerLastEntry": -1,
"individuallyDeletedMessages": "[(59439:44525‥70009:12922]]",
"lastLedgerSwitchTimestamp": "2017-02-03 18:16:48.247",
"state": "NoLedger"
}, |
The same message still appears as "unacked" even after restart. Is there the possibility that the consumer is getting some unexpected data and fails to consume that particular message? You can dump the content of the message by doing : $ pulsar-admin persistent peek-messages $MY_TOPIC -s $MY_SUBSCRIPTION |
The contents of the message seems to be fine. For some reason is still unacked. |
The message is from yesterday for some reason it was never consumed. |
So, if you have set the ack-timeout, this message should be keeping getting resent to the consumer. You can enable debug logs (either on broker or consumer) to check that. |
@merlimat this messages is effectively being redelivered constantly. |
Just to clarify, is the consumer acknowledging the message? About how to detect that, one way is to monitor the storage size constantly increasing (though in your case, with retention that is not feasible). Otherwise, monitor the re-delivery rate for the consumers (in the topic stats). About dead-letter queue. It's an interesting topic and I'd would support that, just not enabled by default. Potentially it could be an option on the namespace itself to define a dead-letter topic where to dump messages after N delivery attempts. |
No, because it's failing to process it.
I don't think message re-delivery rate is going to work, messages are potentially constantly being redelivered because they could spend
Of course, dead-letter queues should be optional, but there are also things to consider, should there be a dead-letter topic for each consumer? if not, how would we handle this? |
Another thought @merlimat The line of thought is: what can we possibly do to prevent this from happening? (a consumer re-processing all messages when a broker restarts because it did not acknowledge a message) |
@merlimat I'm thinking, what would happen if a consumer like the one we had issues with has a single messages that it cannot process and that message get expired via ttl or retention policies, would broker immediately discard it from |
Yes, we thought about that many times. The thing is storing that information, avoids the big backlog on restart but doesn't address the root cause of why the message wasn't acked. At the end, having a hole in the ack-sequence is always the symptom of a separate problem. And for the system, the data cannot be removed from disk because the storage is not a traditional key-value store where you can point-delete data. About the size of the To summarize my point of view
When the broker applies TTL, it will acknowledge the messages and thus close the gap and move the |
Closing this one since the changes are already implemented and merged in #276 |
Signed-off-by: xiaolong.ran <rxl@apache.org> ### Motivation - Add seek by time on consumer ### Modifications - Add seek by time on consumer - Add test case
We have 4 consumers for the same topic, all are supposedly up to date (0 backlog).
But when restarting brokers, thus loading cursor state from zookeeper and bookkeeper, we're consistently getting one consumer reset a long time backwards.
Looking at zookeeper metadata that consumer is the only one to have an outdated markDeleteLedgerId and markDeleteEntryId, this happens for all 10 partitions of the consumer.
Not only has this consumer an outdated cursor, it seems to no be updating it as it consumes all the backlog generated when a broker is restarted and thus consumer is reset.
The text was updated successfully, but these errors were encountered: