-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Catastrophic frequent random subscription freezes, especially on high-traffic topics. #6054
Comments
FWIW, @devinbost I am suspicious of the netty version in the 2.4.X stream. There is a know memory usage issue (netty/netty#8814) in the 4.1.32 version of netty that is used in the stream. When I was testing, I would see issues like you are describing. Then I patched in the latest version of netty, and things seemed better. I have been running with a patched 2.4.2 for a while and have not seen any issues. You may be experiencing something completely different, but it might be worth trying 2.4.2 + latest netty. |
@devinbost have you looked into |
@sijie We have, but I can't remember exactly what our findings were. |
@cdbartholomew We actually started experiencing this issue before 2.4.0. |
I noticed that each topic lives on a single broker, which creates a single point of failure. |
We (StreamNative) have been helping folks from Tencent at developing a feature called ReadOnly brokers. It allows a topic can have multiple owners (one writeable owner and multiple readonly owners). It has been running on production for a while. They will contribute it back soon. |
Incorrect permits has been one of the main reasons causing the consumer to be stalled. You can use "unload" command to unload a topic or a namespace bundle to trigger a consumer reconnect. It resets the consumer state to mitigate the problem. @codelipenghui and @jiazhai are working a proposal on improving the permits protocol. I am not sure if your problem is related to permits. but if the same problem occurs, the first thing you should do is to use |
Interestingly, the available permits seems to be fine (1000). Are you able to get some more stats that we can help debug? |
We will need to capture some stats when this happens next. After my team made some changes to improve the stability of the Zookeeper cluster, the frequency of this issue decreased on v2.4.0. So, we will need to update one of the clusters to use 2.4.2 to reproduce this issue again. |
I also noticed that some of the bookkeeper nodes are currently running: I don't know if that would have anything to do with the issue. |
usually, this kind of problem is not related to bookkeeper. Is the broker running the same version and what is the version? |
The brokers are all running |
The brokers are also running as the function workers. (There aren't dedicated function worker instances.) |
Given it is running a special version, it is hard for us to realize if there are any special changes in that version. I think it is better to provide some jstack or heap dump when this problem happens. Otherwise it is hard for the community to help here. |
@sijieg Thanks for the advice. |
@devinbost There is a fix #5894 for broker stop dispatch messages and it was released in 2.5.0. |
This issue is a duplicate of #5311 |
We noticed that after one of these situations, our Zookeeper nodes had gotten out of sync. We ran a diff after scraping the Zookeeper data on each of the ZK instances, and we noticed that only one ZK instance was behind (although we weren't able to check the ledger data since it's constantly changing.) The difference we noticed was that the ZK instance that was behind was missing several nodes in: |
@addisonj Have you noticed anything similar? |
@lhotari I reproduced the issue with using this as the Pulsar Function:
I chained one function (with parallelism of 4) to another function (with parallelism of 4), both using the same class. |
@rdhabalia This happened after I applied the default broker.conf settings (fixing the non-default configs for ManagedLedger and a few other things that we had) and restarted the brokers. |
We discovered that certain non-default broker.conf settings seem to be key to reproducing this bug, but we haven't yet identified which of them are responsible.
It's obvious that
It was only after adding these that the bug re-appeared:
We also noticed an ERROR message on the affected broker that consistently appears (in Pulsar 2.6.2) at the exact time it freezes:
While it's frozen, after waiting for a while, a series of similar errors will appear on that same broker but with an actual (not -1) ledgerId:
As soon as that broker is stopped, data will resume flowing briefly before freezing again on a different broker, which will show the error It looks like this issue is closely related to apache/bookkeeper#2716 We also noticed no clear JVM memory issues, though a few minutes after the freeze occurs, we did notice a memory leak that showed up after we added The WARN-level logs around the leak are leak_log.txt. |
Many thanks to @lhotari for figuring out how to reproduce this bug outside of my environment. |
I have reported a separate issue about a direct memory leak, it's #10738 . It includes a full repro case with a helm deployment that can be used to reproduce the issue in minikube (requires sufficient RAM) (or any k8s environment). |
Any news regarding this one? |
@JohnMops did you update to Pulsar 2.8.0? |
What are your broker.conf settings? And, how many messages are you
processing? We've noticed that some settings increase the frequency of
this, but I'm still working on investigating the root cause.
Something is causing the broker to not ack messages, but we're not sure
about the exact cause yet.
--
Devin G. Bost
…On Sun, Jun 20, 2021, 4:24 AM Enrico Olivelli ***@***.***> wrote:
@JohnMops <https://github.com/JohnMops> did you update to Pulsar 2.8.0?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6054 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABYTBLYUKN5YVYDBOUZKSFTTTW6WPANCNFSM4KGXAXLQ>
.
|
Related to #10813 except that issue seems to not occur if batching is disabled. As an update, I've reproduced this bug even after correcting the non-standard broker.conf settings that were mentioned earlier in this issue. |
I mapped out more of the ack flow, so I will add it to what I documented here and make it more readable. The bracket notation below is intended to specify an instance of the class (to distinguish from a static method call.) When From somewhere (it's not clear to me exactly where yet) we call From there, the client gets the After a rollover, Something triggers It's still not clear where the ack's are getting lost, so there must be another part of the flow that I'm missing. |
After spending many hours digging through the ack paths and not finding any issues, I took another look at the client ( If there's a connectivity issue, that would explain why this bug has been so hard to reproduce and why it can't be reproduced locally. It could be that some network hardware is doing something weird with the connection, and the client doesn't handle it correctly and gets stuck in a In the function logs, I'm noticing a lot of messages like this:
It looks like the function keeps trying to reconnect, but the connection is immediately (within 1 millisecond) closed. So, the function is never able to re-establish a healthy connection. If it can't establish a healthy connection, it can't produce the messages or receive ack's from the broker. So, this is a viable root cause. |
This bug has been resolved in DataStax Luna Streaming 2.7.2_1.1.21 |
@devinbost Which version of apache pulsar has this fix ? |
Hi @devinbost is this is 2.9.1? |
@skyrocknroll @marcioapm All required fixes are included in Apache Pulsar 2.7.4 and 2.8.2 versions. |
Still have stalled topic issue in 2.9.2 |
Describe the bug
Topics randomly freeze, causing catastrophic topic outages on a weekly (or more frequent) basis. This has been an issue as long as my team has used Pulsar, and it's been communicated to a number of folks on the Pulsar PMC committee.
(I thought an issue was already created for this bug, but I couldn't find it anywhere.)
To Reproduce
We have not figured out how to reproduce the issue. It's random (seems to be non-deterministic) and doesn't seem to have any clues in the broker logs.
Expected behavior
Topics should never just randomly stop working to where the only resolution is restarting the problem broker.
Steps to Diagnose and Temporarily Resolve
Step 2: Check the rate out on the topic. (click on the topic in the dashboard, or do a stats on the topic and look at the "msgRateOut")
If the rate out is 0 this is likely a frozen topic, but to verify do the following:
In the pulsar dashboard, click on the broker that topic is living on. If you see that there are multiple topic that have a rate out of 0, then proceed to the next step, if not it could potentially be another issue. Investigate further.
Step 3: Stop the broker on the server that the topic is living on.
pulsar-broker stop
.Step 4: Wait for the backlog to be consumed and all the functions to be rescheduled. (typically wait for about 5-10 mins)
Environment:
This has been an issue with previous versions of Pulsar as well.
Additional context
Problem was MUCH worse with Pulsar 2.4.2, so our team needed to roll back to 2.4.0 (which has the problem, but it's less frequent).
This is preventing the team from progressing in the use of Pulsar, and it's causing SLA problems with those who use our service.
The text was updated successfully, but these errors were encountered: