-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Post Process Forwarder - KafkaError "Offset Out of Range" #478
Comments
Can you try modifying this line to:
and see if it helps? |
@BYK Thank you, I'll have a try on the manifest and update it. Just an FYI I did fix this by effectively destroying the volume for the Kafka backend and recreate it. I also noticed my volume size is pretty low for the amount of tasks we have coming in. Does Kafka have an inherent GC that cleans things up? It looks as though the disk was filling upwards of 8GB. |
@BYK It came back, looks like with both earliest and latest offsets I get the same error. This is POST fixing things, so it's unexpectedly broken itself again Trying to think of what I can do to fix it. |
Hello, https://sentry.uptactics.com/share/issue/2a83806a230e4c4196f86c83890ad62e/ |
Thanks @DandyDeveloper for outlining your solution path. We had the same issue and were also able to resolve it for now by destroying the persistent volume of Kafka. Until now, we didn't have the issue occurring again like it did in your case. |
I have a solution:
This solution work the same way for other Snuba components (just replace group and\or topic) |
@rmisyurev This actually worked for me, thank you very much! But to be honest I don't know why. The first time I got an error message like this:
I just stopped all containers, then only started zookeeper and kafka containers, executed the statement above and stopped both containers again. I am sure it is not the right solution, so be careful, but for me it worked. |
@rmisyurev - thanks for sharing the steps that helped you! I'll make sure to add these to some future documentation. |
Thanks so much for the helpful comments here, especially @rmisyurev — I looked at your post with the This issue has been plaguing me for weeks, and I tried everything including adding a ton of consumer replicas, which didn't help at all. What I think has finally solved it (in addition to giving Kafka 8GB of RAM) was actually setting the |
This issue has gone three weeks without activity. In another week, I will close it. But! If you comment or otherwise update it, I will reset the clock, and if you label it "A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀 |
Hi, thanks for all the quick fixes, but unfortunately none of them fixed it for us. Currently Sentry runs ok for a few days with: The only way to get the system fully operational again is deleting everything related to kafka (volumes etc) and rebuilding it. Any suggestions on how to make sentry run more stable? |
Same issue here on 20.11.1. I did all the steps mentioned by @rmisyurev, but to no avail. @bbreijer where do you run this setting? EDIT: Update tot 21.1.0 did not solve the issue. sentry_onpremise_snuba-consumer_1, sentry_onpremise_snuba-outcomes-consumer_1, sentry_onpremise_snub-transactions-consumer_1 all keep restarting all the time. EDIT2: I changed the commit-batch-size in the docker-compose.yml file. |
That is where I also changed it. |
Edited the commit-batch-size and removed all snuba (kafka) instances, but still:
errors all the time and continuous restarts. EDIT:
etc.. Stuck at these issues that keep coming back:
EDIT: Fixed this by:
and
EDIT: Tried:
and:
but this issues still persists with the sentry_onpremise_snuba_outcomes-consumer_1 restarting every time. The other containers do stay up now though :) |
I have the same issue, how I can fix this? |
@kfkawalec |
@gabn88 thanks a lot for sharing your solution here. It is a bit hard to follow though, do you think this is something we can implement in the repo to save other's some hassle and time? |
@BYK No, the issue is consumers actively registered to the topic/group MUST be killed before you can actually run through and reset the offsets. Because some consumers may still be running. Unfortunately, I think this is a deeper issue with the Snuba components being bad at "finding" where they left off in the case of disconnects/other criteria for failure. Although that's difficult to prove. This is housekeeping that needs to be performed manually for the time being, the best that can be done is provide clear instructions on how to resolve it (which ideally should be provided by Sentry themselves, or contributed to their docs). The best way I've found it using similar steps as @gabn88 above. Specifically, running into Kafka and resetting the topic/consumer group that has problems. |
@BYK I knew I recognised your name! As a follow up, since you're actively part of the Sentry dev cycle. Is this something Sentry is aware of with the Snuba components? |
Pinging @lynnagara to represent @getsentry/sns here and prove you right or wrong 🙂
We are collecting some common cases and solutions to document. This may even be a
As far as we are concerned (and I am aware), we don't have these issues so I don't think Snuba is at fault here. Typically this happens when there's a burst of events and then Kafka just rolls over, causing offsets to got lost or invalid and then hell breaks lose. Sounds familiar? 😁
From where tho, have we met before and I am being a terrible person for not recognizing your GitHub handle? 😅 |
@BYK I'm not trying to point fingers, I agree it looks like a combination of Kafka exploding from an excessive number of events and the consumer being unable to start back up and finds its way. I think adding an inherent failover mechanism to the Snuba consumers for this would work for everyone. Though the biggest problem would be coordinating consumers to terminate their connections to Kafka whilst the "problem child" resolves its offset issue. I'm sure there's a bunch more things as well.
Sentry forums from a while back. Thank you for all your service! Stupidity Alert: I thought I was in the Sentry chart repository which is why I was being more defensive about introducing an automated fix in the chart not realising I am in the onprem repo.... I am an idiot. |
Hi @DandyDeveloper , What is likely to happen and that I would like you to confirm is that the post process cannot keep up with the events ingestion rate. Events keep piling up until Kafka drops the old ones for running out of space and the post process consumer does not have a committed offset to start from anymore. Would you share the amount of events you ingest per second to ensure this is the scenario we are talking about? Now, one point that was not discussed was that resetting the offset of the post process consumer means loosing data. The events post processing feature would never run on the events that were dropped and all those that were skipped. This is not a behavior snuba would take automatically. I'd argue it should not as this kind of data loss should not happen transparently. Now what to do about it?
Happy to provide further clarifications. Filippo |
@fpacifici Thank you for the reply. Just to be clear, I have overcome these issues for the time being myself. My comments and observations were in response to @BYK and just giving my 2 cents on the situation as other users are experiencing this. I am almost certain this was the case though, and for everyone else above, that the ingestion rate was far below expectations on and Kafka dropped them. I'll see if I can fetch some numbers for you, but I have since made sure events were scaled down (as individuals were misusing the Sentry instance and misunderstood what is classified as an event ;) ). As a result, I don't really have these problems anymore. I believe Snuba having the capabilities to at least self-diagnose to inform the user on what may have happened could be helpful here though, or even documentation that covers the ingestion workflow and what people should "look out for". I understand that self-healing might not be ideal because that could increase the loss of documents. Parallel consumption could well solve this issue for most individuals, presumably this means a single consumer could be fed more resources in order to open more threads to processing events? OR, do you mean we could have multiple consumers on a single topic? Either way, in my case, and other using Kubernetes based deployments. It'll make things much easier to scale out. |
@DandyDeveloper I came back to mention potential data loss but @fpacifici beat me to it. I think we can at least show people the option with a big red warning: we can automatically recover but that will mean you'll lose all the events in the past x hours. Transparency and easiness to understand the system is quite important, I agree.
Oh, thanks a lot. Quite literally my duty but still happy that it was a memorable experience for you :) |
This section now covers the most commin questions/issues we get regarding self-hosted. It also addresses https://app.asana.com/0/1169344595888357/1188448821391289 meaning it fixes getsentry/self-hosted#478, and fixes getsentry/self-hosted#788.
This section now covers the most commin questions/issues we get regarding self-hosted. It also addresses https://app.asana.com/0/1169344595888357/1188448821391289 meaning it fixes getsentry/self-hosted#478, and fixes getsentry/self-hosted#788.
This section now covers the most commin questions/issues we get regarding self-hosted. It also addresses https://app.asana.com/0/1169344595888357/1188448821391289 meaning it fixes getsentry/self-hosted#478, and fixes getsentry/self-hosted#788.
This section now covers the most commin questions/issues we get regarding self-hosted. It also addresses https://app.asana.com/0/1169344595888357/1188448821391289 meaning it fixes getsentry/self-hosted#478, and fixes getsentry/self-hosted#788.
After a while of running, my post process forwarder appear to be having issues with communication to Kafka:
Oddly, my Snuba consumer is working fine and Sentry appears to be running as expected and I see the Snuba consumer running in the Kafka logs.
Can someone give me some suggestions on what to look for to figure out what the post-processor is looking for?
The text was updated successfully, but these errors were encountered: