Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post Process Forwarder - KafkaError "Offset Out of Range" #478

Closed
DandyDeveloper opened this issue May 7, 2020 · 25 comments · Fixed by getsentry/develop#271
Closed

Post Process Forwarder - KafkaError "Offset Out of Range" #478

DandyDeveloper opened this issue May 7, 2020 · 25 comments · Fixed by getsentry/develop#271
Assignees

Comments

@DandyDeveloper
Copy link

DandyDeveloper commented May 7, 2020

After a while of running, my post process forwarder appear to be having issues with communication to Kafka:

Traceback (most recent call last):
  File "/usr/local/bin/sentry", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python2.7/site-packages/sentry/runner/__init__.py", line 164, in main
    cli(prog_name=get_prog(), obj={}, max_content_width=100)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/sentry/runner/decorators.py", line 68, in inner
    return ctx.invoke(f, *args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/sentry/runner/decorators.py", line 30, in inner
    return ctx.invoke(f, *args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/sentry/runner/commands/run.py", line 318, in post_process_forwarder
    initial_offset_reset=options["initial_offset_reset"],
  File "/usr/local/lib/python2.7/site-packages/sentry/utils/services.py", line 104, in <lambda>
    context[key] = (lambda f: lambda *a, **k: getattr(self, f)(*a, **k))(key)
  File "/usr/local/lib/python2.7/site-packages/sentry/eventstream/kafka/backend.py", line 184, in run_post_process_forwarder
    raise Exception(error)
Exception: KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str="Broker: Offset out of range"}

Oddly, my Snuba consumer is working fine and Sentry appears to be running as expected and I see the Snuba consumer running in the Kafka logs.

Can someone give me some suggestions on what to look for to figure out what the post-processor is looking for?

@BYK
Copy link
Member

BYK commented May 7, 2020

Can you try modifying this line to:

command: run post-process-forwarder --commit-batch-size 1 --initial-offset-reset earliest

and see if it helps?

@BYK BYK self-assigned this May 7, 2020
@BYK BYK added the needs-info label May 7, 2020
@DandyDeveloper
Copy link
Author

@BYK Thank you, I'll have a try on the manifest and update it.

Just an FYI I did fix this by effectively destroying the volume for the Kafka backend and recreate it.

I also noticed my volume size is pretty low for the amount of tasks we have coming in. Does Kafka have an inherent GC that cleans things up? It looks as though the disk was filling upwards of 8GB.

@DandyDeveloper
Copy link
Author

DandyDeveloper commented May 11, 2020

@BYK It came back, looks like with both earliest and latest offsets I get the same error. This is POST fixing things, so it's unexpectedly broken itself again

Trying to think of what I can do to fix it.

@ProxiBlue
Copy link

Hello,
I am getting same issue, and it eventually kills sentry, it simply stops logging any issues.
Been going on for some time
I can give you access to a Digital Ocean Droplet where this is happening, if that will help.
I upped the instance to 8gb memory, to test if resources is issue.

https://sentry.uptactics.com/share/issue/2a83806a230e4c4196f86c83890ad62e/

@ProxiBlue
Copy link

I upped the instance to 8gb memory, to test if resources is issue.

I think that helped. Today (since 7h ago) the error stopped.

Selection_320

The readme/docs note at least 2GB memory. Seems I needed 8GB to stop the error. Possible resource related?

@joscha-alisch
Copy link

Thanks @DandyDeveloper for outlining your solution path. We had the same issue and were also able to resolve it for now by destroying the persistent volume of Kafka. Until now, we didn't have the issue occurring again like it did in your case.

@rmisyurev
Copy link

I have a solution:

  1. Log on to Kafka container
    docker exec -it onpremise_kafka_1 bash
  2. Receive consumers list
    kafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --list
  3. Get group info
    kafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --group snuba-post-processor -describe
  4. Watching what gonna happend with offset by using dry-run (optional)
    kafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --group snuba-post-processor --topic events --reset-offsets --to-latest
  5. Set offset to latest and execute
    kafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --group snuba-post-processor --topic events --reset-offsets --to-latest --execute
  6. Profit

This solution work the same way for other Snuba components (just replace group and\or topic)

@flokoe
Copy link

flokoe commented Jul 31, 2020

@rmisyurev This actually worked for me, thank you very much! But to be honest I don't know why.

The first time I got an error message like this:

Error: Assignments can only be reset if the group 'snuba-post-processor' is inactive, but the current state is Stable.

I just stopped all containers, then only started zookeeper and kafka containers, executed the statement above and stopped both containers again.
After that I started all containers and everything was fine.

I am sure it is not the right solution, so be careful, but for me it worked.

@BYK
Copy link
Member

BYK commented Aug 3, 2020

@rmisyurev - thanks for sharing the steps that helped you! I'll make sure to add these to some future documentation.

@jdotjdot
Copy link

Thanks so much for the helpful comments here, especially @rmisyurev — I looked at your post with the kafka-consumer-groups commands at least 30 times when debugging this.

This issue has been plaguing me for weeks, and I tried everything including adding a ton of consumer replicas, which didn't help at all. What I think has finally solved it (in addition to giving Kafka 8GB of RAM) was actually setting the --commit-batch-size argument on the post-processor. The Helm chart for Sentry defaults this to 1 for some reason, which is way too low (see here). I set it to 100 and for the first time I'm seeing the lag go down rather than up. Wanted to share this in case anyone else runs across this issue.

@github-actions
Copy link

github-actions bot commented Jan 4, 2021

This issue has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you label it Status: Accepted, I will leave it alone ... forever!


"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

@bbreijer
Copy link

bbreijer commented Jan 6, 2021

Hi, thanks for all the quick fixes, but unfortunately none of them fixed it for us. Currently Sentry runs ok for a few days with:
--commit-batch-size 20 --initial-offset-reset earliest.
The first problem i'am experiencing is that the 2FA check fails, the rest seems to operate fine.

The only way to get the system fully operational again is deleting everything related to kafka (volumes etc) and rebuilding it.

Any suggestions on how to make sentry run more stable?

@gabn88
Copy link

gabn88 commented Jan 18, 2021

Same issue here on 20.11.1. I did all the steps mentioned by @rmisyurev, but to no avail.
Now upgrading sentry to the latest version and redoing all the steps. Hardware is an intel i7-4770@ 3.40Ghz with 16GB on Docker under windows 10 (using WSL), so that shouldn't be an issue for +/- 10k events per day I would say? The sentry worker is running with 2x100% CPU every now or then still.

@bbreijer where do you run this setting?

EDIT: Update tot 21.1.0 did not solve the issue. sentry_onpremise_snuba-consumer_1, sentry_onpremise_snuba-outcomes-consumer_1, sentry_onpremise_snub-transactions-consumer_1 all keep restarting all the time.

EDIT2: I changed the commit-batch-size in the docker-compose.yml file.

@bbreijer
Copy link

EDIT2: I changed the commit-batch-size in the docker-compose.yml file.

That is where I also changed it.

@gabn88
Copy link

gabn88 commented Jan 18, 2021

Edited the commit-batch-size and removed all snuba (kafka) instances, but still:

f.__message = self.__consumer.poll(timeout=1.0)

File "/usr/src/snuba/snuba/subscriptions/consumer.py", line 120, in poll

message = self.__consumer.poll(timeout)

File "/usr/src/snuba/snuba/utils/streams/synchronized.py", line 217, in poll

message = self.__consumer.poll(timeout)

File "/usr/src/snuba/snuba/utils/streams/backends/kafka.py", line 400, in poll

raise OffsetOutOfRange(str(error))

snuba.utils.streams.backends.abstract.OffsetOutOfRange: KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str="Broker: Offset out of range"}

errors all the time and continuous restarts.

EDIT:
Ah found out that I had to do:

kafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --group snuba-consumers --topic events --reset-offsets --to-latest --execute

etc..
But how do I know which topic is used?

Stuck at these issues that keep coming back:

+ exec gosu snuba snuba subscriptions --auto-offset-reset=latest --consumer-group=snuba-events-subscriptions-consumers --topic=events --result-topic=events-subscription-results --dataset=events --commit-log-topic=snuba-commit-log --commit-log-group=snuba-consumers --delay-seconds=60 --schedule-ttl=60

2021-01-18 15:13:33,298 New partitions assigned: {Partition(topic=Topic(name='events'), index=0): 5877628}

2021-01-18 15:13:33,299 Caught OffsetOutOfRange('KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str="Broker: Offset out of range"}'), shutting down...

EDIT: Fixed this by:

 kafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --group snuba-events-subscriptions-consumers --topic events --reset-offsets --to-latest --execute

and

+ exec gosu snuba snuba consumer --storage outcomes_raw --auto-offset-reset=earliest --max-batch-time-ms 750

2021-01-18 15:27:59,791 New partitions assigned: {Partition(topic=Topic(name='outcomes'), index=0): 6250846}

2021-01-18 15:27:59,795 Caught OffsetOutOfRange('KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str="Broker: Offset out of range"}'), shutting down...

EDIT: Tried:

kafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --group outcomes_raw --topic outcomes --reset-offsets --to-latest --execute

and:

kafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --group outcomes_raw --topic events --reset-offsets --to-latest --execute

but this issues still persists with the sentry_onpremise_snuba_outcomes-consumer_1 restarting every time. The other containers do stay up now though :)

@kfkawalec
Copy link

I have the same issue, how I can fix this?

@gabn88
Copy link

gabn88 commented Jan 18, 2021

@kfkawalec
You have to look at the error. And then per error determine which snuba-consumers to update and update it manually.

@BYK
Copy link
Member

BYK commented Jan 19, 2021

@gabn88 thanks a lot for sharing your solution here. It is a bit hard to follow though, do you think this is something we can implement in the repo to save other's some hassle and time?

@DandyDeveloper
Copy link
Author

@BYK No, the issue is consumers actively registered to the topic/group MUST be killed before you can actually run through and reset the offsets. Because some consumers may still be running.

Unfortunately, I think this is a deeper issue with the Snuba components being bad at "finding" where they left off in the case of disconnects/other criteria for failure. Although that's difficult to prove.

This is housekeeping that needs to be performed manually for the time being, the best that can be done is provide clear instructions on how to resolve it (which ideally should be provided by Sentry themselves, or contributed to their docs).

The best way I've found it using similar steps as @gabn88 above. Specifically, running into Kafka and resetting the topic/consumer group that has problems.

@DandyDeveloper
Copy link
Author

@BYK I knew I recognised your name! As a follow up, since you're actively part of the Sentry dev cycle. Is this something Sentry is aware of with the Snuba components?

@BYK
Copy link
Member

BYK commented Jan 20, 2021

Unfortunately, I think this is a deeper issue with the Snuba components being bad at "finding" where they left off in the case of disconnects/other criteria for failure. Although that's difficult to prove.

Pinging @lynnagara to represent @getsentry/sns here and prove you right or wrong 🙂

This is housekeeping that needs to be performed manually for the time being, the best that can be done is provide clear instructions on how to resolve it (which ideally should be provided by Sentry themselves, or contributed to their docs).

We are collecting some common cases and solutions to document. This may even be a fix-kafka-snuba.sh script if I understand correctly.

As a follow up, since you're actively part of the Sentry dev cycle. Is this something Sentry is aware of with the Snuba components?

As far as we are concerned (and I am aware), we don't have these issues so I don't think Snuba is at fault here. Typically this happens when there's a burst of events and then Kafka just rolls over, causing offsets to got lost or invalid and then hell breaks lose. Sounds familiar? 😁

@BYK I knew I recognised your name!

From where tho, have we met before and I am being a terrible person for not recognizing your GitHub handle? 😅

@DandyDeveloper
Copy link
Author

DandyDeveloper commented Jan 21, 2021

@BYK I'm not trying to point fingers, I agree it looks like a combination of Kafka exploding from an excessive number of events and the consumer being unable to start back up and finds its way.

I think adding an inherent failover mechanism to the Snuba consumers for this would work for everyone. Though the biggest problem would be coordinating consumers to terminate their connections to Kafka whilst the "problem child" resolves its offset issue. I'm sure there's a bunch more things as well.

From where tho

Sentry forums from a while back. Thank you for all your service!

Stupidity Alert: I thought I was in the Sentry chart repository which is why I was being more defensive about introducing an automated fix in the chart not realising I am in the onprem repo.... I am an idiot.

@fpacifici
Copy link
Contributor

Hi @DandyDeveloper ,
a couple of clarifications about the behavior you are observing. You see an OFFSET_OUT_OF_RANGE from the post-processor while the primary snuba consumer is fine from your first message.
This is not a surprising behavior if your ingestion rate is too high for the ingestion pipeline as the snuba consumer processes messages in parallel while the post processor does not yet.
https://getsentry.github.io/event-ingestion-graph/ for details. See that the post-process is downstream with respect to the snuba consumer.

What is likely to happen and that I would like you to confirm is that the post process cannot keep up with the events ingestion rate. Events keep piling up until Kafka drops the old ones for running out of space and the post process consumer does not have a committed offset to start from anymore. Would you share the amount of events you ingest per second to ensure this is the scenario we are talking about?

Now, one point that was not discussed was that resetting the offset of the post process consumer means loosing data. The events post processing feature would never run on the events that were dropped and all those that were skipped. This is not a behavior snuba would take automatically. I'd argue it should not as this kind of data loss should not happen transparently.

Now what to do about it?

  • good news is that the parallel consumer is going to be deployed for post processing soonish. It is likely to happen within H1, that would make it much faster at processing events.
  • The reason the post process cannot keep up (assuming this was your issue) is that by default the onpremise deployment creates only one partition, which means events are strictly consumed sequentially. Do you send all your load on one single project or is it spread across multiple projects? If your load is already spread among multiple projects you can increase the partitions number of the events topic in kafka. At this point if you start multiple instances of the post process consumer (post-process-forwarder ) these will consume the partitions in parallel, and you will increase your throughput.
  • If your load is entirely in one project the situation is more complex. Every project is assigned to one and only one partition, which means that increasing partitions, if you have one project only, won't help. In this case is there any way you can divide your load across multiple sentry projects?

Happy to provide further clarifications.

Filippo

@DandyDeveloper
Copy link
Author

@fpacifici Thank you for the reply. Just to be clear, I have overcome these issues for the time being myself. My comments and observations were in response to @BYK and just giving my 2 cents on the situation as other users are experiencing this.

I am almost certain this was the case though, and for everyone else above, that the ingestion rate was far below expectations on and Kafka dropped them. I'll see if I can fetch some numbers for you, but I have since made sure events were scaled down (as individuals were misusing the Sentry instance and misunderstood what is classified as an event ;) ). As a result, I don't really have these problems anymore.

I believe Snuba having the capabilities to at least self-diagnose to inform the user on what may have happened could be helpful here though, or even documentation that covers the ingestion workflow and what people should "look out for". I understand that self-healing might not be ideal because that could increase the loss of documents.

Parallel consumption could well solve this issue for most individuals, presumably this means a single consumer could be fed more resources in order to open more threads to processing events? OR, do you mean we could have multiple consumers on a single topic?

Either way, in my case, and other using Kubernetes based deployments. It'll make things much easier to scale out.

@BYK
Copy link
Member

BYK commented Jan 22, 2021

@DandyDeveloper I came back to mention potential data loss but @fpacifici beat me to it. I think we can at least show people the option with a big red warning: we can automatically recover but that will mean you'll lose all the events in the past x hours. Transparency and easiness to understand the system is quite important, I agree.

Sentry forums from a while back. Thank you for all your service!

Oh, thanks a lot. Quite literally my duty but still happy that it was a memorable experience for you :)

BYK added a commit to getsentry/develop that referenced this issue Feb 12, 2021
This section now covers the most commin questions/issues we get regarding self-hosted. It also addresses https://app.asana.com/0/1169344595888357/1188448821391289 meaning it fixes getsentry/self-hosted#478, and fixes getsentry/self-hosted#788.
BYK added a commit to getsentry/develop that referenced this issue Feb 15, 2021
This section now covers the most commin questions/issues we get regarding self-hosted. It also addresses https://app.asana.com/0/1169344595888357/1188448821391289 meaning it fixes getsentry/self-hosted#478, and fixes getsentry/self-hosted#788.
@github-actions github-actions bot locked and limited conversation to collaborators Mar 3, 2021
a-hariti pushed a commit to getsentry/sentry-docs that referenced this issue Jun 8, 2024
This section now covers the most commin questions/issues we get regarding self-hosted. It also addresses https://app.asana.com/0/1169344595888357/1188448821391289 meaning it fixes getsentry/self-hosted#478, and fixes getsentry/self-hosted#788.
stephanie-anderson pushed a commit to getsentry/sentry-docs that referenced this issue Jun 10, 2024
This section now covers the most commin questions/issues we get regarding self-hosted. It also addresses https://app.asana.com/0/1169344595888357/1188448821391289 meaning it fixes getsentry/self-hosted#478, and fixes getsentry/self-hosted#788.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.