-
-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfaults during consumer shutdown #254
Comments
@vwassan can you share some code snippets and answer following questions:
|
I can reproduce with Karafka test suite:
|
Ok I can reproduce on all the versions. We need to introduce a lock during the assignments states. It seems Karafka is not directly affected because its shutdown model is not directly the same as rdkafka so I will focus on rdkafka now (as could not reproduce with karafka) |
@mensfeld Thank you for looking into the issue! |
@vwassan I'm good. I have a fix running locally. I need, however one or two days to solidify it and run all the regressions to make sure it's stable. So far so good. There are few ways how to tackle that and need to pick one that will suite all ;) |
@vwassan ok bad news. My patch is indeed making things better but under heavy load it can crash (3 times in 6 hours of rebalances) and adds up to the shutdown time. I created a new issue here to provide debug info for librdkafka people: confluentinc/librdkafka#4312 If they are slow to react I will try to include the patch in rdkafka release or will share it with you here so you can at least partially mitigate this. |
Maciej, we'd be down to try the patch 😄 |
@methodmissing you will have to wait for it because if I am right (running stress tests once more) it cannot be mitigated directly in rdkafka but either in karafka or a custom consumer using rdkafka-ruby (or the default simple one from rdkafka-ruby). So far so good. The nature of this seems to be related to resources allocation during the initial rebalance for a given consumer instance and not per se with the shutdown. There seems to be some dangling stuff from the first not completed rebalance. This would explain why karafka users don't see it. Karafka has a different flow for startup and shutdown, hence the chances of such short lived assignments are much lower than with using rdkafka-ruby directly 🤷 Atm the bypass is running for 2h and no crashes. |
At the moment I am running 64 consumers for over 4 hours, no crashes. Starts to look solid at least on local. edit: ok got first crash after several hours. So the solution is not final 😅 thought it's still much better than previous crash on every run edit: I can replicate it with karafka as well - a lot of effort but doable. Lets check if the patch works there as well. |
@methodmissing @vwassan it was already fixed by confluent so won't ship my patch. It's a matter of releasing 2.2.0. I will bump it here probably in the next release as need to run extensive testing to make sure it's stable. When bumping karafka-rdkafka to |
2.2.0-rc-1 does not present this behaviour. I will close this once rdkafka catches up on releases. |
@methodmissing @vwassan do you want to test out the RC release of librdkafka with the fix? |
Nvmd. It's back in 2.2.0 🤦 |
@mensfeld sorry! somehow I missed this message |
@vwassan nvmd. They broke it back. I already commented and got another stable reproduction. Lets see what they have to say ;) |
@methodmissing @vwassan its a different issue 😅 though the effect is the same. I do however know how to mitigate this and hopefully will craft a PR next week. |
Thanks @mensfeld. Appreciate your efforts! |
confluentinc/librdkafka#4312 (comment) sounds exciting |
When running in production, we've also started to see a lot of segfaults during producer shutdowns. I haven't been able to reproduce locally. However, a hunch from a co-worker helped us fix the issue client-side by doing: at_exit do
client.close
client = nil
GC.start
end This suggests that the bug is most likely in rdkafka-ruby's finalizer logic. |
@dmariassy same applies to rdkafka-ruby itself:
If any, I would recommend using If you have librdkafka stracktrace it could be helpful. waterdrop had one crash for last 6 months on CI and my envs and it was only related to the internal librdkafka race condition. |
Sorry, my example above was incomplete. We do close the producer before setting it to nil. I corrected the omission. The full example is:
|
@dmariassy I will need:
Without stable repro I cannot fix it unfortunately as none of my higher level libs or code presents this behaviour anymore. Nor any of the specs or things in rdkafka-ruby itself. WaterDrop uses a different (opinionated) finalizer hook for cleanups: https://github.com/karafka/waterdrop/blob/master/lib/waterdrop/producer.rb you may try to take its code and implement it yourself or to check if the same would be present via waterdrop to assess. |
@dmariassy if I can reproduce it in any way (except not closing the producer at all), It should be fixable (minus known librdkafka issues that can only be partially mitigated). If I reproduce, I will probably open a new issue as this was related to other things that are already fixed and refer to consumer or won't be fixed because are in librdkafka and were reported. |
Hi @mensfeld , sorry for the late reply. I finally have a stacktrace I can share 🎉
As for the other pieces of info
|
@dmariassy thanks! Just to be sure, I spoke recently with @peterzhu2118 on your shutdown procedures and there was a case of multi-threaded shutdown with timeout on close. I assume this is not the case here right? Can you show me more of the shutdown procedure you guys are using? |
Also one more thing: was the process a fork? |
We still call def close_client
begin
client.flush
rescue => e
@logger.error("Error flushing messages while closing rdkafka client: #{e.inspect}")
end
closed = T.let(false, T::Boolean)
start = Time.now
t = Thread.new do
client.close # This is the line referenced in the stack trace: `abstract_rdkafka_writer.rb:207`
closed = true
rescue => ex
@logger.error("Error closing rdkafka client: #{ex.inspect}")
end
sleep 0.1 while !closed && Time.now - start < 5 # wait 5 seconds for the client to close
unless closed
# The client will never close if it cannot connect to the broker
@logger.error("Failed to close rdkafka client for within allowed timeout")
end
rescue => e
@logger.error("Error closing rdkafka client: #{e.inspect}")
ensure
# These two lines were added as an earlier attempt to prevent segfaults. And it worked in some cases
client = nil
GC.start
@logger.info("Rdkafka close sequence complete")
end
Yes |
rdkafka-ruby uses librdkafka https://github.com/confluentinc/librdkafka/blob/v2.0.2/src/rdkafka.c#L1107C8-L1107C8 Which leads me to another question: was the producer initialized in the parent process and used from the fork? |
ref:
|
will try to repro tomorrow. |
@dmariassy yeah I can crash it:
|
@dmariassy stop using config and producer references from the original process in the fork and don't close it and things will work as expected. Quote from waterdrop:
My repro as followed: crashes VMs in scale config = {:"bootstrap.servers" => "localhost:9092", :"queue.buffering.max.ms" => 30_000}
producer = Rdkafka::Config.new(config).producer
producer.produce(topic: 'test', payload: 'a')
10.times { sleep(rand); fork {10.times { producer.produce(topic: 'test', payload: 'a') }; producer.close } } does NOT crash: config = {:"bootstrap.servers" => "localhost:9092", :"queue.buffering.max.ms" => 30_000}
10.times {
sleep(rand);
fork {
producer = Rdkafka::Config.new(config).producer
10.times {
producer.produce(topic: 'test', payload: 'a')
}
producer.close
}
} |
The last case is a librdkafka bug on closing with metadata refresh running on a cluster change. This is reported and being fixed in librdkafka. Your case seems to be the one described and reproduced by me above. |
Thanks for looking into it! Our library wrapper already defers the producer initialization until the first |
@dmariassy from your stacktrace, librdkafka code and my POC it seems there's "something" out there that does not work as expected. I ofc may be wrong and then I will get back to this issue, but as of now it looks like you close a producer that is referenced from a fork. |
This version seems to crash more often: config = {
:"bootstrap.servers" => "localhost:9092",
:"queue.buffering.max.ms" => 30_000
}
producer = Rdkafka::Config.new(config).producer
producer.produce(topic: 'test', payload: 'a')
sleep(2)
20.times do |i|
sleep(rand / 10)
fork do
10.times do
producer.produce(topic: 'test', payload: 'a')
end
sleep(0.01)
producer.close
end
end |
Yeah @dmariassy I was not able to crash this: config = {
:"bootstrap.servers" => "localhost:9092",
:"queue.buffering.max.ms" => 30_000
}
20.times do |i|
sleep(rand / 10)
fork do
producer = Rdkafka::Config.new(config).producer
producer.produce(topic: 'test', payload: 'a')
10.times do
producer.produce(topic: 'test', payload: 'a')
end
sleep(0.01)
producer.close
end
end it clearly indicates resource leakage. |
I think we might be closing from the master process since that's where our |
you can do what waterdrop does: https://github.com/karafka/waterdrop/blob/master/lib/waterdrop/producer.rb#L97 this is lazy deferred to the post-fork stage (as long as you do not operate from main thread). Still worth closing correctly but such a trick should also operate. |
I am closing this one. The last issues were fixed in librdkafka 2.3.0. We should get a 1:1 alignment with karafka-rdkafka and merger of both in the upcoming months. |
Check if 2.0.2 has a correct lock for rebalance in progress close flow for rdkafka-ruby and karafka
details here: confluentinc/librdkafka#4308
more details here: confluentinc/librdkafka#4312
ref #242
ref #250
The text was updated successfully, but these errors were encountered: