Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native crash when producer is closed #7

Closed
Pchelolo opened this issue Aug 16, 2016 · 9 comments
Closed

Native crash when producer is closed #7

Pchelolo opened this issue Aug 16, 2016 · 9 comments
Assignees
Labels

Comments

@Pchelolo
Copy link
Contributor

Here's a relevant part of the crash report:

Crashed Thread:        0  Dispatch queue: com.apple.main-thread

Exception Type:        EXC_BAD_ACCESS (SIGSEGV)
Exception Codes:       KERN_INVALID_ADDRESS at 0x0000000000000000

VM Regions Near 0:
--> 
    __TEXT                 00000001022f8000-00000001032f9000 [ 16.0M] r-x/rwx SM=COW  /Users/USER/*

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   libc++abi.dylib                 0x00007fff9c7afb1e __dynamic_cast + 34
1   node-librdkafka.node            0x00000001044a1bea RdKafka::Topic::create(RdKafka::Handle*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, RdKafka::Conf*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&) + 232 (TopicImpl.cpp:102)
2   node-librdkafka.node            0x000000010448fb0e NodeKafka::Topic::New(Nan::FunctionCallbackInfo<v8::Value> const&) + 1380 (topic.cc:33)
3   node-librdkafka.node            0x000000010448ff4d Nan::imp::FunctionCallbackWrapper(v8::FunctionCallbackInfo<v8::Value> const&) + 131 (nan_callbacks_12_inl.h:175)
4   node                            0x000000010245f8b5 v8::internal::FunctionCallbackArguments::Call(void (*)(v8::FunctionCallbackInfo<v8::Value> const&)) + 373
5   node                            0x00000001024b901f v8::internal::Builtin_HandleApiCallConstruct(int, v8::internal::Object**, v8::internal::Isolate*) + 1375
6   ???                             0x000004463bb0961b 0 + 4699695650331

Unfortunately I couldn't create a small tests case to reliably reproduce, but the cause seem to be quite clear. When produce and disconnect calls are placed with very unlucky timing, we've got a race:

  1. EvenThread: disconnect called, ProducerDisconnect work is placed.
  2. EventThread: produce called - since we're updating the _isConnected only after the disconnect happened, the produce call goes to maybeTopic and then to native code and gets all the way till here
  3. ProducerDisconnect work executed on background thread and deletes the m_client
  4. EventThread: calls RdKafka::Topic::create -> boom, SIGSEGV

I think the easiest solution is to update JS _isConnected property before the actual disconnect happens - then all these racy code paths will be protected. However, maybe it's better to invest time in fixing those races on the native level, not sure which path do you wanna choose.

@webmakersteve
Copy link
Contributor

Do you have a big test case you can send me a gist of? If not I can just try to write one myself from your description.

It looks like RdKafka::Topic::create needs to use the thread lock in case we are currently disconnecting.

I'll fix this at the native level with some JS level checks to make sure it isn't throwing an unnecessary unhandled exception.

@Pchelolo
Copy link
Contributor Author

Do you have a big test case you can send me a gist of? If not I can just try to write one myself from your description.

@webmakersteve my 'big test case' is tests for https://github.com/wikimedia/change-propagation repo that I'm currently trying to upgrade to use your client, but it's a very messy WIP right now, I don't think you will make it run in the current state. I can push my WIP work and give you instructions how to run it if you want, but honestly I think you will just waist your time setting it up, bc even if you set it up, it's reproducible quite rarely, it's a data race after all.

@Pchelolo Pchelolo changed the title Native crash when consumers are closed Native crash when producer is closed Aug 16, 2016
@Pchelolo
Copy link
Contributor Author

Pchelolo commented Aug 16, 2016

@webmakersteve Got it! Here's a small test:

const Kafka = require('./lib');
const producer = new Kafka.Producer({
    'metadata.broker.list': 'localhost:9092'
});
producer.connect(undefined, () => {
    producer.disconnect(() => {
        console.log('Disconnected');
    });
    while (true) {
        producer.Topic('test', {});
    }
});

It crashes repeatedly.

@webmakersteve
Copy link
Contributor

Reproduced.

I'd like to fix this issue regardless, but the way I would recommend producing to topics anyway is to make the topics and handle them yourself.

e.g.

producer.connect(undefined, (err) => {
   if (err) {
      return;
   }

   var kafkaTopic = producer.Topic('test', {});

   for (var i = 0; i < 2000; i++) {
      producer.produce({ topic: kafkaTopic, message: new Buffer('message' });
   }
});

And you can do that for whatever topics you'd like. maybeTopic was put in there for backwards compatibility with our servers.

It used to create a topic every time produce was called, which was a waste of memory and CPU. So I changed topics to be wrapped objects so that they would be inaccessible at the same time as their connections (generally because they would be in the same scope).

Looking at this problem I realize I'll need to have some way to invalidate topics when the consumer or producer is disconnected unless i just rely on using JavaScript to guard it. I would probably need a callback that the Topic can listen to so it can dispose of itself.

A situation like your test case where are you manually initializing a topic after you have called disconnect would likely be a programming error. It shouldn't segfault anyway (and that's why it should be fixed), but I think if you change up how you do it you can probably get around the problem until i refactor it a bit.

@Pchelolo
Copy link
Contributor Author

A situation like your test case where are you manually initializing a topic after you have called disconnect would likely be a programming error. It shouldn't segfault anyway (and that's why it should be fixed), but I think if you change up how you do it you can probably get around the problem until i refactor it a bit.

Ye, sure, it's not a valid situation of course and if it didn't segfault but throw an error or something I wouldn't create this ticket, just segfault is not acceptable as a consequence of any programming errors :)

@webmakersteve
Copy link
Contributor

Absolutely! And that's why I'm going to fix it! Haha.

webmakersteve added a commit that referenced this issue Aug 16, 2016
Fixes test case (as far as I can tell) for Issue #7
@webmakersteve webmakersteve self-assigned this Aug 16, 2016
@webmakersteve
Copy link
Contributor

webmakersteve commented Aug 16, 2016

I believe this fixes that particular test case: https://github.com/Blizzard/node-rdkafka/tree/topic-connection-check

Do you want to try that and ensure it also works for you?

Still doesn't solve the problem of "What if I create a topic and then try to use that topic after I have disconnected?"

That will need to be handled in a special way.

@Pchelolo
Copy link
Contributor Author

@webmakersteve I've already changed stuff in my WIP, so it doesn't do this stupid stuff that was causing the crash any more, but looking at your commit that should work fine.. Thank you

webmakersteve added a commit that referenced this issue Aug 16, 2016
Fixes test case (as far as I can tell) for Issue #7
webmakersteve added a commit that referenced this issue Aug 16, 2016
@webmakersteve
Copy link
Contributor

Closing this issue. Reopen it if you find it not working :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants