-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Producer shouldn't use multiprocessing #151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
So are you talking about having multiple threads write to the same producer, or a producer per thread? IIRC the Connection class is not reentrant and doesn't play well with threads. |
Neither. I'm saying that |
I see. I'm generally ok with the idea of a thread based async producer, but I'm not sure that it should come at the cost of the multiprocessing based one. I'd be interested in seeing any benchmarking that showed the cost multiprocessing.Queue exceeds the cost of sending via a thread. |
There was also some work a while back to make this pluggable between threading / multiprocessing / gevent / etc. I think the previous discussion was on #37 . Not sure whether any of those folks have made progress on this. No PR that I know of, other than the gevent support. |
@wizzat the cost of sending over multiprocessing.Queue is huge. See my gist for an example benchmark: https://gist.github.com/snaury/9939482 As you can see threading queues are not impacted by message sizes, while multiprocessing queues are. What's even more important is that sending via multiprocessing.Queue can never be faster than threading.Queue because multiprocessing.Queue actually uses threads under the hood (otherwise you'd find that multiprocessing.Pipe obviously has very small buffer size and thus is not asynchronous at all). Performance, however, is not that big a problem (at least for me). Any performance improvements from not paying pickle/unpickle tax over pipes would be a plus (though, compared to network speed, it would of course be negligible), but the biggest problem is that by using multiprocessing kafka-python disallows uses of multiprocessing where it might give more gain. I might want to use multiprocessing where there's a lot of computation involved (it's not just numbers, e.g. text parsing is also computation), and where parallelism is really important. Imagine that during those computations I want to write results to Kafka, but I don't want to synchronously wait for any ACKs from Kafka, thus I want async=True. When kafka-python tries to use multiprocessing it, simply speaking, bans parallelism in my application, because nested processes are not allowed by multiprocessing. I would be ok if there was e.g. MultiProcessProducer (similar to MultiProcessConsumer), but default shouldn't use multiprocessing, unless explicitly asked to. |
Thank you for the detailed response! I really want to try some performance benchmarking myself, but I'm 100% in favor of a threaded solution. Do you have a PR for this? |
@wizzat not yet, will prepare one in a day or two. |
I think we can close this issue. |
yes indeed -- thanks for all the help on this! |
Producer, when in async mode, uses multiprocessing. Unfortunately this breaks when application itself already uses multiprocessing, because nested processes are not allowed by multiprocessing. What's even more important is that this forces messages to be synchronously pickled and sent over the channel (multiprocessing.Queue) which should be unnecessary. I wanted to use Producer with batch_send=True, but due to these two reasons I effectively can't. Since async and batching is probably more about I/O wouldn't it be better to use threads for async Producers?
The text was updated successfully, but these errors were encountered: