Skip to content
This repository has been archived by the owner on Mar 24, 2021. It is now read-only.

Documentation on PyKafka vs kafka-python #334

Closed
microamp opened this issue Nov 9, 2015 · 17 comments
Closed

Documentation on PyKafka vs kafka-python #334

microamp opened this issue Nov 9, 2015 · 17 comments

Comments

@microamp
Copy link

microamp commented Nov 9, 2015

Hello. I'd like to play around with Kafka, but I don't know which client to use to start with. I know there is at least one other Python client called kafka-python. I wonder if there is any documentation on comparison between the two. I'll start with PyKafka in the meantime. :)

@emmettbutler
Copy link
Contributor

@microamp Thanks, this is a great idea. There's currently no documentation on this, but to my knowledge the main differences are the specifics of the Python API and PyKafka's implementation of the BalancedConsumer. PyKafka strives to keep the API as pythonic as possible, which means using useful features of the language where appropriate for client code simplicity. This includes things like context managers for object cleanup and futures for asynchronous error handling. PyKafka's balanced consumer implements the Kafka project's notion of the "high level consumer", which uses ZooKeeper to balance consumption of partitions between multiple nodes in a consumer group. From what I understand, kafka-python is waiting until Kafka 0.9, when this functionality will be supported natively by the Kafka server itself, to implement self-balancing consumers.
Also, the last time we did a speed test (which was admittedly a while ago at this point), PyKafka's consumer outperformed kafka-python. I unfortunately no longer have the results from that test, so you may not want to bet too hard on PyKafka being significantly faster or slower - just figured I'd mention it.

@emmettbutler
Copy link
Contributor

Some more research - there are differences in the versions of python supported by each library. PyKafka supports 2.7, 3.4, 3.5, and pypy, while kafka-python adds 2.6 and removes 3.5 support. kafka-python also requires a ZooKeeper connection for offset management, which PyKafka does not. kafka-python supports versions of Kafka from 0.8.0 to 0.8.2, where PyKafka only supports 0.8.2.

@microamp
Copy link
Author

@emmett9001

Thanks a lot for the reply. I find the information very helpful.

It's good to know that PyKafka supports Python 3.4+. It was still work in progress the last time I checked a few months back. Good work guys.

rduplain added a commit that referenced this issue Nov 10, 2015
rduplain added a commit that referenced this issue Nov 10, 2015
rduplain added a commit that referenced this issue Nov 11, 2015
yungchin added a commit that referenced this issue Nov 13, 2015
…tension

The producer-futures feature was backed out of master, which means the
expected interface for RdKafkaProducer._produce() has changed back, too.
I've addressed all merge conflicts here - the change in the _produce()
interface will be addressed in the next commit.

* parsely/master: (26 commits)
  changelog updates for 2.0.3, dev version
  increment version
  Catch IOError in recvall_into util.
  Catch IOError during connection response.
  re-import weakref
  Revert 52ae7a1
  Link #334 to README.
  drop autocommit logging to DEBUG level. fixes #337
  update after socket error in offset manager discovery
  remove unused condition
  unconditionally update partition leaders on update
  Load all topic values on values method.
  fix typo causing interpreter error in reset_offsets`
  fix outdenting error
  clarify functools import
  catch all exceptions when removing from zookeeper
  be very specific about the error we expect
  producer: minimal changes for gc'ability
  balancedconsumer: minimal changes for gc'ability (RFC)
  add logging, fix some retry/reconnect/update logic in simpleconsumer
  ...

Signed-off-by: Yung-Chin Oei <yungchin@yungchin.nl>

Conflicts:
	.travis.yml
	pykafka/simpleconsumer.py
	tests/pykafka/test_producer.py
@ottomata
Copy link
Contributor

A difference between kafka-python and pykafka is the producer interface. kafka-python does not require that you know the topic when instantiating the producer. This is convenient if you need to produce to topics dynamically based on input (which I do!) :)

@amontalenti
Copy link
Contributor

@ottomata That seems like an interesting request for us to look at. Want to open a separate issue about that?

@ottomata
Copy link
Contributor

Sure!

@cscheffler
Copy link
Contributor

@emmett9001 @ottomata Just got pointed at this thread and thought I'd make a late contribution.

We compared pykafka and kafka-python about 2 months ago while trying to decide which one to use. In the end, the deciding factor for us was that balanced consumers were much easier to manage in pykafka.

Also, we discovered later, a pykafka producer doesn't die on Kafka broker restart, while our kafka-python producers did.

Below are performance figures from a 3-node Kafka cluster running in EC2, using a single producer or consumer. The three numbers for each test are the quartiles measured for the test.

  • pykafka producer: 41400 – 46500 – 50200 messages per second
  • pykafka consumer: 12100 – 14400 – 23700 messages per second
  • kafka-python producer: 26500 – 27700 – 29500 messages per second
  • kafka-python consumer: 35000 – 37300 – 39100 messages per second

So, for clarification, the median performance of a pykafka producer was 46500 messages per second, with a quartile range of 41400 (25th percentile) to 50200 (75th percentile). Hope that makes sense.

@emmettbutler
Copy link
Contributor

This is awesome, thanks for the performance numbers @cscheffler. Do you have anything to share on the methodology you used to find them?

@ottomata
Copy link
Contributor

Cool! For the producer bench, did you just use the default parameters? I assume async with req_acks=1?

@rghv
Copy link

rghv commented Nov 25, 2015

@cscheffler can you please share the links to the test scripts, if they are open-sourced? I see https://github.com/cscheffler/kafka-demo which uses pykafka. It would be great help if you can share the test scripts for kafka-python that were used in your comparison. Thanks!

@emmettbutler
Copy link
Contributor

This writeup by @jofusa is the most thorough comparative benchmark of the python kafka clients I've seen.

@soedjais
Copy link

soedjais commented Oct 27, 2016

Leaving a url of another benchmark done recently between pykafka 2.3.1, kafka-python 1.1.1, and confluent-kafka 0.9.1
http://activisiongamescience.github.io/2016/06/15/Kafka-Client-Benchmarking/
Edit: already mentioned above by @emmett9001

@johnistan
Copy link
Contributor

original author here. just a fyi those are one and the same

@amitt001
Copy link

amitt001 commented Jul 19, 2017

It's Jul, 2017 is there any new update and a recent comparison?
I think now even kafka-python supports the balanced consumers.

@AKarbas
Copy link

AKarbas commented Jul 17, 2019

It's Jul, 2019! Any updates on the comparison? :)

@guoruibiao
Copy link

It's April, 2020! Newbe here, what i want to find is which one is friendly for us?

@prncoprs
Copy link

It’s Sept, 2020! Anything update?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests