Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SerializingProducer is much slower than Producer in Python #1440

Open
6 tasks done
zacharydestefano89 opened this issue Oct 6, 2022 · 6 comments
Open
6 tasks done

Comments

@zacharydestefano89
Copy link

zacharydestefano89 commented Oct 6, 2022

Description

I was working on code to produce messages to a Kafka topic. The messages are protobuf bytes and I used SerializingProducer to pass the schema information. I tried a separate method where I imitated what was done here

It was able to produce and flush messages at a rate of about 12 messages per second. For my use case, this is way too slow.

When I just used Producer and took out any schema information, the rate suddenly jumped to ~100s of messages per second.

How to reproduce

  1. Write a job to put thousands of messages onto a Kafka topic
  2. Have the job put schema information into each message and time it
  3. Compare it to the same job that does put in schema information

Checklist

Please provide the following information:

  • confluent-kafka-python and librdkafka version (confluent_kafka.version() and confluent_kafka.libversion()):

From requirements.txt with the Python library:
confluent-kafka==1.7.0

From console:

>>> import confluent_kafka
>>> confluent_kafka.libversion()
('1.7.0', 17236223)
>>> confluent_kafka.version()
('1.7.0', 17235968)
>>> 
  • Apache Kafka broker version:
    Confluent Cloud

  • Client configuration: {...}

Producer config:

{'bootstrap.servers': '...',
 'error_cb': <function error_cb at 0x7fd2dc01f820>,
 'sasl.mechanism': 'PLAIN',
 'sasl.password': '***************************',
 'sasl.username': '***************',
 'security.protocol': 'SASL_SSL'}
  • Operating system:

Run from docker container derived from Python 3.8.8 base

First line of Dockerfile:
FROM python:3.8.8

  • Provide client logs (with 'debug': '..' as necessary)

Using SerializingProducer:

INFO:root:Now adding 221 messages to Kafka topic. INFO mode will display the first and last 3 messages, DEBUG mode will display all of them
[2022-10-06, 20:05:42 UTC] {docker.py:310} INFO - INFO:root:2022-10-06T20:05:42.031972+00:00 : Adding message starting `user_i_d: "******` onto Kafka buffer under topic `***`
...
[2022-10-06, 20:06:16 UTC] {docker.py:310} INFO - INFO:root:Now flushing Kafka producer
[2022-10-06, 20:06:16 UTC] {docker.py:310} INFO - INFO:root:Time to produce and flush for chunk of 221 messages: 34.54440498352051 seconds

Using Producer:

[2022-10-06, 20:06:16 UTC] {docker.py:310} INFO - INFO:root:Now adding 54 messages to Kafka topic. INFO mode will display the first and last 3 messages, DEBUG mode will display all of them
[2022-10-06, 20:06:16 UTC] {docker.py:310} INFO - INFO:root:2022-10-06T20:06:16.675951+00:00 : Adding message starting `b'\n\****\x1` onto Kafka buffer under topic `****`
...
[2022-10-06, 20:06:16 UTC] {docker.py:310} INFO - INFO:root:Now flushing Kafka producer
[2022-10-06, 20:06:16 UTC] {docker.py:310} INFO - INFO:root:Time to produce and flush for chunk of 54 messages: 0.18948936462402344 seconds
  • Critical issue: Not critical, have a workaround
@mhowlett
Copy link
Contributor

It was able to produce and flush messages at a rate of about 12 messages per second

are you flushing after every produce? (this will be slow)

@zacharydestefano89
Copy link
Author

It was able to produce and flush messages at a rate of about 12 messages per second

are you flushing after every produce? (this will be slow)

I tried both flushing after every produce and flushing after producing many messages. In both cases, messages were put on the topic at that aforementioned rate, 12 per second.

@mhowlett
Copy link
Contributor

~100s messages per second.

you should be able to get 10s of thousands of messages per second without the protobuf serdes. i don't have a good feel for how performant the protobuf serdes are (and you don't say anything about the size of your messages), but 12 per second seems very low.

It doesn't seem like we have a benchmark application for Python, we should write one (marking as enhancement).

@edenhill
Copy link
Contributor

I get the feeling it is doing a schema-registry lookup for each message, which would explain the low thruput.
Maybe worth checking, somehow?

@CTCC1
Copy link

CTCC1 commented Oct 25, 2022

I reported the unnecessary lookup in 2020 #935
It was fixed by #1133 so 1.8.2+
So i think upgrading to 1.8.2+ should fix.

@pranavrth
Copy link
Member

Can you please confirm if it was fixed with the version upgrade?

@pranavrth pranavrth assigned pranavrth and unassigned pranavrth Feb 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants