Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perfomance degradation in kafka sink #18762

Closed
Ilmarii opened this issue Oct 4, 2023 · 3 comments · Fixed by #18770
Closed

Perfomance degradation in kafka sink #18762

Ilmarii opened this issue Oct 4, 2023 · 3 comments · Fixed by #18770
Assignees
Labels
meta: regression This issue represents a regression type: bug A code related bug.

Comments

@Ilmarii
Copy link
Contributor

Ilmarii commented Oct 4, 2023

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Our deployment topology: vector-agent -> kafka -> vector-aggregator.
After updating vector-agents from 0.32.2 to 0.33.0, I noticed a significant decrease in the number of incoming messages to Kafka. For agents with a large flow of logs (~1000 logs/s), the sending rate to Kafka dropped from 1000 to 120 per second, and disk buffer usage also began to grow.
I also noticed that the kafka_queue_messages metric fluctuated around 100-700, after update it is exactly 64.
Maybe this is related to #18634? (I assume only because of the similar number)

Configuration

data_dir: /vector-data-dir
api:
  enabled: true
  address: '0.0.0.0:8686'
  playground: true
log_schema:
  host_key: host
  message_key: message
  source_type_key: source_type
  timestamp_key: timestamp
expire_metrics_secs: 60
sources:
  kubernetes_logs_src:
    type: kubernetes_logs
    max_line_bytes: 1048576
    glob_minimum_cooldown_ms: 5000
    max_read_bytes: 102400
sinks:
  kafka_sink:
    type: "kafka"
    inputs: ["kubernetes_logs_src"]
    bootstrap_servers: "kafka-server"
    topic: "kafka-topic"
    key_field: "stream_key"
    encoding:
      codec: "json"
    acknowledgements:
      enabled: true
    buffer:
      type: "disk"
      max_size: 1073741824
    librdkafka_options:
      "batch.num.messages": "500000"
      "batch.size": "10000000"
      "compression.codec": "zstd"
      "enable.idempotence": "true"
      "message.max.bytes": "10000000"
      "queue.buffering.max.ms": "500"
    sasl:
      enabled: true
      mechanism: "SCRAM-SHA-512"
    tls:
      enabled: true

Version

0.33.0

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

@Ilmarii Ilmarii added the type: bug A code related bug. label Oct 4, 2023
@dsmith3197 dsmith3197 self-assigned this Oct 4, 2023
@dsmith3197
Copy link
Contributor

Hi @Ilmarii,

Thanks for pointing this out. I believe you are correct that this behavior is related to #18634.

The limit was previously hard-coded to 100k, which was too high for many users and resulted in Vector OOM'ing when the upstream applied back pressure. I updated to the limit to 64 under the understanding that FutureProducer::send() returned a result once the message was enqueued on the underlying producer's queue. However, that is not the case. Rather, as you are encountering, it returns when the message is sent.

Rather than hard-coding this value to 64 or 100k, it should probably be set equal to queue.buffering.max.messages.

@dsmith3197
Copy link
Contributor

@Ilmarii For now, I'd advise downgrading to your last Vector version until we release a fix.

@sim0nx
Copy link

sim0nx commented Oct 12, 2023

Just to confirm that I am experiencing the same issue after having upgraded to 0.33.0; it basically killed my whole logging pipeline.
A downgrade restored the previous behaviour.

Looking forward to a fix :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta: regression This issue represents a regression type: bug A code related bug.
Projects
None yet
4 participants