Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SERVICE AVAILABILITY: vector validate hangs indefinitely when using kafka source on v0.34.0 and beyond - Denial of Service for systemd managed instances #20687

Closed
reyjrar opened this issue Jun 18, 2024 · 7 comments · Fixed by #20698
Labels
type: bug A code related bug.

Comments

@reyjrar
Copy link

reyjrar commented Jun 18, 2024

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

vector validate hangs indefinitely when using kafka sources with tls.enabled. I verified this happens by bisecting versions from v0.32.1 (where it worked) through v0.39.0. The regression was introduced in v0.34.0.

I'm not really sure what's happening here, but this is the state:

  • Upgraded from 0.32.1 to 0.39.0, vector validate hangs
  • Downgrade through a ton of releases until 0.33.1, now vector validate is fine
  • Restart vector
  • Stop vector
  • Upgrade to 0.34.0, vector validate is now fine
  • Start 0.34.0, everything is fine
  • vector validate hangs once v0.34.0 starts the first time, cannot restart.

I ran through this a few times opening this ticket. However, I'm now at the point where, anything v0.34.0+ does not validate (it will start just fine, but the systemd service has the vector validate step in it which causes it to timeout and so I'm DEAD in the water).

I had to forcefully downgrade vector to v0.33.1 everywhere because the validate action is broken.

Configuration

Kafka source with TLS enabled.

Version

vector 0.32.1 (x86_64-unknown-linux-gnu 9965884 2023-08-21 14:52:38.330227446)

Debug Output

2024-06-18T01:26:17.033101Z TRACE vector::validate: Healthcheck for elasticsearch_ecs starting.
2024-06-18T01:26:17.033127Z  INFO vector::topology::builder: Healthcheck disabled.
~ Health check disabled for "elasticsearch_ecs"
2024-06-18T01:26:17.033166Z TRACE vector::validate: Healthcheck for elasticsearch_ecs done.
2024-06-18T01:26:17.033234Z TRACE tower::buffer::worker: worker polling for next message
2024-06-18T01:26:17.033254Z TRACE rdkafka::producer::base_producer: Destroy ThreadedProducer
2024-06-18T01:26:17.033255Z TRACE tower::buffer::worker: buffer already closed
2024-06-18T01:26:17.033265Z TRACE tower::buffer::worker: worker polling for next message
2024-06-18T01:26:17.033270Z TRACE rdkafka::producer::base_producer: Stopping polling
2024-06-18T01:26:17.033281Z TRACE tower::buffer::worker: buffer already closed
2024-06-18T01:26:17.033283Z TRACE rdkafka::producer::base_producer: Waiting for polling thread termination
2024-06-18T01:26:17.033932Z TRACE rdkafka::producer::base_producer: Polling thread loop started
2024-06-18T01:26:17.134016Z TRACE rdkafka::producer::base_producer: Polling thread loop terminated
2024-06-18T01:26:17.134152Z TRACE rdkafka::producer::base_producer: Polling stopped
2024-06-18T01:26:17.134166Z TRACE rdkafka::producer::base_producer: ThreadedProducer destroyed
2024-06-18T01:26:17.134226Z TRACE rdkafka::util: Destroying client: 0x7f6757d47800
2024-06-18T01:26:17.136206Z TRACE rdkafka::util: Destroyed client: 0x7f6757d47800
2024-06-18T01:26:17.136265Z TRACE rdkafka::producer::base_producer: Destroy ThreadedProducer
2024-06-18T01:26:17.136276Z TRACE rdkafka::producer::base_producer: Stopping polling
2024-06-18T01:26:17.136286Z TRACE rdkafka::producer::base_producer: Waiting for polling thread termination
2024-06-18T01:26:17.215017Z TRACE rdkafka::producer::base_producer: Polling thread loop terminated
2024-06-18T01:26:17.215153Z TRACE rdkafka::producer::base_producer: Polling stopped
2024-06-18T01:26:17.215168Z TRACE rdkafka::producer::base_producer: ThreadedProducer destroyed
2024-06-18T01:26:17.215228Z TRACE rdkafka::util: Destroying client: 0x7f6757d46400
2024-06-18T01:26:17.217298Z TRACE rdkafka::util: Destroyed client: 0x7f6757d46400
2024-06-18T01:26:17.217318Z TRACE rdkafka::util: Destroying queue: 0x7f6757deff00
2024-06-18T01:26:17.217330Z TRACE rdkafka::util: Destroyed queue: 0x7f6757deff00
2024-06-18T01:26:17.217342Z TRACE rdkafka::consumer::base_consumer: Destroying consumer: 0x7f6757d45000
2024-06-18T01:26:17.217392Z TRACE rdkafka::consumer::base_consumer: Consumer destroyed: 0x7f6757d45000
2024-06-18T01:26:17.217404Z TRACE rdkafka::util: Destroying client: 0x7f6757d45000
2024-06-18T01:26:17.219134Z TRACE rdkafka::util: Destroyed client: 0x7f6757d45000
2024-06-18T01:26:17.219200Z TRACE rdkafka::util: Destroying queue: 0x7f6757dd5bc0
2024-06-18T01:26:17.219214Z TRACE rdkafka::util: Destroyed queue: 0x7f6757dd5bc0
2024-06-18T01:26:17.219226Z TRACE rdkafka::consumer::base_consumer: Destroying consumer: 0x7f6757d42800
2024-06-18T01:26:17.219223Z TRACE rdkafka::consumer::stream_consumer: Shut down stream consumer wake loop: 0x7f6757d45000
2024-06-18T01:26:17.219254Z TRACE rdkafka::consumer: Running pre-rebalance with Assign(TPL {ecs/0: offset=Invalid metadata="",
 error=Ok(()); ecs/1: offset=Invalid metadata="", error=Ok(()); ecs/2: offset=Invalid metadata="", error=Ok(()); ecs/3: offset
=Invalid metadata="", error=Ok(()); ecs/4: offset=Invalid metadata="", error=Ok(()); ecs/5: offset=Invalid metadata="", error=
Ok(()); ecs/6: offset=Invalid metadata="", error=Ok(()); ecs/7: offset=Invalid metadata="", error=Ok(()); ecs/8: offset=Invali
d metadata="", error=Ok(()); ecs/9: offset=Invalid metadata="", error=Ok(()); ecs/10: offset=Invalid metadata="", error=Ok(())
; ecs/11: offset=Invalid metadata="", error=Ok(())})

<CRICKETS>

Example Data

Not needed

Additional Context

That TRACE rdkafka::consumer: Running pre-rebalance with Assign is the last thing I see. When I strace the only output is for a full 6 minutes is:

$ sudo strace -p 58163
strace: Process 58163 attached
futex(0x7fa1258a0030, FUTEX_WAIT_BITSET_PRIVATE, 4294967295, NULL, FUTEX_BITSET_MATCH_ANY

References

No response

@reyjrar reyjrar added the type: bug A code related bug. label Jun 18, 2024
@reyjrar
Copy link
Author

reyjrar commented Jun 18, 2024

I tried setting:

export SSL_CERT_FILE="<location of CA trust>"

Still hung.

I added:

export VECTOR_OPENSSL_NO_PROBE=true

Still hung.

I tried enabling the "Legacy OpenSSL" mode as well.

Worked once, then hung indefinitely.

The difference in the trace logs for the first run:

2024-06-18T02:38:09.986203Z  INFO vector::topology::builder: Healthcheck disabled.                        19:38:12 [218/27796]
~ Health check disabled for "kafka_json"
2024-06-18T02:38:09.986234Z TRACE vector::validate: Healthcheck for kafka_json done.
2024-06-18T02:38:09.986308Z TRACE tower::buffer::worker: worker polling for next message
2024-06-18T02:38:09.986313Z TRACE rdkafka::producer::base_producer: Destroy ThreadedProducer
2024-06-18T02:38:09.986322Z TRACE tower::buffer::worker: buffer already closed
2024-06-18T02:38:09.986332Z TRACE rdkafka::producer::base_producer: Stopping polling
2024-06-18T02:38:09.986343Z TRACE rdkafka::producer::base_producer: Waiting for polling thread termination
2024-06-18T02:38:09.986609Z TRACE rdkafka::producer::base_producer: Polling thread loop started
2024-06-18T02:38:10.086694Z TRACE rdkafka::producer::base_producer: Polling thread loop terminated
2024-06-18T02:38:10.086833Z TRACE rdkafka::producer::base_producer: Polling stopped
2024-06-18T02:38:10.086849Z TRACE rdkafka::producer::base_producer: ThreadedProducer destroyed
2024-06-18T02:38:10.086917Z TRACE rdkafka::util: Destroying client: 0x7f2deb347800
2024-06-18T02:38:10.088773Z TRACE rdkafka::util: Destroyed client: 0x7f2deb347800
2024-06-18T02:38:10.088800Z TRACE rdkafka::producer::base_producer: Destroy ThreadedProducer
2024-06-18T02:38:10.088811Z TRACE rdkafka::producer::base_producer: Stopping polling                                          2024-06-18T02:38:10.088821Z TRACE rdkafka::producer::base_producer: Waiting for polling thread termination                    2024-06-18T02:38:10.167422Z TRACE rdkafka::producer::base_producer: Polling thread loop terminated                            2024-06-18T02:38:10.167560Z TRACE rdkafka::producer::base_producer: Polling stopped
2024-06-18T02:38:10.167574Z TRACE rdkafka::producer::base_producer: ThreadedProducer destroyed
2024-06-18T02:38:10.167633Z TRACE rdkafka::util: Destroying client: 0x7f2deb346400
2024-06-18T02:38:10.169647Z TRACE rdkafka::util: Destroyed client: 0x7f2deb346400
2024-06-18T02:38:10.169724Z TRACE tower::buffer::worker: worker polling for next message
2024-06-18T02:38:10.169741Z TRACE tower::buffer::worker: buffer already closed
2024-06-18T02:38:10.169771Z TRACE rdkafka::util: Destroying queue: 0x7f2dc53812e0
2024-06-18T02:38:10.169784Z TRACE rdkafka::util: Destroyed queue: 0x7f2dc53812e0
2024-06-18T02:38:10.169796Z TRACE rdkafka::consumer::base_consumer: Destroying consumer: 0x7f2deb345000
2024-06-18T02:38:10.169853Z TRACE rdkafka::consumer::base_consumer: Consumer destroyed: 0x7f2deb345000
2024-06-18T02:38:10.169876Z TRACE rdkafka::util: Destroying client: 0x7f2deb345000
2024-06-18T02:38:10.171737Z TRACE rdkafka::util: Destroyed client: 0x7f2deb345000
2024-06-18T02:38:10.171793Z TRACE rdkafka::util: Destroying queue: 0x7f2deb3d6fa0
2024-06-18T02:38:10.171806Z TRACE rdkafka::util: Destroyed queue: 0x7f2deb3d6fa0
2024-06-18T02:38:10.171806Z TRACE rdkafka::consumer::stream_consumer: Shut down stream consumer wake loop: 0x7f2deb345000
2024-06-18T02:38:10.171819Z TRACE rdkafka::consumer::base_consumer: Destroying consumer: 0x7f2deb342800
2024-06-18T02:38:12.727404Z TRACE rdkafka::consumer::base_consumer: Consumer destroyed: 0x7f2deb342800
2024-06-18T02:38:12.727429Z TRACE rdkafka::util: Destroying client: 0x7f2deb342800
2024-06-18T02:38:12.729419Z TRACE rdkafka::util: Destroyed client: 0x7f2deb342800
2024-06-18T02:38:12.729499Z TRACE rdkafka::consumer::stream_consumer: Shut down stream consumer wake loop: 0x7f2deb342800
-----------------------------------------------------------------------------------------------------------------------
                                                                                                                                                                                  Validated

re-ran and it tried to rebalance again:

2024-06-18T02:38:14.280403Z  INFO vector::topology::builder: Healthcheck disabled.
~ Health check disabled for "kafka_json"
2024-06-18T02:38:14.280441Z TRACE vector::validate: Healthcheck for kafka_json done.
2024-06-18T02:38:14.280497Z TRACE tower::buffer::worker: worker polling for next message
2024-06-18T02:38:14.280510Z TRACE tower::buffer::worker: buffer already closed
2024-06-18T02:38:14.280541Z TRACE rdkafka::producer::base_producer: Destroy ThreadedProducer
2024-06-18T02:38:14.280574Z TRACE tower::buffer::worker: worker polling for next message
2024-06-18T02:38:14.280583Z TRACE rdkafka::producer::base_producer: Stopping polling
2024-06-18T02:38:14.280587Z TRACE tower::buffer::worker: buffer already closed
2024-06-18T02:38:14.280596Z TRACE rdkafka::producer::base_producer: Waiting for polling thread termination
2024-06-18T02:38:14.280823Z TRACE rdkafka::producer::base_producer: Polling thread loop started
2024-06-18T02:38:14.380910Z TRACE rdkafka::producer::base_producer: Polling thread loop terminated
2024-06-18T02:38:14.381054Z TRACE rdkafka::producer::base_producer: Polling stopped
2024-06-18T02:38:14.381068Z TRACE rdkafka::producer::base_producer: ThreadedProducer destroyed
2024-06-18T02:38:14.381133Z TRACE rdkafka::util: Destroying client: 0x7f6787947800
2024-06-18T02:38:14.382983Z TRACE rdkafka::util: Destroyed client: 0x7f6787947800
2024-06-18T02:38:14.383028Z TRACE rdkafka::producer::base_producer: Destroy ThreadedProducer
2024-06-18T02:38:14.383039Z TRACE rdkafka::producer::base_producer: Stopping polling
2024-06-18T02:38:14.383050Z TRACE rdkafka::producer::base_producer: Waiting for polling thread termination
2024-06-18T02:38:14.460695Z TRACE rdkafka::producer::base_producer: Polling thread loop terminated
2024-06-18T02:38:14.460847Z TRACE rdkafka::producer::base_producer: Polling stopped
2024-06-18T02:38:14.460867Z TRACE rdkafka::producer::base_producer: ThreadedProducer destroyed
2024-06-18T02:38:14.460939Z TRACE rdkafka::util: Destroying client: 0x7f6787946400
2024-06-18T02:38:14.463040Z TRACE rdkafka::util: Destroyed client: 0x7f6787946400
2024-06-18T02:38:14.463066Z TRACE rdkafka::util: Destroying queue: 0x7f67879d6fa0
2024-06-18T02:38:14.463078Z TRACE rdkafka::util: Destroyed queue: 0x7f67879d6fa0
2024-06-18T02:38:14.463090Z TRACE rdkafka::consumer::base_consumer: Destroying consumer: 0x7f6787942800                       2024-06-18T02:38:14.463119Z TRACE rdkafka::consumer: Running pre-rebalance with Assign(TPL {ecs/6: offset=Invalid metadata="", error=Ok(()); ecs/7: offset=Invalid metadata="", error=Ok(()); ecs/8: offset=Invalid metadata="", error=Ok(()); ecs/9: offset=Invalid metadata="", error=Ok(()); ecs/10: offset=Invalid metadata="", error=Ok(()); ecs/11: offset=Invalid metadata="", erro
r=Ok(())})
2024-06-18T02:40:18.353591Z  INFO vector::signal: Signal received. signal="SIGINT"
2024-06-18T02:40:18.556170Z  INFO vector::signal: Signal received. signal="SIGINT"
2024-06-18T02:40:19.059678Z  INFO vector::signal: Signal received. signal="SIGINT"

SIGINT can't stop it, I have to pkill -9 it from another window.

I don't think it's the TLS changes, I suspect, something is causing #17497 to run during a validate which doesn't feel right.

@jches
Copy link
Contributor

jches commented Jun 18, 2024

Just to confirm: it's the vector validate command that deadlocks/hangs indefinitely here, right? Running this is not causing problems for any other running vector processes that are part of the same consumer group, but does prevent a service restart?

I agree this probably isn't related to TLS settings; however I am not able to observe this behavior with a simple kafka source -> blackhole sink configuration on 0.34.2 nor 0.39.0. Are you able to provide any more info about your configuration, and/or pare down your config to a minimal reproduction of the issue?

@reyjrar
Copy link
Author

reyjrar commented Jun 18, 2024

Correct, it only happens on validate.

Here's the config that's hanging most of the time. If I remove this from my VECTOR_CONFIG variable, the hang disappears.

[sources.kafka_ecs]
type = "kafka"
topics = ["ecs"]
group_id = "ecs-to-es"
bootstrap_servers = "host1.example.com:9093,host2.example.com:9093"
tls.enabled = true

[transforms.ecs_jsonify]
type = "remap"
inputs = [ "kafka_ecs" ]
source = """
. = parse_json!(.message)
"""

[sinks.elasticsearch_ecs]
type = "elasticsearch"
inputs = [ "ecs_jsonify" ]
endpoints = ["http://localhost:9200"]
bulk.index = "{{ @metadata.beat  }}-{{ @metadata.version  }}-{{ labels.location  }}-%Y.%m.%d"
pipeline = "ecs-geoip-info"
healthcheck.enabled = false

@jches
Copy link
Contributor

jches commented Jun 18, 2024

With my simple test setup of kafka source -> blackhole sink it looks like vector validate doesn't stay connected long enough to trigger a consumer group rebalance, but if I do a kafka source -> kafka sink it does force the other connected consumers to rebalance.

Typically the source created for the vector validate command does not receive any partition assignment during these rebalances, and it exits cleanly. But sometimes it does get an assignment, and then I do see a deadlock with the logs below, similar to your initial report (notice how the consumer is destroyed, followed by a consumer rebalance callback being called). I'm not sure yet but this could be a bug in the underlying rdkafka library; I don't think vector versions prior to 0.34 hooked into these rebalance callbacks and would not be affected. Other consumers do continue to process normally after the rebalance (partitions assigned to the validate source get reassigned after it disconnects).

It seems to be easier to reproduce on 0.39.0 fwiw, maybe because of the bias change in that version, but I don't see it happening as reliably with 0.34.2, though it does still happen there.

2024-06-18T22:38:32.130605Z TRACE rdkafka::consumer::base_consumer: Destroying consumer: 0x7f57ec1c1400
2024-06-18T22:38:32.130631Z TRACE rdkafka::consumer: Running pre-rebalance with Assign(TPL {test_topic/0: ...snipped... })

@reyjrar
Copy link
Author

reyjrar commented Jun 19, 2024

I can confirm it tends to happen more reliably on 0.39.0 than 0.34.*. That mirrors my experience. It is SO comforting to know that this is not just a "me" thing. Thank you for looking into this. I guess, what I'm curious about, is I have:

healthchecks.enabled = false

So why are connections happening in the vector validate step at all?

I guess, I'm expecting that to just be a syntax check, which is the purpose of the ExecStartPre in the systemd units traditionally. "Will it do something useful when it's running" is generally left as an exercise for the reader.

Is the "right" way to fix this, add vector check-syntax and set the ExecStartPre to that so it's not making any connections?

@jches
Copy link
Contributor

jches commented Jun 19, 2024

Looks like the healthcheck option is specific to sinks, and unrelated to vector validate. Validate sets up the whole topology of sources, transforms, and sinks and validates the live configuration, connectivity, etc by default.

One solution for kafka sources would be to use a different/random consumer group name than the real config to avoid these spurious rebalances.

For the vector pipelines I manage we do not run the validate command at all in our production environment, specifically to avoid the rebalancing issue. We run unit tests ahead of time in CI and it's just not necessary to validate at that point in our case. You can also use the vector validate --no-environment option and that gives something closer to a syntax check only, I believe.

@jches
Copy link
Contributor

jches commented Jun 19, 2024

With the fix in #20698 the consumer that's set up for validate won't attempt to join the consumer group, which solves both the spurious rebalance issue I had seen before, and the deadlock issue, which was happening because the main source task isn't polled during validate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug A code related bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants