Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error occurring in the event catcher pod when messaging is enabled #607

Closed
abellotti opened this issue Aug 6, 2020 · 4 comments
Closed
Assignees

Comments

@abellotti
Copy link
Member

Errors occurring in the 1-vmware-infra-event-catcher-* pod when messaging is enabled for the pods.

Steps to reproduce:

  1. Deploy a ManageIQ project with the Messaging service (kafka/zookeeper) enabled:
apiVersion: manageiq.org/v1alpha1
kind: ManageIQ
metadata:
  name: aab-miq
spec:
  applicationDomain: aab-miq.sampledomain.com
  httpdAuthenticationType: internal
  deployMessagingService: true
  1. In ManageIQ create a VmWare provider
  2. Select the newly created provider, then run Configuration->Refresh Relationships and Power States
  3. Browse over the newly detected Virtual Machines
  4. Select a Virtual Machine and perform a power operation, i.e. Power->Power On
  5. Then monitor the pods, oc get pods, you'll see then Error on the 1-vmware-infra-event-catcher pod, then eventually the CrashLoopBackOff:
NAME                                              READY   STATUS             RESTARTS   AGE
1-event-handler-6bc66cbb9f-9xgc9                  1/1     Running            0          59m
1-generic-7cfbd96c48-pc6mn                        1/1     Running            0          59m
1-generic-7cfbd96c48-rd2tj                        1/1     Running            0          59m
1-priority-74f6c4c668-5pwqt                       1/1     Running            0          59m
1-priority-74f6c4c668-vvwwt                       1/1     Running            0          59m
1-remote-console-5796d54b65-q78x6                 1/1     Running            0          59m
1-reporting-58dfbcd48-b846j                       1/1     Running            0          59m
1-reporting-58dfbcd48-ftz62                       1/1     Running            0          59m
1-schedule-5dbfcbfcb-282v5                        1/1     Running            0          59m
1-ui-6f65b8d974-m9crb                             1/1     Running            0          59m
1-vmware-infra-event-catcher-2-57cddffd5b-tkqdm   0/1     CrashLoopBackOff   9          47m
1-vmware-infra-operations-2-765754d557-sdw5r      1/1     Running            0          47m
1-vmware-infra-refresh-2-8f9fd74b9-pbnbm          1/1     Running            0          47m
1-web-service-5f69fcf65-jgvql                     1/1     Running            0          59m
httpd-858c5894c7-jd5nm                            1/1     Running            0          1h
kafka-6867548dc-wsz2c                             1/1     Running            0          1h
memcached-84987cbdf5-9gnnv                        1/1     Running            0          1h
orchestrator-854978d8d5-xt667                     1/1     Running            0          1h
postgresql-56f89c8fc7-g7ppc                       1/1     Running            0          1h
zookeeper-587bb95864-2rpl4                        1/1     Running            0          1h

An oc log of 1-vmware-infra-event-catcher-2-57cddffd5b-tkqdm shows the following error failing to send an event:

{"@timestamp":"2020-08-06T15:52:08.184519 ","hostname":"1-vmware-infra-event-catcher-2-57cddffd5b-tkqdm","pid":8,"tid":"2b12b4da7964","level":"err","message":"EMS [10.8.96.135] as [abellott@redhat.com] ID [15] PID [8] GUID [e758e408-53ab-4c77-bd04-9129ab5574a4] An error has occurred during work processing: Failed to send messages to manageiq.ems-events/0
/opt/manageiq/manageiq-gemset/gems/ruby-kafka-1.2.0/lib/kafka/producer.rb:438:in `deliver_messages_with_retries'
/opt/manageiq/manageiq-gemset/gems/ruby-kafka-1.2.0/lib/kafka/producer.rb:261:in `block in deliver_messages'
/opt/manageiq/manageiq-gemset/gems/activesupport-5.2.4.3/lib/active_support/notifications.rb:170:in `instrument'
/opt/manageiq/manageiq-gemset/gems/ruby-kafka-1.2.0/lib/kafka/instrumenter.rb:21:in `instrument'
/opt/manageiq/manageiq-gemset/gems/ruby-kafka-1.2.0/lib/kafka/producer.rb:254:in `deliver_messages'
/opt/manageiq/manageiq-gemset/gems/manageiq-messaging-0.1.6/lib/manageiq/messaging/kafka/common.rb:48:in `raw_publish'
/opt/manageiq/manageiq-gemset/gems/manageiq-messaging-0.1.6/lib/manageiq/messaging/kafka/topic.rb:8:in `publish_topic_impl'
/opt/manageiq/manageiq-gemset/gems/manageiq-messaging-0.1.6/lib/manageiq/messaging/client.rb:174:in `publish_topic'
/var/www/miq/vmdb/app/models/ems_event.rb:271:in `publish_event'
/var/www/miq/vmdb/app/models/ems_event.rb:28:in `add_queue'
/opt/manageiq/manageiq-gemset/bundler/gems/manageiq-providers-vmware-32e5783fc77c/app/models/manageiq/providers/vmware/infra_manager/event_catcher/runner.rb:74:in `queue_event'
/opt/manageiq/manageiq-gemset/bundler/gems/manageiq-gems-pending-f205ad57017e/lib/gems/pending/util/duplicate_blocker/dedup_handler.rb:73:in `[]'
/opt/manageiq/manageiq-gemset/bundler/gems/manageiq-gems-pending-f205ad57017e/lib/gems/pending/util/duplicate_blocker/dedup_handler.rb:73:in `handle'
/opt/manageiq/manageiq-gemset/bundler/gems/manageiq-gems-pending-f205ad57017e/lib/gems/pending/util/duplicate_blocker.rb:76:in `block (2 levels) in dedup_instance_method'
/var/www/miq/vmdb/app/models/manageiq/providers/base_manager/event_catcher/runner.rb:124:in `process_event'
/var/www/miq/vmdb/app/models/manageiq/providers/base_manager/event_catcher/runner.rb:186:in `block in process_events'
/var/www/miq/vmdb/app/models/manageiq/providers/base_manager/event_catcher/runner.rb:184:in `each'
/var/www/miq/vmdb/app/models/manageiq/providers/base_manager/event_catcher/runner.rb:184:in `process_events'
/var/www/miq/vmdb/app/models/manageiq/providers/base_manager/event_catcher/runner.rb:203:in `do_work'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:256:in `block in do_work_loop'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:253:in `loop'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:253:in `do_work_loop'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:113:in `run'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:95:in `start'
/var/www/miq/vmdb/lib/workers/bin/run_single_worker.rb:113:in `\u003cmain\u003e' Worker exiting."}
@agrare
Copy link
Member

agrare commented Aug 6, 2020

It appears that you have kafka env vars set up so that the event catcher will publish events to the manageiq.ems-events topic but something about that kafka broker is not set up properly.

@abellotti
Copy link
Member Author

abellotti commented Aug 12, 2020

Thanks @agrare for the pointers to running the manageiq-messaging client on the pod, that allowed me to debug further.

As I've been trying to test #604 on the kafka side, I was seeing the same errors mentioned here when deploying our own Kafka as part of the project.

With configuring an external kafka server, I was able to publish messages to it from my notebook, but doing the same from a pod, I would get the failure mentioned here even though the Kafka server is reachable by ip address.

The problem occurs with just exercising ruby-kafka so the problem is below our manageiq-messaging gem.

Testing in a pod as follows:

$ cd /var/www/miq/vmdb
$ source ./container_env
$ rails console
require "ruby-kafka"

kclient = Kafka.new(["<kafka_server_ip>:9092", :client_id => "miq-pod", :logger => Rails.logger)

kclient.deliver_message("Hello World!!!", :topic => "manageiq.ems-events")

You'll notice in the Rails log that the failure to connect to kafka is NOT <kafka_server_ip> but to the FQDN of the server instead.

Issue here is that what is specified in MESSAGING_HOSTNAME is NOT what the ruby-kafka client attempts to connect to. It connects to the KAFKA_ADVERTISED_HOST_NAME that is configured on the kafka server.

My notebook had that FQDN of the kafka server in the hosts file, the pod did not explaining why it worked in one and not the other.

Starting the kafka server with the FQDN reachable (or just IP address for dev) for the advertised host name for the server would be the deployment recommendation:

i.e.

  bin/kafka-server-start.sh config/server.properties  \
	  --override advertised.host.name=<kafka_server_ip>	\
	  --override log.segment.bytes=10485760 		        \
	  --override log.retention.bytes=10485760

Then placing that same advertised host name (in this case the <kafka_server_ip>) for the MESSAGING_HOSTNAME
in the kafka-secrets for the external kafka service as follows:

  oc create secret generic kafka-secrets      \
    --from-literal=hostname=<kafka_server_ip> \
    --from-literal=username=root              \
    --from-literal=username=<password>

Did the trick.

NAME                                              READY   STATUS    RESTARTS   AGE
1-event-handler-5cd44cd874-5vl6k                  1/1     Running   0          29m
1-generic-74f4c7c879-bkb97                        1/1     Running   0          29m
1-generic-74f4c7c879-p4qvc                        1/1     Running   0          29m
1-priority-6dc8c96465-glxp8                       1/1     Running   0          29m
1-priority-6dc8c96465-nmqmm                       1/1     Running   0          29m
1-remote-console-6d7f555668-qwbvg                 1/1     Running   0          29m
1-reporting-7f454f9699-2kw7v                      1/1     Running   0          29m
1-reporting-7f454f9699-nvzwb                      1/1     Running   0          29m
1-schedule-577ddb7d69-29l7t                       1/1     Running   0          29m
1-ui-7c558f7cc7-4cdh9                             1/1     Running   0          29m
1-vmware-infra-event-catcher-2-86cbc5bb99-frzvd   1/1     Running   0          25m
1-vmware-infra-operations-2-648bb7895d-h7q46      1/1     Running   0          25m
1-vmware-infra-refresh-2-57df44d89b-jnzmz         1/1     Running   0          25m
1-web-service-5fd999986-72rb2                     1/1     Running   0          29m
httpd-858c5894c7-rq5wx                            1/1     Running   0          35m
memcached-84987cbdf5-shkrn                        1/1     Running   0          35m
orchestrator-84c5678dfb-7xsqc                     1/1     Running   0          35m
postgresql-56f89c8fc7-tnt4m                       1/1     Running   0          35m

Of course without kafka here, since I was testing using an external kafka server.

While this resolved the connectivity to an external kafka service, it might be the same issue for the local kafka pod. Probably need to override the KAFKA_ADVERTISED_HOST_NAME in the server as "kafka" since that is the default hostname we connect to.

@miq-bot
Copy link
Member

miq-bot commented Mar 6, 2023

This issue has been automatically marked as stale because it has not been updated for at least 3 months.

If you can still reproduce this issue on the current release or on master, please reply with all of the information you have about it in order to keep the issue open.

Thank you for all your contributions! More information about the ManageIQ triage process can be found in the triage process documentation.

@miq-bot miq-bot added the stale label Mar 6, 2023
@miq-bot miq-bot closed this as completed Jun 12, 2023
@miq-bot
Copy link
Member

miq-bot commented Jun 12, 2023

This issue has been automatically closed because it has not been updated for at least 3 months.

Feel free to reopen this issue if this issue is still valid.

Thank you for all your contributions! More information about the ManageIQ triage process can be found in the triage process documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants