Ensure ems workers are killed by their server/orchestrator pod #20290

jrafanie · 2020-06-18T20:26:59Z

Fixes #20288

Previously, direct calls to ems#destroy would assume you're calling it local
to each of the ems's workers and would fail to find the pid if not local. Additionally,
in pods, only the orchestrator pod of the worker has permissions to kill the pod
so this would fail with permission errors such as:

deployments.apps "1-xyz-event-catcher-1" is forbidden: User "abc" cannot patch resource "deployments" in API group "apps" in the namespace "123" for PATCH https:...]

The ems.destroy_queue method calls _queue_task from the AsyncDeleteMixin, which doesn't specify the server_guid or queue_name so a UI request to delete the ems COULD be initiated in a UI appliance and picked up by the same appliance, which isn't where the ems's worker processes are running, and would ultimately call kill on each workers that don't exist locally.

Now, we queue the worker's kill method for the queue_name 'miq_server' so it's handled by the server "process" in appliances or orchestrator in pods and server_guid of the worker's server as an ems's workers can be on different servers.

jrafanie · 2020-06-18T20:43:39Z

@carbonin I'll need help testing this in pods. 🤣
It works in appliances.

Fixes ManageIQ#20288 Previously, direct calls to ems#destroy would assume you're calling it local to each of the ems's workers and would fail to find the pid if not local. Additionally, in pods, only the orchestrator pod of the worker has permissions to kill the pod so this would fail with permission errors such as: deployments.apps "1-xyz-event-catcher-1" is forbidden: User "abc" cannot patch resource "deployments" in API group "apps" in the namespace "123" for PATCH https:...] The ems.destroy_queue method calls _queue_task from the AsyncDeleteMixin, which doesn't specify the server_guid or queue_name so a UI request to delete the ems COULD be initiated in a UI appliance and picked up by the same appliance, which isn't where the ems's worker processes are running, and would ultimately call kill on each workers that don't exist locally. Now, we queue the worker's kill method for the queue_name 'miq_server' so it's handled by the server "process" in appliances or orchestrator in pods and server_guid of the worker's server as an ems's workers can be on different servers.

miq-bot · 2020-06-19T20:40:31Z

Checked commit jrafanie@aadc622 with ruby 2.5.7, rubocop 0.69.0, haml-lint 0.28.0, and yamllint
4 files checked, 0 offenses detected
Everything looks fine. 🏆

jrafanie · 2020-06-19T21:00:02Z

app/models/ext_management_system.rb

+  end
+
+  def wait_for_ems_workers_removal
+    return if Rails.env.test?


I'm not sure how else to test this if this method will loop and wait for the worker rows to be removed.

I wonder if we could stub the #kill_async and have it execute it directly instead of putting test specifics in the main code...might play with this later

Yeah, that's a possibility. The problems I had in the 2 ext_management_system_spec.rb examples changed in this PR:

I'd need to stub #kill_async to delete the rows

I'd need to either stub using any_instance since destroy queues for the Ems and deliver would get a different Ems unless you any_instance or you'd need to stub deliver to get your specific Ems instance with the stubbed method.

jrafanie · 2020-06-22T18:11:55Z

app/models/ext_management_system.rb

+    return if Rails.env.test?
+
+    quiesce_loop_timeout = ::Settings.server.worker_monitor.quiesce_loop_timeout || 5.minutes
+    worker_monitor_poll  = (::Settings.server.worker_monitor.poll || 1.second).to_i_with_method


I grabbed these values from the worker quiesce code.

app/models/ext_management_system.rb

jrafanie · 2020-06-24T14:08:54Z

@agrare @carbonin I think this is ready to go. I tested in an upstream appliance and in pods by monkey patching the queueing code in console, ensuring calling Ems#destroy queues a message that was targeted to and picked up successfully by the orchestrator and kills the workers with the problems mentioned above[1][2]

[1] event catchers for amazon are not killed immediately (seems like aws might be rescuing Exception or trapping signals) [2] refresh worker for amazon with multiple ems queue names don't get killed immediately because we're only looking for a singular ems queue name. These workers exit after the managers and ems is destroyed, so it's possible that a refresh puts the ems/managers back.

carbonin · 2020-06-24T14:15:57Z

@agrare Look good to you?

agrare

👍 LGTM

agrare · 2020-06-24T14:23:07Z

app/models/ext_management_system.rb

+  end
+
+  def wait_for_ems_workers_removal
+    return if Rails.env.test?


I wonder if we could stub the #kill_async and have it execute it directly instead of putting test specifics in the main code...might play with this later

…ver_before_ems_destroy Ensure ems workers are killed by their server/orchestrator pod (cherry picked from commit 9600648)

simaishi · 2020-06-25T20:52:25Z

Jansa backport details:

$ git log -1
commit 167fc6f6d1d4e0a76c184d023d13f036a19dbfe1
Author: Nick Carboni <ncarboni@redhat.com>
Date:   Wed Jun 24 10:41:38 2020 -0400

    Merge pull request #20290 from jrafanie/kill_ems_workers_on_their_server_before_ems_destroy

    Ensure ems workers are killed by their server/orchestrator pod

    (cherry picked from commit 9600648fdbd0803809742fbf33f25c69b2c99f06)

jrafanie added bug core/workers labels Jun 18, 2020

jrafanie requested review from agrare, Fryguy, gtanzillo and kbrock as code owners June 18, 2020 20:26

jrafanie force-pushed the kill_ems_workers_on_their_server_before_ems_destroy branch from 3fb43b2 to 9dde30e Compare June 18, 2020 20:45

jrafanie force-pushed the kill_ems_workers_on_their_server_before_ems_destroy branch from 9dde30e to aadc622 Compare June 19, 2020 20:38

jrafanie commented Jun 19, 2020

View reviewed changes

jrafanie commented Jun 22, 2020

View reviewed changes

carbonin reviewed Jun 23, 2020

View reviewed changes

app/models/ext_management_system.rb Show resolved Hide resolved

carbonin approved these changes Jun 24, 2020

View reviewed changes

agrare approved these changes Jun 24, 2020

View reviewed changes

jrafanie mentioned this pull request Jun 24, 2020

Queue workers with multiple ems queue names don't currently get killed before an ems and it's managers are removed #20307

Closed

carbonin self-assigned this Jun 24, 2020

carbonin merged commit 9600648 into ManageIQ:master Jun 24, 2020

carbonin added the jansa/yes? label Jun 24, 2020

jrafanie deleted the kill_ems_workers_on_their_server_before_ems_destroy branch June 24, 2020 14:41

jrafanie mentioned this pull request Jun 24, 2020

Set the monitor_running event to unblock the main thread ManageIQ/manageiq-providers-amazon#633

Merged

Fryguy added jansa/yes and removed jansa/yes? labels Jun 25, 2020

simaishi pushed a commit that referenced this pull request Jun 25, 2020

Merge pull request #20290 from jrafanie/kill_ems_workers_on_their_ser…

167fc6f

…ver_before_ems_destroy Ensure ems workers are killed by their server/orchestrator pod (cherry picked from commit 9600648)

simaishi added jansa/backported and removed jansa/yes labels Jun 25, 2020

jrafanie added the core/pods label Jul 9, 2020

jrafanie mentioned this pull request Dec 16, 2020

Use systemd-notify for worker heartbeating #20840

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure ems workers are killed by their server/orchestrator pod #20290

Ensure ems workers are killed by their server/orchestrator pod #20290

jrafanie commented Jun 18, 2020 •

edited

Loading

jrafanie commented Jun 18, 2020

miq-bot commented Jun 19, 2020

jrafanie Jun 19, 2020

agrare Jun 24, 2020

jrafanie Jun 24, 2020

jrafanie Jun 22, 2020

jrafanie commented Jun 24, 2020

carbonin commented Jun 24, 2020

agrare left a comment

agrare Jun 24, 2020

simaishi commented Jun 25, 2020

Ensure ems workers are killed by their server/orchestrator pod #20290

Ensure ems workers are killed by their server/orchestrator pod #20290

Conversation

jrafanie commented Jun 18, 2020 • edited Loading

jrafanie commented Jun 18, 2020

miq-bot commented Jun 19, 2020

jrafanie Jun 19, 2020

Choose a reason for hiding this comment

agrare Jun 24, 2020

Choose a reason for hiding this comment

jrafanie Jun 24, 2020

Choose a reason for hiding this comment

jrafanie Jun 22, 2020

Choose a reason for hiding this comment

jrafanie commented Jun 24, 2020

carbonin commented Jun 24, 2020

agrare left a comment

Choose a reason for hiding this comment

agrare Jun 24, 2020

Choose a reason for hiding this comment

simaishi commented Jun 25, 2020

jrafanie commented Jun 18, 2020 •

edited

Loading