Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure ems workers are killed by their server/orchestrator pod #20290

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 24 additions & 2 deletions app/models/ext_management_system.rb
Original file line number Diff line number Diff line change
Expand Up @@ -622,11 +622,33 @@ def destroy_queue
task.id
end

def ems_workers
MiqWorker.find_alive.where(:queue_name => queue_name)
jrafanie marked this conversation as resolved.
Show resolved Hide resolved
end

def wait_for_ems_workers_removal
return if Rails.env.test?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how else to test this if this method will loop and wait for the worker rows to be removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could stub the #kill_async and have it execute it directly instead of putting test specifics in the main code...might play with this later

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a possibility. The problems I had in the 2 ext_management_system_spec.rb examples changed in this PR:

  • I'd need to stub #kill_async to delete the rows
  • I'd need to either stub using any_instance since destroy queues for the Ems and deliver would get a different Ems unless you any_instance or you'd need to stub deliver to get your specific Ems instance with the stubbed method.


quiesce_loop_timeout = ::Settings.server.worker_monitor.quiesce_loop_timeout || 5.minutes
worker_monitor_poll = (::Settings.server.worker_monitor.poll || 1.second).to_i_with_method
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I grabbed these values from the worker quiesce code.

kill_ems_workers_started_on = Time.now.utc

loop do
# killed workers will have their row removed, so we wait for this
break unless ems_workers.exists?
break if (Time.now.utc - kill_ems_workers_started_on) > quiesce_loop_timeout

sleep worker_monitor_poll
end
end

def destroy(task_id = nil)
disable!(:validate => false) if enabled?

# kill workers
MiqWorker.find_alive.where(:queue_name => queue_name).each(&:kill)
# Async kill each ems worker and wait until their row is removed before we delete
# the ems/managers to ensure a worker doesn't recreate the ems/manager.
ems_workers.each(&:kill_async)
wait_for_ems_workers_removal

_log.info("Destroying #{child_managers.count} child_managers")
child_managers.destroy_all
Expand Down
15 changes: 15 additions & 0 deletions app/models/miq_worker.rb
Original file line number Diff line number Diff line change
Expand Up @@ -394,6 +394,21 @@ def kill
destroy
end

# kill needs be done by the worker's orchestrator pod / server process
# TODO: Note, stop is async through the queue, while kill is sync. Should kill be async too?
# Also, this looks a lot like MiqServer#stop_worker_queue except stop_worker is called on the server row whereas
# we're calling kill on the worker row.
def kill_async
MiqQueue.put_unless_exists(
:class_name => self.class.name,
:instance_id => id,
:method_name => 'kill',
:queue_name => 'miq_server',
:server_guid => miq_server.guid,
:zone => miq_server.my_zone
)
end

def kill_process
if containerized_worker?
delete_container_objects
Expand Down
13 changes: 10 additions & 3 deletions spec/models/ext_management_system_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -666,8 +666,15 @@

it "destroys an ems with active workers" do
ems = FactoryBot.create(:ext_management_system)
worker = FactoryBot.create(:miq_ems_refresh_worker, :queue_name => ems.queue_name, :status => "started")
worker = FactoryBot.create(:miq_ems_refresh_worker, :queue_name => ems.queue_name, :status => "started", :miq_server => EvmSpecHelper.local_miq_server)

ems.destroy

# Simulate another process delivering the worker kill message
queue_message = MiqQueue.order(:id).first
status, message, result = queue_message.deliver
queue_message.delivered(status, message, result)

expect(ExtManagementSystem.count).to eq(0)
expect(worker.class.exists?(worker.id)).to eq(false)
end
Expand Down Expand Up @@ -719,8 +726,8 @@
ems.destroy_queue

expect(MiqQueue.count).to eq(1)

deliver_queue_message
deliver_queue_message # ems destroy message
deliver_queue_message # worker kill message

expect(MiqQueue.count).to eq(0)
expect(ExtManagementSystem.count).to eq(0)
Expand Down
27 changes: 27 additions & 0 deletions spec/models/miq_worker_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -392,6 +392,33 @@ def check_has_required_role(worker_role_names, expected_result)
end
end

describe "#kill_async" do
let!(:remote_server) { EvmSpecHelper.remote_guid_miq_server_zone[1] }
let!(:local_server) { EvmSpecHelper.local_guid_miq_server_zone[1] }

it "queues local worker to local server" do
worker = FactoryBot.create(:miq_worker, :miq_server => local_server)
worker.kill_async
msg = MiqQueue.where(:method_name => 'kill', :class_name => worker.class.name).first
expect(msg).to have_attributes(
:queue_name => 'miq_server',
:server_guid => local_server.guid,
:zone => local_server.my_zone
)
end

it "queues remote worker to remote server" do
worker = FactoryBot.create(:miq_worker, :miq_server => remote_server)
worker.kill_async
msg = MiqQueue.where(:method_name => 'kill', :class_name => worker.class.name).first
expect(msg).to have_attributes(
:queue_name => 'miq_server',
:server_guid => remote_server.guid,
:zone => remote_server.my_zone
)
end
end

describe "#stopping_for_too_long?" do
subject { @worker.stopping_for_too_long? }

Expand Down