Mark workers associated with failed systemd units as stopped #23182

agrare · 2024-09-11T18:09:09Z

If we start a systemd unit and it fails this can leave the miq_worker record associated with it in "creating" without ever being cleaned up.

When we stop and cleanup any failed systemd units we should also mark any associated miq-worker records as stopped so that they can be cleaned up by the clean_worker_records method.

INFO -- evm: MIQ(MiqServer::WorkerManagement::Systemd#cleanup_failed_systemd_services) Disabling failed unit files: [opentofu-runner.service]
INFO -- evm: MIQ(MiqServer::WorkerManagement::Systemd#cleanup_failed_systemd_services) Stopping worker records for failed units: [opentofu-runner.service]
INFO -- evm: MIQ(MiqServer::WorkerManagement::Systemd#clean_worker_records) SQL Record for Worker [OpentofuWorker] with ID: [71], PID: [], GUID: [46e4cdf4-22b8-426>

TODO

Live test on an appliance

Fixes ManageIQ/manageiq-providers-embedded_terraform#59

If a systemd unit is failed but there is still a miq_worker record associated with it we should mark that worker record as stopped. This will then be cleaned up by the subsequent `clean_worker_records` method.

miq-bot · 2024-09-27T14:55:40Z

Checked commits agrare/manageiq@2906f85~...728e223 with ruby 3.1.5, rubocop 1.56.3, haml-lint 0.51.0, and yamllint
2 files checked, 0 offenses detected
Everything looks fine. 🏆

agrare · 2024-09-27T15:53:54Z

Okay I ran a live test on a master appliance build with this applied and I enable the embedded_terraform role first then set the container_image later and confirmed the failed workers are marked stopped and later deleted and then after the container_image setting is set properly the next time the worker starts up it pulls the correct image. Taking out of WIP

Fryguy · 2024-10-08T19:37:53Z

Backported to radjabov in commit e6e6c81.

commit e6e6c81e8cceafbbb2be8ee4852c8aaf8bf23867
Author: Jason Frey <fryguy9@gmail.com>
Date:   Fri Sep 27 16:04:07 2024 -0400

    Merge pull request #23182 from agrare/mark_workers_for_failed_units_stopped
    
    Mark workers associated with failed systemd units as stopped
    
    (cherry picked from commit de72e9e6b5d67e724113fd6852ec31867fada811)

…topped Mark workers associated with failed systemd units as stopped (cherry picked from commit de72e9e)

agrare added the bug label Sep 11, 2024

agrare requested review from jrafanie and Fryguy as code owners September 11, 2024 18:09

miq-bot added the wip label Sep 11, 2024

agrare changed the title ~~[WIP] Mark workers associated with failed systemd units as stopped~~ Mark workers associated with failed systemd units as stopped Sep 11, 2024

agrare removed the wip label Sep 11, 2024

agrare changed the title ~~Mark workers associated with failed systemd units as stopped~~ [WIP] Mark workers associated with failed systemd units as stopped Sep 18, 2024

agrare added the wip label Sep 18, 2024

agrare added 2 commits September 27, 2024 10:52

Fix method name typo

2906f85

Stop any worker records for failed systemd units

728e223

If a systemd unit is failed but there is still a miq_worker record associated with it we should mark that worker record as stopped. This will then be cleaned up by the subsequent `clean_worker_records` method.

agrare force-pushed the mark_workers_for_failed_units_stopped branch from b1e30ad to 728e223 Compare September 27, 2024 14:52

agrare changed the title ~~[WIP] Mark workers associated with failed systemd units as stopped~~ Mark workers associated with failed systemd units as stopped Sep 27, 2024

agrare added core/workers and removed wip labels Sep 27, 2024

Fryguy approved these changes Sep 27, 2024

View reviewed changes

Fryguy merged commit de72e9e into ManageIQ:master Sep 27, 2024
8 checks passed

Fryguy self-assigned this Sep 27, 2024

agrare added the radjabov/yes? label Sep 30, 2024

agrare deleted the mark_workers_for_failed_units_stopped branch September 30, 2024 17:39

Fryguy added radjabov/yes and removed radjabov/yes? labels Oct 8, 2024

Fryguy added a commit that referenced this pull request Oct 8, 2024

Merge pull request #23182 from agrare/mark_workers_for_failed_units_s…

e6e6c81

…topped Mark workers associated with failed systemd units as stopped (cherry picked from commit de72e9e)

Fryguy added radjabov/backported and removed radjabov/yes labels Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mark workers associated with failed systemd units as stopped #23182

Mark workers associated with failed systemd units as stopped #23182

agrare commented Sep 11, 2024 •

edited

Loading

miq-bot commented Sep 27, 2024

agrare commented Sep 27, 2024

Fryguy commented Oct 8, 2024

Mark workers associated with failed systemd units as stopped #23182

Mark workers associated with failed systemd units as stopped #23182

Conversation

agrare commented Sep 11, 2024 • edited Loading

miq-bot commented Sep 27, 2024

agrare commented Sep 27, 2024

Fryguy commented Oct 8, 2024

agrare commented Sep 11, 2024 •

edited

Loading