Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark workers associated with failed systemd units as stopped #23182

Merged
merged 2 commits into from
Sep 27, 2024

Conversation

agrare
Copy link
Member

@agrare agrare commented Sep 11, 2024

If we start a systemd unit and it fails this can leave the miq_worker record associated with it in "creating" without ever being cleaned up.

When we stop and cleanup any failed systemd units we should also mark any associated miq-worker records as stopped so that they can be cleaned up by the clean_worker_records method.

INFO -- evm: MIQ(MiqServer::WorkerManagement::Systemd#cleanup_failed_systemd_services) Disabling failed unit files: [opentofu-runner.service]
INFO -- evm: MIQ(MiqServer::WorkerManagement::Systemd#cleanup_failed_systemd_services) Stopping worker records for failed units: [opentofu-runner.service]
INFO -- evm: MIQ(MiqServer::WorkerManagement::Systemd#clean_worker_records) SQL Record for Worker [OpentofuWorker] with ID: [71], PID: [], GUID: [46e4cdf4-22b8-426>

TODO

  • Live test on an appliance

Fixes ManageIQ/manageiq-providers-embedded_terraform#59

@agrare agrare added the bug label Sep 11, 2024
@miq-bot miq-bot added the wip label Sep 11, 2024
@agrare agrare changed the title [WIP] Mark workers associated with failed systemd units as stopped Mark workers associated with failed systemd units as stopped Sep 11, 2024
@agrare agrare removed the wip label Sep 11, 2024
@agrare agrare changed the title Mark workers associated with failed systemd units as stopped [WIP] Mark workers associated with failed systemd units as stopped Sep 18, 2024
@agrare agrare added the wip label Sep 18, 2024
If a systemd unit is failed but there is still a miq_worker record
associated with it we should mark that worker record as stopped.  This
will then be cleaned up by the subsequent `clean_worker_records` method.
@agrare agrare force-pushed the mark_workers_for_failed_units_stopped branch from b1e30ad to 728e223 Compare September 27, 2024 14:52
@miq-bot
Copy link
Member

miq-bot commented Sep 27, 2024

Checked commits agrare/manageiq@2906f85~...728e223 with ruby 3.1.5, rubocop 1.56.3, haml-lint 0.51.0, and yamllint
2 files checked, 0 offenses detected
Everything looks fine. 🏆

@agrare agrare changed the title [WIP] Mark workers associated with failed systemd units as stopped Mark workers associated with failed systemd units as stopped Sep 27, 2024
@agrare agrare added core/workers and removed wip labels Sep 27, 2024
@agrare
Copy link
Member Author

agrare commented Sep 27, 2024

Okay I ran a live test on a master appliance build with this applied and I enable the embedded_terraform role first then set the container_image later and confirmed the failed workers are marked stopped and later deleted and then after the container_image setting is set properly the next time the worker starts up it pulls the correct image. Taking out of WIP

@Fryguy Fryguy merged commit de72e9e into ManageIQ:master Sep 27, 2024
8 checks passed
@Fryguy Fryguy self-assigned this Sep 27, 2024
@agrare agrare deleted the mark_workers_for_failed_units_stopped branch September 30, 2024 17:39
@Fryguy
Copy link
Member

Fryguy commented Oct 8, 2024

Backported to radjabov in commit e6e6c81.

commit e6e6c81e8cceafbbb2be8ee4852c8aaf8bf23867
Author: Jason Frey <fryguy9@gmail.com>
Date:   Fri Sep 27 16:04:07 2024 -0400

    Merge pull request #23182 from agrare/mark_workers_for_failed_units_stopped
    
    Mark workers associated with failed systemd units as stopped
    
    (cherry picked from commit de72e9e6b5d67e724113fd6852ec31867fada811)

Fryguy added a commit that referenced this pull request Oct 8, 2024
…topped

Mark workers associated with failed systemd units as stopped

(cherry picked from commit de72e9e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OpentofuWorker record stuck in "creating" even though service is failed
3 participants