[RFC] EmbeddedAnsible with ansible-runner-based implementation #45

Fryguy · 2019-04-16T02:19:53Z

Architecture

General approach

The current AWX implementation work by creating a provider that talks to an AWX instance, and uses the provider refresh to pull data into the database. CRUD operations on AWX objects go through the provider API, where the object is created in AWX, and then brought in via EMS refresh. After that callers use the ManageIQ models to do whatever they need to with the data.

As such, all of the ManageIQ callers use the provider API as an abstraction layer, and we can take advantage of that. Instead of have provider CRUD operations go to a provider, we can instead write the data directly into the database tables as if a "refresh" had occurred immediately.

Repositories

A repository is created as a ManageIQ::Providers::EmbeddedAnsible::AutomationManager::ConfigurationScriptSource (< ConfigurationScriptSource). For the implementation in this PR, the git repos are cloned into Rails.root.join("tmp/git_repos/:id"). This works great for single appliance, but will not work as well for federated appliances, nor appliances that can't access the internet directly. As such a different design is needed, which is below in the git repo management section.

Once the repository is cloned, then the playbooks are each synced as a ManageIQ::Providers::EmbeddedAnsible::AutomationManager::Playbook (< ConfigurationScriptPayload < ConfigurationScriptBase (table name configuration_scripts). In this PR I've also pulled in the "name" attribute as the playbook description, though I'm not sure if this is correct or not.

Service Template

When designing a service, the service template is saved as a ManageIQ::Providers::EmbeddedAnsible::AutomationManager::ConfigurationScript which is a subclass of ConfigurationScript, which is a subclass of ConfigurationScriptBase (table name configuration_scripts).

CONFUSION NOTE: Both services templates and playbooks are stored in the same table, but with different subclasses and different column usage. Additionally confusing is that unlike playbooks which create a subclass with that native term, the class here is ConfigurationScript instead of the native term JobTemplate, but some of the relationships use the term job_template instead.

For the purposes of this PoC, I've stored some of the options for the service template in the variables column, but I don't believe that is the correct way to do it. We will have to go back to the original design to see where the Tower provider stores those values during refresh.

Service execute

When an ansible service template is ordered, a ServiceTemplateProvisionRequest (< MiqRequest) is started, which goes through automate, and ultimately an instance of a ServiceAnsiblePlaybook (< Service) is executed. In the general Service flow there are 2 main methods that need to be implemented, execute and check_completed. In the execute method a ManageIQ::Providers::EmbeddedAnsible::AutomationManager::Job (< OrchestrationStack) is created as a resource for this service, and "launched", moving on to the check_completed step.

Launching ansible-runner

For launching ansible-runner, we are using the ManageIQ::Providers::AnsibleRunnerWorkflow class which will eventually use Ansible::Runner helper class. (Note: this workflow class was created as a helper for provider authors to create ansible based operations, however, the code itself is not provider specific and this code should be moved out of the providers namespace and into the Ansible::Runner namespace instead).

CONFUSION NOTE: The workflow class is a subclass of ::Job, which is our generic state machine using MiqTasks. This is completely unrelated to ManageIQ::Providers::EmbeddedAnsible::AutomationManager::Job, which is just a resource representation for the service.

The AnsibleRunnerWorkflow, being a self-contained Job will launch ansible-runner with json output, asynchronously poll if the ansible-runner execution has completed, and once it has detected completion, it will grab the results, store them in the MiqTask context, and cleanup the ansible-runner execution temp directory.

Service check_completed

In the meantime, the check_completed step of the ServiceAnsiblePlaybook is run every so often. In this implementation, the MiqTask associated with the AnsibleRunnerWorkflow is being watched for completion. Once it has been marked as finished, then the service can move on with its post-execution steps.

Services page

The services page shows the details of the ServiceAnsiblePlaybook, and the user can drill into the provision details. One of those details is the ansible stdout. In the AWX-based implementation, this was one of the few places where the database records were not used, and instead an asynchronous call would be made to AWX directly to fetch the stdout on demand. In the new ansible-runner design we don't have that option. For now, in this implementation, we happen to have this information already stored in the AnsibleRunnerWorkflow's associated MiqTask, and since we have a relationship between the ServiceAnsiblePlaybook, and the MiqTask, we can get the data directly from the database. We may not want to store this information in the MiqTask permanently, so a better design might be need which I'll elaborate on in the Ansible stdout section

The stdout is extracted from the stored json records, however it has ANSI character codes for terminal colors embedded. In the previous implementation, one could ask AWX for the HTML version, but we don't have that in this implementation. So, instead we use the terminal ruby gem, which converts the raw terminal output to HTML replacing ANSI escape sequences with css classes. For this PoC, I've use the default CSS file that comes with the terminal gem, which styles the HTML by wrapping it in a div and scoping that style to the wrapper div. We will likely want the UI team to have the freedom to style this directly, so instead we can forego the built-in CSS for styles directly in our ManageIQ stylesheets.

Installing ansible-runner

On Mac

brew install ansible python
pip3 install ansible-runner
source /usr/local/Cellar/ansible/2.7.10/libexec/bin/activate && pip3 install psutil && deactivate

On Fedora/CentOS

sudo wget -O /etc/yum.repos.d/ansible-runner.repo https://releases.ansible.com/ansible-runner/ansible-runner.el7.repo
sudo dnf install ansible-runner

git repo management

@mkanoor and I had started on a federated git repo management design back when we had the idea that the automate models would work better stored in git repos, thus allowing us to run them at any point in time as well as for history tracking, auditing, and reverting capabilities.

The premise was that an appliance would be given the git_owner role, which would behave much like the db_owner role. This appliance would allow internet access and thus could clone from public locations like github and/or private git instances. A record would be put into the git_repositories table, so that if we needed to failover the appliance we could re-clone.

All other appliances, if they needed to access something about the git repository, would git clone/fetch from the appliance with the git_owner role. This would allow non-internet connected appliances to get at the data in an on-demand fashion.

Some of these classes already exist, such as the GitRepository, GitReference, GitBranch, and GitTag models, as well as the GitWorktree class which manages the on-disk repositories using the rugged gem.

The work that still needs to occur is to

complete these classes
expose the git protocol from the appliance, likely through Apache, but with some sort of server to server authentication (perhaps similar to how we do MiqServer.api_system_auth_token_for_region?)
have a way to identify the appliance with the git_owner role, likely in a similar fashion to MiqRegion#remote_ui_miq_server

Once these are completed, we can ensure a git repo by checking if our on-disk git exists, to which we can git clone from the git_owner appliance, or if it already exists but is not up-to-date (checked by comparing to the expected SHA stored in the git_repositories table), then git fetching from the git_owner appliance.

Additionally, this would allow us to support things like "Update on Launch", because we would know the expected SHA for launching and can ensure we use that SHA, so when doing an Update on Launch we git fetch first and update the expected SHA.

Extra-bonus, since all of this is done, @mkanoor and I will be able to realize our git-based automate design 😄

Seeding

I'm not sure we need to seed any more than what's in the PR (i.e. default credentials for "localhost"). The original code had to create defaults for a number of things in order to please AWX, but those aren't necessarily needed for the new implementation. Even so, we need to research each one of those. (cc @carbonin)

Ansible stdout

In this implementation ansible stdout is stored in the MiqTask and it's associated AnsibleRunnerWorkflow job. (cc @agrare) These stdouts can get really big, so it's probably best to only have it stored once. We probably also do not want to store it in MiqTask, as that class could get cleaned up eventually, so it's probably better to hang a binary_blob entry off of the ServiceAnsiblePlaybook instance.

Another complication here is how the UI is implemented, since this was originally a special casing for asynchronously fetching the stdout from AWX on-demand. In the original implementation, the backend code would start a special MiqTask specifically to get the output as HTML, and temporarily store it in the task. Then, the UI would wait_for_task, and when it was done delete the MiqTask.

None of this is needed anymore, and I think the backend code could be changed such that when the AnsibleRunnerWorkflow is completed, the data is extracted from the MiqTask, and stored as a binary_blob. Later, when the UI asks for the output, no MiqTask is needed as the data is already in the database and can just be served directly. Even better, this can probably be done as a normal controller action, where the controller just asks the model for the raw output and the TerminalToHtml call is done in the controller (since that's the more logical place to convert raw data to presentation HTML).

Automate methods that are playbooks directly (without the service/service catalog)

Automate methods that are playbooks directly can use the AnsiblePlaybookWorkflow directly. Unlike the Service modeling which had its own execute and check_completed callouts, the automate methods do not.

TODO

Credential management

TODO

run pahse, only have a single credential type, allowing the user to define how to map credential details to playbook env vars and/or extra vars. Then we can get rid of the specialized types. Slight alternative have 2 types, a mappable type and a key pair type where the latter would map to SSH machine creds.
crawl phase (or ran out of time phase) we can keep the specialized types, and do the mapping in code a la
- https://github.com/ansible/awx/blob/3f73176ef2ae0d952b02e579974c60b807ec172b/awx/main/models/credential/__init__.py#L654
- https://github.com/ansible/awx/blob/3f73176ef2ae0d952b02e579974c60b807ec172b/awx/main/models/credential/injectors.py

This section will likely need UI work.

Some settings in the service, such as logging, verbosity

TODO

Using the embedded_ansible or perhaps automate role

TODO

Upgrades

TODO

Tests

TODO

The text was updated successfully, but these errors were encountered:

carbonin · 2019-07-19T21:29:26Z

Do we want to run ansible-galaxy to fetch required roles before running a playbook or when we update a git repo?

Right now I don't think we'll ever fetch the contents of the requirements.yml for repos added for embedded ansible.

I believe that AWX would do this, but we can try to confirm that.

Fryguy · 2019-07-23T20:31:42Z

GitRepository's clone is not process-safe. We need to add a clone lock around the Dir.exists? call and the actual clone.
This does not yet use the SCM credentials. SCM credentials are tied to a ConfigurationScriptSource, but GitRepository is expected to own it's own credentials, so I have a double-ownership problem. GitRepository will have to be refactored somehow before I can handle it, hence why I want to do it in a follow up.
The default ansible playbook consolidated thing may not update properly. (i.e. it can clone on first seed, but may not update on subsequent seed)
Checkouts created after running the playbook still need to be cleaned up.
Git repos over ssh don't work just yet.
Testing of individual credentials
Verify multi-role support (incl. zones and broadcast of deletes)
Ensure old job runs' stdouts don't "blowup" post-upgrade (because old stdouts are likely not available)

NickLaMuro · 2019-07-23T20:38:42Z

Using this comment as an excuse to get myself tagged as a participant on this issue, but a list of the currently in flight and completed PRs for this effort can be found with this pull request search:

https://github.com/pulls?page=1&q=is%3Apr+label%3A%22embedded+ansible%22+archived%3Afalse&utf8=%E2%9C%93

Probably wouldn't hurt adding the embedded ansible label to this issue if you have a chance.

carbonin · 2019-07-23T20:48:13Z

ManageIQ/manageiq-appliance#240 handles making the roles provided by plugins available for playbook runs.

It doesn't actually have anything to do with the initial git repo we create, that's for playbooks and we don't actually provide any of those currently. So we could solve this one:

The default ansible playbook consolidated thing may not update properly. (i.e. it can clone on first seed, but may not update on subsequent seed)

by removing the consolidated repo and leaving the roles ....

carbonin · 2019-07-24T22:04:13Z

Opened ManageIQ/manageiq#19056 to remove the default consolidated playbook repo thing.

carbonin · 2019-07-24T22:51:09Z

So based on https://github.com/ansible/awx/blob/128fa8947ac620add275a15cb07577178745a849/awx/playbooks/project_update.yml#L141-L165 it looks like pulling the roles down from ansible galaxy was a part of the project update process in AWX.

That said, I'm not sure when/how we should do this. They kept the whole project repo on disk which meant that they could install roles into that directory directly. This, I assume, lead to things like the "clean" option which would remove the role files. Since we're using bare repos and since the playbook lives somewhere other than where we're executing ansible-runner this becomes a bit more difficult.

My first thought was to run it from AnsibleRunner. That way we won't conflict roles between differerent playbook runs, but that means we need to find the requirements file from just the playbook path. Any other ideas?

NickLaMuro · 2019-07-25T15:45:01Z

@carbonin I saw this yesterday, but I was going to dig into it a bit more this morning since I do have to understand the playbook specifics that you linked (but also the parts around it that setup some of the conditionals), so I will get back to you on that.

Though, I don't know that we need to make this a playbook like they did, in fact, I think using rugged for this makes more sense since we are already using it and I think the playbook would end up different from what they have anyway since we are using the bare repos instead.

I think if I am reading that playbook right, they are always doing a ansible-galaxy call except when no change is needed (I think the equivalent of git's "Everything is up to date"), so the one thing we would loose by using bare repos is the caching of the galaxy roles. I think that is fine, just an FYI.

Anyway, I think the plan of doing a check for a "#{playbooks_path}/roles/requirements.yml" to decide if we make an extra ansible-galaxy call is fine and works for me.

Fryguy · 2019-07-25T19:31:54Z

Updated the OP with Fedora/CentOS instructions.

carbonin · 2019-08-02T20:07:32Z

Created https://bugzilla.redhat.com/show_bug.cgi?id=1737149 to track the issue of ansible.cfg files included in repos. Originally raised here ManageIQ/manageiq#19079 (comment)

Fryguy · 2019-11-14T19:08:38Z

Completed in ManageIQ/manageiq#18687 and subsequent PRs

Fryguy mentioned this issue Apr 16, 2019

[WIP] [PoC] Replace Embedded Ansible with ansible-runner based solution ManageIQ/manageiq#18657

Closed

Fryguy added the enhancement label Apr 16, 2019

This was referenced Jun 5, 2019

Use ansible-runner in EmbeddedAnsible ManageIQ/manageiq#18687

Merged

Embedded Ansible setup guide ManageIQ/guides#276

Closed

NickLaMuro mentioned this issue Jul 29, 2019

[GitRepository] Use multi-process file locking for git actions ManageIQ/manageiq#19074

Merged

This was referenced Aug 9, 2019

[MiqQueue] Add MiqQueue.broadcast ManageIQ/manageiq#19122

Merged

[GitRepository] Broadcast delete of repo_dir ManageIQ/manageiq#19123

Merged

Fryguy closed this as completed Nov 14, 2019

NickLaMuro mentioned this issue Mar 5, 2020

[RFE] Have the bot sync all labels to all ManageIQ repos ManageIQ/miq_bot#484

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] EmbeddedAnsible with ansible-runner-based implementation #45

[RFC] EmbeddedAnsible with ansible-runner-based implementation #45

Fryguy commented Apr 16, 2019 •

edited

Loading

General approach

Repositories

Service Template

Service execute

Launching ansible-runner

Service check_completed

Services page

carbonin commented Jul 19, 2019

Fryguy commented Jul 23, 2019 •

edited

Loading

NickLaMuro commented Jul 23, 2019 •

edited

Loading

carbonin commented Jul 23, 2019

carbonin commented Jul 24, 2019

carbonin commented Jul 24, 2019

NickLaMuro commented Jul 25, 2019

Fryguy commented Jul 25, 2019

carbonin commented Aug 2, 2019

Fryguy commented Nov 14, 2019

[RFC] EmbeddedAnsible with ansible-runner-based implementation #45

[RFC] EmbeddedAnsible with ansible-runner-based implementation #45

Comments

Fryguy commented Apr 16, 2019 • edited Loading

General approach

Repositories

Service Template

Service execute

Launching ansible-runner

Service check_completed

Services page

TODO

carbonin commented Jul 19, 2019

Fryguy commented Jul 23, 2019 • edited Loading

NickLaMuro commented Jul 23, 2019 • edited Loading

carbonin commented Jul 23, 2019

carbonin commented Jul 24, 2019

carbonin commented Jul 24, 2019

NickLaMuro commented Jul 25, 2019

Fryguy commented Jul 25, 2019

carbonin commented Aug 2, 2019

Fryguy commented Nov 14, 2019

Fryguy commented Apr 16, 2019 •

edited

Loading

Fryguy commented Jul 23, 2019 •

edited

Loading

NickLaMuro commented Jul 23, 2019 •

edited

Loading