Assumptions about queue message for a server are broken in pods #19957

carbonin · 2020-03-12T20:33:26Z

Currently when running ManageIQ in pods from https://github.com/ManageIQ/manageiq-pods workers associated with the same "server" don't actually share a process space or filesystem as they are running in different containers.

This upends the assumption that process or filesystem level shared resources will be accessible to all workers on the same server. We use that assumption when we queue messages targeted for a specific server.

An example of this is the AnsibleRunnerWorkflow class. This job queues each of its states to avoid tying up a worker as the ansible playbook runs, but those queue messages are pinned to a specific server, not to a particular worker. This means that when running workers as separate containers we could clone the playbook repo on one container then try to run a playbook from that repo in a separate one. In the course of a single playbook run at least 3 messages are sent over the queue between cloning the repo and cleaning it up. It's very unlikely that all three messages get picked up by the same worker that queued the first one, so it's fair to consider this feature completely broken.

I'm sure there are other examples of this kind of behavior, but I don't know where they might be.

It seems like this might be an issue for smartstate if we are mounting something or trying to copy data to/from the local disk. Maybe V2V could also run into issues?

The best idea I have currently for this is to add a worker guid to queue messages if we need the message to be processed by a particular worker instance and use deliver_on to ensure the worker waits before consuming the message (think thread context switching). I don't like the idea of adding to the queue though so this may not be the best way forward.

From a pods-first viewpoint, I would like to spin up Job instances for this kind of work, then poll for the status of the job. It would be that separate pod's responsibility to take care of most of what we're doing in the workflow class. That way we don't have to worry about separate workers trying to monitor a process. I'm not sure what we would implement on an appliance that would fulfill the same use-case though.

I made a quick fix to the AnsibleRunnerWorkflow case in #19956, but this bug report is really about the larger issue so we can get input about other features that might be affected.

cc @Fryguy @chessbyte @NickLaMuro

The text was updated successfully, but these errors were encountered:

NickLaMuro · 2020-03-12T20:49:23Z

From a pods-first viewpoint, I would like to spin up Job instances for this kind of work

So I think I like this approach in general, I am just wondering if it makes sense to proactively look at more ways to increase the boot speed for Job pods, specifically since around a 10-15 second penalty (roughly) to have this booted into a new process every time just to spin up the work required seems like it wouldn't be scalable long term.

Though, speaking specifically about using this approach specifically for EmbeddedAnsible, only a very small subset of our app is really required to kick off a ansible-runner job (though, I guess a bit more if we are talking about having to run the GitWorktree code and load a few other database records.

Anyway, just the first thing that popped into my head after reading that.

Fryguy · 2020-03-12T21:21:49Z

I think the assumption is that ansible-runner itself would be the thing launched as a Kube Job.

In my mind I think we need a MonitoredProcess abstraction which then can have different backends for pods (Kube Jobs) and appliances (Processes) and then callers can just go through that abstraction. Haven't fully thought it through though, and I'm not sure how we'd handle things like the smart state cases.

NickLaMuro · 2020-03-12T21:28:42Z

I think the assumption is that ansible-runner itself would be the thing launched as a Kube Job.

I feel like that is still missing a step, since you need the repo checked out on the same container/volume/whatever, and that isn't part of the CLI call to ansible-runner.

That said, I am unsure the capabilities of "Kube Jobs", and if they allow passing a file system of sorts along to the job to be executed, so my point might be moot if that is the case.

carbonin · 2020-03-13T14:49:17Z

Yeah, at the very start of development of ansible_runner we asked for a container image that would handle checking out the repo and all of that, but I don't think that ever happened. I would imagine that's what we would build for this.

carbonin added the bug label Mar 12, 2020

carbonin mentioned this issue Mar 26, 2020

Openscap scanner not working ManageIQ/manageiq-pods#374

Closed

Fryguy added help wanted redesign labels Mar 26, 2020

jrafanie mentioned this issue Jul 8, 2020

RHN registration expectation for single filesystem per server_guid: /etc/rhsm/rhsm.conf will not work in pods #20343

Closed

carbonin added the core/pods label Jul 9, 2020

jrafanie mentioned this issue Jul 9, 2020

scan_metadata / sync_metadata smartstate code expects to run on the system that's configured for this (libraries installed/volumes mounted/network reachable) by using queue server_guid - could be broken in pods #20346

Open

chessbyte added this to the Kasparov milestone Jul 28, 2020

Fryguy mentioned this issue Jul 28, 2020

Differences between podified and appliance ManageIQ/manageiq-pods#595

Open

18 tasks

chessbyte removed this from the Kasparov milestone Jul 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assumptions about queue message for a server are broken in pods #19957

Assumptions about queue message for a server are broken in pods #19957

carbonin commented Mar 12, 2020

NickLaMuro commented Mar 12, 2020

Fryguy commented Mar 12, 2020

NickLaMuro commented Mar 12, 2020

carbonin commented Mar 13, 2020

Assumptions about queue message for a server are broken in pods #19957

Assumptions about queue message for a server are broken in pods #19957

Comments

carbonin commented Mar 12, 2020

NickLaMuro commented Mar 12, 2020

Fryguy commented Mar 12, 2020

NickLaMuro commented Mar 12, 2020

carbonin commented Mar 13, 2020