Add container events to pod #16583

zeari · 2017-12-03T14:17:30Z

BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1496179

Container events were previously unhandled. They belong to their parent pod.
needs to be merged with:
ManageIQ/manageiq-providers-kubernetes#181
ManageIQ/manageiq-content#225

@cben @moolitayer @enoodle Please review
cc @simon3z @bazulay

@miq-bot add_label bug, providers/containers, automate

zeari · 2017-12-03T14:54:17Z

@agrare @Ladas @cben @gmcculloug
I cant quite tell if we need to take care of event reconnection or not: #16497

I successfully tested this locally but im not sure in production we will get the same results:

A failure event comes in for a brand new pod with container that has an unresolavable image name.
- after this change the event should show up as" Pod Container Failed"
We didnt collect that pod through refresh
the event triggers a refresh so we get the new pod, the correct policy action occurs.

Can we expect the same results with a large openshift env? does the refresh triggered by the event ALWAYS happen before the policy actions?
does refreshing after each pod event even make sense? 😕

zeari · 2017-12-03T15:17:14Z

Looking into it and i still see occasionaly

evm.log:[----] W, [2017-12-03T17:13:18.139644 #8150:8e7114]  WARN -- : MIQ(EmsEvent#parse_policy_parameters) Unable to find target [container_group], skipping policy evaluation

so maybe the event fires several times. oc get events on the provider side yields a few:

13m        14m         3         goodbye-openshift                   Pod       spec.containers{goodbye-openshift}   Normal    BackOff                    kubelet, ocp-compute01.10.35.48.187.xip.io   Back-off pulling image "openshift/hello-openshiftaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
13m        14m         3         goodbye-openshift                   Pod       spec.containers{goodbye-openshift}   Warning   Failed                     kubelet, ocp-compute01.10.35.48.187.xip.io   Error: ImagePullBackOff

Looks to me like we do need to reconnect targetless events on refresh as per #16497

Ladas · 2017-12-04T08:14:48Z

@zeari

missing image for Container will cause missing STI, then a lot of code can explode Make sure Container has always the right STI type manageiq-providers-kubernetes#177

2,3) So even with refresh + reconnect, the refresh is async so the policy will probably be executed before the refresh happens, so it will not have the container/pod associated.

In bigger env it's expected that any event tied to creating of entity will not have the refreshed entity in time (processing of refresh takes much more time than processing of the policy)

The policy or the state machine invoking the policy should have an async waiting loop (code that will check Pod/Container is in our DB and send Automate :retry if not). Then of course we will need the post refresh event reconnect, so the association is filled.

In h-release, we will be creating the Pod/Container with the event(no reconnect should be needed) and it will be targeted refresh based on the event, so it will be much quicker. We will still need the waiting loop though, to make sure the policy has the record and all the record's attributes needed for the processing.

cben · 2017-12-04T09:47:08Z

db/fixtures/miq_event_definitions.csv

+containergroup_containerkilling,Pod Container Killing,Default,container_operations
+containergroup_containerstarted,Pod Container Started,Default,container_operations
+containergroup_containerstopped,Pod Container Stopped,Default,container_operations
+containergroup_containerunhealthy,Pod Container Unhealthy,Default,container_operations


❤️ the names with "Pod" at the beginning, this is presently the only way for user to know which events belong in which policies

zeari · 2017-12-04T09:51:22Z

@Ladas

My plan was to call handle a second time on each ems_event I managed to reconnect after refresh: #16497 (comment)

So the flow would be:

event comes in for a pod that not in the db
unable to find target message appears
next refresh gets the pod
after refresh, the event reconnects and handle is called on the events we reconnected

Do you think its a good intermediate solution?

EDIT: this is also a good approach for alerts

cben · 2017-12-04T10:40:48Z

@Ladas, Do you mean we should delay policy until we have both container spec AND status? Current criteria for both normal policy execution and proposed reconnect is pod existing which only guarantees spec.

Ladas · 2017-12-04T12:58:06Z

@cben yeah I think we need a waiting step in the event handle (async wait), waiting for entity to be in the DB. This might apply for more events, not just events upon creation, since the refresh can take a long time.

@zeari what you you mean by the :handle, calling the Event handle again after the reconnect?

@agrare I wonder, what is the way we solve this usually? Was the refresh_new_target the way? Was it sync, I know the current refresh_sync is blocking, so that is not an option. The async waiting step sounds like the best option.

zeari · 2017-12-04T12:59:52Z

@zeari what you you mean by the :handle, calling the Event handle again after the reconnect?

@Ladas yes

Ladas · 2017-12-04T13:01:26Z

@zeari not sure what would be the correct way to process that, we would have to send the Event again, to invoke the handle. We should not be calling handle directly from the refresh worker.

zeari · 2017-12-04T13:18:03Z

@zeari not sure what would be the correct way to process that, we would have to send the Event again, to invoke the handle. We should not be calling handle directly from the refresh worker.

@gmcculloug Do you know the downsides to calling handle again on an event after the first time yielded Unable to Find Target? (expecting that this time the event will have a target)

lfu · 2017-12-04T20:35:55Z

@zeari The only downside that I can think of with raising the event the second time is that refresh would be called again.

lfu · 2017-12-04T20:36:14Z

LGTM 👍

cben · 2017-12-05T10:07:12Z

Continuing the event reconnect discussion on #16497. This PR LGTM 👍

moolitayer

LGTM 👍

zeari · 2017-12-07T13:57:06Z

@lfu @gmcculloug
Can we go forward with this PR and ManageIQ/manageiq-content#225 ?

zeari · 2017-12-07T15:40:29Z

@miq-bot add_label gaprindashvili/yes

zeari · 2017-12-07T15:42:25Z

@miq-bot add_label fine/yes

agrare · 2017-12-07T19:43:08Z

@agrare I wonder, what is the way we solve this usually? Was the refresh_new_target the way

For VMs we raise a creation_event during post_refresh which is what people usually attach their policy events to (so @gmcculloug tells me). Can we do this for containers/container_groups as well? Looks like we already have a creation_event on container_images.

cben · 2017-12-07T22:34:27Z

Yes, there is separate BZ to add such creation events for pods etc. These are nice because they're guaranteed to already have inventory.

But discussion here (moved to #16497) was about processing "real" external events that arrive before inventory.

Ladas · 2017-12-08T09:03:04Z

@agrare as @cben says, it the world of long refresh and fast events, any event can have the target missing

so it can be:

pod_1_created_event
refresh(started+finished)
pod_1_created_policy
pod_1_any_other_event_policy -> ( 🎉 yaay, target is in the DB)

or:

pod_1_created_event
refresh (started)
pod_1_any_other_event_policy -> (this goes 💥 , since the target is not in the DB yet)
refresh (finished)
pod_1_created_policy

agrare · 2017-12-08T14:06:33Z

So my biggest issue with just running the event handler again is if a customer adds some action which isn't idempotent in addition to triggering the policy event it could break their workflows.

This is a pretty fundamental change in how event handlers are run and we cannot say for certain that this won't cause regressions for customers.

Today if a customer puts a policy event on a native VmCreatedEvent from VMware it isn' guaranteed that the target is in the DB so this is no different, which is why we have the synthesized events when a VM is created in the DB.

cben · 2017-12-13T14:16:31Z

@lfu @gmcculloug @agrare so, can we go forward with this PR, ManageIQ/manageiq-content#225, and ManageIQ/manageiq-providers-kubernetes#181?
I think all 3 got positive reviews, it's just we got into a discussion about reconnecting & re-processing events — which these 3 PR do not do.

lfu · 2017-12-13T14:36:59Z

Sounds good to me.

Ladas · 2017-12-13T15:31:47Z

@cben I would probably wait with ManageIQ/manageiq-content#225 ? Since that might cause a lot of failures with missing targets? Unless we solve the 'wait for target to be in the DB' somehow.

agrare · 2017-12-13T21:39:12Z

@zeari @cben can you explain a bit better why you want to switch all of these from being associated with the container to being associated with the pod? It isn't clear to me why this would help.

My plan was to call handle a second time
Do you think its a good intermediate solution?

Right now I'm 👎 on this unless @gmcculloug tells me I'm way off base on this and there's nothing to worry about 😄. This is a big change and we can't just spring this on customers with custom automate methods that might not behave well when run multiple times.

If we were waiting to execute the policy until the item was in the DB maybe, but that won't handle if the container will never be in the DB

zeari · 2017-12-14T09:33:27Z

@agrare Well there isnt code to associate those events with Container in the first place: https://github.com/manageiq/manageiq/blob/master/app/models/ems_event.rb#L140

While there was some pre-work to have them attached to Container here
in the current state, container events arent handled just as the BZ says.
Having those events target the pod fixes the issue as well as makes sense because:

Users are likely to attach policies to pods
containers are an inseparable part of their pods(on the openshift side) so the events target the pod anyway.
@cben Come up with a Added missing routes. #3?

zeari · 2017-12-17T10:37:03Z

on customers with custom automate methods that might not behave well when run multiple times.

I dont think these events currently working for any customer as they were never associated correctly with containers 😅

cben · 2017-12-17T11:20:55Z

These MiqEvents weren't even defined. We're not switching them, we're adding them to pods. If you mean why not add Container Policies and target them there - no deep reasons, we discussed this back and forth, @simon3z and @bazulay decided on pods.

On Dec 17, 2017 12:37 PM, "Ari Zellner" ***@***.***> wrote: > on customers with custom automate methods that might not behave well when run multiple times. I dont think these events currently work for any customer 😅

EDIT: Also, can we please separate discussion of calling .handle twice? That's a separate PR, it's an orthogonal goal to make *all* node/pod/container events robust. Unless I'm missing some way it affects this PR?

zeari · 2017-12-17T15:04:37Z

EDIT: Also, can we please separate discussion of calling .handle twice? That's a separate PR, it's an orthogonal goal to make all node/pod/container events robust. Unless I'm missing some way it affects this PR?

I think we are over the discussion. Its not a viable solution and i closed the other PR.

zeari · 2017-12-21T11:56:12Z

@agrare
I removed the containers prefixes in the containers specific events
Changed pod_containercreated to just pod_created

zeari · 2017-12-21T12:17:36Z

db/fixtures/miq_event_definitions.csv

+containergroup_killing,Pod Container Killing,Default,container_operations
+containergroup_started,Pod Container Started,Default,container_operations
+containergroup_stopped,Pod Container Stopped,Default,container_operations
+containergroup_unhealthy,Pod Container Unhealthy,Default,container_operations


@agrare I think we should at least keep the labels more indicative: Pod Container Created

miq-bot · 2017-12-21T12:26:43Z

Checked commits zeari/manageiq@81db841~...99bbac4 with ruby 2.3.3, rubocop 0.47.1, haml-lint 0.20.0, and yamllint 1.10.0
0 files checked, 0 offenses detected
Everything looks fine. 🍰

agrare

👍 LGTM

Add container events to pod (cherry picked from commit 8e2bfac) https://bugzilla.redhat.com/show_bug.cgi?id=1530651

simaishi · 2018-01-03T15:08:25Z

Gaprindashvili backport details:

$ git log -1
commit a9d71962b5e148c43fdb540a7d992ccdd1f032c4
Author: Adam Grare <agrare@redhat.com>
Date:   Thu Dec 21 08:43:48 2017 -0500

    Merge pull request #16583 from zeari/container_events_to_pod
    
    Add container events to pod
    (cherry picked from commit 8e2bfac33e7e335d07e07ca72b7f63769028527d)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1530651

Add container events to pod (cherry picked from commit 8e2bfac) https://bugzilla.redhat.com/show_bug.cgi?id=1530653

simaishi · 2018-01-04T17:28:26Z

Fine backport details:

$ git log -1
commit fd945e735def672519fa40f48fc1ef909ae3a462
Author: Adam Grare <agrare@redhat.com>
Date:   Thu Dec 21 08:43:48 2017 -0500

    Merge pull request #16583 from zeari/container_events_to_pod
    
    Add container events to pod
    (cherry picked from commit 8e2bfac33e7e335d07e07ca72b7f63769028527d)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1530653

Add container events to pod (cherry picked from commit 8e2bfac) https://bugzilla.redhat.com/show_bug.cgi?id=1530653

zeari mentioned this pull request Dec 3, 2017

Make container events belong to their container groups ManageIQ/manageiq-content#225

Merged

miq-bot added automate bug providers/containers labels Dec 3, 2017

zeari mentioned this pull request Dec 3, 2017

Make container events belong to pod ManageIQ/manageiq-providers-kubernetes#181

Merged

cben reviewed Dec 4, 2017

View reviewed changes

gmcculloug requested a review from lfu December 4, 2017 13:59

lfu approved these changes Dec 4, 2017

View reviewed changes

cben mentioned this pull request Dec 5, 2017

[WIP] Reconnect events for container entities #16497

Closed

moolitayer reviewed Dec 6, 2017

View reviewed changes

moolitayer approved these changes Dec 6, 2017

View reviewed changes

miq-bot added the gaprindashvili/yes label Dec 7, 2017

miq-bot added the fine/yes label Dec 7, 2017

Make container events belong to pod

81db841

zeari force-pushed the container_events_to_pod branch from 2c78025 to 8ed149b Compare December 21, 2017 11:52

lose container prefix in pod events: pod_containerfailed -> pod_failed

99bbac4

zeari force-pushed the container_events_to_pod branch from 8ed149b to 99bbac4 Compare December 21, 2017 12:17

zeari commented Dec 21, 2017

View reviewed changes

agrare approved these changes Dec 21, 2017

View reviewed changes

agrare merged commit 8e2bfac into ManageIQ:master Dec 21, 2017

agrare self-assigned this Dec 21, 2017

agrare added this to the Sprint 76 Ending Jan 1, 2018 milestone Dec 21, 2017

simaishi pushed a commit that referenced this pull request Jan 3, 2018

Merge pull request #16583 from zeari/container_events_to_pod

a9d7196

Add container events to pod (cherry picked from commit 8e2bfac) https://bugzilla.redhat.com/show_bug.cgi?id=1530651

simaishi added gaprindashvili/backported and removed gaprindashvili/yes labels Jan 3, 2018

simaishi pushed a commit that referenced this pull request Jan 4, 2018

Merge pull request #16583 from zeari/container_events_to_pod

fd945e7

Add container events to pod (cherry picked from commit 8e2bfac) https://bugzilla.redhat.com/show_bug.cgi?id=1530653

simaishi added fine/backported and removed fine/yes labels Jan 4, 2018

d-m-u pushed a commit to d-m-u/manageiq that referenced this pull request Jun 6, 2018

Merge pull request ManageIQ#16583 from zeari/container_events_to_pod

6ecb7da

Add container events to pod (cherry picked from commit 8e2bfac) https://bugzilla.redhat.com/show_bug.cgi?id=1530653

Add container events to pod #16583

Add container events to pod #16583

Conversation

zeari commented Dec 3, 2017 • edited Loading

zeari commented Dec 3, 2017 • edited Loading

zeari commented Dec 3, 2017 • edited Loading

Ladas commented Dec 4, 2017

cben Dec 4, 2017

Choose a reason for hiding this comment

zeari commented Dec 4, 2017 • edited Loading

cben commented Dec 4, 2017 via email

Ladas commented Dec 4, 2017

zeari commented Dec 4, 2017

Ladas commented Dec 4, 2017

zeari commented Dec 4, 2017 • edited Loading

lfu commented Dec 4, 2017

lfu commented Dec 4, 2017

cben commented Dec 5, 2017 • edited Loading

moolitayer left a comment

Choose a reason for hiding this comment

zeari commented Dec 7, 2017

zeari commented Dec 7, 2017

zeari commented Dec 7, 2017

agrare commented Dec 7, 2017

cben commented Dec 7, 2017

Ladas commented Dec 8, 2017 • edited Loading

agrare commented Dec 8, 2017

cben commented Dec 13, 2017

lfu commented Dec 13, 2017

Ladas commented Dec 13, 2017

agrare commented Dec 13, 2017

zeari commented Dec 14, 2017 • edited Loading

zeari commented Dec 17, 2017 • edited Loading

cben commented Dec 17, 2017 via email • edited Loading

zeari commented Dec 17, 2017

zeari commented Dec 21, 2017

zeari Dec 21, 2017

Choose a reason for hiding this comment

miq-bot commented Dec 21, 2017

agrare left a comment

Choose a reason for hiding this comment

simaishi commented Jan 3, 2018

simaishi commented Jan 4, 2018

zeari commented Dec 3, 2017 •

edited

Loading

zeari commented Dec 3, 2017 •

edited

Loading

zeari commented Dec 3, 2017 •

edited

Loading

zeari commented Dec 4, 2017 •

edited

Loading

zeari commented Dec 4, 2017 •

edited

Loading

cben commented Dec 5, 2017 •

edited

Loading

Ladas commented Dec 8, 2017 •

edited

Loading

zeari commented Dec 14, 2017 •

edited

Loading

zeari commented Dec 17, 2017 •

edited

Loading

cben commented Dec 17, 2017 via email •

edited

Loading