-
Notifications
You must be signed in to change notification settings - Fork 898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add container events to pod #16583
Add container events to pod #16583
Conversation
@agrare @Ladas @cben @gmcculloug I successfully tested this locally but im not sure in production we will get the same results:
Can we expect the same results with a large openshift env? does the refresh triggered by the event ALWAYS happen before the policy actions? |
Looking into it and i still see occasionaly evm.log:[----] W, [2017-12-03T17:13:18.139644 #8150:8e7114] WARN -- : MIQ(EmsEvent#parse_policy_parameters) Unable to find target [container_group], skipping policy evaluation so maybe the event fires several times. 13m 14m 3 goodbye-openshift Pod spec.containers{goodbye-openshift} Normal BackOff kubelet, ocp-compute01.10.35.48.187.xip.io Back-off pulling image "openshift/hello-openshiftaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
13m 14m 3 goodbye-openshift Pod spec.containers{goodbye-openshift} Warning Failed kubelet, ocp-compute01.10.35.48.187.xip.io Error: ImagePullBackOff Looks to me like we do need to reconnect targetless events on refresh as per #16497 |
2,3) So even with refresh + reconnect, the refresh is async so the policy will probably be executed before the refresh happens, so it will not have the container/pod associated. In bigger env it's expected that any event tied to creating of entity will not have the refreshed entity in time (processing of refresh takes much more time than processing of the policy) The policy or the state machine invoking the policy should have an async waiting loop (code that will check Pod/Container is in our DB and send Automate :retry if not). Then of course we will need the post refresh event reconnect, so the association is filled. In h-release, we will be creating the Pod/Container with the event(no reconnect should be needed) and it will be targeted refresh based on the event, so it will be much quicker. We will still need the waiting loop though, to make sure the policy has the record and all the record's attributes needed for the processing. |
containergroup_containerkilling,Pod Container Killing,Default,container_operations | ||
containergroup_containerstarted,Pod Container Started,Default,container_operations | ||
containergroup_containerstopped,Pod Container Stopped,Default,container_operations | ||
containergroup_containerunhealthy,Pod Container Unhealthy,Default,container_operations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ the names with "Pod" at the beginning, this is presently the only way for user to know which events belong in which policies
My plan was to call So the flow would be:
Do you think its a good intermediate solution? EDIT: this is also a good approach for alerts |
@Ladas, Do you mean we should delay policy until we have both container
spec AND status?
Current criteria for both normal policy execution and proposed reconnect is
pod existing which only guarantees spec.
|
@cben yeah I think we need a waiting step in the event handle (async wait), waiting for entity to be in the DB. This might apply for more events, not just events upon creation, since the refresh can take a long time. @zeari what you you mean by the :handle, calling the Event handle again after the reconnect? @agrare I wonder, what is the way we solve this usually? Was the refresh_new_target the way? Was it sync, I know the current refresh_sync is blocking, so that is not an option. The async waiting step sounds like the best option. |
@zeari not sure what would be the correct way to process that, we would have to send the Event again, to invoke the handle. We should not be calling handle directly from the refresh worker. |
@gmcculloug Do you know the downsides to calling handle again on an event after the first time yielded |
@zeari The only downside that I can think of with raising the event the second time is that refresh would be called again. |
LGTM 👍 |
Continuing the event reconnect discussion on #16497. This PR LGTM 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
@lfu @gmcculloug |
@miq-bot add_label gaprindashvili/yes |
@miq-bot add_label fine/yes |
For VMs we raise a creation_event during post_refresh which is what people usually attach their policy events to (so @gmcculloug tells me). Can we do this for containers/container_groups as well? Looks like we already have a creation_event on container_images. |
Yes, there is separate BZ to add such creation events for pods etc. These are nice because they're guaranteed to already have inventory. But discussion here (moved to #16497) was about processing "real" external events that arrive before inventory. |
@agrare as @cben says, it the world of long refresh and fast events, any event can have the target missing so it can be:
or:
|
So my biggest issue with just running the event handler again is if a customer adds some action which isn't idempotent in addition to triggering the policy event it could break their workflows. This is a pretty fundamental change in how event handlers are run and we cannot say for certain that this won't cause regressions for customers. Today if a customer puts a policy event on a native VmCreatedEvent from VMware it isn' guaranteed that the target is in the DB so this is no different, which is why we have the synthesized events when a VM is created in the DB. |
@lfu @gmcculloug @agrare so, can we go forward with this PR, ManageIQ/manageiq-content#225, and ManageIQ/manageiq-providers-kubernetes#181? |
Sounds good to me. |
@cben I would probably wait with ManageIQ/manageiq-content#225 ? Since that might cause a lot of failures with missing targets? Unless we solve the 'wait for target to be in the DB' somehow. |
@zeari @cben can you explain a bit better why you want to switch all of these from being associated with the container to being associated with the pod? It isn't clear to me why this would help.
Right now I'm 👎 on this unless @gmcculloug tells me I'm way off base on this and there's nothing to worry about 😄. This is a big change and we can't just spring this on customers with custom automate methods that might not behave well when run multiple times. If we were waiting to execute the policy until the item was in the DB maybe, but that won't handle if the container will never be in the DB |
@agrare Well there isnt code to associate those events with While there was some pre-work to have them attached to
|
I dont think these events currently working for any customer as they were never associated correctly with containers 😅 |
These MiqEvents weren't even defined.
We're not switching them, we're adding them to pods.
If you mean why not add Container Policies and target them there - no deep
reasons, we discussed this back and forth, @simon3z and @bazulay decided on
pods.
On Dec 17, 2017 12:37 PM, "Ari Zellner" ***@***.***> wrote:
> on customers with custom automate methods that might not behave well when
run multiple times.
I dont think these events currently work for any customer 😅
EDIT: Also, can we please separate discussion of calling .handle twice? That's a separate PR, it's an orthogonal goal to make *all* node/pod/container events robust. Unless I'm missing some way it affects this PR?
|
I think we are over the discussion. Its not a viable solution and i closed the other PR. |
2c78025
to
8ed149b
Compare
@agrare |
8ed149b
to
99bbac4
Compare
containergroup_killing,Pod Container Killing,Default,container_operations | ||
containergroup_started,Pod Container Started,Default,container_operations | ||
containergroup_stopped,Pod Container Stopped,Default,container_operations | ||
containergroup_unhealthy,Pod Container Unhealthy,Default,container_operations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@agrare I think we should at least keep the labels more indicative: Pod Container Created
Checked commits zeari/manageiq@81db841~...99bbac4 with ruby 2.3.3, rubocop 0.47.1, haml-lint 0.20.0, and yamllint 1.10.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 LGTM
Add container events to pod (cherry picked from commit 8e2bfac) https://bugzilla.redhat.com/show_bug.cgi?id=1530651
Gaprindashvili backport details:
|
Add container events to pod (cherry picked from commit 8e2bfac) https://bugzilla.redhat.com/show_bug.cgi?id=1530653
Fine backport details:
|
Add container events to pod (cherry picked from commit 8e2bfac) https://bugzilla.redhat.com/show_bug.cgi?id=1530653
BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1496179
Container events were previously unhandled. They belong to their parent pod.
needs to be merged with:
ManageIQ/manageiq-providers-kubernetes#181
ManageIQ/manageiq-content#225
@cben @moolitayer @enoodle Please review
cc @simon3z @bazulay
@miq-bot add_label bug, providers/containers, automate