-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip invalid container_images #94
Conversation
e61f8cb
to
cdf9afd
Compare
Do you have details on what the input looked like? It's generally good idea to not abort refresh on errors.
|
Also interested in that. Currently I think we assume the container -> container image link always exists so removing that might cause unexpected problems (e.g screens that relay on that relation). Maybe we can fix the parsing to match new cases? Or maybe this is due to a bug in kubernetes that we need to mitigate (missing image element in some cases)? |
@cben no not yet, I was hoping to be able to tell from the log line on the customer env to see if it is something we need to update that regex to accommodate or if it is something we need to work around. @blomquisg were you guys able to find out from the customer anything about the invalid image? |
https://bugzilla.redhat.com/show_bug.cgi?id=1484337 |
@enoodle Until we know more, this would log the exact input, and let refresh complete 👍
Makes sense. Is it enough to skip the container status, or the whole container? |
@@ -1077,11 +1077,14 @@ def parse_quantity(resource) # parse a string with a suffix into a int\float | |||
end | |||
|
|||
def parse_container_status(container, pod_id) | |||
container_image = parse_container_image(container.image, container.imageID) | |||
return if container_image.nil? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you log more info on the affected container? Ideally whole input spec and status, which would have to happen in caller.
This returns nil, but caller does containers_index[cn.name].merge!(parse_container_status(cn, pod.metadata.uid))
— doesn't merge!(nil) crash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch I'll have parse_pod catch a nil status
So would it be enough to log cn
from here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of scope: I'm wondering if we should systematically catch exceptions in (*) will need similar helper for |
cdf9afd
to
7a49bf0
Compare
@@ -766,8 +766,14 @@ def parse_pod(pod) | |||
|
|||
unless pod.status.nil? || pod.status.containerStatuses.nil? | |||
pod.status.containerStatuses.each do |cn| | |||
container_status = parse_container_status(cn, pod.metadata.uid) | |||
if container_status.nil? | |||
_log.warn("Invalid container status: pod [#{pod.metadata.uid}] container [#{cn}]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally also log containers_index[cn.name]
LGTM 👍 |
image_parts = docker_pullable_re.match(image) | ||
if image_parts.nil? | ||
_log.warn("Invalid image #{image}") | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good with logging an error and getting a bug filed, I'm not good with throwing an exception and stopping the refresh for one bad image which tends to lead to severity 1 issues :D
@agrare have we learned anything about the inputs causing the problem you are fixing? |
@agrare do we have a bz for the issue fixed here? do we plan to eventually backport it? |
@moolitayer customer is no longer seeing this issue, we are assuming the problematic container was deleted but they were hitting this for about a week. |
7a49bf0
to
4351cc5
Compare
@cben I have also been thinking we need something along the lines of what you said
@jameswnl and I were kicking some ideas around and we were thinking if we had a way to build up a set of errors that we can raise at the end of the refresh as a notification to the user without stopping the refresh but also more visible than errors in the logs. What do you think?
I think this would be a good addition as well, along the lines of "we want to let the user know something is wrong but don't want to stop the refresh for it". Honestly even exiting the refresh with an error usually goes unnoticed unless someone is looking for it. |
Sure, but we have no idea what it is. Hopefully the logging in this PR would help to catch such input. |
@agrare @moolitayer @cben I think we need specs for this. We can unit-test |
@simon3z yeah I can add specs, I wish we knew what was causing the parsing error but we can come back and add a test specifically for that when we apply this and find out |
@agrare ah sorry I forgot to write my thoughts on that 😄 ...I think it can be a particular state (especially when overloaded?) when a Pod has been created but not yet picked up by the scheduler. |
Oh okay :) so do you think the image name is nil? |
@agrare if we're talking about these: status:
...
containerStatuses:
- ...
image: docker.io/foobar/...
imageID: docker-pullable://docker.io/foobar/... then yes. In particular The one provided on "creation" (maybe it's better to say on "definition") instead must be there since the beginning: spec:
containers:
- ...
image: docker.io/foobar/... |
@simon3z okay and do you think it makes sense to skip these containers if they don't have an |
Personally I think it'd be better to parse and save the container and leave the image link blank until it is filled in but that's just me :) |
@simon3z hmm actually I don't know that it is a missing |
4351cc5
to
15c6799
Compare
Checked commits agrare/manageiq-providers-kubernetes@22f8ab0~...15c6799 with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0 |
👍 to ignore code climate, |
Backported to Euwe via ManageIQ/manageiq#15918 |
Backported to Fine via ManageIQ/manageiq#16019 |
This PR caught the invalid image :-) Example:
cc @enoodle does this ^^ tell you anything? (Should open BZ, but wanted to dump info somewhere for now) |
@cben there is also another type where the imageID is blank:
|
https://bugzilla.redhat.com/show_bug.cgi?id=1484337