Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip invalid container_images #94

Merged
merged 3 commits into from
Aug 29, 2017

Conversation

agrare
Copy link
Member

@agrare agrare commented Aug 15, 2017

[----] E, [2017-08-08T05:55:00.221227 #15805:1041130] ERROR -- : [NoMethodError]: undefined method `[]' for nil:NilClass  Method:[rescue in block in refresh]
[----] E, [2017-08-08T05:55:00.221475 #15805:1041130] ERROR -- : /var/www/miq/vmdb/app/models/manageiq/providers/kubernetes/container_manager/refresh_parser.rb:767:in `parse_image_name'

https://bugzilla.redhat.com/show_bug.cgi?id=1484337

@cben
Copy link
Contributor

cben commented Aug 16, 2017

Do you have details on what the input looked like?
cc @zgalor this is in fetch_hawk_inv (unrelated adjacent log line, known other issue)

It's generally good idea to not abort refresh on errors.

  • I'm not sure we should skip the container too.
  • If we just drop things in parser, existing records can be deleted.
  • I'd like some mechanism stronger than logging to communicate problems. User should see "Refreshed with 3 errors" or something.

@moolitayer
Copy link

Do you have details on what the input looked like?

Also interested in that. Currently I think we assume the container -> container image link always exists so removing that might cause unexpected problems (e.g screens that relay on that relation). Maybe we can fix the parsing to match new cases? Or maybe this is due to a bug in kubernetes that we need to mitigate (missing image element in some cases)?

@agrare
Copy link
Member Author

agrare commented Aug 16, 2017

Do you have details on what the input looked like?

@cben no not yet, I was hoping to be able to tell from the log line on the customer env to see if it is something we need to update that regex to accommodate or if it is something we need to work around.

@blomquisg were you guys able to find out from the customer anything about the invalid image?

@cben
Copy link
Contributor

cben commented Aug 23, 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1484337
Has full backtrace but no more details on the input that causes this.

@cben
Copy link
Contributor

cben commented Aug 23, 2017

@enoodle Until we know more, this would log the exact input, and let refresh complete 👍

Currently I think we assume the container -> container image link always exists so removing that might cause unexpected problems (e.g screens that relay on that relation).

Makes sense. Is it enough to skip the container status, or the whole container?
Hmm, I think some containers don't have a container status (eg. just created), so this should be OK.
On master skipping parse_container_status should be enough, you'll get Container with partial fields.
On fine/euwe this was parse_container IIRC, you'd get ContainerDefinition without Container.
@zeari is that right?

@@ -1077,11 +1077,14 @@ def parse_quantity(resource) # parse a string with a suffix into a int\float
end

def parse_container_status(container, pod_id)
container_image = parse_container_image(container.image, container.imageID)
return if container_image.nil?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you log more info on the affected container? Ideally whole input spec and status, which would have to happen in caller.

This returns nil, but caller does containers_index[cn.name].merge!(parse_container_status(cn, pod.metadata.uid)) — doesn't merge!(nil) crash?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch I'll have parse_pod catch a nil status
So would it be enough to log cn from here ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cben
Copy link
Contributor

cben commented Aug 23, 2017

out of scope: I'm wondering if we should systematically catch exceptions in process_collection (*), log the whole offending input, and potentially continue to next item.

(*) will need similar helper for get_*_graph methods that currently just do .each.

@@ -766,8 +766,14 @@ def parse_pod(pod)

unless pod.status.nil? || pod.status.containerStatuses.nil?
pod.status.containerStatuses.each do |cn|
container_status = parse_container_status(cn, pod.metadata.uid)
if container_status.nil?
_log.warn("Invalid container status: pod [#{pod.metadata.uid}] container [#{cn}]")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally also log containers_index[cn.name]

@cben
Copy link
Contributor

cben commented Aug 23, 2017

LGTM 👍

image_parts = docker_pullable_re.match(image)
if image_parts.nil?
_log.warn("Invalid image #{image}")
return

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear that we will swallow future bugs this way.
Question is should we 'warn' or 'error'. When we error we tend to get bugs filed -
I'm thinking that if we have an image missing in our inventory due to unknown reasons we want a bug.
cc @cben @simon3z

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with logging an error and getting a bug filed, I'm not good with throwing an exception and stopping the refresh for one bad image which tends to lead to severity 1 issues :D

@moolitayer
Copy link

moolitayer commented Aug 24, 2017

@agrare have we learned anything about the inputs causing the problem you are fixing?
We should get this PR merged since we should not fail refresh due to one problem.
I'm thinking there might be another PR needed to fix an underlying issue or maybe there is a problematic output in which case we need to file an issue on openshift/kubernetes.

@moolitayer
Copy link

@agrare do we have a bz for the issue fixed here? do we plan to eventually backport it?

@agrare
Copy link
Member Author

agrare commented Aug 24, 2017

@moolitayer customer is no longer seeing this issue, we are assuming the problematic container was deleted but they were hitting this for about a week.
The BZ is https://bugzilla.redhat.com/show_bug.cgi?id=1484337 I'll amend the commit to reflect this

@agrare
Copy link
Member Author

agrare commented Aug 24, 2017

@cben I have also been thinking we need something along the lines of what you said

I'd like some mechanism stronger than logging to communicate problems. User should see "Refreshed with 3 errors" or something.

@jameswnl and I were kicking some ideas around and we were thinking if we had a way to build up a set of errors that we can raise at the end of the refresh as a notification to the user without stopping the refresh but also more visible than errors in the logs. What do you think?

I'm wondering if we should systematically catch exceptions in process_collection (*), log the whole offending input, and potentially continue to next item.

I think this would be a good addition as well, along the lines of "we want to let the user know something is wrong but don't want to stop the refresh for it". Honestly even exiting the refresh with an error usually goes unnoticed unless someone is looking for it.

@cben
Copy link
Contributor

cben commented Aug 24, 2017

I'm thinking there might be another PR needed to fix an underlying issue

Sure, but we have no idea what it is. Hopefully the logging in this PR would help to catch such input.
Actually it'll go unnoticed as refresh wouldn't fail :-[
But we keep getting logs from this customer, should make sure we search for this message...

@simon3z
Copy link
Contributor

simon3z commented Aug 25, 2017

@agrare @moolitayer @cben I think we need specs for this. We can unit-test parse_pod and parse_container_image as we do for other parsers.

@agrare
Copy link
Member Author

agrare commented Aug 25, 2017

@simon3z yeah I can add specs, I wish we knew what was causing the parsing error but we can come back and add a test specifically for that when we apply this and find out

@simon3z
Copy link
Contributor

simon3z commented Aug 25, 2017

@simon3z yeah I can add specs, I wish we knew what was causing the parsing error but we can come back and add a test specifically for that when we apply this and find out

@agrare ah sorry I forgot to write my thoughts on that 😄 ...I think it can be a particular state (especially when overloaded?) when a Pod has been created but not yet picked up by the scheduler.

@agrare
Copy link
Member Author

agrare commented Aug 25, 2017

@simon3z yeah I can add specs, I wish we knew what was causing the parsing error but we can come back and add a test specifically for that when we apply this and find out

@agrare ah sorry I forgot to write my thoughts on that 😄 ...I think it can be a particular state (especially when overloaded?) when a Pod has been created but not yet picked up by the scheduler.

Oh okay :) so do you think the image name is nil?

@simon3z
Copy link
Contributor

simon3z commented Aug 25, 2017

Oh okay :) so do you think the image name is nil?

@agrare if we're talking about these:

  status:
    ...
    containerStatuses:
    - ...
      image: docker.io/foobar/...
      imageID: docker-pullable://docker.io/foobar/...

then yes. In particular imageID is filled in way later in the game once the image has been downloaded to the node (can take a while).

The one provided on "creation" (maybe it's better to say on "definition") instead must be there since the beginning:

  spec:
    containers:
    - ...
      image: docker.io/foobar/...

@agrare
Copy link
Member Author

agrare commented Aug 25, 2017

@simon3z okay and do you think it makes sense to skip these containers if they don't have an imageID yet? We could just not link them up with an image but keep the container, up to you I don't know enough about the assumptions made here

@agrare
Copy link
Member Author

agrare commented Aug 25, 2017

Personally I think it'd be better to parse and save the container and leave the image link blank until it is filled in but that's just me :)

@agrare
Copy link
Member Author

agrare commented Aug 25, 2017

@simon3z hmm actually I don't know that it is a missing imageID, from the original backtrace in the BZ it fails on this line:
hostname = image_parts[:host] || image_parts[:host2] || image_parts[:localhost] so nothing to do with the image_ref

@miq-bot
Copy link
Member

miq-bot commented Aug 25, 2017

Checked commits agrare/manageiq-providers-kubernetes@22f8ab0~...15c6799 with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0
2 files checked, 0 offenses detected
Everything looks fine. 👍

@simon3z simon3z merged commit f006a25 into ManageIQ:master Aug 29, 2017
@agrare agrare deleted the skip_invalid_image_names branch August 29, 2017 15:22
@cben
Copy link
Contributor

cben commented Aug 29, 2017

👍 to ignore code climate, parse_image_name is indeed complex but out of scope here.

@agrare agrare added this to the Sprint 68 Ending Sep 4, 2017 milestone Aug 29, 2017
@simaishi
Copy link

Backported to Euwe via ManageIQ/manageiq#15918

@simaishi
Copy link

Backported to Fine via ManageIQ/manageiq#16019

@cben
Copy link
Contributor

cben commented Sep 26, 2017

This PR caught the invalid image :-) Example:

MIQ(ManageIQ::Providers::Openshift::ContainerManager::RefreshParser#parse_pod) Invalid container: pod - [4649a035-1691-11e7-9ba8-0050568704dd] container - [#<Kubeclient::Pod name="author", state={:waiting=>{:reason=>"CrashLoopBackOff", :message=>"Back-off 5m0s restarting failed container=author pod=author-5-vf7dy_yzhang15(4649a035-1691-11e7-9ba8-0050568704dd)"}}, lastState={:terminated=>{:exitCode=>1, :reason=>"Error", :startedAt=>"2017-09-25T19:42:43Z", :finishedAt=>"2017-09-25T19:43:14Z", :containerID=>"docker://4191dc91456ffa4251348c1c2d0de803f0d3d05e7f1e697e871cb126c16a1318"}}, ready=false, restartCount=46546, image="172.30.101.240:5000/yzhang15/author@sha256:a574f1929deeedda313e78068f35b07141e9f0b12a72e709c51ac582f94da6cf", imageID="docker://sha256:d130c244f9cdb82435eca2c950bb6fdd4357b4ada599aacabf8691760983562b", containerID="docker://4191dc91456ffa4251348c1c2d0de803f0d3d05e7f1e697e871cb126c16a1318">]
MIQ(ManageIQ::Providers::Openshift::ContainerManager::RefreshParser#parse_image_name) Invalid image_ref docker://sha256:d130c244f9cdb82435eca2c950bb6fdd4357b4ada599aacabf8691760983562b

cc @enoodle does this ^^ tell you anything?

(Should open BZ, but wanted to dump info somewhere for now)

@agrare
Copy link
Member Author

agrare commented Sep 26, 2017

@cben there is also another type where the imageID is blank:

MIQ(ManageIQ::Providers::Openshift::ContainerManager::RefreshParser#parse_image_name) Invalid image_ref 
MIQ(ManageIQ::Providers::Openshift::ContainerManager::RefreshParser#parse_pod) Invalid container: pod - [74a74292-77bc-11e7-aa07-0050568704dd] container - [#<Kubeclient::Pod name="pnp-a4me-widget-738", state={:waiting=>{:reason=>"ImagePullBackOff", :message=>"Back-off pulling image \"docker.example.com/riptide_apps/a4me-widget@sha256:cad220c47ccaa05df88d35b4f25abf81d6cf26aaf357eb1302b485ba9a590aaa\""}}, lastState={}, ready=false, restartCount=0, image="docker.example.com/riptide_apps/a4me-widget@sha256:cad220c47ccaa05df88d35b4f25abf81d6cf26aaf357eb1302b485ba9a590aaa", imageID="">]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants