Skip invalid container_images #94

agrare · 2017-08-15T21:01:39Z

[----] E, [2017-08-08T05:55:00.221227 #15805:1041130] ERROR -- : [NoMethodError]: undefined method `[]' for nil:NilClass  Method:[rescue in block in refresh]
[----] E, [2017-08-08T05:55:00.221475 #15805:1041130] ERROR -- : /var/www/miq/vmdb/app/models/manageiq/providers/kubernetes/container_manager/refresh_parser.rb:767:in `parse_image_name'

https://bugzilla.redhat.com/show_bug.cgi?id=1484337

cben · 2017-08-16T05:41:14Z

Do you have details on what the input looked like?
~~cc @zgalor this is in fetch_hawk_inv~~ (unrelated adjacent log line, known other issue)

It's generally good idea to not abort refresh on errors.

I'm not sure we should skip the container too.
If we just drop things in parser, existing records can be deleted.
I'd like some mechanism stronger than logging to communicate problems. User should see "Refreshed with 3 errors" or something.

moolitayer · 2017-08-16T08:02:08Z

Do you have details on what the input looked like?

Also interested in that. Currently I think we assume the container -> container image link always exists so removing that might cause unexpected problems (e.g screens that relay on that relation). Maybe we can fix the parsing to match new cases? Or maybe this is due to a bug in kubernetes that we need to mitigate (missing image element in some cases)?

agrare · 2017-08-16T20:08:39Z

Do you have details on what the input looked like?

@cben no not yet, I was hoping to be able to tell from the log line on the customer env to see if it is something we need to update that regex to accommodate or if it is something we need to work around.

@blomquisg were you guys able to find out from the customer anything about the invalid image?

cben · 2017-08-23T11:48:25Z

https://bugzilla.redhat.com/show_bug.cgi?id=1484337
Has full backtrace but no more details on the input that causes this.

cben · 2017-08-23T12:42:09Z

@enoodle Until we know more, this would log the exact input, and let refresh complete 👍

Currently I think we assume the container -> container image link always exists so removing that might cause unexpected problems (e.g screens that relay on that relation).

Makes sense. Is it enough to skip the container status, or the whole container?
Hmm, I think some containers don't have a container status (eg. just created), so this should be OK.
On master skipping parse_container_status should be enough, you'll get Container with partial fields.
On fine/euwe this was parse_container IIRC, you'd get ContainerDefinition without Container.
@zeari is that right?

cben · 2017-08-23T14:13:42Z

app/models/manageiq/providers/kubernetes/container_manager/refresh_parser.rb

@@ -1077,11 +1077,14 @@ def parse_quantity(resource) # parse a string with a suffix into a int\float
    end

    def parse_container_status(container, pod_id)
+      container_image = parse_container_image(container.image, container.imageID)
+      return if container_image.nil?


Can you log more info on the affected container? Ideally whole input spec and status, which would have to happen in caller.

This returns nil, but caller does containers_index[cn.name].merge!(parse_container_status(cn, pod.metadata.uid)) — doesn't merge!(nil) crash?

Good catch I'll have parse_pod catch a nil status
So would it be enough to log cn from here ?

@cben see https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/94/files#diff-0324981fdb3019ce6d98f9c86d97f2bbR770

cben · 2017-08-23T14:17:16Z

out of scope: I'm wondering if we should systematically catch exceptions in process_collection (*), log the whole offending input, and potentially continue to next item.

(*) will need similar helper for get_*_graph methods that currently just do .each.

cben · 2017-08-23T21:51:47Z

app/models/manageiq/providers/kubernetes/container_manager/refresh_parser.rb

@@ -766,8 +766,14 @@ def parse_pod(pod)

      unless pod.status.nil? || pod.status.containerStatuses.nil?
        pod.status.containerStatuses.each do |cn|
+          container_status = parse_container_status(cn, pod.metadata.uid)
+          if container_status.nil?
+            _log.warn("Invalid container status: pod [#{pod.metadata.uid}] container [#{cn}]")


ideally also log containers_index[cn.name]

cben · 2017-08-23T21:52:23Z

LGTM 👍

moolitayer · 2017-08-24T13:20:55Z

app/models/manageiq/providers/kubernetes/container_manager/refresh_parser.rb

      image_parts = docker_pullable_re.match(image)
+      if image_parts.nil?
+        _log.warn("Invalid image #{image}")
+        return


I fear that we will swallow future bugs this way.
Question is should we 'warn' or 'error'. When we error we tend to get bugs filed -
I'm thinking that if we have an image missing in our inventory due to unknown reasons we want a bug.
cc @cben @simon3z

I'm good with logging an error and getting a bug filed, I'm not good with throwing an exception and stopping the refresh for one bad image which tends to lead to severity 1 issues :D

moolitayer · 2017-08-24T13:28:57Z

@agrare have we learned anything about the inputs causing the problem you are fixing?
We should get this PR merged since we should not fail refresh due to one problem.
I'm thinking there might be another PR needed to fix an underlying issue or maybe there is a problematic output in which case we need to file an issue on openshift/kubernetes.

moolitayer · 2017-08-24T13:32:25Z

@agrare do we have a bz for the issue fixed here? do we plan to eventually backport it?

agrare · 2017-08-24T13:34:32Z

@moolitayer customer is no longer seeing this issue, we are assuming the problematic container was deleted but they were hitting this for about a week.
The BZ is https://bugzilla.redhat.com/show_bug.cgi?id=1484337 I'll amend the commit to reflect this

agrare · 2017-08-24T13:46:42Z

@cben I have also been thinking we need something along the lines of what you said

I'd like some mechanism stronger than logging to communicate problems. User should see "Refreshed with 3 errors" or something.

@jameswnl and I were kicking some ideas around and we were thinking if we had a way to build up a set of errors that we can raise at the end of the refresh as a notification to the user without stopping the refresh but also more visible than errors in the logs. What do you think?

I'm wondering if we should systematically catch exceptions in process_collection (*), log the whole offending input, and potentially continue to next item.

I think this would be a good addition as well, along the lines of "we want to let the user know something is wrong but don't want to stop the refresh for it". Honestly even exiting the refresh with an error usually goes unnoticed unless someone is looking for it.

cben · 2017-08-24T21:28:50Z

I'm thinking there might be another PR needed to fix an underlying issue

Sure, but we have no idea what it is. Hopefully the logging in this PR would help to catch such input.
Actually it'll go unnoticed as refresh wouldn't fail :-[
But we keep getting logs from this customer, should make sure we search for this message...

simon3z · 2017-08-25T14:27:47Z

@agrare @moolitayer @cben I think we need specs for this. We can unit-test parse_pod and parse_container_image as we do for other parsers.

agrare · 2017-08-25T15:26:45Z

@simon3z yeah I can add specs, I wish we knew what was causing the parsing error but we can come back and add a test specifically for that when we apply this and find out

simon3z · 2017-08-25T15:28:50Z

@simon3z yeah I can add specs, I wish we knew what was causing the parsing error but we can come back and add a test specifically for that when we apply this and find out

@agrare ah sorry I forgot to write my thoughts on that 😄 ...I think it can be a particular state (especially when overloaded?) when a Pod has been created but not yet picked up by the scheduler.

agrare · 2017-08-25T15:45:57Z

@simon3z yeah I can add specs, I wish we knew what was causing the parsing error but we can come back and add a test specifically for that when we apply this and find out

@agrare ah sorry I forgot to write my thoughts on that 😄 ...I think it can be a particular state (especially when overloaded?) when a Pod has been created but not yet picked up by the scheduler.

Oh okay :) so do you think the image name is nil?

simon3z · 2017-08-25T15:52:30Z

Oh okay :) so do you think the image name is nil?

@agrare if we're talking about these:

  status:
    ...
    containerStatuses:
    - ...
      image: docker.io/foobar/...
      imageID: docker-pullable://docker.io/foobar/...

then yes. In particular imageID is filled in way later in the game once the image has been downloaded to the node (can take a while).

The one provided on "creation" (maybe it's better to say on "definition") instead must be there since the beginning:

  spec:
    containers:
    - ...
      image: docker.io/foobar/...

agrare · 2017-08-25T16:36:36Z

@simon3z okay and do you think it makes sense to skip these containers if they don't have an imageID yet? We could just not link them up with an image but keep the container, up to you I don't know enough about the assumptions made here

agrare · 2017-08-25T16:37:52Z

Personally I think it'd be better to parse and save the container and leave the image link blank until it is filled in but that's just me :)

https://bugzilla.redhat.com/show_bug.cgi?id=1484337

agrare · 2017-08-25T17:01:28Z

@simon3z hmm actually I don't know that it is a missing imageID, from the original backtrace in the BZ it fails on this line:
hostname = image_parts[:host] || image_parts[:host2] || image_parts[:localhost] so nothing to do with the image_ref

miq-bot · 2017-08-25T17:34:07Z

Checked commits agrare/manageiq-providers-kubernetes@22f8ab0~...15c6799 with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0
2 files checked, 0 offenses detected
Everything looks fine. 👍

cben · 2017-08-29T15:22:30Z

👍 to ignore code climate, parse_image_name is indeed complex but out of scope here.

simaishi · 2017-08-31T17:23:09Z

Backported to Euwe via ManageIQ/manageiq#15918

simaishi · 2017-09-22T16:55:55Z

Backported to Fine via ManageIQ/manageiq#16019

cben · 2017-09-26T11:01:31Z

This PR caught the invalid image :-) Example:

MIQ(ManageIQ::Providers::Openshift::ContainerManager::RefreshParser#parse_pod) Invalid container: pod - [4649a035-1691-11e7-9ba8-0050568704dd] container - [#<Kubeclient::Pod name="author", state={:waiting=>{:reason=>"CrashLoopBackOff", :message=>"Back-off 5m0s restarting failed container=author pod=author-5-vf7dy_yzhang15(4649a035-1691-11e7-9ba8-0050568704dd)"}}, lastState={:terminated=>{:exitCode=>1, :reason=>"Error", :startedAt=>"2017-09-25T19:42:43Z", :finishedAt=>"2017-09-25T19:43:14Z", :containerID=>"docker://4191dc91456ffa4251348c1c2d0de803f0d3d05e7f1e697e871cb126c16a1318"}}, ready=false, restartCount=46546, image="172.30.101.240:5000/yzhang15/author@sha256:a574f1929deeedda313e78068f35b07141e9f0b12a72e709c51ac582f94da6cf", imageID="docker://sha256:d130c244f9cdb82435eca2c950bb6fdd4357b4ada599aacabf8691760983562b", containerID="docker://4191dc91456ffa4251348c1c2d0de803f0d3d05e7f1e697e871cb126c16a1318">]
MIQ(ManageIQ::Providers::Openshift::ContainerManager::RefreshParser#parse_image_name) Invalid image_ref docker://sha256:d130c244f9cdb82435eca2c950bb6fdd4357b4ada599aacabf8691760983562b

cc @enoodle does this ^^ tell you anything?

(Should open BZ, but wanted to dump info somewhere for now)

agrare · 2017-09-26T13:47:55Z

@cben there is also another type where the imageID is blank:

MIQ(ManageIQ::Providers::Openshift::ContainerManager::RefreshParser#parse_image_name) Invalid image_ref 
MIQ(ManageIQ::Providers::Openshift::ContainerManager::RefreshParser#parse_pod) Invalid container: pod - [74a74292-77bc-11e7-aa07-0050568704dd] container - [#<Kubeclient::Pod name="pnp-a4me-widget-738", state={:waiting=>{:reason=>"ImagePullBackOff", :message=>"Back-off pulling image \"docker.example.com/riptide_apps/a4me-widget@sha256:cad220c47ccaa05df88d35b4f25abf81d6cf26aaf357eb1302b485ba9a590aaa\""}}, lastState={}, ready=false, restartCount=0, image="docker.example.com/riptide_apps/a4me-widget@sha256:cad220c47ccaa05df88d35b4f25abf81d6cf26aaf357eb1302b485ba9a590aaa", imageID="">]

agrare force-pushed the skip_invalid_image_names branch from e61f8cb to cdf9afd Compare August 15, 2017 21:11

cben reviewed Aug 23, 2017

View reviewed changes

agrare force-pushed the skip_invalid_image_names branch from cdf9afd to 7a49bf0 Compare August 23, 2017 14:26

cben reviewed Aug 23, 2017

View reviewed changes

moolitayer reviewed Aug 24, 2017

View reviewed changes

agrare force-pushed the skip_invalid_image_names branch from 7a49bf0 to 4351cc5 Compare August 24, 2017 13:37

agrare added the bug label Aug 24, 2017

Skip invalid container_images

22f8ab0

https://bugzilla.redhat.com/show_bug.cgi?id=1484337

agrare added 2 commits August 25, 2017 13:22

Add spec tests for nil image_name/image_ref for parse_container_image

8413db4

Add spec tests for parse_container_status

15c6799

agrare force-pushed the skip_invalid_image_names branch from 4351cc5 to 15c6799 Compare August 25, 2017 17:33

simon3z merged commit f006a25 into ManageIQ:master Aug 29, 2017

agrare deleted the skip_invalid_image_names branch August 29, 2017 15:22

agrare added this to the Sprint 68 Ending Sep 4, 2017 milestone Aug 29, 2017

agrare added euwe/yes labels Aug 29, 2017

agrare mentioned this pull request Aug 31, 2017

[EUWE] Skip invalid container_images ManageIQ/manageiq#15918

Merged

simaishi added euwe/backported and removed euwe/yes labels Aug 31, 2017

This was referenced Sep 22, 2017

[FINE] Skip invalid container_images ManageIQ/manageiq#16018

Closed

[FINE] Skip invalid container_images ManageIQ/manageiq#16019

Merged

simaishi added blocker fine/backported and removed fine/yes labels Sep 22, 2017

This was referenced Nov 2, 2017

Invalid image_ref docker://sha256:d130c244f9cdb82435eca2c950bb6fdd4357b4ada599aacabf8691760983562b #155

Closed

Invalid empty image_ref — imageID="" #156

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip invalid container_images #94

Skip invalid container_images #94

agrare commented Aug 15, 2017 •

edited

Loading

cben commented Aug 16, 2017 •

edited

Loading

moolitayer commented Aug 16, 2017

agrare commented Aug 16, 2017

cben commented Aug 23, 2017

cben commented Aug 23, 2017

cben Aug 23, 2017

agrare Aug 23, 2017

agrare Aug 23, 2017

cben commented Aug 23, 2017

cben Aug 23, 2017

cben commented Aug 23, 2017

moolitayer Aug 24, 2017

agrare Aug 24, 2017

moolitayer commented Aug 24, 2017 •

edited

Loading

moolitayer commented Aug 24, 2017

agrare commented Aug 24, 2017

agrare commented Aug 24, 2017

cben commented Aug 24, 2017

simon3z commented Aug 25, 2017

agrare commented Aug 25, 2017

simon3z commented Aug 25, 2017

agrare commented Aug 25, 2017

simon3z commented Aug 25, 2017

agrare commented Aug 25, 2017

agrare commented Aug 25, 2017

agrare commented Aug 25, 2017

miq-bot commented Aug 25, 2017

cben commented Aug 29, 2017

simaishi commented Aug 31, 2017

simaishi commented Sep 22, 2017

cben commented Sep 26, 2017

agrare commented Sep 26, 2017

Skip invalid container_images #94

Skip invalid container_images #94

Conversation

agrare commented Aug 15, 2017 • edited Loading

cben commented Aug 16, 2017 • edited Loading

moolitayer commented Aug 16, 2017

agrare commented Aug 16, 2017

cben commented Aug 23, 2017

cben commented Aug 23, 2017

cben Aug 23, 2017

Choose a reason for hiding this comment

agrare Aug 23, 2017

Choose a reason for hiding this comment

agrare Aug 23, 2017

Choose a reason for hiding this comment

cben commented Aug 23, 2017

cben Aug 23, 2017

Choose a reason for hiding this comment

cben commented Aug 23, 2017

moolitayer Aug 24, 2017

Choose a reason for hiding this comment

agrare Aug 24, 2017

Choose a reason for hiding this comment

moolitayer commented Aug 24, 2017 • edited Loading

moolitayer commented Aug 24, 2017

agrare commented Aug 24, 2017

agrare commented Aug 24, 2017

cben commented Aug 24, 2017

simon3z commented Aug 25, 2017

agrare commented Aug 25, 2017

simon3z commented Aug 25, 2017

agrare commented Aug 25, 2017

simon3z commented Aug 25, 2017

agrare commented Aug 25, 2017

agrare commented Aug 25, 2017

agrare commented Aug 25, 2017

miq-bot commented Aug 25, 2017

cben commented Aug 29, 2017

simaishi commented Aug 31, 2017

simaishi commented Sep 22, 2017

cben commented Sep 26, 2017

agrare commented Sep 26, 2017

agrare commented Aug 15, 2017 •

edited

Loading

cben commented Aug 16, 2017 •

edited

Loading

moolitayer commented Aug 24, 2017 •

edited

Loading