Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-35199: daemon: skip imageInspect during checkOS for PinnedImages #4402

Closed
wants to merge 1 commit into from

Conversation

hexfusion
Copy link
Contributor

@hexfusion hexfusion commented Jun 11, 2024

This PR is a follow-up to #4347 and #3821. This PR skips the imageInspect check if PinnedImages feature gate is enabled and the osImage has been pulled locally.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 11, 2024
@openshift-ci-robot
Copy link
Contributor

@hexfusion: This pull request references Jira Issue OCPBUGS-35199, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.17.0) matches configured target version for branch (4.17.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This PR is a follow-up to #3821 and skips the imageInspect check if PinnedImages feature is enabled and the osImage has been pulled locally.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jun 11, 2024
Copy link
Contributor

openshift-ci bot commented Jun 11, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hexfusion
Once this PR has been reviewed and has the lgtm label, please assign djoshy for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Sam Batschelet <sbatsche@redhat.com>
@openshift-ci-robot
Copy link
Contributor

@hexfusion: This pull request references Jira Issue OCPBUGS-35199, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.17.0) matches configured target version for branch (4.17.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

This PR is a follow-up to #4347 and #3821. This PR skips the imageInspect check if PinnedImages feature is enabled and the osImage has been pulled locally.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Jun 11, 2024

@hexfusion: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-hypershift 6ce6cd3 link true /test e2e-hypershift
ci/prow/e2e-gcp-op-techpreview 6ce6cd3 link false /test e2e-gcp-op-techpreview
ci/prow/e2e-azure-ovn-upgrade-out-of-change 6ce6cd3 link false /test e2e-azure-ovn-upgrade-out-of-change
ci/prow/e2e-vsphere-ovn-upi-zones 6ce6cd3 link false /test e2e-vsphere-ovn-upi-zones

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@sergiordlr
Copy link

sergiordlr commented Jun 13, 2024

We have run an upgrade in a disconnected clusters, with pinned images and using an empty pull-secret. No access to any registry.

Upgrade from 4.17.0-0.nightly-2024-06-06-061523 to 4.16.0-0.ci.test-2024-06-12-092356-ci-ln-frc8mnk-latest (ci image with our fix)

We have seen this error in the MCDs

2024-06-13T10:11:04.640197053+00:00 stderr F I0613 10:11:04.640183  151780 rpm-ostree.go:316] Running captured: podman images -q --filter reference=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307
2024-06-13T10:11:10.825691692+00:00 stderr F I0613 10:11:10.825643  151780 pinned_image_set.go:426] Completed scheduling 25% of images
2024-06-13T10:11:20.840856559+00:00 stderr F I0613 10:11:20.840805  151780 pinned_image_set.go:426] Completed scheduling 50% of images
2024-06-13T10:11:30.856404386+00:00 stderr F I0613 10:11:30.856359  151780 pinned_image_set.go:426] Completed scheduling 75% of images
2024-06-13T10:11:40.872787010+00:00 stderr F I0613 10:11:40.872741  151780 pinned_image_set.go:426] Completed scheduling 100% of images
2024-06-13T10:11:42.981134742+00:00 stderr F I0613 10:11:42.981101  151780 pinned_image_set.go:527] CRI-O config file is up to date, no reload required
2024-06-13T10:12:03.746391202+00:00 stderr F I0613 10:12:03.746352  151780 certificate_writer.go:288] Certificate was synced from controllerconfig resourceVersion 114001
2024-06-13T10:12:04.720906853+00:00 stderr F time="2024-06-13T10:12:04Z" level=warning msg="Failed, retrying in 1s ... (1/2). Error: (Mirrors also failed: [ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: reading manifest sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307 in ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release: authentication required]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: pinging container registry quay.io: Get \"https://quay.io/v2/\": dial tcp 54.86.200.156:443: i/o timeout"
2024-06-13T10:13:05.741726483+00:00 stderr F time="2024-06-13T10:13:05Z" level=warning msg="Failed, retrying in 2s ... (2/2). Error: (Mirrors also failed: [ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: reading manifest sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307 in ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release: authentication required]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: pinging container registry quay.io: Get \"https://quay.io/v2/\": dial tcp 54.221.103.142:443: i/o timeout"
2024-06-13T10:14:03.185800816+00:00 stderr F I0613 10:14:03.185750  151780 daemon.go:1364] Shutting down MachineConfigDaemon

The image that is triggering the error is the orginal coreos image, not the target coreos image

$ oc adm release info registry.ci.openshift.org/ocp/release:4.17.0-0.nightly-2024-06-06-061523 --pullspecs| grep coreos
  rhel-coreos                                    quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307

After reporting this error for a long while, and even restarting MCDs, the configuration is eventually applied. I don't know why it MCD stops restarting and eventually it decides to apply the configuration.

2024-06-13T10:19:44.284840446+00:00 stderr F I0613 10:19:44.284800  155375 pinned_image_set.go:527] CRI-O config file is up to date, no reload required
2024-06-13T10:20:05.045533819+00:00 stderr F I0613 10:20:05.045476  155375 certificate_writer.go:288] Certificate was synced from controllerconfig resourceVersion 114001
2024-06-13T10:20:06.014479760+00:00 stderr F time="2024-06-13T10:20:06Z" level=warning msg="Failed, retrying in 1s ... (1/2). Error: (Mirrors also failed: [ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: reading manifest sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307 in ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release: authentication required]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: pinging container registry quay.io: Get \"https://quay.io/v2/\": dial tcp 44.194.103.74:443: i/o timeout"
2024-06-13T10:21:07.035257526+00:00 stderr F time="2024-06-13T10:21:07Z" level=warning msg="Failed, retrying in 2s ... (2/2). Error: (Mirrors also failed: [ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: reading manifest sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307 in ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release: authentication required]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: pinging container registry quay.io: Get \"https://quay.io/v2/\": dial tcp 54.173.5.6:443: i/o timeout"
2024-06-13T10:21:55.224373042+00:00 stderr F I0613 10:21:55.224329  155375 pinned_image_set.go:302] Reconciling pinned image set: 99-worker-pinned-release: generation: 1
2024-06-13T10:21:55.328957322+00:00 stderr F I0613 10:21:55.328920  155375 pinned_image_set.go:527] CRI-O config file is up to date, no reload required
2024-06-13T10:22:09.059813690+00:00 stderr F W0613 10:22:09.059770  155375 daemon.go:2620] Unable to check manifest for matching hash: error parsing image name "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307": (Mirrors also failed: [ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: reading manifest sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307 in ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocp/release: authentication required]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7ab7b8bdb9f6ddde8d3f860324cacbb2e2ec7438b4228aadbf6fcb71dc1aa307: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp 54.86.200.156:443: i/o timeout
2024-06-13T10:22:09.059813690+00:00 stderr F I0613 10:22:09.059793  155375 rpm-ostree.go:316] Running captured: rpm-ostree kargs
2024-06-13T10:22:09.171178020+00:00 stderr F I0613 10:22:09.171136  155375 update.go:2621] Validated on-disk state
2024-06-13T10:22:09.222870849+00:00 stderr F I0613 10:22:09.222830  155375 update.go:2643] Adding SIGTERM protection
2024-06-13T10:22:09.242960693+00:00 stderr F I0613 10:22:09.242923  155375 update.go:1009] Checking Reconcilable for config rendered-worker-c11566a8572146defaf95ca346654742 to rendered-worker-64602b930cbae5db1feb30e67b974b39
2024-06-13T10:22:09.288117764+00:00 stderr F I0613 10:22:09.288074  155375 update.go:2621] Starting update from rendered-worker-c11566a8572146defaf95ca346654742 to rendered-worker-64602b930cbae5db1feb30e67b974b39: &{osUpdate:true kargs:false fips:false passwd:false files:true units:true kernelType:false extensions:false}
2024-06-13T10:22:09.322117482+00:00 stderr F I0613 10:22:09.322082  155375 update.go:757] Calculating node disruption actions
2024-06-13T10:22:09.322117482+00:00 stderr F I0613 10:22:09.322111  155375 drain.go:121] Checking drain required for node disruption actions

Eventually we get an upgrade that is reported to be successful

$ oc get clusterversion
NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.ci.test-2024-06-12-092356-ci-ln-frc8mnk-latest   True        False         90m     Cluster version is 4.16.0-0.ci.test-2024-06-12-092356-ci-ln-frc8mnk-latest

Nevertheless, we can observe that the debug command does not work because it is trying to use the old original image instead of the one corresponding to the new version in the cluster. If we observe the tools imagestream in the openshift namespace we can see that the new image could not be imported

oc get is -n openshift tools -oyaml
....
  tags:
  - conditions:
    - generation: 8
      lastTransitionTime: "2024-06-13T11:01:04Z"
      message: 'Internal error occurred: [you may not have access to the container
        image "ec2-18-217-101-126.us-east-2.compute.amazonaws.com:5000/ocpupg/release@sha256:4ff6d13185bb1e378d8527ad6d3a3a92e11024488c1a74bffc42b0c9f8f21fd7",
        registry.build03.ci.openshift.org/ci-ln-frc8mnk/stable@sha256:4ff6d13185bb1e378d8527ad6d3a3a92e11024488c1a74bffc42b0c9f8f21fd7:
        Get "https://registry.build03.ci.openshift.org/v2/": dial tcp 54.172.72.33:443:
        i/o timeout]'
      reason: InternalError
      status: "False"
      type: ImportSuccess
    items:

We can observe similar failures in these imagestreams in the openshift namespace

    name: cli
    name: cli-artifacts
    name: driver-toolkit
    name: installer
    name: installer-artifacts
    name: must-gather
    name: network-tools
    name: oauth-proxy
    name: tests
    name: tools

Extensions are working fine after the upgrade

$ oc debug -q --image registry.build03.ci.openshift.org/ci-ln-frc8mnk/stable@sha256:4ff6d13185bb1e378d8527ad6d3a3a92e11024488c1a74bffc42b0c9f8f21fd7 node/ip-10-0-52-40 -- chroot /host rpm -q usbguard
usbguard-1.0.0-15.el9.x86_64

We hold the PR until we decide if we need to fix the "reading manifest" before merging it.

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 13, 2024
@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 12, 2024
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 12, 2024
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Nov 12, 2024
Copy link
Contributor

openshift-ci bot commented Nov 12, 2024

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci-robot
Copy link
Contributor

@hexfusion: This pull request references Jira Issue OCPBUGS-35199. The bug has been updated to no longer refer to the pull request using the external bug tracker.

In response to this:

This PR is a follow-up to #4347 and #3821. This PR skips the imageInspect check if PinnedImages feature gate is enabled and the osImage has been pulled locally.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants