Skip to content
This repository has been archived by the owner on May 7, 2021. It is now read-only.

kola: add run-upgrade command #1168

Merged
merged 4 commits into from
Jan 27, 2020
Merged

kola: add run-upgrade command #1168

merged 4 commits into from
Jan 27, 2020

Conversation

jlebon
Copy link
Member

@jlebon jlebon commented Jan 22, 2020

This adds a new run-upgrade command focused on running upgrade tests.
It also adds a single test in that testsuite: fcos.upgrade.basic.

To run this test, one can do:

kola run-upgrade -v \
        --cosa-build /path/to/meta.json \
        --qemu-image /path/to/starting-image.qcow2

You can tell kola to automatically detect the parent image to start
from:

kola run-upgrade -v \
        --cosa-build /path/to/meta.json \
        --find-parent-image

For FCOS, this will fetch the metadata for the latest release for the
target stream. On AWS, it will use the AMI from there as the starting
image. On the QEMU platform, it will download the QEMU image locally
(with signature verification). The code is extensible to add support for
RHCOS and other target platforms.

Why make it a separate command from run? Multiple reasons:

  1. As shown above, it's about multiple artifacts, not just the system
    under test. By contrast, run is largely about using a single
    artifact input. For example, on AWS, --aws-ami points to the
    starting image, and --cosa-build points to the target upgrade.
  2. It's more expensive than other tests. To make it truly cross-platform
    and self-contained, it works by pushing the OSTree content to the
    node and serving it from there to itself. Therefore, it's not a test
    that developers would necessarily be interested in running locally
    very often (though it's definitely adapted for local tests too when
    needed).
  3. Unlike run, it has some special semantics like
    --find-parent-image to make it easier to use.

Now, this is only part of the FCOS upgrade testing story. Here's roughly
how I see this all fit together:

  1. The FCOS pipeline runs kola run-upgrade -p qemu and possibly
    kola run-upgrade -p aws after the basic kola run tests have
    passed.
  2. Once the build is clean and pushed out to S3, its content will be
    imported into the annex/compose repo.
  3. Once there, we can do more realistic tests by targeting the annex
    repo and a dedicated Cincinnati. For example, we can have canary
    nodes following those updates that started from various previous
    releases to catch any state-dependent issues. Another more explicit
    approach is a test that starts those nodes at the select releases and
    gate new releases on that test.

Essentially, the main advantage of this test is that we can do some
upgrade testing before pushing out any bits at all to S3. The major
bug category this is intended to catch are state-dependent ones (i.e.
anything that isn't captured by the OSTree commit).

However, it does also exercise many of the major parts of the update
system (zincati, rpm-ostree, ostree, libcurl). Though it's clearly not
a replacement for more realistic e2e tests downstream.

@jlebon
Copy link
Member Author

jlebon commented Jan 22, 2020

Marking as WIP for now. I have to split out lots of commits into separate prep PRs first.

@cgwalters
Copy link
Member

Only glanced at this but I think we want --cosa-build to still define the start, and --cosa-upgrade-target to take the end or so.

@jlebon
Copy link
Member Author

jlebon commented Jan 22, 2020

Yeah, I'm open to tweaking the current interface. My reasoning for making --cosa-build the target build is that I consider the build we update to to be more the actual artifact under test. The previous build normally is one that has already been released in the wild.

So, to flip this around: --cosa-build and --cosa-parent-build? IOW, --cosa-build is constant, --cosa-parent-build is the thing that could be changed (in fact, we could make it take multiple parent builds and execute those in parallel). Note also that --cosa-build is mandatory, whereas --cosa-parent-build isn't (there's a --find-parent-image which can automatically figure out the parent to use).

@jlebon
Copy link
Member Author

jlebon commented Jan 22, 2020

Another way to look at this is, --cosa-build is always about the test artifact we actually care about, regardless if it's kola run or kola run-upgrade.

@cgwalters
Copy link
Member

Another way to look at this is, --cosa-build is always about the test artifact we actually care about, regardless if it's kola run or kola run-upgrade.

OK, that's convincing yes.

@jlebon
Copy link
Member Author

jlebon commented Jan 22, 2020

OK, split out prep patches in #1170!

@jlebon jlebon force-pushed the pr/fcos-upgrade branch 2 times, most recently from e9816e3 to 0d23362 Compare January 23, 2020 20:09
@jlebon jlebon changed the title WIP: kola: add run-upgrade command kola: add run-upgrade command Jan 23, 2020
@jlebon jlebon marked this pull request as ready for review January 23, 2020 20:09
@jlebon jlebon force-pushed the pr/fcos-upgrade branch 2 times, most recently from c8f5d83 to a5bf6d8 Compare January 23, 2020 21:43
@jlebon
Copy link
Member Author

jlebon commented Jan 23, 2020

OK, this now works on top of #1170! I added streaming decompression and signature verification to make --find-parent-image for the qemu platform faster.

There's definitely a lot more we could do on top of this, though I think it's good enough for now to get in as is, so we can at least start using it in the pipeline.

Some follow-up improvements:

  • add --cosa-previous-build
  • add support for more platforms (though really, AWS is the only cloud we can test right now for FCOS until we start uploading elsewhere)
  • add support for RHCOS

And while we're there, rename the functions to be more descriptive. This
is prep for doing streaming decompression and GPG verification of
downloaded qemu images.
That way a caller that wants to use the streaming interface can also
just pass `""` as the key file to get the default keyring.
This function will download a compressed file and decompress it and
verify the signature in a streaming fashion. What we lose in return is
the ability to resume file downloads if we're interrupted. I think that
trade-off is worth it though for a faster and more efficient common
case.
This adds a new `run-upgrade` command focused on running upgrade tests.
It also adds a single test in that testsuite: `fcos.upgrade.basic`.

To run this test, one can do:

```
kola run-upgrade -v \
        --cosa-build /path/to/meta.json \
        --qemu-image /path/to/starting-image.qcow2
```

You can tell kola to automatically detect the parent image to start
from:

```
kola run-upgrade -v \
        --cosa-build /path/to/meta.json \
        --find-parent-image
```

For FCOS, this will fetch the metadata for the latest release for the
target stream. On AWS, it will use the AMI from there as the starting
image. On the QEMU platform, it will download the QEMU image locally
(with signature verification). The code is extensible to add support for
RHCOS and other target platforms.

Why make it a separate command from `run`? Multiple reasons:
1. As shown above, it's about multiple artifacts, not just the system
   under test. By contrast, `run` is largely about using a single
   artifact input. For example, on AWS, `--aws-ami` points to the
   *starting* image, and `--cosa-build` points to the target upgrade.
2. It's more expensive than other tests. To make it truly cross-platform
   and self-contained, it works by pushing the OSTree content to the
   node and serving it from there to itself. Therefore, it's not a test
   that developers would necessarily be interested in running locally
   very often (though it's definitely adapted for local tests too when
   needed).
3. Unlike `run`, it has some special semantics like
   `--find-parent-image` to make it easier to use.

Now, this is only part of the FCOS upgrade testing story. Here's roughly
how I see this all fit together:
1. The FCOS pipeline runs `kola run-upgrade -p qemu` and possibly
   `kola run-upgrade -p aws` after the basic `kola run` tests have
   passed.
2. Once the build is clean and pushed out to S3, its content will be
   imported into the annex/compose repo.
3. Once there, we can do more realistic tests by targeting the annex
   repo and a dedicated Cincinnati. For example, we can have canary
   nodes following those updates that started from various previous
   releases to catch any state-dependent issues. Another more explicit
   approach is a test that starts those nodes at the select releases and
   gate new releases on that test.

Essentially, the main advantage of this test is that we can do some
upgrade testing *before* pushing out any bits at all to S3. The major
bug category this is intended to catch are state-dependent ones (i.e.
anything that *isn't* captured by the OSTree commit).

However, it does also exercise many of the major parts of the update
system (zincati, rpm-ostree, ostree, libcurl).  Though it's clearly not
a replacement for more realistic e2e tests downstream.
@jlebon
Copy link
Member Author

jlebon commented Jan 24, 2020

Rebased!

jlebon added a commit to jlebon/fedora-coreos-pipeline that referenced this pull request Jan 24, 2020
Start running the new upgrade test right after building the QEMU image.
In the AWS test job, run the upgrade test on AWS in parallel.

For more information, see:
coreos/mantle#1168
}
kola.QEMUOptions.DiskImage = decompressedQcowLocal
case "aws":
kola.AWSOptions.AMI, err = parentCosaBuild.FindAMI(kola.AWSOptions.Region)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also openshift/installer#2906 - we'll likely at some point need to copy the code from the installer to make images from storage; which gets into terraform vs something else here, or forking out to openshift-install instantiate-coreos-image or someting.

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall! Very nice code.
Would be nice probably to support the no-zincati case for RHCOS...or OTOH we could just add zincati to RHCOS and leave it disabled by default.

@jlebon
Copy link
Member Author

jlebon commented Jan 27, 2020

Thanks for the review!

Would be nice probably to support the no-zincati case for RHCOS...or OTOH we could just add zincati to RHCOS and leave it disabled by default.

Yeah, I left space for this enhancement in the code. I guess the closer equivalent would be to just upload the oscontainer into the image storage, then write it to /etc/pivot/image-pullspec and run systemctl start machine-config-daemon-host.service the way the MCD does it? That way we at least test parts of the same MCD/podman/rpm-ostree path as in a cluster. (But yeah, again the emphasis here is on disk-dependent hysteresis -- it's just a little extra points if we get to exercise the same mechanisms as the real thing as a sanity-check for later more realistic downstream e2e tests).

Gonna merge this one now! I'd like to get this and coreos/fedora-coreos-pipeline#190 in before the next stable release (which should be this week).

jlebon added a commit to jlebon/fedora-coreos-pipeline that referenced this pull request Jan 28, 2020
Start running the new upgrade test right after building the QEMU image.
In the AWS test job, run the upgrade test on AWS in parallel.

For more information, see:
coreos/mantle#1168
jlebon added a commit to jlebon/fedora-coreos-pipeline that referenced this pull request Jan 28, 2020
Start running the new upgrade test right after building the QEMU image.
In the AWS test job, run the upgrade test on AWS in parallel.

For more information, see:
coreos/mantle#1168
jlebon added a commit to coreos/fedora-coreos-pipeline that referenced this pull request Jan 28, 2020
Start running the new upgrade test right after building the QEMU image.
In the AWS test job, run the upgrade test on AWS in parallel.

For more information, see:
coreos/mantle#1168
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants