Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci-operator/config/openshift/installer/master: Move to openshift-install #1677

Merged
merged 1 commit into from
Sep 27, 2018

Conversation

wking
Copy link
Member

@wking wking commented Sep 26, 2018

This is the next-gen installer. With the new installer, we only need the installer binary and terraform to launch and destroy clusters. I've removed the smoke tests for now, because I haven't looked into building them without Bazel. Hopefully we'll get expanded e2e testing and other OpenShift projects using our installer soon to keep us honest.

Also missing from the new installer is a way to set expirationDate. But we'll just not leak until we regain the ability to set that, right? ;)

The new config dumps less cruft into the output directory (most of the generated output goes into a temporary directory), so I've adjusted the openshift-install calls to just use an artifacts subdir for their state storage. A bonus of this approach is that if the installer hangs up, we'll capture anything it wrote to disk without needing an explicit cp call.

I've also dropped the 3.11 config, since the installer is 4.0-only.

We'll need openshift/installer#333 to land before this PR will work.

/hold

And even once that PR lands, I dunno if this one will work. @sallyom, can you run this through its paces or give me some pointers?

CC @crawford; once this lands, we can delete the old installer without breaking CI.

@openshift-ci-robot openshift-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 26, 2018
@wking wking force-pushed the next-gen-installer branch 2 times, most recently from ae3bfb4 to 5073bea Compare September 26, 2018 08:57
@crawford
Copy link
Contributor

/approve

@wking
Copy link
Member Author

wking commented Sep 26, 2018

I've pushed 5073bea -> 606af40 updating the kubeconfig path, now that the installer is writing straight to /tmp/artifacts/installer. Also, openshift/installer#333 landed, and openshift/installer#332 is in the pipe, so:

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 26, 2018
@wking
Copy link
Member Author

wking commented Sep 26, 2018

Rebased around #1690.

@crawford
Copy link
Contributor

crawford commented Sep 26, 2018

/hold

Waiting for one more fix in the installer.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 26, 2018
cp terraform.tfvars /tmp/artifacts/installer/
echo "Invoking installer ..."
OPENSHIFT_INSTALL_CLUSTER_NAME="${INSTANCE_PREFIX}" \
OPENSHIFT_INSTALL_BASE_DOMAIN=origin-ci-int-aws.dev.rhcloud.com \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be aware that at some point in the next month or so we'll likely need to move this back out of the job definition and make parametrizable for different teams and content streams. But it's fine for now.

@@ -181,7 +180,7 @@ objects:
- name: AWS_SHARED_CREDENTIALS_FILE
value: /etc/openshift-installer/.awscred
- name: KUBECONFIG
value: /tmp/cluster/generated/auth/kubeconfig
value: /tmp/artifacts/installer/auth/kubeconfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting kubeconfig in artifacts isn't a great idea if we accidentally put a secret in there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(i.e. if something shows up in there that is longer lived than the cluster)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting kubeconfig in artifacts isn't a great idea if we accidentally put a secret in there.

I'd guessed that we were we generating the secret for this CI run on the fly. Is that not accurate? Or is this just a "watch out for long-lived secrets in the future"?

from: installer-bazel
to: installer-smoke
source_path: /go/src/github.com/openshift/installer/bin/terraform
to: installer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the replacement dockerfile centos based and using the right tooling?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the replacement dockerfile centos based and using the right tooling?

There is no replacement Dockerfile. openshift/release:golang-1.10 has what we need for hack/build.sh && hack/get-terraform.sh, and I was hoping to build the :installer image with just that here. But I don't know what I'm doing with ci-operator; maybe that's not how it works?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no replacement Dockerfile.

This was a dead end. I've pushed a replacement Dockerfile with openshift/installer#343, but it's based on scratch, not CentOS. Do we need CentOS for something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ART team recommends this, may be outright required, I'm not sure. (https://mojo.redhat.com/docs/DOC-1179058#jive_content_id_Images)

I also quickly noticed I can't exec in with /bin/bash, which is awfully handy for debugging. Actually in the way we're probably going to use this in Hive we probably need a more substantial base layer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also quickly noticed I can't exec in with /bin/bash, which is awfully handy for debugging.

The Dockerfile is stuffing the binaries into scratch (for folks who just need the binaries), but I couldn't figure out the right phrasing for that here (lucky for you ;), so we're currently using the base image (defined here) as the base layer that gets the binaries added. I chat about this a bit in the 3f2f01c commit message if you want more details.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah so your published CI output is quite different from your scratch Dockerfile.ci if you built it yourself? As long as the published image is CentOS/RHEL based I think we'll have what we need.

You could also consider multiple Dockerfiles if need be. I can hack my own for now until this is getting publishing.

@smarterclayton
Copy link
Contributor

(none of my comments are blocking)

/approve

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 26, 2018
@wking
Copy link
Member Author

wking commented Sep 26, 2018

/hold cancel

openshift/installer#340 landed, so we should be ok here, although I still haven't gotten this tested locally (@sallyom's been helping me with her notes, but I've been distracted).

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 26, 2018
@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 26, 2018
@wking
Copy link
Member Author

wking commented Sep 26, 2018

Rebased around #1688 and updated with a @sallyom suggestion to avoid:

error: could not resolve inputs: could not determine inputs for step [input:root]: could not resolve base image: imagestreamtags.image.openshift.io "installer-test-base:" is forbidden: User "wking" cannot get imagestreamtags.image.openshift.io in the namespace "stable": no RBAC policy matched

in local testing. e4ae796 -> 3144205.

@openshift-bot openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 26, 2018
@wking
Copy link
Member Author

wking commented Sep 26, 2018

Ok, in local testing with this slightly-tweaked template, I'm getting:

$ ci-operator -template templates/cluster-launch-installer-e2e-new.yaml -config ~/src/openshift/release/ci-operator/config/openshift/installer/master.yaml -git-ref=openshift/installer@master -namespace wking-next-gen-installer
2018/09/26 13:57:05 Resolved openshift/installer@master to commit 60991d84b2a266187c3a46f4bb91662cd15afccc
2018/09/26 13:57:05 Resolved source https://github.com/openshift/installer to master@60991d84
2018/09/26 13:57:06 Resolved openshift/release:golang-1.10 to sha256:43ad4740f25e4cbc6de8a964e2ce65ce50fbc24fc4e5e268e0a7370ca3b09bd1
2018/09/26 13:57:07 Resolved openshift/origin-v4.0:base to sha256:8026efab865dcb2d776b47ce05d639c6cb4113225df75c1fbce785bd6dfc88b7
2018/09/26 13:57:07 Using namespace wking-next-gen-installer
2018/09/26 13:57:07 Running [input:root], [input:base], [release-inputs], src, bin, installer, unit, gofmt, govet, golint, [output:stable:installer], [images], cluster-launch-installer-e2e-new, [release:latest]
2018/09/26 13:57:07 Creating namespace wking-next-gen-installer
2018/09/26 13:57:07 Setting a soft TTL of 1h0m0s for the namespace
2018/09/26 13:57:07 Setting a hard TTL of 12h0m0s for the namespace
2018/09/26 13:57:08 warning: Could not mark the namespace to be deleted later because you do not have permission to update the namespace (details: namespaces "wking-next-gen-installer" is forbidden: User "wking" cannot update namespaces in the namespace "wking-next-gen-installer": no RBAC policy matched)
2018/09/26 13:57:08 Setting up pipeline imagestream for the test
2018/09/26 13:57:09 Populating secrets for test
2018/09/26 13:57:09 Tagging openshift/release:golang-1.10 into pipeline:root
2018/09/26 13:57:09 Tagging openshift/origin-v4.0:base into pipeline:base
2018/09/26 13:57:09 Tagged release images from openshift/origin-v4.0:${component}, images will be pullable from registry.svc.ci.openshift.org/wking-next-gen-installer/stable:${component}
2018/09/26 13:57:09 Building src
2018/09/26 13:57:12 Build src already succeeded in 27s
2018/09/26 13:57:12 Executing test golint
2018/09/26 13:57:12 Executing test gofmt
2018/09/26 13:57:12 Building bin
2018/09/26 13:57:12 Executing test unit
2018/09/26 13:57:12 Executing test govet
2018/09/26 13:57:13 Build bin already succeeded in 35s
2018/09/26 13:57:13 Building installer
2018/09/26 13:57:14 Build installer failed, printing logs:
Pulling image "docker-registry.default.svc:5000/wking-next-gen-installer/pipeline@sha256:d913d470e79f89cf1797725d2aa0884fd9d3424fb902f41a697df3e4f1d708d4" ...
Pulling image "docker-registry.default.svc:5000/wking-next-gen-installer/pipeline@sha256:e3eb8f4e82cfffda0568c0853ddbe11c695b877365be67ed1b77393bea68df43" ...
error: open /tmp/build/inputs/Dockerfile: no such file or directory
2018/09/26 13:57:19 Container test in pod gofmt completed successfully
2018/09/26 13:57:19 Pod gofmt succeeded after 4s
2018/09/26 13:57:21 Container test in pod golint completed successfully
2018/09/26 13:57:21 Pod golint succeeded after 5s
2018/09/26 13:57:33 Container test in pod unit completed successfully
2018/09/26 13:57:33 Pod unit succeeded after 19s
2018/09/26 14:00:24 Container test in pod govet completed successfully
2018/09/26 14:00:24 Pod govet succeeded after 3m10s
2018/09/26 14:00:25 Ran for 3m19s
error: could not run steps: could not wait for build: the build installer failed with reason GenericBuildFailed: Generic Build failure - check logs for details.

I dunno where the /tmp/build/inputs/Dockerfile is coming from, grepping hasn't turned it up for me in either this repo or ci-operator. And that may not even be the main issue.

@wking
Copy link
Member Author

wking commented Sep 26, 2018

Probably something in here is assuming the presence of a Dockerfile, when I'm trying to get it to use base as its build image, run hack/build.sh && hack/get-terraform.sh, and stick the resulting two binaries into the output installer container.

@dgoodwin
Copy link
Contributor

Regarding missing region see openshift/installer#335. Fairly confident that the Terraform is not respecting the same methods and ordering of loading AWS credentials as the installer.

@wking
Copy link
Member Author

wking commented Sep 27, 2018

Regarding missing region see openshift/installer#335.

Ah, thanks for the reproduction instructions :). Does openshift/installer#346 fix it for you?

@wking
Copy link
Member Author

wking commented Sep 27, 2018

A bunch more installer work later, and now I'm running:

diff --git a/ci-operator/config/openshift/installer/master.yaml b/ci-operator/config/openshift/installer/master.yaml
index 9f451bb..f030c07 100644
--- a/ci-operator/config/openshift/installer/master.yaml
+++ b/ci-operator/config/openshift/installer/master.yaml
@@ -44,12 +44,3 @@ tests:
 - as: unit
   commands: go test ./pkg/...
   from: src
-- as: gofmt
-  commands: IS_CONTAINER=TRUE ./hack/go-fmt.sh .
-  from: src
-- as: govet
-  commands: IS_CONTAINER=TRUE ./hack/go-vet.sh ./...
-  from: src
-- as: golint
-  commands: IS_CONTAINER=TRUE ./hack/go-lint.sh -min_confidence 0.3 $(go list -f '{{ .ImportPath }}' ./...)
-  from: src
diff --git a/ci-operator/templates/cluster-launch-installer-e2e.yaml b/ci-operator/templates/cluster-launch-installer-e2e.yaml
index 898b69f..23b2c88 100644
--- a/ci-operator/templates/cluster-launch-installer-e2e.yaml
+++ b/ci-operator/templates/cluster-launch-installer-e2e.yaml
@@ -8,6 +8,7 @@ parameters:
   required: true
 - name: NAMESPACE
   required: true
+  value: "wking-next-gen-installer"
 - name: IMAGE_FORMAT
   required: true
 - name: LOCAL_IMAGE_INSTALLER
@@ -16,6 +17,7 @@ parameters:
   required: true
 - name: CLUSTER_TYPE
   required: true
+  value: "aws"
 - name: TEST_FOCUS
 - name: TEST_SKIP
 - name: TEST_FOCUS_SERIAL
@@ -60,7 +62,7 @@ objects:
       emptyDir: {}
     - name: cluster-profile
       secret:
-        secretName: ${JOB_NAME_SAFE}-cluster-profile
+        secretName: cluster-profile
 
     containers:
 
@@ -212,7 +214,7 @@ objects:
       - name: OPENSHIFT_INSTALL_CLUSTER_NAME
         value: ${NAMESPACE}-${JOB_NAME_HASH}
       - name: OPENSHIFT_INSTALL_BASE_DOMAIN
-        value: origin-ci-int-aws.dev.rhcloud.com
+        value: coreservices.team.coreos.systems
       - name: OPENSHIFT_INSTALL_EMAIL_ADDRESS
         value: test@ci.openshift.io
       - name: OPENSHIFT_INSTALL_PASSWORD

(with the base domain change because I'm using teamcoreservices credentials to launch the cluster). Launched with:

$ oc delete is pipeline stable -n wking
$ ci-operator -template ci-operator/templates/cluster-launch-installer-e2e.yaml -config ci-operator/config/openshift/installer/master.yaml -git-ref=openshift/installer@master -namespace wking

the setup container runs to a successful completion:

Apply complete! Resources: 154 added, 0 changed, 0 destroyed.
...
time="2018-09-27T22:23:14Z" level=debug msg="Generating Cluster Metadata..." 

The test container is still waiting:

Waiting for API at https://wking-ef260-api.coreservices.team.coreos.systems:6443 to respond ... 

but the API returns something when I hit it directly:

$ curl -ks https://wking-ef260-api.coreservices.team.coreos.systems:6443/ | head
{
  "paths": [
    "/api",
    "/api/v1",
    "/apis",
    "/apis/",
    "/apis/admissionregistration.k8s.io",
    "/apis/admissionregistration.k8s.io/v1beta1",
    "/apis/apiextensions.k8s.io",
    "/apis/apiextensions.k8s.io/v1beta1",

So it's possible that I just need to be patient. Or it's possible that I broke the wait-for-the-API logic.

@wking
Copy link
Member Author

wking commented Sep 27, 2018

Anyhow, I think this is close enough that I'd like to land it, and if there are any remaining bugs we can sort them out on the fly. @smarterclayton, can you take another look and drop a /lgtm if you like it?

@wking
Copy link
Member Author

wking commented Sep 27, 2018

The cluster continues:

API at https://wking-ef260-api.coreservices.team.coreos.systems:6443  has responded
Waiting for router to be created ...

so better and better :).

@wking
Copy link
Member Author

wking commented Sep 27, 2018

I ctrl-ced ci-operator (because I wanted to fix my patched namespace from wking-next-gen-installer -> wking, and checking the logs for the teardown cluster, things look healthy there too:

...
time="2018-09-27T22:39:40Z" level=debug msg="Deleting subnets"
time="2018-09-27T22:39:40Z" level=debug msg="Exiting deleting subnets"
time="2018-09-27T22:39:40Z" level=debug msg="goroutine deleteSubnets complete"
time="2018-09-27T22:39:42Z" level=debug msg="Deleting security groups"
time="2018-09-27T22:39:42Z" level=debug msg="Exiting deleting security groups"
time="2018-09-27T22:39:42Z" level=debug msg="goroutine deleteSecurityGroups complete"
...

@crawford
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 27, 2018
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: crawford, smarterclayton, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 470869e into openshift:master Sep 27, 2018
@openshift-ci-robot
Copy link
Contributor

@wking: Updated the following 5 configmaps:

  • ci-operator-openshift-installer configmap using the following files:
    • key release-3.11.yaml using file ``
    • key master.yaml using file ci-operator/config/openshift/installer/master.yaml
  • job-config configmap using the following files:
    • key openshift-installer-master-presubmits.yaml using file ci-operator/jobs/openshift/installer/openshift-installer-master-presubmits.yaml
  • prow-job-cluster-launch-installer-e2e-smoke configmap using the following files:
    • key cluster-launch-installer-e2e-smoke.yaml using file ``
  • prow-job-cluster-launch-installer-e2e configmap using the following files:
    • key cluster-launch-installer-e2e.yaml using file ci-operator/templates/cluster-launch-installer-e2e.yaml
  • cluster-profile-aws configmap using the following files:
    • key openshift.yaml using file ``

In response to this:

This is the next-gen installer. With the new installer, we only need the installer binary and terraform to launch and destroy clusters. I've removed the smoke tests for now, because I haven't looked into building them without Bazel. Hopefully we'll get expanded e2e testing and other OpenShift projects using our installer soon to keep us honest.

Also missing from the new installer is a way to set expirationDate. But we'll just not leak until we regain the ability to set that, right? ;)

The new config dumps less cruft into the output directory (most of the generated output goes into a temporary directory), so I've adjusted the openshift-install calls to just use an artifacts subdir for their state storage. A bonus of this approach is that if the installer hangs up, we'll capture anything it wrote to disk without needing an explicit cp call.

I've also dropped the 3.11 config, since the installer is 4.0-only.

We'll need openshift/installer#333 to land before this PR will work.

/hold

And even once that PR lands, I dunno if this one will work. @sallyom, can you run this through its paces or give me some pointers?

CC @crawford; once this lands, we can delete the old installer without breaking CI.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the next-gen-installer branch September 27, 2018 23:32
wking added a commit to wking/openshift-installer that referenced this pull request Sep 28, 2018
The final installer-origin-release and tectonic-installer consumers
were removed in openshift/release@3f2f01c4
(ci-operator/config/openshift/installer/master: Move to
openshift-install, 2018-09-26, openshift/release#1677).

The final reference to kubernetes-e2e was removed locally in 40eb6f0
(Update README.md, 2018-06-22, coreos/tectonic-installer#3310).
wking added a commit to wking/openshift-release that referenced this pull request Sep 29, 2018
Like d44e106 (remove ref to openshift.yaml aws installer config,
2018-09-27, openshift#1755), this is cleaning up after 3f2f01c
(ci-operator/config/openshift/installer/master: Move to
openshift-install, 2018-09-26, openshift#1677).
wking added a commit to wking/openshift-release that referenced this pull request Oct 4, 2018
We'd dropped this in 3f2f01c
(ci-operator/config/openshift/installer/master: Move to
openshift-install, 2018-09-26, openshift#1677).  But since
openshift/installer@98a35316 (pkg/asset/installconfig: Add
_CI_ONLY_STAY_AWAY_OPENSHIFT_INSTALL_AWS_USER_TAGS, 2018-09-28,
openshift/installer#364), the installer supports setting it again, so
we can add it back.

As the name suggests, this variable is not a stable API.  Soon the new
installer will be able to load the configuration from YAML (again),
but we can keep adjusting as the installer evolves.
wking added a commit to wking/openshift-release that referenced this pull request Oct 17, 2018
I'd removed these in 3f2f01c
(ci-operator/config/openshift/installer/master: Move to
openshift-install, 2018-09-26, openshift#1677) while removing the shell from
the setup container.  But we got a shell back in b20d571
(cluster-launch-installer-e2e: Start setting expirationDate again,
2018-09-29, openshift#1761), and restoring these handlers will help:

* Trigger early pod wrap-up when the setup container dies.  Both the
  test and teardown containers are monitoring the 'exit' file so they
  can bail early.

* Forward TERM to the installer, so we can exit gracefully when
  someone TERMs the shell script.

For the signal handling to work, we need to keep the shell process
around, so I've dropped the 'exec' from the openshift-install
invocation and moved it to the background instead (so 'jobs -p' will
list it).
wking added a commit to wking/openshift-release that referenced this pull request Feb 26, 2019
The teardown container requires oc (for retrieving cluster logs), gzip
(for compressing those logs), and openshift-install (for destroying
the cluster).  We were already copying in oc from another container;
with this commit we copy in the statically-linked openshift-install
too.  That leaves gzip, which we will be built in to IMAGE_TESTS as "a
command needed in the CI-test context".  Using IMAGE_TESTS for
teardown means we no longer need to copy oc into the shared volume for
the teardown container (although we may need to copy it in for other
containers, e.g. the src template's test container).

I've also dropped an unnecessary /bin/ prefix from the 'create
cluster' invocations, which we'd been dragging around from the
pre-shell days of 3f2f01c
(ci-operator/config/openshift/installer/master: Move to
openshift-install, 2018-09-26, openshift#1677).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants