Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openshift-installer-presubmits: Use TEST_SKIP for e2e-aws #1341

Closed
wants to merge 1 commit into from

Conversation

wking
Copy link
Member

@wking wking commented Aug 31, 2018

Improve our coverage by blacklisting known failures. I generated this string from an e2e-aws-all run today against the master codebase with this helper script in my ~/bin/junit-blacklist.py:

import os
import re
import sys
from xml.etree.ElementTree import ElementTree


def get_failures(stream=sys.stdin):
    tree = ElementTree()
    tree.parse(stream)
    for test in tree.findall('testcase'):
        if test.find('skipped') is not None:
            continue
        elif test.find('failure') is not None:
            print('failure: {}'.format(test.attrib['name']))
            yield test.attrib['name']
        else:
            print('success: {}'.format(test.attrib['name']))


def get_all_failures(root=os.curdir):
    for path, dirnames, filenames in os.walk(root):
        for filename in filenames:
            if filename.endswith('.xml'):
                yield from get_failures(stream=open(os.path.join(path, filename)))


def process_failure(failure):
    chars = []
    depth = 0
    for char in failure:
        if char == '[':
            depth += 1
            continue
        if char == ']':
            depth -= 1
            continue
        if depth == 0:
            chars.append(char)
    processed = re.escape(''.join(chars).strip())
    processed = processed.replace(r'\ ', ' ').replace(r'\-', '-').replace(r'\/', '/').replace(r'\=', '=')  # Work around https://bugs.python.org/issue29995, fixed in Python 3.7
    return processed


if __name__ == '__main__':
    failures = sorted(set(process_failure(failure=failure) for failure in get_all_failures()))
    if len(failures) == 1:
        print(failures[0])
    elif failures:
        print('>')
        print('.*(({}'.format(failures[0]))
        for failure in failures[1:]:
            print('*)|({}'.format(failure))
        print('*)).*')

and this invocation:

$ wget -r -e robots=off -np -H https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_installer/170/pull-ci-origin-installer-e2e-aws-all/3/artifacts/e2e-aws-all/junit/
$ cd storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/170/pull-ci-origin-installer-e2e-aws-all/3/artifacts/e2e-aws-all/junit
$ junit-blacklist.py

The regexp is because Ginko's -skip takes a regexp. Presumably that's a Go regexp, which, like Python regexps, are similar to POSIX's Extended Regular Expressions (EREs). From the JUnit XML, the script collects names for all failed tests and assembles a regexp to ignore all of them. My target regexp pattern is:

.*((test-1-name)|(test-2-name)|(test-3-name)).*

but that would get a bit hard to read. So instead, I inject optional whitespace after each name:

.*((test-1-name *)|(test-2-name *)|(test-3-name *)).*

and then use YAML's folded style to get:

>
.*((test-1-name
*)|(test-2-name
*)|(test-3-name
*)).*

As we improve the installer, we can gradually remove entries from this blacklist by cutting a single line from the folded regexp.

CC @sallyom, @smarterclayton. It would be good if there was a way to test this before going live. Maybe I should put it in a new e2e-aws-blacklist? Or we can leave it in e2e-aws and revert if this breaks something ;).

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 31, 2018
@wking
Copy link
Member Author

wking commented Aug 31, 2018

Prow does not like my folded string :p. How much do we care about readability? Enough to patch ordered-prow-config? CC @petr-muller, @stevekuznetsov.

@smarterclayton
Copy link
Contributor

smarterclayton commented Aug 31, 2018 via email

@wking wking force-pushed the installer-e2e-aws-blacklist branch from d9bb362 to 7991c1a Compare August 31, 2018 22:39
@wking
Copy link
Member Author

wking commented Aug 31, 2018

You could try this on all first, then move to e2e-AWS and reset all.

That sounds like a low-stress approach. I've switched over to it with d9bb362 -> 7991c1a.

Also, be aware we’re growing the suite by 140% in a few hours, so you may want to regen after that lands.

I'm fine either way. With a blacklist approach, this is always going to be an issue.

Any thoughts on folding or not?

@@ -108,10 +108,54 @@ presubmits:
- --template=/usr/local/e2e-aws-all
- --target=e2e-aws-all
env:
- name: TEST_FOCUS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you want focus + skip? If you don't focus conformance you get a LOT of tests as your base to skip from

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you want focus + skip? If you don't focus conformance you get a LOT of tests as your base to skip from

Ah, that makes sense. Restored with 7991c1a -> d486714.

@wking wking force-pushed the installer-e2e-aws-blacklist branch from 7991c1a to d486714 Compare August 31, 2018 23:26
@smarterclayton
Copy link
Contributor

Re folding that’s Steve’s call.

@smarterclayton
Copy link
Contributor

Readability probably isn’t a huge issue since this is misty going to be you and me and not last long.

@wking wking force-pushed the installer-e2e-aws-blacklist branch from d486714 to 2482470 Compare September 4, 2018 17:14
@openshift-ci-robot openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 4, 2018
@wking wking force-pushed the installer-e2e-aws-blacklist branch from 2482470 to 03f6041 Compare September 4, 2018 17:16
@wking
Copy link
Member Author

wking commented Sep 4, 2018

Re folding that’s Steve’s call.

To help get this landed, I've pushed d486714 -> 03f6041 dropping the folding. I can re-fold later (and update the line-ordering test to allow it) if @stevekuznetsov wants to go that way.

@stevekuznetsov
Copy link
Contributor

With the

  value: |
    whatever

formatting instead of

  value: -|
    whatever

Is there technically a newline in the front of the value? Does it matter? Otherwise this LGTM

@wking wking force-pushed the installer-e2e-aws-blacklist branch from 03f6041 to 75e85bc Compare September 4, 2018 17:23
@wking
Copy link
Member Author

wking commented Sep 4, 2018

Is there technically a newline in the front of the value?

Maybe? I've just pushed 03f6041 -> 75e85bc to use value: .*(...).* without any fancy YAML literal characters ;).

@stevekuznetsov
Copy link
Contributor

Will LGTM once tests pass

@wking
Copy link
Member Author

wking commented Sep 4, 2018

Ah, the most recent error was a request for line wrapping. That seems even harder to read to me than the long line, so I've gone back to the | + single long line recommended here.

And ordered-prow-config didn't like that either, so I've just pushed 8706b44 with their exact wrapping. All green, @stevekuznetsov.

@wking
Copy link
Member Author

wking commented Sep 4, 2018

Actually, give me a minute. Looks like I can trim this further based on this recent run.

@wking wking force-pushed the installer-e2e-aws-blacklist branch from 8706b44 to eb7cef2 Compare September 4, 2018 18:22
@openshift-ci-robot openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 4, 2018
@wking
Copy link
Member Author

wking commented Sep 4, 2018

Ok, I've updated based on run 4. Changes vs. run 3 were (in the wrapped form):

 >-
-.*((DNS should answer endpoint and wildcard queries for the cluster
-*)|(HostPath should support subPath
-*)|(Image hard prune should delete orphaned blobs
+.*((Image hard prune should delete orphaned blobs
 *)|(Image hard prune should show orphaned blob deletions in dry-run mode
 *)|(Image limit range should deny a docker image reference exceeding limit on openshift\.io/image-tags resource
 *)|(Image prune of schema 1 should prune old image
@@ -236,38 +19,4 @@
 *)|(Image prune with --prune-registry==false should prune old image but skip registry
 *)|(Image prune with default --all flag should prune both internally managed and external images
 *)|(Image resource quota should deny a push of built image exceeding openshift\.io/imagestreams quota
-*)|(Prometheus when installed to the cluster should start and expose a secured proxy and unsecured metrics
-*)|(Prometheus when installed to the cluster should start and expose a secured proxy and verify build metrics
-*)|(The HAProxy router should expose prometheus metrics for a route
-*)|(The HAProxy router should expose the profiling endpoints
-*)|(The HAProxy router should respond with 503 to unrecognized hosts
-*)|(The HAProxy router should serve routes that were created from an ingress
-*)|(The HAProxy router should set Forwarded headers appropriately
-*)|(deploymentconfigs with multiple image change triggers  should run a successful deployment with a trigger used by different containers
-*)|(deploymentconfigs with multiple image change triggers  should run a successful deployment with multiple triggers
-*)|(forcePull should affect pulling builder images  ForcePull test case execution custom
-*)|(forcePull should affect pulling builder images  ForcePull test case execution docker
-*)|(forcePull should affect pulling builder images  ForcePull test case execution s2i
-*)|(imagechangetriggers  imagechangetriggers should trigger builds of all types
-*)|(jenkins autoprovision Pipeline build config should autoprovision jenkins
-*)|(oc new-app  should fail with a --name longer than 58 characters
-*)|(oc new-app  should succeed with a --name of 58 characters
-*)|(openshift mongodb image  creating from a template should instantiate the template
-*)|(process valueFrom in build strategy environment variables  should fail resolving unresolvable valueFrom in docker build environment variable references
-*)|(process valueFrom in build strategy environment variables  should fail resolving unresolvable valueFrom in sti build environment variable references
-*)|(process valueFrom in build strategy environment variables  should successfully resolve valueFrom in docker build environment variables
-*)|(process valueFrom in build strategy environment variables  should successfully resolve valueFrom in s2i build environment variables
-*)|(prune builds based on settings in the buildconfig   buildconfigs should have a default history limit set when created via the group api
-*)|(prune builds based on settings in the buildconfig   buildconfigs should not have a default history limit set when created via the legacy api
-*)|(prune builds based on settings in the buildconfig  should prune builds after a buildConfig change
-*)|(prune builds based on settings in the buildconfig  should prune canceled builds based on the failedBuildsHistoryLimit setting
-*)|(prune builds based on settings in the buildconfig  should prune completed builds based on the successfulBuildsHistoryLimit setting
-*)|(prune builds based on settings in the buildconfig  should prune errored builds based on the failedBuildsHistoryLimit setting
-*)|(prune builds based on settings in the buildconfig  should prune failed builds based on the failedBuildsHistoryLimit setting
-*)|(templateinstance impersonation tests should pass impersonation creation tests
-*)|(templateinstance impersonation tests should pass impersonation update tests
-*)|(templateinstance security tests  should pass security tests
-*)|(templateservicebroker bind test  should pass bind tests
-*)|(templateservicebroker end-to-end test  should pass an end-to-end test
-*)|(templateservicebroker security test  should pass security tests
 *)).*

@smarterclayton, did something happen to the parallel tests? Run 4 only seems to have results from the serial tests.

@wking wking force-pushed the installer-e2e-aws-blacklist branch from eb7cef2 to 192e17f Compare September 4, 2018 18:27
@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 4, 2018
@smarterclayton
Copy link
Contributor

That’s weird... looks like something is broken, possibly in the job. Your serial suite is incorrect too, it should be openshift/conformance/serial, not registry serial.

@wking
Copy link
Member Author

wking commented Sep 4, 2018

Your serial suite is incorrect too, it should be openshift/conformance/serial, not registry serial.

I've filed #1351 for this.

@wking
Copy link
Member Author

wking commented Sep 4, 2018

That’s weird... looks like something is broken, possibly in the job.

From the job 4 build log:

• Failure in Spec Setup (BeforeEach) [9.872 seconds]
[sig-storage] Projected
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/common/projected.go:34
  should be consumable in multiple volumes in the same pod [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] [BeforeEach]
  /tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:684

  Failed to create PSP e2e-test-privileged-psp
  Expected error:
      <*errors.StatusError | 0xc420dcdcb0>: {
          ErrStatus: {
              TypeMeta: {Kind: "", APIVersion: ""},
              ListMeta: {SelfLink: "", ResourceVersion: "", Continue: ""},
              Status: "Failure",
              Message: "podsecuritypolicies.extensions \"e2e-test-privileged-psp\" already exists",
              Reason: "AlreadyExists",
              Details: {
                  Name: "e2e-test-privileged-psp",
                  Group: "extensions",
                  Kind: "podsecuritypolicies",
                  UID: "",
                  Causes: nil,
                  RetryAfterSeconds: 0,
              },
              Code: 409,
          },
      }
      podsecuritypolicies.extensions "e2e-test-privileged-psp" already exists
  not to have occurred
...
	 -------------------------------------------------------------------
	|                                                                   |
	|  Ginkgo timed out waiting for all parallel nodes to report back!  |
	|                                                                   |
	 -------------------------------------------------------------------

Maybe it's bumping into a resource leak?

I want to improve our e2e-aws coverage by blacklisting known failures.
I generated this string from an e2e-aws-all run today against the
master codebase [1,2] with this helper script:

  $ cat ~/bin/junit-blacklist.py
  #!/usr/bin/env python3

  import os
  import re
  import sys
  from xml.etree.ElementTree import ElementTree

  def get_failures(stream=sys.stdin):
      tree = ElementTree()
      tree.parse(stream)
      for test in tree.findall('testcase'):
          if test.find('skipped') is not None:
              continue
          elif test.find('failure') is not None:
              print('failure: {}'.format(test.attrib['name']))
              yield test.attrib['name']
          else:
              print('success: {}'.format(test.attrib['name']))

  def get_all_failures(root=os.curdir):
      for path, dirnames, filenames in os.walk(root):
          for filename in filenames:
              if filename.endswith('.xml'):
                  yield from get_failures(stream=open(os.path.join(path, filename)))

  def process_failure(failure):
      chars = []
      depth = 0
      for char in failure:
          if char == '[':
              depth += 1
              continue
          if char == ']':
              depth -= 1
              continue
          if depth == 0:
              chars.append(char)
      processed = re.escape(''.join(chars).strip())
      processed = processed.replace(r'\ ', ' ').replace(r'\-', '-').replace(r'\/', '/').replace(r'\=', '=')  # Work around https://bugs.python.org/issue29995, fixed in Python 3.7
      return processed

  if __name__ == '__main__':
      failures = sorted(set(process_failure(failure=failure) for failure in get_all_failures()))
      if len(failures) == 1:
          print(failures[0])
      elif failures:
          print('>-')
          print('.*(({}'.format(failures[0]))
          for failure in failures[1:]:
              print('*)|({}'.format(failure))
          print('*)).*')

and this invocation:

  $ wget -r -e robots=off -np -H https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_installer/170/pull-ci-origin-installer-e2e-aws-all/4/artifacts/e2e-aws-all/junit/
  $ cd storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/170/pull-ci-origin-installer-e2e-aws-all/3/artifacts/e2e-aws-all/junit
  $ junit-blacklist.py

The regexp is because Ginko's -skip takes a regexp [3].  Presumably
that's a Go regexp [4], which, like Python regexps [5], are similar to
POSIX's Extended Regular Expressions (EREs) [6].  From the JUnit XML,
the script collects names for all failed tests and assembles a regexp
to ignore all of them.  My target regexp pattern is:

  .*(({test-1-name})|({test-2-name})|({test-3-name})).*

but that would get a bit hard to read.  So instead, I tried injecting
optional whitespace after each name:

  .*(({test-1-name} *)|({test-2-name} *)|({test-3-name} *)).*

and then using YAML's folded style [7] with the stripping indicator
[8] to get:

  >-
  .*(({test-1-name}
  *)|({test-2-name}
  *)|({test-3-name}
  *)).*

As we improve the installer, we'd gradually remove entries from this
blacklist by cutting a single line from the folded regexp.  The
ordered-prow-config test didn't like my line wrapping, though [9]:

  --- /tmp/tmp.hlKAjJ4Mf4/jobs/openshift/installer/openshift-installer-presubmits.yaml  2018-08-31 23:26:44.138011998 +0000
  +++ hack/../ci-operator/jobs/openshift/installer/openshift-installer-presubmits.yaml  2018-08-31 23:26:44.495040384 +0000
  @@ -113,53 +111,8 @@ presubmits:
           - name: TEST_FOCUS_SERIAL
             value: Suite:openshift/registry/serial
           - name: TEST_SKIP
  -          value: >
  -            .*((DNS should answer endpoint and wildcard queries for the cluster
  -            ...
  -            *)).*
  +          value: |
  +            .*((DNS should answer endpoint and wildcard queries for the cluster...*)).*

and it didn't like my attempt to use one long line (with stripping
[8]) [10]:

  --- /tmp/tmp.kwtUW4jpNh/jobs/openshift/installer/openshift-installer-presubmits.yaml  2018-09-04 17:59:44.796949829 +0000
  +++ hack/../ci-operator/jobs/openshift/installer/openshift-installer-presubmits.yaml  2018-09-04 17:59:44.910959930 +0000
  @@ -113,8 +111,60 @@ presubmits:
           - name: TEST_FOCUS_SERIAL
             value: Suite:openshift/registry/serial
           - name: TEST_SKIP
  -          value: |-
  -            .*((DNS should answer endpoint and wildcard queries for the cluster)|...).*
  +          value: .*((DNS should answer endpoint and wildcard queries for the cluster)|(HostPath
  +            should support subPath)|(Image hard prune should delete orphaned blobs)|(Image
  +            ...
  +            pass security tests)).*

so with this commit I've just used the form with the test-required
wrapping (without the optional whitespace hack).

This change is fairly dramatic, so I'm temporarily hijacking
e2e-aws-all to exercise it.  If it works well there, I'll move this
TEST_SKIP into e2e-aws and restore e2e-aws-all to its previous setup.

Keeping the FOCUS variables keeps the tests from expanding into
untested waters [11].  If we want to drop filters later, we'd need a
new run to figure out what would need blacklisting in the expanded
suite.

[1]: openshift/installer#170 (comment)
[2]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/170/pull-ci-origin-installer-e2e-aws-all/3/
[3]: https://github.com/onsi/ginkgo/blame/v1.6.0/README.md#L30
[4]: https://golang.org/pkg/regexp/#pkg-overview
[5]: https://docs.python.org/3/library/re.html#regular-expression-syntax
[6]: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04
[7]: http://yaml.org/spec/1.2/spec.html#style/block/folded
[8]: http://yaml.org/spec/1.2/spec.html#id2794534
[9]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/1341/pull-ci-openshift-release-master-ordered-prow-config/111/build-log.txt
[10]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/1341/pull-ci-openshift-release-master-ordered-prow-config/137/build-log.txt
[11]: openshift#1341 (comment)
@wking wking force-pushed the installer-e2e-aws-blacklist branch from 192e17f to 546bfdf Compare September 4, 2018 20:59
@wking
Copy link
Member Author

wking commented Sep 4, 2018

I've rebased around #1351 with 192e17f -> 546bfdf. I'll kick off another round in openshift/installer#170 with the new focus, and then update the blacklist here once that returns.

@smarterclayton
Copy link
Contributor

Yeah, it looks like ginkgo failed on the first run (the tests got run and logged, just died at the end). That can be a problem with the nodes (like we were having before with firewall rules) or it can be an indication that a test totally and completely failed and/or hung, leading to the timeout of the run.

I think we might be missing a ginkgo setting to print all communications (even when hung) before exiting in these cases - Kube has occasionally printed this stuff. I'll have to look at that later.

@wking
Copy link
Member Author

wking commented Sep 4, 2018

Yeah, it looks like ginkgo failed on the first run (the tests got run and logged, just died at the end).

Same for job 5, so probably not a flake.

I think we might be missing a ginkgo setting to print all communications (even when hung) before exiting in these cases - Kube has occasionally printed this stuff. I'll have to look at that later.

The Ginkgo docs suggest -stream for debugging hung tests. I'll launch a cluster locally and see if I can reproduce the hang with -stream enabled.

@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 30, 2018
@openshift-bot
Copy link
Contributor

@wking: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@wking: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/owners 546bfdf link /test owners

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@wking
Copy link
Member Author

wking commented Dec 7, 2018

@smarterclayton has taken over growing the tests in openshift/origin.

/close

@openshift-ci-robot
Copy link
Contributor

@wking: Closed this PR.

In response to this:

@smarterclayton has taken over growing the tests in openshift/origin.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the installer-e2e-aws-blacklist branch December 7, 2018 19:04
derekhiggins pushed a commit to derekhiggins/release that referenced this pull request Oct 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants