Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[e2e failure] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] [Serial] [Slow] ReplicaSet Should scale ... #54574

Closed
spiffxp opened this issue Oct 25, 2017 · 29 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. milestone/needs-attention priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling.
Milestone

Comments

@spiffxp
Copy link
Member

spiffxp commented Oct 25, 2017

/priority critical-urgent
/sig autoscaling

This test case started failing recently and affects a number of jobs: triage report

This is affecting multiple jobs on the release-master-blocking testgrid dashboard, and prevents us from cutting 1.9.0-alpha.2 (kubernetes/sig-release#22). Is there work ongoing to bring this job back to green?

triage cluster b75045e2cb613e12dca1

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:39
timeout waiting 15m0s for 5 replicas
Expected error:
    <*errors.errorString | 0xc4202cafe0>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:128

Suspect range from gci-gce-serial: 060b4b8...51244eb

Suspect range from gci-gke-serial: b1e2d7a...82a52a9

@k8s-ci-robot k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. labels Oct 25, 2017
@spiffxp spiffxp changed the title [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] [Serial] [Slow] ReplicaSet Should scale ... [e2e failure] [sig-autoscaling] [HPA] Horizontal pod autoscaling (scale resource: CPU) [sig-autoscaling] [Serial] [Slow] ReplicaSet Should scale ... Oct 25, 2017
@spiffxp
Copy link
Member Author

spiffxp commented Oct 25, 2017

@kubernetes/sig-autoscaling-test-failures

@spiffxp
Copy link
Member Author

spiffxp commented Oct 25, 2017

/priority failing-test

@k8s-ci-robot k8s-ci-robot added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Oct 25, 2017
@MaciekPytel
Copy link
Contributor

@DirectXMan12 the moment this test started failing coincides exactly with merging of #53743. Which BTW is a very large commit to HPA which was not tagged with sig-autoscaling and therefore slipped us completely.

I think we should revert #53743 for now and merge it again after fixing it.

cc: @mwielgus

@DirectXMan12
Copy link
Contributor

DirectXMan12 commented Oct 25, 2017

I apologize for missing the SIG autoscaling label (although I'm surprised that the bot didn't complain about it. Perhaps because I'm the one who submitted it?).

I'll track down why it's failing.

@DirectXMan12
Copy link
Contributor

DirectXMan12 commented Oct 25, 2017

found the issue. When you write your scaleTargetRef, it's important to actually specify an APIVersion field now. It didn't matter before, but it was still poor form to just refer to kind: ReplicaSet without apiVersion: extensions/v1beta1 (or apps/v1beta2).

@DirectXMan12
Copy link
Contributor

will have a PR in a couple minutes

@DirectXMan12
Copy link
Contributor

DirectXMan12 commented Oct 25, 2017

... aaand the apps API group doesn't set registry subresource versions correctly, so the group-version on scales returned by apps is apps/v1beta2 (incorrectly).

EDIT: @liggitt correctly pointed out that I misread things, and that apps/v1beta2.Scale is a real thing, unfortunately. I'll have to add a slightly different fix to the PR.

@DirectXMan12
Copy link
Contributor

PR posted ^

@spiffxp
Copy link
Member Author

spiffxp commented Oct 31, 2017

PR's continue to await review

k8s-github-robot pushed a commit that referenced this issue Nov 6, 2017
…nd-hpa-gvks

Automatic merge from submit-queue (batch tested with PRs 53645, 54734, 54586, 55015, 54688). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Fix Incorrect Scale Subresources and HPA e2e ScaleTargetRefs

The HPA e2es failed to actually set `apiVersion` on the created HPAs, which previous was ignored.  Since the polymorphic scale client was merged, this behavior is no longer tolerated (it was never correct to begin with, but it accidentally worked).

Additionally, the `apps` resources have their own version of scale.  Until `apps/v1beta1` and `apps/v1beta2` go away, we need to support those versions in the scale client.

Together, these broke some of the HPA e2es.

Fixes #54574

```release-note
NONE
```
@spiffxp spiffxp added this to the v1.9 milestone Nov 7, 2017
@spiffxp
Copy link
Member Author

spiffxp commented Nov 9, 2017

/reopen
I'm still seeing this here https://k8s-testgrid.appspot.com/sig-release-master-blocking#gci-gce-serial

Unless we decide to punt that job from release-master-blocking, this is now impacting 1.9.0-alpha.3 (kubernetes/sig-release#27)

@k8s-ci-robot k8s-ci-robot reopened this Nov 9, 2017
@spiffxp
Copy link
Member Author

spiffxp commented Nov 9, 2017

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 9, 2017
@spiffxp
Copy link
Member Author

spiffxp commented Nov 9, 2017

/status approved-for-milestone
(new comment since the bot doesn't accept edits) I can type, I swear

@abgworrall
Copy link
Contributor

I'm also seeing this on our OS image validation testgrid: https://k8s-testgrid.appspot.com/sig-node-cos-image#e2e-gce-cosbeta-k8sdev-serial

@DirectXMan12
Copy link
Contributor

Looking into the current set of failures

@DirectXMan12
Copy link
Contributor

Looking at the failure logs, I'm seeing

horizontal.go:189] failed to query scale subresource for Deployment/e2e-tests-horizontal-pod-autoscaling-zxc9r/test-deployment: the server could not find the requested resource

So, I tried reproducing locally (provider=local, hack/local-up-cluster.sh), and I cannot. HPA seems to be able to fetch scale properly for replicasets and deployments in a vanilla fresh cluster-up environment. Is there something special about the way we stand up those test environments?

@frobware
Copy link
Contributor

Going to investigate this too.

@frobware
Copy link
Contributor

I'm still seeing this here https://k8s-testgrid.appspot.com/sig-release-master-blocking#gci-gce-serial

@spiffxp this appears to be passing now. As highlighted by @MaciekPytel on slack/sig-autoscaling, #55413 might be significant here.

@spiffxp
Copy link
Member Author

spiffxp commented Nov 13, 2017

/remove-priority critical-urgent
/priority important-soon
Agree, this has moved off of release-master-blocking, with the exception of soak-gci-gce which I would like to kick out of the list of blocking tests wholesale anyway.

This is still affecting some upgrade tests, which I'm not actively watching yet. Once we hit code freeze, I will be watching them, and will bump priority accordingly. Does something need to be cherry-picked into the release-1.8 branch?

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Nov 13, 2017
@dims
Copy link
Member

dims commented Nov 16, 2017

/assign @frobware

@frobware feel free to reassign/unassign, i assigned it based on your comment 2 days ago

@k8s-ci-robot
Copy link
Contributor

@dims: GitHub didn't allow me to assign the following users: frobware.

Note that only kubernetes members can be assigned.

In response to this:

/assign @frobware

@frobware feel free to reassign/unassign, i assigned it based on your comment 2 days ago

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@spiffxp
Copy link
Member Author

spiffxp commented Nov 20, 2017

/close
I'm no longer seeing this on sig-release-master-blocking nor sig-release-master-upgrade

@spiffxp
Copy link
Member Author

spiffxp commented Nov 27, 2017

/reopen
FYI @kubernetes/sig-autoscaling-test-failures now that some of the upgrade jobs have been fixed, I'm seeing this again in a number of jobs:

eg: triage cluster b75045e2cb613e12dca1

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:41
timeout waiting 15m0s for 3 replicas
Expected error:
    <*errors.errorString | 0xc4202d9330>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/autoscaling/horizontal_pod_autoscaling.go:122

triage report

@DirectXMan12 @frobware (taking a total guess) are there fixes that need to be cherry-picked into release-1.8?

tracking this against v1.9.0-beta.1 (kubernetes/sig-release#34)

@k8s-ci-robot k8s-ci-robot reopened this Nov 27, 2017
@k8s-github-robot k8s-github-robot removed this from the v1.9 milestone Nov 27, 2017
@spiffxp
Copy link
Member Author

spiffxp commented Nov 27, 2017

/remove-priority important-soon
/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Nov 27, 2017
@spiffxp spiffxp added this to the v1.9 milestone Nov 27, 2017
@DirectXMan12
Copy link
Contributor

@spiffxp I'd guess that we'd have to cherry-pick the test suite fixes back to the 1.8 test suite if you've got instances of the 1.8 test suite running against 1.9 code.

@DirectXMan12
Copy link
Contributor

DirectXMan12 commented Nov 29, 2017

The fix needed should be #54586. Let me try and repro locally (1.9 cluster, 1.8 tests) and see what happens.

@DirectXMan12
Copy link
Contributor

I've reproduced locally. The backport seems to fix the issue (just doing one final test run). Should have a PR up shortly.

@spiffxp
Copy link
Member Author

spiffxp commented Dec 1, 2017

Now tracking against v1.9.0-beta.2 (kubernetes/sig-release#39)

k8s-github-robot pushed a commit that referenced this issue Dec 1, 2017
…est-scale-gvks

Automatic merge from submit-queue.

[e2e] make sure to specify APIVersion in HPA tests

Previously, the HPA controller ignored APIVersion when resolving the
scale subresource for a kind, meaning if it was set incorrectly in the
HPA's scaleTargetRef, it would not matter.  This was the case for
several of the HPA e2e tests.

Since the polymorphic scale client merged into Kubernetes 1.9, and we need to
do upgrade testing, APIVersion now matters.  This
updates the HPA e2es to care about APIVersion, by passing kind as a full
GroupVersionKind, and not just a string.

Fixes #54574 (again)

```release-note
NONE
```
@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Issue Needs Attention

@DirectXMan12 @spiffxp @kubernetes/sig-autoscaling-misc

Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress label so it can be tracked with other in-flight issues.

Action Required: This issue has not been updated since Dec 1. Please provide an update.

Note: This issue is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required
Issue Labels
  • sig/autoscaling: Issue will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. milestone/needs-attention priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling.
Projects
None yet
Development

No branches or pull requests

8 participants