Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go.k8s.io/triage is out of date #9271

Closed
spiffxp opened this issue Sep 5, 2018 · 20 comments
Closed

go.k8s.io/triage is out of date #9271

spiffxp opened this issue Sep 5, 2018 · 20 comments
Assignees
Labels
area/kettle area/triage kind/bug Categorizes issue or PR as related to a bug.

Comments

@spiffxp
Copy link
Member

spiffxp commented Sep 5, 2018

/area kettle
/kind bug
/assign

http://velodrome.k8s.io/dashboard/db/bigquery-metrics?panelId=12&fullscreen&orgId=1

Called out in https://github.com/kubernetes/test-infra/blob/master/docs/oss-oncall-log.md

Last log entry

==== 2018-08-29 14:55:36 PDT ========================================
PULLED 471
ACK irrelevant 469
EXTEND-ACK  2
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-kops-aws-channelalpha/5155 True True 2018-08-29 14:23:01 PDT SUCCESS
gs://kubernetes-jenkins/pr-logs/pull/batch/pull-kubernetes-e2e-kops-aws/104011 True True 2018-08-29 14:05:59 PDT SUCCESS
ACK "finished.json" 2
Downloading JUnit artifacts.

Replace the pod

spiffxp@spiffxp-macbookpro:kettle (master %)$ k get pods
NAME                      READY     STATUS    RESTARTS   AGE
kettle-5df45c4dcb-7tnx9   1/1       Running   202        26d
spiffxp@spiffxp-macbookpro:kettle (master %)$ kubectl delete pod -l app=kettle
pod "kettle-5df45c4dcb-7tnx9" deleted
spiffxp@spiffxp-macbookpro:kettle (master %)$ kubectl rollout status deployment/kettle
deployment "kettle" successfully rolled out
spiffxp@spiffxp-macbookpro:kettle (master %)$ kubectl get pod -l app=kettle
NAME                      READY     STATUS        RESTARTS   AGE
kettle-5df45c4dcb-7tnx9   1/1       Terminating   202        26d
kettle-5df45c4dcb-fzkjt   1/1       Running       0          16s

Watch the logs

spiffxp@spiffxp-macbookpro:kettle (master %)$ kubectl logs -f $(kubectl get pod -l app=kettle -oname)
Activated service account credentials for: [kettle@k8s-gubernator.iam.gserviceaccount.com]
Loading builds from gs://kubernetes-jenkins/pr-logs
already have 1296792 builds
gs://kubernetes-jenkins/pr-logs/pull/kubeflow_kubebench/74/kubeflow-kubebench-presubmit/144
gs://kubernetes-jenkins/pr-logs/pull/kubeflow_examples/242/kubeflow-examples-presubmit/672
gs://kubernetes-jenkins/pr-logs/pull/kubeflow_examples/242/kubeflow-examples-presubmit/666
gs://kubernetes-jenkins/pr-logs/pull/kubeflow_examples/242/kubeflow-examples-presubmit/664
gs://kubernetes-jenkins/pr-logs/pull/cloud-provider-gcp/50/cloud-provider-gcp-tests/30
# ...
@k8s-ci-robot k8s-ci-robot added area/kettle kind/bug Categorizes issue or PR as related to a bug. labels Sep 5, 2018
@spiffxp
Copy link
Member Author

spiffxp commented Sep 5, 2018

I will close this when the spice is flowing once more

@krzyzacy
Copy link
Member

krzyzacy commented Sep 5, 2018

hummm restart the pod?

@spiffxp
Copy link
Member Author

spiffxp commented Sep 5, 2018

At last glance the kettle pod is still trying to catch up. There is data flowing into bigquery. I'm not sure if anything needs to be done to refresh triage or the metrics driving our velodrome dashboard. Going to wait a bit longer.

@spiffxp
Copy link
Member Author

spiffxp commented Sep 6, 2018

@spiffxp
Copy link
Member Author

spiffxp commented Sep 6, 2018

https://prow.k8s.io/?type=periodic&job=ci-test-infra-triage is unhappy, unclear why though

@BenTheElder
Copy link
Member

/assign
currently oncall and looking into this

@spiffxp
Copy link
Member Author

spiffxp commented Sep 24, 2018

/unassign @BenTheElder
Trying to get to the point where a run takes less than two hours. In the meantime I've updated with results from a manual run on my laptop.

https://storage.googleapis.com/k8s-gubernator/triage/index.html

@spiffxp
Copy link
Member Author

spiffxp commented Sep 24, 2018

/area triage
/remove-area kettle

@BenTheElder
Copy link
Member

Thanks @spiffxp!

@spiffxp
Copy link
Member Author

spiffxp commented Sep 25, 2018

https://storage.googleapis.com/k8s-gubernator/triage/index.html now looks truncated midway through, trying to figure out why

@spiffxp
Copy link
Member Author

spiffxp commented Sep 26, 2018

It was truncated because I was downloading an old tarball of results. Whoops. One last thing, going to try and get https://k8s-testgrid.appspot.com/sig-testing-misc#triage populated

@spiffxp
Copy link
Member Author

spiffxp commented Sep 26, 2018

/area kettle
https://k8s-testgrid.appspot.com/sig-testing-misc#metrics-kettle

I think kettle may be failing again

@spiffxp
Copy link
Member Author

spiffxp commented Sep 26, 2018

Kettle's log output is confusing me, it's streaming:

gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335654
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335666
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335650
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335640
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335655
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335649
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335647
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335663
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335662
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335646
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335659
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335645
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335652
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335641
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335639
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335635
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335633
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335631
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335643
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335628
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335630

Trying a delete/rollout and seeing what happens

@spiffxp
Copy link
Member Author

spiffxp commented Sep 27, 2018

It fell into the same loop, but on a different bucket. I have kicked kettle again, and it seems to be going further this time?

I looked at logs for the past few days in stackdriver. Normal behavior is:

  • make_db.py gets called around 1am PDT
  • Starts loading buckets with Loading builds from gs://kubernetes-jenkins/pr-logs, which is hardcoded as the first bucket to load
  • Second to last bucket loaded is gs://kubernetes-jenkins/logs/
  • Last bucket loaded is gs://istio-circleci/

Last night:

  • I see the entry for kubernetes-jenkins/logs,
  • ... but not istio-circleci
  • Instead, it starts looping on ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new as described in the previous comment

@spiffxp
Copy link
Member Author

spiffxp commented Sep 27, 2018

Current suspicion: while trying to decide how to enumerate builds for a given job, we hit an error when trying to read the latest-build.txt file that is silently passed and kicks us to the non-sequential path:

https://github.com/kubernetes/test-infra/blob/master/kettle/make_db.py#L137-L145

This path ends up going through a while True that could potentially keep looping:

https://github.com/kubernetes/test-infra/blob/master/kettle/make_db.py#L96-L109

@spiffxp
Copy link
Member Author

spiffxp commented Sep 27, 2018

Now suspect this is related to tot being down while the rest of prow was down due to outage on 2019-09-25 (https://docs.google.com/document/d/1kwqU4sCycwxfTsV774lnrtFakCg90rMXNShmjSqyEJI/view)

@BenTheElder
Copy link
Member

Is this stable now or still giving us problems?

@spiffxp
Copy link
Member Author

spiffxp commented Sep 28, 2018

I think it's stable. I'll close after verifying and open a followup issue for extending how far back we look. I'm tempted to punt on hardening kettle if we're going to revisit tot/snowflake id's.

@spiffxp
Copy link
Member Author

spiffxp commented Sep 28, 2018

/close

@k8s-ci-robot
Copy link
Contributor

@spiffxp: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kettle area/triage kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants