go.k8s.io/triage is out of date #9271

spiffxp · 2018-09-05T18:16:46Z

/area kettle
/kind bug
/assign

http://velodrome.k8s.io/dashboard/db/bigquery-metrics?panelId=12&fullscreen&orgId=1

Called out in https://github.com/kubernetes/test-infra/blob/master/docs/oss-oncall-log.md

Last log entry

==== 2018-08-29 14:55:36 PDT ========================================
PULLED 471
ACK irrelevant 469
EXTEND-ACK  2
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-kops-aws-channelalpha/5155 True True 2018-08-29 14:23:01 PDT SUCCESS
gs://kubernetes-jenkins/pr-logs/pull/batch/pull-kubernetes-e2e-kops-aws/104011 True True 2018-08-29 14:05:59 PDT SUCCESS
ACK "finished.json" 2
Downloading JUnit artifacts.

Replace the pod

spiffxp@spiffxp-macbookpro:kettle (master %)$ k get pods
NAME                      READY     STATUS    RESTARTS   AGE
kettle-5df45c4dcb-7tnx9   1/1       Running   202        26d
spiffxp@spiffxp-macbookpro:kettle (master %)$ kubectl delete pod -l app=kettle
pod "kettle-5df45c4dcb-7tnx9" deleted
spiffxp@spiffxp-macbookpro:kettle (master %)$ kubectl rollout status deployment/kettle
deployment "kettle" successfully rolled out
spiffxp@spiffxp-macbookpro:kettle (master %)$ kubectl get pod -l app=kettle
NAME                      READY     STATUS        RESTARTS   AGE
kettle-5df45c4dcb-7tnx9   1/1       Terminating   202        26d
kettle-5df45c4dcb-fzkjt   1/1       Running       0          16s

Watch the logs

spiffxp@spiffxp-macbookpro:kettle (master %)$ kubectl logs -f $(kubectl get pod -l app=kettle -oname)
Activated service account credentials for: [kettle@k8s-gubernator.iam.gserviceaccount.com]
Loading builds from gs://kubernetes-jenkins/pr-logs
already have 1296792 builds
gs://kubernetes-jenkins/pr-logs/pull/kubeflow_kubebench/74/kubeflow-kubebench-presubmit/144
gs://kubernetes-jenkins/pr-logs/pull/kubeflow_examples/242/kubeflow-examples-presubmit/672
gs://kubernetes-jenkins/pr-logs/pull/kubeflow_examples/242/kubeflow-examples-presubmit/666
gs://kubernetes-jenkins/pr-logs/pull/kubeflow_examples/242/kubeflow-examples-presubmit/664
gs://kubernetes-jenkins/pr-logs/pull/cloud-provider-gcp/50/cloud-provider-gcp-tests/30
# ...

The text was updated successfully, but these errors were encountered:

spiffxp · 2018-09-05T18:17:15Z

I will close this when the spice is flowing once more

krzyzacy · 2018-09-05T20:50:31Z

hummm restart the pod?

spiffxp · 2018-09-05T22:44:20Z

At last glance the kettle pod is still trying to catch up. There is data flowing into bigquery. I'm not sure if anything needs to be done to refresh triage or the metrics driving our velodrome dashboard. Going to wait a bit longer.

spiffxp · 2018-09-06T14:41:41Z

http://velodrome.k8s.io/dashboard/db/bigquery-metrics?orgId=1 back up to date

https://storage.googleapis.com/k8s-gubernator/triage/index.html still looks the same though

spiffxp · 2018-09-06T14:52:19Z

https://prow.k8s.io/?type=periodic&job=ci-test-infra-triage is unhappy, unclear why though

BenTheElder · 2018-09-10T23:57:03Z

/assign
currently oncall and looking into this

spiffxp · 2018-09-24T20:51:01Z

/unassign @BenTheElder
Trying to get to the point where a run takes less than two hours. In the meantime I've updated with results from a manual run on my laptop.

https://storage.googleapis.com/k8s-gubernator/triage/index.html

spiffxp · 2018-09-24T20:55:56Z

/area triage
/remove-area kettle

BenTheElder · 2018-09-24T20:56:30Z

Thanks @spiffxp!

spiffxp · 2018-09-25T18:43:19Z

https://storage.googleapis.com/k8s-gubernator/triage/index.html now looks truncated midway through, trying to figure out why

spiffxp · 2018-09-26T01:14:13Z

It was truncated because I was downloading an old tarball of results. Whoops. One last thing, going to try and get https://k8s-testgrid.appspot.com/sig-testing-misc#triage populated

spiffxp · 2018-09-26T17:49:32Z

/area kettle
https://k8s-testgrid.appspot.com/sig-testing-misc#metrics-kettle

I think kettle may be failing again

spiffxp · 2018-09-26T17:58:03Z

Kettle's log output is confusing me, it's streaming:

gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335654
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335666
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335650
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335640
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335655
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335649
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335647
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335663
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335662
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335646
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335659
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335645
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335652
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335641
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335639
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335635
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335633
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335631
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335643
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335628
gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new/1044652866447335630

Trying a delete/rollout and seeing what happens

spiffxp · 2018-09-27T01:41:58Z

It fell into the same loop, but on a different bucket. I have kicked kettle again, and it seems to be going further this time?

I looked at logs for the past few days in stackdriver. Normal behavior is:

make_db.py gets called around 1am PDT
Starts loading buckets with Loading builds from gs://kubernetes-jenkins/pr-logs, which is hardcoded as the first bucket to load
Second to last bucket loaded is gs://kubernetes-jenkins/logs/
Last bucket loaded is gs://istio-circleci/

Last night:

I see the entry for kubernetes-jenkins/logs,
... but not istio-circleci
Instead, it starts looping on ci-kubernetes-e2e-gce-stable1-beta-upgrade-cluster-new as described in the previous comment

spiffxp · 2018-09-27T01:52:48Z

Current suspicion: while trying to decide how to enumerate builds for a given job, we hit an error when trying to read the latest-build.txt file that is silently passed and kicks us to the non-sequential path:

https://github.com/kubernetes/test-infra/blob/master/kettle/make_db.py#L137-L145

This path ends up going through a while True that could potentially keep looping:

https://github.com/kubernetes/test-infra/blob/master/kettle/make_db.py#L96-L109

spiffxp · 2018-09-27T20:42:03Z

Now suspect this is related to tot being down while the rest of prow was down due to outage on 2019-09-25 (https://docs.google.com/document/d/1kwqU4sCycwxfTsV774lnrtFakCg90rMXNShmjSqyEJI/view)

BenTheElder · 2018-09-28T05:36:12Z

Is this stable now or still giving us problems?

spiffxp · 2018-09-28T19:27:56Z

I think it's stable. I'll close after verifying and open a followup issue for extending how far back we look. I'm tempted to punt on hardening kettle if we're going to revisit tot/snowflake id's.

spiffxp · 2018-09-28T20:20:47Z

/close

ref: Allow triage to look back further than a week #9615 for triage lookback
ref: Replace tot with snowflake IDs #9604 for tot

k8s-ci-robot · 2018-09-28T20:20:48Z

@spiffxp: Closing this issue.

In response to this:

/close

ref: Allow triage to look back further than a week #9615 for triage lookback

ref: Replace tot with snowflake IDs #9604 for tot

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot assigned spiffxp Sep 5, 2018

k8s-ci-robot added area/kettle kind/bug Categorizes issue or PR as related to a bug. labels Sep 5, 2018

k8s-ci-robot assigned BenTheElder Sep 10, 2018

spiffxp mentioned this issue Sep 11, 2018

[WIP] split triage summarize bq to one day at a time #9331

Closed

BenTheElder mentioned this issue Sep 14, 2018

gubernator triage: 0 clusters of 0 failures out of X builds #9388

Closed

k8s-ci-robot unassigned BenTheElder Sep 24, 2018

k8s-ci-robot added area/triage and removed area/kettle labels Sep 24, 2018

spiffxp mentioned this issue Sep 24, 2018

Get triage working again #9554

Merged

This was referenced Sep 25, 2018

Add logging, stop looking at previous clusters #9570

Merged

Run triage more frequently as a decorated prowjob #9578

Merged

k8s-ci-robot added the area/kettle label Sep 26, 2018

k8s-ci-robot closed this as completed Sep 28, 2018

spiffxp mentioned this issue Oct 22, 2018

kettle: root cause why it hangs/freezes periodically #8800

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

go.k8s.io/triage is out of date #9271

go.k8s.io/triage is out of date #9271

spiffxp commented Sep 5, 2018

spiffxp commented Sep 5, 2018

krzyzacy commented Sep 5, 2018

spiffxp commented Sep 5, 2018

spiffxp commented Sep 6, 2018

spiffxp commented Sep 6, 2018

BenTheElder commented Sep 10, 2018

spiffxp commented Sep 24, 2018

spiffxp commented Sep 24, 2018

BenTheElder commented Sep 24, 2018

spiffxp commented Sep 25, 2018

spiffxp commented Sep 26, 2018 •

edited

Loading

spiffxp commented Sep 26, 2018

spiffxp commented Sep 26, 2018

spiffxp commented Sep 27, 2018

spiffxp commented Sep 27, 2018

spiffxp commented Sep 27, 2018

BenTheElder commented Sep 28, 2018

spiffxp commented Sep 28, 2018

spiffxp commented Sep 28, 2018

k8s-ci-robot commented Sep 28, 2018

go.k8s.io/triage is out of date #9271

go.k8s.io/triage is out of date #9271

Comments

spiffxp commented Sep 5, 2018

spiffxp commented Sep 5, 2018

krzyzacy commented Sep 5, 2018

spiffxp commented Sep 5, 2018

spiffxp commented Sep 6, 2018

spiffxp commented Sep 6, 2018

BenTheElder commented Sep 10, 2018

spiffxp commented Sep 24, 2018

spiffxp commented Sep 24, 2018

BenTheElder commented Sep 24, 2018

spiffxp commented Sep 25, 2018

spiffxp commented Sep 26, 2018 • edited Loading

spiffxp commented Sep 26, 2018

spiffxp commented Sep 26, 2018

spiffxp commented Sep 27, 2018

spiffxp commented Sep 27, 2018

spiffxp commented Sep 27, 2018

BenTheElder commented Sep 28, 2018

spiffxp commented Sep 28, 2018

spiffxp commented Sep 28, 2018

k8s-ci-robot commented Sep 28, 2018

spiffxp commented Sep 26, 2018 •

edited

Loading