Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automation misbehaving due to Github October 21 incident #9881

Closed
spiffxp opened this issue Oct 22, 2018 · 6 comments
Closed

Automation misbehaving due to Github October 21 incident #9881

spiffxp opened this issue Oct 22, 2018 · 6 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@spiffxp
Copy link
Member

spiffxp commented Oct 22, 2018

Opening this as an issue to refer to if/when you find weird behavior over the next day or two. GitHub is mitigating an incident.

At a minimum, I guarantee prow isn't doing things right now because:

Out of an abundance of caution we have taken steps to ensure the integrity of your data, including pausing webhook events and other internal processing systems.

We may have some mitigation of our own to do, since it's unclear whether github will release/replay webhooks, or we'll need to do some sort of reconciliation on our own.

Timeline

  • 2018-10-21 00:16:09 PT: status.github.com "We are investigating reports of elevated error rates."
  • 2018-10-21 00:17:05 PT: status.github.com "We're failing over a data storage system in order to restore access to GitHub.com."
  • 2018-10-21 00:17:36 PT: #sig-contributor-experience notices "they're having some issues right now"
  • 2018-10-21 00:20:52 PT: amwat given heads up in #testing-ops
  • 2018-10-22 00:09:24 PT: status.github.com "We've completed validation of data consistency and have enabled some background jobs. We're continuing to monitor as the system recovers and expect to resume delivering webhooks at 16:45UTC."
  • 2018-10-22 00:09:29 PT: opened this issue
  • 2018-10-22 00:09:50 PT: status.github.com "We have resumed delivery of webhooks and will continue to monitor as we process a delayed backlog of events"
  • 2018-10-22 00:09:50 PT: Notified kubernetes-dev@
  • 2018-10-22 00:10:32 PT: status.github.com "We have temporarily paused delivery of webhooks while we address an issue. We are working to resume delivery as soon as possible"

/kind bug
/priority critical-urgent
/assign @amwat
as go.k8s.io/oncall lists him as the test-infra oncall

@spiffxp spiffxp added kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Oct 22, 2018
@spiffxp
Copy link
Member Author

spiffxp commented Oct 22, 2018

Notified kubernetes-dev@

@amwat
Copy link
Contributor

amwat commented Oct 22, 2018

Update: looks like we are slowly starting to receive webhooks for past events and things are starting to move.
https://status.github.com/messages

@amwat
Copy link
Contributor

amwat commented Oct 22, 2018

Update: old webhooks have caught up, new webhooks are working are normally.
https://status.github.com/messages

@spiffxp
Copy link
Member Author

spiffxp commented Oct 22, 2018

Looking good to me. I can't tell that anything merged that shouldn't have, but I haven't looked beyond the spot checks linked here.

@amwat
Copy link
Contributor

amwat commented Oct 23, 2018

Everything (related to this outage at least) is looking fine.
/close

@k8s-ci-robot
Copy link
Contributor

@amwat: Closing this issue.

In response to this:

Everything (related to this outage at least) is looking fine.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

No branches or pull requests

3 participants