-
Notifications
You must be signed in to change notification settings - Fork 673
weave-npc should reconcile ipsets/rules on restart #3771
Comments
Note I said in #3764 that it is written not to "simply log". "it appears to me that the intention is to re-raise the panic and hence exit the whole program. I am mystified why it keeps on running" |
The code snippet explicitly intercepts the panic from the crashed routine with recover(), log and move on. Only if the "ReallyCrash" global variable is set to true, or if an additional handler is globally registered with a new panic or with an exit syscall, will it do anything besides logging. Code works as expected, behavior is however wrong.
Best Regards,
Quentin Machu,
Head of DevOps, BitMEX
+1 (415) 720 1243
…On Feb 19, 2020, 03:46 -0800, Bryan Boreham ***@***.***>, wrote:
Note I said in #3764 that it is written not to "simply log".
"it appears to me that the intention is to re-raise the panic and hence exit the whole program. I am mystified why it keeps on running"
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
|
I found |
I think it will block at this line: https://github.com/kubernetes/client-go/blob/319dbfd0ed290ad5dbbbe252d27cb5bc9181e6be/tools/cache/controller.go#L147 , when the panic was called in |
@gobomb yes, very good. I recreated the problem and here is a stack trace from it waiting:
Do you know if there is a Kubernetes issue filed? |
Found your Kubernetes issue: kubernetes/kubernetes#93641 |
Forking off this issue: #3764 as per @bboreham's request.
TL;DR >
As of right now, when one of weave-npc's controller/go-routine panics, weave-npc will simply log the panic rather than propagating it in order to restart the go-routine, or in order to restart weave-npc as a whole (thus potentially saving it from a panic loop if the memory structures are in an unexpected state. This will leave weave-npc running in a non-functioning state.
Furthermore, when weave-npc restarts, it incurs a 10s+ downtime as weave-npc resets every IPSets/Rules, then re-creates them, instead of gracefully reconciling the host / desired / current states. The trouble is that, when a bad informer sends unexpected data (as per issue above), all weave-npc containers will crash at once, hence creating a full cluster downtime - potentially lengthened by the slowdown of the API due to sheer amount of requests.
/cc @murali-reddy
The text was updated successfully, but these errors were encountered: