Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

weave-npc should reconcile ipsets/rules on restart #3771

Open
Quentin-M opened this issue Feb 12, 2020 · 7 comments
Open

weave-npc should reconcile ipsets/rules on restart #3771

Quentin-M opened this issue Feb 12, 2020 · 7 comments

Comments

@Quentin-M
Copy link

Forking off this issue: #3764 as per @bboreham's request.

TL;DR >

As of right now, when one of weave-npc's controller/go-routine panics, weave-npc will simply log the panic rather than propagating it in order to restart the go-routine, or in order to restart weave-npc as a whole (thus potentially saving it from a panic loop if the memory structures are in an unexpected state. This will leave weave-npc running in a non-functioning state.

Furthermore, when weave-npc restarts, it incurs a 10s+ downtime as weave-npc resets every IPSets/Rules, then re-creates them, instead of gracefully reconciling the host / desired / current states. The trouble is that, when a bad informer sends unexpected data (as per issue above), all weave-npc containers will crash at once, hence creating a full cluster downtime - potentially lengthened by the slowdown of the API due to sheer amount of requests.

/cc @murali-reddy

@bboreham
Copy link
Contributor

Note I said in #3764 that it is written not to "simply log".

"it appears to me that the intention is to re-raise the panic and hence exit the whole program. I am mystified why it keeps on running"

@Quentin-M
Copy link
Author

Quentin-M commented Feb 20, 2020 via email

@bboreham
Copy link
Contributor

Only if the "ReallyCrash" global variable is set to true,

ReallyCrash defaults to true. Why would it change, in this context?

@bboreham bboreham changed the title weave-npc should gracefully restart upon go routine panic weave-npc should reconcile ipsets/rules on restart Feb 26, 2020
@gobomb
Copy link
Contributor

gobomb commented Jul 16, 2020

I found ReallyCrash is true but the process didn't crash. I am confused.

@gobomb
Copy link
Contributor

gobomb commented Jul 20, 2020

I think it will block at this line: https://github.com/kubernetes/client-go/blob/319dbfd0ed290ad5dbbbe252d27cb5bc9181e6be/tools/cache/controller.go#L147 , when the panic was called in wait.Until(c.processLoop, time.Second, stopCh) .

@bboreham
Copy link
Contributor

bboreham commented Aug 4, 2020

@gobomb yes, very good.

I recreated the problem and here is a stack trace from it waiting:

goroutine 107 [semacquire]:
sync.runtime_Semacquire(0xc0003e6568)
        /usr/local/go/src/runtime/sema.go:56 +0x42
sync.(*WaitGroup).Wait(0xc0003e6560)
        /usr/local/go/src/sync/waitgroup.go:130 +0x64
k8s.io/apimachinery/pkg/util/wait.(*Group).Wait(0xc0003e6560)
        /go/src/github.com/weaveworks/weave/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:49 +0x2d
panic(0x13d9360, 0x17cdc10)
        /usr/local/go/src/runtime/panic.go:969 +0x166
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/src/github.com/weaveworks/weave/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:57 +0x1c8
panic(0x13d9360, 0x17cdc10)
        /usr/local/go/src/runtime/panic.go:969 +0x166
github.com/weaveworks/weave/npc.(*controller).AddPod(0xc00044b260, 0x181ad20, 0xc000120000, 0xc00002c000, 0x0, 0x0)
        /go/src/github.com/weaveworks/weave/npc/controller.go:125 +0x136

Do you know if there is a Kubernetes issue filed?

@bboreham
Copy link
Contributor

bboreham commented Aug 4, 2020

Found your Kubernetes issue: kubernetes/kubernetes#93641

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants