Resources leak until timeout if response fails to return to the Client #1020

Bolodya1997 · 2021-07-14T10:58:58Z

Expected Behavior

All allocated resources should be either accessible from the Client or should be cleaned up in the short time.

Current Behavior

We can get resources allocated and cleaned up only on timeout happens:

NSC requests icmp-responder with T request timeout.
Request blocks in discover until NSE registers.
NSE registers after T-t duration.
During t duration Request comes to NSE, allocates IP addresses and starts returning back.
Request timeout happens.
IP addresses allocated on [4] are not accessible from the Client and would be cleaned up only on (NSMgr -> NSE) token timeout.

Issue happens in real life

https://github.com/networkservicemesh/integration-k8s-packet/actions/runs/1029599216

The text was updated successfully, but these errors were encountered:

Bolodya1997 · 2021-07-14T12:28:36Z

So here is a basic problem:

... -> hop-a (id-a) -> hop-b (id-b) -> ...

If Request allocates some resources in hop-b and doesn't return id-b ID to hop-a, hop-a is unable to Close these resources. Even if it tries to Close, it will result in closing id-b' Connection which doesn't exist on hop-b.

I don't think that this can be solved without adding backward monitoring from hop-b to hop-a - in such case hop-b will receive monitor break and close the id-b Connection.

@edwarnicke @denis-tingaikin
Thoughts?

denis-tingaikin · 2021-07-14T14:12:00Z

It seems to me missed a Close call. Monitor server knows when stream breaks. So it can produce a event for closing the resources.

Bolodya1997 · 2021-07-16T13:35:23Z

It seems to me missed a Close call. Monitor server knows when stream breaks. So it can produce a event for closing the resources.

So here is a problem with a monitor server:

Monitor server doesn’t actually know what is the Connection that was requesting the left side, because it is another API not in the Request/Close chain.
Even if it can know it, we make a huge restriction “if you request monitor events, please hold this stream until your death - in the other way whole left side would be closed”, so it affects our current solution with cmd-nsc-init and cmd-nsc, and so it deprecates non-owners of the Connection to monitor it.

Bolodya1997 · 2021-07-16T13:40:21Z

Actually here by backward monitoring I don't mean sending some monitor update from the right side to the left side. The only thing we need is to do some streaming request from the left side to the right side which will be a part of Request/Close chain with the lifetime of the corresponding Connection.

@edwarnicke
Thoughts?

edwarnicke · 2021-07-20T12:34:28Z

@Bolodya1997 This isn't a leak as long as things are cleaned up when the expireTime expires. That said, please see networkservicemesh/deployments-k8s#906

Bolodya1997 · 2021-07-21T12:38:21Z

@edwarnicke
Generally you are right. But it can lead to some problematic issues like networkservicemesh/sdk-vpp#315. And so initially this issue was filed due to some failure in our integration tests caused by exhausted IP poll in IPAM for the same reason.

Bolodya1997 · 2021-07-28T17:56:51Z

@edwarnicke
Please take a look, here are 2 new solutions:

Monitor based solution

We can add some client chain element before the networkservice.Client creating a monitor stream on initial Request failure. If we have a corresponding Connection on the remote side - we should close it.

This solution doesn't require any new API and backward monitoring.

Backward monitoring solution

We can add some establish API:

Client requests a stream from Server on Request context and sends initial event with its connection ID.
Client requests Server.
Server receives a Request and it already has a stream to this Request and calls next.Request.
If path wasn't modified during the Request, Server closes stream.
Server returns back Connection.
Client sends close event and closes the stream.
Server receives close event and also closes the stream.
If Server doesn't receive close event, but stream closes with context timeout, it closes the Connection on its side.

This solution doesn't affect healing in any way, because it works only with initial Requests.
This solution will work in case of sudden network failure (1 solution just can't send Close in such case).
But it needs new API and backward monitoring.

denis-tingaikin · 2021-08-08T12:16:36Z

@edwarnicke Do you agree with solution from @Bolodya1997 and @glazychev-art #1051?

For me it looks a bit complicated, but I'm fine if you are ok with it.

Bolodya1997 · 2021-08-17T14:07:33Z

If this will be fixed, we need to set back request timeout in heal tests to 15s - networkservicemesh/deployments-k8s#2541.

Bolodya1997 · 2021-08-17T14:32:47Z

@edwarnicke
Would we think about something else then timeout to close leaked connections?

Bolodya1997 · 2021-08-31T07:52:18Z

This possibly can lead (leads sometimes) to very painful bugs with kernel interfaces in VPP Forwarder - networkservicemesh/sdk-vpp#315.

denis-tingaikin · 2021-09-01T20:03:58Z

@Bolodya1997
The currently proposed solution is not better than using simply timeout.
So I moving this into backlog until we don't get the motivation to return to this back and find the simpler solution.

denis-tingaikin · 2021-09-29T15:43:38Z

@edwarnicke Should we do this before release? I think we can try to test a behaviour with handling ctx.Done and calling next.Server(ctx).Close(ctx,...)

edwarnicke · 2021-09-29T15:47:48Z

@denis-tingaikin I think this is probably a good idea... I'd suggest we do it in its own chain element, and place it before 'begin' (as it will then clean up begin). Should be pretty quick and easy :)

Bolodya1997 added the bug Something isn't working label Jul 14, 2021

Bolodya1997 added the question Further information is requested label Jul 14, 2021

Bolodya1997 mentioned this issue Jul 16, 2021

Tap server fails if kernel interfaces already exist networkservicemesh/sdk-vpp#315

Closed

Bolodya1997 mentioned this issue Jul 20, 2021

NSM usage on high load #1031

Closed

Bolodya1997 mentioned this issue Jul 28, 2021

Update from update/networkservicemesh/integration-tests networkservicemesh/integration-k8s-kind#340

Merged

glazychev-art self-assigned this Aug 3, 2021

glazychev-art mentioned this issue Aug 3, 2021

Add checkconn chain element #1051

Closed

9 tasks

Bolodya1997 mentioned this issue Aug 17, 2021

[qfix] Increse request timeout in heal tests networkservicemesh/deployments-k8s#2541

Merged

9 tasks

denis-tingaikin removed the question Further information is requested label Sep 1, 2021

denis-tingaikin mentioned this issue Sep 6, 2021

IPPool doesn't free IP-addresses if healing failed #1007

Closed

denis-tingaikin added the stability The problem is related to system stability label Sep 7, 2021

denis-tingaikin added the good first issue Good for newcomers label Sep 29, 2021

yuraxdrumz mentioned this issue Apr 28, 2022

IPPool is empty error networkservicemesh/deployments-k8s#5599

Closed

denis-tingaikin mentioned this issue Dec 11, 2023

Investigate why forwarder could restart too long networkservicemesh/cmd-forwarder-vpp#664

Open

edwarnicke mentioned this issue Mar 26, 2024

After component restart there are more interfaces in NSE than expected networkservicemesh/deployments-k8s#11371

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resources leak until timeout if response fails to return to the Client #1020

Resources leak until timeout if response fails to return to the Client #1020

Bolodya1997 commented Jul 14, 2021 •

edited

Loading

Bolodya1997 commented Jul 14, 2021

denis-tingaikin commented Jul 14, 2021 •

edited

Loading

Bolodya1997 commented Jul 16, 2021 •

edited

Loading

Bolodya1997 commented Jul 16, 2021

edwarnicke commented Jul 20, 2021

Bolodya1997 commented Jul 21, 2021

Bolodya1997 commented Jul 28, 2021

denis-tingaikin commented Aug 8, 2021

Bolodya1997 commented Aug 17, 2021

Bolodya1997 commented Aug 17, 2021

Bolodya1997 commented Aug 31, 2021

denis-tingaikin commented Sep 1, 2021 •

edited

Loading

denis-tingaikin commented Sep 29, 2021 •

edited

Loading

edwarnicke commented Sep 29, 2021

Resources leak until timeout if response fails to return to the Client #1020

Resources leak until timeout if response fails to return to the Client #1020

Comments

Bolodya1997 commented Jul 14, 2021 • edited Loading

Expected Behavior

Current Behavior

Issue happens in real life

Bolodya1997 commented Jul 14, 2021

denis-tingaikin commented Jul 14, 2021 • edited Loading

Bolodya1997 commented Jul 16, 2021 • edited Loading

Bolodya1997 commented Jul 16, 2021

edwarnicke commented Jul 20, 2021

Bolodya1997 commented Jul 21, 2021

Bolodya1997 commented Jul 28, 2021

Monitor based solution

Backward monitoring solution

denis-tingaikin commented Aug 8, 2021

Bolodya1997 commented Aug 17, 2021

Bolodya1997 commented Aug 17, 2021

Bolodya1997 commented Aug 31, 2021

denis-tingaikin commented Sep 1, 2021 • edited Loading

denis-tingaikin commented Sep 29, 2021 • edited Loading

edwarnicke commented Sep 29, 2021

Bolodya1997 commented Jul 14, 2021 •

edited

Loading

denis-tingaikin commented Jul 14, 2021 •

edited

Loading

Bolodya1997 commented Jul 16, 2021 •

edited

Loading

denis-tingaikin commented Sep 1, 2021 •

edited

Loading

denis-tingaikin commented Sep 29, 2021 •

edited

Loading