-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
interface leak on initial Request cancel #1133
Comments
Another example where the NSE side interface load-balan-t6nu got stuck.
filter string: (same forwarder logs) |
I was able to reproduce it on a 4 node KIND cluster relying on basic deployment with kernel-nse and kernel-nsc by tweaking NSM_REQUEST_TIMEOUT parameter. Double checked that in nsc the requests kept failing due to context timeout, and left the system running for a couple minutes and was monitoring forwarder logs for "vppapi TapCreateV3 returned error" printouts.
|
I checked two ideas to handle the interface cleanup: Other possibilities to check:
|
I`m not in favor of option 1, since IMHO parallel actions could alter the interface index list in the meantime. What if create would employ an independent (background) context to execute the VPP API call TapCreateV3? |
The problem should be resolved in v1.14.0-rc.3. Could you please chec kit? |
Hi, Regarding this one: #1133 (comment) Also, run a test with NSM 1.14.0-rc.4 which looked much better than rc.3, but there were still excess intefaces in the NSE but way less than with rc.3. Couldn't spot forwarder restarts due to OOM either. |
Hi, @zolug. Could you please also test |
Hi, In case of #1133 (comment): When running Meridio where the LB NSE was removed by a script continuously, thus was only available intermittently for a short period of time the excess interfaces were not cleaned-up after stopping the script that interfered with the availability of the NSE. What I observed in the second case was, that despite the request context's timeout/cancel the begin let the "connect" to continue. Thus from the perspective of nsmgr and forwarder a connection including interfaces was established. Thus the NSC upon a subsequent try/request was able to recover a connection via Monitor Connection via nsmgr. But as soon as part of one of the "reconnect" the nsmgr got failure response from the forwarder due to the NSE not being around, the connection state was changed to Down. Which in the NSC triggered a reconnect with reselect instead an ordinary request, promting the nsmgr to close the "old connection first" Now, this also failed in my case despite the 1 minute timeout in the beginServer (due to NSE not being found), yet the cleanup could not remove the interface because the context was already expired.
IMHO, it would be preferable to have separate context for tap cleanup (and probably for create as well).
|
Hi, @zolug! The problem is related to small context timeout before any |
@Ex4amp1e Could you please share the current technical solution and decomposition to cover the issue? Thanks. |
@denis-tingaikin Currently we have tried some draft solution to override contexts, where vpp operations are being used - networkservicemesh/sdk-vpp#867. But we decided to use another solution: to override vpp connection instead of overriding create/delete function contexts in general. It will help to cover and make safe all vpp operations. Decomposition and plan:
|
This problem is looking resolved. @zolug if you get a chance, could you check this image |
Unfortunately, I can still see the problem. I have attached a forwarder log file where the problem is visible. Check id: |
@NikitaSkrynnik It looks like we missed branches when the incoming request had enough time for the request, but it was cancelled. At this moment we are just returning the same context, so it can be cancelled. I think these lines should be covered: |
@zolug We've managed the mentioned context case, so please check ghcr.io/networkservicemesh/cmd-forwarder-vpp:v1.14.2-rc.1 when you get a chance. |
Interface can be leaked if the initial Request's context gets cancelled before the connection could be established.
The problem might occur in kernelTap if the context gets cancelled during the vppConn tapCreate call. In which case the call returns with an error, hence the interface won't get saved into the ifindex storage.
Therefore, even though the Request should initiate a Close due to the failed kernelTap create (e.g. in kernelTapServer), del won't be able to lookup the interface from the storage.
NSM version: v1.13.2-rc.1
Logs:
Follow the following request:
Use filter string:
load-balan-2ezO|9a29a6db-aed2-48d4-b5fa-c75f7337328e|95b833d9-10fc-43d4-9c30-54f2ad7f1c2b|load-balan-tZer
NSC side load-balan-2ezO interface will not be removed.
nsm_forwarder-vpp-mwbc2_forwarder-vpp.log
The text was updated successfully, but these errors were encountered: