-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NSM usage on high load #1031
Comments
@edwarnicke Do we need this improvements? Or fixing just [1-3] issues (or probably [1-2]) will be already enough for us? |
Here's the better question: why is 50 Clients exhausting resources enough to lead to this behavior? |
Actually it is not 50 clients - it is 50 clients sending Close and 50 client sending Request, so it is mostly like 100 clients. |
Good to know. But even so.. I'm surprised that 100 Clients is causing problems. Do we know why? What's the bottle neck in NSM? Or is NSM just getting throttled out by 100 Pods sharing the Node? |
Yes, whole NSM just starts working incredibly slower. |
@Bolodya1997 OK.. do we have a sense of why? |
I guess mostly because of swapping, |
Ah... what is using the memory? |
Just rechecked right now, it is not related to swapping - RAM is consumed only up to 40% on VM. |
Tested this on packet, everything is OK for 50 (100 on peak) clients. |
That's interesting... do you have the detailed logs that shows the particular things that are taking so long to program? I typically see that sort of thing taking less than 100ms locally... so I'm curious where the bottleneck is there in your packet runs. You can get the detail logs by setting NSM_LOG_LEVEL to 'DEBUG'. |
Here are the logs for the single client: |
Filed an issue to track a VPP issue - networkservicemesh/sdk-vpp#345. |
Description
This issue shows results of NSM behavior on high load investigation.
Context
Environment 8Gb RAM, 2x2.4Gh CPU, Ubuntu 18.04.5 LTS (VM)
Client
cmd-nsc
with setup:Endpoint
cmd-icmp-responder
with setup:Test scenario
Behavior
In lack of CPU/memory resources NSM starts working slow, passing chain element can take 1-2s, performing a gRPC request can also take 1-2s.
It results in Request/Close timeouts for the new/old Clients and so reveals such issues:
So currently we have such state for the NSM in the different periods of time:
cmd-nsc
pod restart are failingSo in general we have NSM still working and Client pods eventually connecting to the NSM after some retries even on high load, but we have problems with leaked resources even after high load ends and timeout happens.
Fixing [1, 2] would lead us to the following:
cmd-nsc
pod restart are failingFixing [3] can lead us very close to the following:
cmd-nsc
pod restart are failingActually
resources are leaking
is fully caused byRequests, Closes are failing
. Fixing [3] should even mean it caused only beCloses are failing
. Doesn't look like we can fully fix this issue, because in the worst case we can getClose
event not reaching NSMgr until the context timeout happens (networkservicemesh/deployments-k8s#2085), but we can try to improve it in different ways:closer
server to the NSMgr chain #1032connect
client #1033queue
server chain element #1034The text was updated successfully, but these errors were encountered: