-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
missing interface in NSE after relocation #9863
missing interface in NSE after relocation #9863
Comments
@ljkiraly Hi, I tried to reproduce the problem several times, but I didn't see any errors. I tested this setup on main branch. Could you please test it using main branch too? |
@NikitaSkrynnik My appologies, I missed that I made a change in my basic setup. I mentioned it on the description, but I forget to add to the reproduction sreps that two registry-k8s instances are used.
I will update the description also. Maybe this behavior has the same root cause as the issue described at last community call. On that setup also two registry-k8s pods are running and the unregister request was sent to the registry which can not handle the request. |
@ljkiraly Should be fixed in v1.11.0. Please let us know if it's still reproduciable. |
@denis-tingaikin Verification result is good. Thanks. |
@denis-tingaikin @NikitaSkrynnik Unfortunately with NSM v1.11.2 the problem occurred again. I was able to reproduce it also. I was using the steps described above. I attached the logs: Earlier our nightly tests succeeded with previous NSM version, until now when using NSM v1.11.2. |
Hi @denis-tingaikin , @NikitaSkrynnik , |
Thanks @ljkiraly , |
@glazychev-art , Compressed and updating again. Thanks for the notice. |
Seems like this one is fixed. Please feel free to reopen if it's reproducing. |
Expected Behavior
The restoration time of interfaces in NSE should be more deterministic.
Current Behavior
When more clients are deployed with two registries some interfaces are missing on NSE after NSE relocation.
The node where the NSE pod runs was cordoned and the NSE was relocated. After NSE is restarted on another node one connection to an NSC fails to be restored. Based on logs the deleted NSE remains stored in registry/etcd.
Failure Information (for bugs)
The restoration time varies and sometimes takes more then 150 seconds, sometimes restores promptly, or never restored. Note that two instance of registry-k8s are used. IPv6 address range is set into NSE because seemingly the issue pops up frequently with IPv6 addresses.
What is the expected NSE expiration time? How that depend on NSM_MAX_TOKEN_LIFETIME? (is the description in this PR still valid? networkservicemesh/sdk#1404)
Steps to Reproduce
cluster-config
NSE, NSC yamls
README.md
> cat README.md # Remote NSE deathThis example shows that NSM keeps working after the remote NSE death.
NSC and NSE are using the
kernel
mechanism to connect to its local forwarder.Forwarders are using the
vxlan
mechanism to connect with each other.Requires
Make sure that you have completed steps from basic or memory setup.
Run
Deploy NSC and NSE:
Wait for applications ready:
kubectl wait --for=condition=ready --timeout=1m pod -l app=alpine -n ns-remote-nse-death
kubectl wait --for=condition=ready --timeout=1m pod -l app=alpine1 -n ns-remote-nse-death
kubectl wait --for=condition=ready --timeout=1m pod -l app=alpine2 -n ns-remote-nse-death
kubectl wait --for=condition=ready --timeout=1m pod -l app=nse-kernel -n ns-remote-nse-death
Find NSE pod by label:
NSE=$(kubectl get pods -l app=nse-kernel -n ns-remote-nse-death --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}')
NSE:
Cordon NSE node:
New NSE:
Wait for new NSE to start:
kubectl wait --for=condition=ready --timeout=2m pod -l app=nse-kernel -n ns-remote-nse-death
Find new NSE pod:
NSE=$(kubectl get pods -l app=nse-kernel -n ns-remote-nse-death --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}')
Check the new NSE:
Cleanup
Delete ns:
kubectl uncordon $NSE_NODE
Failure Logs
missing-interface.zip
The text was updated successfully, but these errors were encountered: