Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes vl3 network interface is recreated #8351

Closed
d-uzlov opened this issue Dec 21, 2022 · 9 comments
Closed

Sometimes vl3 network interface is recreated #8351

d-uzlov opened this issue Dec 21, 2022 · 9 comments
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@d-uzlov
Copy link
Contributor

d-uzlov commented Dec 21, 2022

Expected Behavior

NSM network interface is created once and stays working until it is explicitly closed, usually when deleting the pod.

Current Behavior

I somehow got my cluster to a state when one of the nsm interfaces was regularly deleted and recreated, making it impossible to use.

Failure Information (for bugs)

I notices the issue when running tcpdump:

$ k1 exec -n istio-system deployments/istiod -c cmd-nsc -- tcpdump -i nsm-1 -U -w - >4-istio-tcpdump-2-nsm.pcap                 [14:12:15]
tcpdump: listening on nsm-1, link-type RAW (Raw IP), snapshot length 262144 bytes
tcpdump: pcap_loop: The interface disappeared
8 packets captured
8 packets received by filter
0 packets dropped by kernel
command terminated with exit code 1

I tried to capture tcpdump on istiod pod.
The interface was stable at first.
Then I deployed a second pod, with istio-proxy, that tried to connect to istiod via the nsm interface.
Right after this second pod was created nsm interface got recreated (tcp dump said "The interface disappeared", but ip a still showed the nsm-1 interface in the list).
When I tried to run tcpdump again, I got the same error after a few seconds.
The interface became stable again after I deleted the pod with istio-proxy.

I tried this a few more times, and each time tcpdump complained about interface disappearing while istio-proxy pod existed.
Few times I even got the error that interface doesn't exist. Apparently, there is some time between interface deletion and recreation events.

Steps to Reproduce

Unfortunately I was unable to recreate this issue.
I was experimenting with istio+nsm integration, and one time I randomly got this behavior.
But I took the cluster-info dump logs.

Here are rough instructions of the setup I was testing:
https://github.com/d-uzlov/deployments-k8s/blob/deployment-issue-do-not-delete/examples/interdomain/nsm_istio_vl3/clean/
There are also uncompressed cluster-info dump logs in this branch.

I'm not sure if these instructions will be helpful because usually they don't produce the error described in this issue.

Context

  • Kubernetes Version: v1.25
  • NSM Version: v1.7.0-rc.2

Failure Logs

Logs file from kubectl cluster-info dump:
dump-1.zip

@denis-tingaikin denis-tingaikin added the bug Something isn't working label Dec 21, 2022
@d-uzlov
Copy link
Contributor Author

d-uzlov commented Dec 22, 2022

I got this issue again.

What I did:

I actually just refreshed my cluster, so there is not much.
I took a capture of my console command history:

 1401* ../delete-cluster.sh; ../create-cluster.sh; ../init-multicluster-2.sh
 1402  kubectl --kubeconfig=$KUBECONFIG1 apply -k ./vl3-dns\nsleep 0.5\nkubectl --kubeconfig=$KUBECONFIG1 -n ns-dns-vl3 wait --for=condition=ready --timeout=1m pod -l app=nse-vl3-vpp\nkubectl --kubeconfig=$KUBECONFIG1 -n ns-dns-vl3 wait --for=condition=ready --timeout=1m pod -l app=vl3-ipam\n
 1403  kubectl --kubeconfig=$KUBECONFIG1 apply -f istio-namespace.yaml\nistioctl install --set profile=minimal -y --kubeconfig=$KUBECONFIG1 --set meshConfig.accessLogFile=/dev/stdout\nk1 exec -n istio-system deployments/istiod -c cmd-nsc -- apk add tcpdump\n
 1404  WORK_DIR="$(git rev-parse --show-toplevel)/examples/interdomain/nsm_istio_vl3/clean/istio-vm-configs"\nVM_APP="vm-app"\nVM_NAMESPACE="vm-ns"\nSERVICE_ACCOUNT="serviceaccountvm"\nCLUSTER_NETWORK=""\nVM_NETWORK=""\nCLUSTER="Kubernetes"\n
 1405  kubectl --kubeconfig=$KUBECONFIG1 create namespace "${VM_NAMESPACE}"\nkubectl --kubeconfig=$KUBECONFIG1 create serviceaccount "${SERVICE_ACCOUNT}" -n "${VM_NAMESPACE}"\n
 1406  k1 exec -n istio-system deployments/istiod -c cmd-nsc -- ip a
 1407  time k1 exec -n istio-system deployments/istiod -c cmd-nsc -- tcpdump -i nsm-1 -U -w - >1-istio-standard.pcap &\nsleep 1\nk1 apply -k ubuntu-standard\nsleep 0.5\nk1 -n vl3-test wait --for=condition=ready --timeout=1m pod -l app=ubuntu\nsleep 1\nkill -2 $!\nsleep 1\nk1 delete -k ubuntu-standard\ntshark -r 1-istio-standard.pcap\n
 1408  istioctl x workload entry configure -f workloadgroup.yaml -o "${WORK_DIR}" --clusterID "${CLUSTER}" --kubeconfig=$KUBECONFIG1 --ingressIP=172.16.0.2\n# sed -i '' 's/15012/15010/' "${WORK_DIR}/mesh.yaml"\nrm -rf ubuntu-standard/istio-vm-configs\nrm -rf ubuntu-hosts/istio-vm-configs\nrm -rf ubuntu-hosts-2/istio-vm-configs\nrm -rf ubuntu-hosts-2-vmlike/istio-vm-configs\ncp -r "${WORK_DIR}" ubuntu-standard/istio-vm-configs\ncp -r "${WORK_DIR}" ubuntu-hosts/istio-vm-configs\ncp -r "${WORK_DIR}" ubuntu-hosts-2/istio-vm-configs\ncp -r "${WORK_DIR}" ubuntu-hosts-2-vmlike/istio-vm-configs\n
 1409  time k1 exec -n istio-system deployments/istiod -c cmd-nsc -- tcpdump -i nsm-1 -U -w - >4-istio-tcpdump-1-nsm.pcap &\nsleep 1\nk1 apply -k ubuntu-hosts\nsleep 0.5\nk1 -n vl3-test wait --for=condition=ready --timeout=1m pod -l app=ubuntu\nk1 -n vl3-test get pod\n# k1 -n vl3-test exec deployments/ubuntu-deployment -c istio-proxy -- curl 172.16.0.2:8080/ready -vs\nsleep 1\nkill -2 $!\ntime k1 exec -n istio-system deployments/istiod -c cmd-nsc -- tcpdump -i nsm-1 -U -w - >4-istio-tcpdump-2-nsm.pcap\nsleep 1\nk1 delete -k ubuntu-hosts\ntshark -r 4-istio-tcpdump-1-nsm.pcap
 1410  k1 delete -k ubuntu-hosts
 1411  time k1 exec -n istio-system deployments/istiod -c cmd-nsc -- tcpdump -i nsm-1 -U -w - >4-istio-tcpdump-1-nsm.pcap &\nsleep 1\nk1 apply -k ubuntu-hosts\nsleep 0.5\nk1 -n vl3-test wait --for=condition=ready --timeout=1m pod -l app=ubuntu\nk1 -n vl3-test get pod\nk1 -n vl3-test exec deployments/ubuntu-deployment -c istio-proxy -- curl 172.16.0.2:8080/ready -vs\nsleep 1\nkill -2 $!\n# time k1 exec -n istio-system deployments/istiod -c cmd-nsc -- tcpdump -i nsm-1 -U -w - >4-istio-tcpdump-2-nsm.pcap\nsleep 1\nk1 delete -k ubuntu-hosts\ntshark -r 4-istio-tcpdump-1-nsm.pcap\n# tshark -r 4-istio-tcpdump-2-nsm.pcap\n
 1412  jobs
 1413  time k1 exec -n istio-system deployments/istiod -c cmd-nsc -- tcpdump -i nsm-1 -U -w - >4-istio-tcpdump-1-nsm.pcap &\nsleep 1\nk1 apply -k ubuntu-hosts\nsleep 0.5\nk1 -n vl3-test wait --for=condition=ready --timeout=1m pod -l app=ubuntu\nk1 -n vl3-test get pod\nk1 -n vl3-test exec deployments/ubuntu-deployment -c istio-proxy -- curl 172.16.0.2:8080/ready -vs\nsleep 1\nkill -2 $!\n# time k1 exec -n istio-system deployments/istiod -c cmd-nsc -- tcpdump -i nsm-1 -U -w - >4-istio-tcpdump-2-nsm.pcap\nsleep 1\nk1 delete -k ubuntu-hosts
 1414  k1 cluster-info dump --output yaml --output-directory dump-2 --all-namespaces
 1415  cp ~/.bash_history .
 1416  cp ~/.zsh_history .
 1417  history 0

So, basically:
1401 — delete previous cluster, create a new one, initialize it with interdomain setup (2 clusters with metallb+dns+spire+nsm)
-- I waited some time after running this command, maybe about 0.5h
1402-1406 — deploy vl3 nse, deploy istiod client
1407 — test deployment that doesn't use nsm interface
1408 — more istio initialization
1409 — start tcpdump, deploy vl3 client with istio, wait for readiness, cleanup (kill tcpdump, delete deployment)
-- I canceled the command with ctrl+C (cleanup part didn't run)
1410 — delete deployment. I didn't stop tcpdump. Not sure if this impacted the issue.
1411 — repeat previous command
-- after command 1411 I saw some tcpdump errors from background, but I think it were different errors
-- tcp dump file was corrupted due to simultaneous writes from 2 different instances
1412 — run the same command

After 1412 I immediately saw the error from tcpdump that the interface has disappeared, and run cluster-info dump before doing anything else.

I'm not sure if I or someone else would be able to replicate this by running the commands as I described above. I myself haven't tried replicating this yet.
I suspect that this issue has stochastic nature, so it will be hard to reproduce reliably.
Still maybe this list of commands would be helpful.

In NSM logs I see repeated close, request, close, request, close, request, .... Both istiod and test client (istio-proxy) have this pattern in cmd-nsc sidecar.

Logs

The logs are aken right after I was the issue first time with this cluster, so they must be much cleaner than previous ones.
dump-2.zip

All related information can be found here:

https://github.com/d-uzlov/deployments-k8s/tree/deployment-issue-do-not-delete-2/examples/interdomain/nsm_istio_vl3/clean

@d-uzlov
Copy link
Contributor Author

d-uzlov commented Jan 10, 2023

I checked what happens if I disable NSC liveness check (I modified webhook to disable it on all clients).
It turns out the after we re-deploy a client in the second cluster this new client gets ip 172.16.0.2.
This ip was previously assigned to istiod.

I checked pings between vl3 clients.
Client in the second cluster .0.2 can't ping anyone (except itself).
NSE .0.1 can ping sample server .0.3.
Sample server .0.3 can ping NSE .0.1 and .0.2 (which is probably istiod)
Istiod .0.2 can ping sample server .0.3 but ping to NSE .0.1 fails.

Here are logs from both clusters in this state:
dump-3.zip

@d-uzlov
Copy link
Contributor Author

d-uzlov commented Jan 10, 2023

I noticed that vl3 NSE has restarted in the setup above.
I captured logs from its previous run.

00-1-nse-prev.log

It looks like it had crashed due to SIGSEGV in memif.
And apparently after it has restarted something went wrong.

@d-uzlov
Copy link
Contributor Author

d-uzlov commented Jan 10, 2023

And apparently after it has restarted something went wrong.

@glazychev-art hinted me that in the current version of NSM this case isn't handled properly yet, and here's the discussion:

@wazsone
Copy link
Contributor

wazsone commented Jan 17, 2023

Stack trace from restarted nse container from @d-uzlov

Jan 10 06:44:46.612�[36m [INFO] [cmd:vpp] �[0mvpp[14]: memif_plugin: memif2868466484/0: memif_msg_receive_disconnect: disconnect received
Jan 10 06:44:46.612�[36m [INFO] [cmd:vpp] �[0mvpp[14]: received signal SIGSEGV, PC 0x7fc2e6959299, faulting address 0x4
Jan 10 06:44:46.612�[36m [INFO] [cmd:vpp] �[0mvpp[14]: #0  0x00007fc2e672edf0 0x7fc2e672edf0
Jan 10 06:44:46.612�[36m [INFO] [cmd:vpp] �[0mvpp[14]: #1  0x00007fc2e66563c0 0x7fc2e66563c0
Jan 10 06:44:46.612�[36m [INFO] [cmd:vpp] �[0mvpp[14]: #2  0x00007fc2e6959299 vnet_hw_if_generate_rxq_int_poll_vector + 0x69
Jan 10 06:44:46.612�[36m [INFO] [cmd:vpp] �[0mvpp[14]: #3  0x00007fc2a366b2ea memif_input_node_fn_hsw + 0x3869a
Jan 10 06:44:46.612�[36m [INFO] [cmd:vpp] �[0mvpp[14]: #4  0x00007fc2e66cc355 0x7fc2e66cc355
Jan 10 06:44:46.612�[36m [INFO] [cmd:vpp] �[0mvpp[14]: #5  0x00007fc2e66cb285 0x7fc2e66cb285
Jan 10 06:44:46.612�[36m [INFO] [cmd:vpp] �[0mvpp[14]: #6  0x00007fc2e672e796 0x7fc2e672e796
Jan 10 06:44:46.612�[36m [INFO] [cmd:vpp] �[0mvpp[14]: #7  0x00007fc2e65e8494 0x7fc2e65e8494

Stack traces from restarted nse and forwarder-vpp containers from my attempt to reproduce the issue

nse
Jan 17 04:03:32.570�[36m [INFO] [cmd:vpp] �[0mvpp[17]: memif_plugin: memif2557108433/0: memif_msg_receive_disconnect: disconnect received
Jan 17 04:03:32.570�[36m [INFO] [cmd:vpp] �[0mvpp[17]: received signal SIGSEGV, PC 0x7f81b70dc299, faulting address 0x4
Jan 17 04:03:32.570�[36m [INFO] [cmd:vpp] �[0mvpp[17]: #0  0x00007f81b6eb1df0 0x7f81b6eb1df0
Jan 17 04:03:32.570�[36m [INFO] [cmd:vpp] �[0mvpp[17]: #1  0x00007f81b6dd93c0 0x7f81b6dd93c0
Jan 17 04:03:32.570�[36m [INFO] [cmd:vpp] �[0mvpp[17]: #2  0x00007f81b70dc299 vnet_hw_if_generate_rxq_int_poll_vector + 0x69
Jan 17 04:03:32.570�[36m [INFO] [cmd:vpp] �[0mvpp[17]: #3  0x00007f8173eaa28c 0x7f8173eaa28c
Jan 17 04:03:32.570�[36m [INFO] [cmd:vpp] �[0mvpp[17]: #4  0x00007f81b6e4f355 0x7f81b6e4f355
Jan 17 04:03:32.570�[36m [INFO] [cmd:vpp] �[0mvpp[17]: #5  0x00007f81b6e4e285 0x7f81b6e4e285
Jan 17 04:03:32.570�[36m [INFO] [cmd:vpp] �[0mvpp[17]: #6  0x00007f81b6eb1796 0x7f81b6eb1796
Jan 17 04:03:32.570�[36m [INFO] [cmd:vpp] �[0mvpp[17]: #7  0x00007f81b6d6b494 0x7f81b6d6b494
forwarder-vpp
Jan 17 03:57:17.459�[36m [INFO] [cmd:vpp] �[0mvpp[3376]: capo_sw_interface_add_del:103: unconfiguring policies for if 13 deleted
Jan 17 03:57:17.459�[36m [INFO] [cmd:vpp] �[0mvpp[3376]: capo_sw_interface_add_del:110: error deleting caiop (output): -2
Jan 17 03:57:17.459�[36m [INFO] [cmd:vpp] �[0mvpp[3376]: capo_sw_interface_add_del:116: error deleting caiop (input): -2
Jan 17 03:57:17.459�[36m [INFO] [cmd:vpp] �[0mvpp[3376]: received signal SIGSEGV, PC 0x7fcff461c2a9, faulting address 0x4
time="2023-01-17T04:20:31Z" level=info msg="No subscription found for the notification message." msg_id=710 msg_size=19
Jan 17 04:20:31.613�[37m [DEBU] [id:aef89aec-b5e4-4e54-abfe-7708080f368d] [duration:6.025641ms] [swIfIndex:13] [type:networkService] [vppapi:MemifDelete] �[0m(58.1)                                                            completed
Jan 17 03:57:17.459�[36m [INFO] [cmd:vpp] �[0mvpp[3376]: #0  0x00007fcff43f1df0 0x7fcff43f1df0
Jan 17 03:57:17.459�[36m [INFO] [cmd:vpp] �[0mvpp[3376]: #1  0x00007fcff4319420 0x7fcff4319420
Jan 17 03:57:17.459�[36m [INFO] [cmd:vpp] �[0mvpp[3376]: #2  0x00007fcff461c2a9 vnet_hw_if_generate_rxq_int_poll_vector + 0x69
Jan 17 03:57:17.459�[36m [INFO] [cmd:vpp] �[0mvpp[3376]: #3  0x00007fcff4bf6fd7 virtio_input_node_fn + 0x22a97
Jan 17 03:57:17.459�[36m [INFO] [cmd:vpp] �[0mvpp[3376]: #4  0x00007fcff438f355 0x7fcff438f355
Jan 17 03:57:17.459�[36m [INFO] [cmd:vpp] �[0mvpp[3376]: #5  0x00007fcff438e285 0x7fcff438e285
Jan 17 03:57:17.459�[36m [INFO] [cmd:vpp] �[0mvpp[3376]: #6  0x00007fcff43f1796 0x7fcff43f1796
Jan 17 03:57:17.459�[36m [INFO] [cmd:vpp] �[0mvpp[3376]: #7  0x00007fcff42ab494 0x7fcff42ab494

@denis-tingaikin
Copy link
Member

@edwarnicke Looks like the issue is on the vpp/govpp side. Could you have a look at the stack trace? ☝️

@denis-tingaikin denis-tingaikin added the help wanted Extra attention is needed label Jan 23, 2023
@wazsone
Copy link
Contributor

wazsone commented Jan 23, 2023

@wazsone
Copy link
Contributor

wazsone commented Feb 6, 2023

The issue is not reproduced with updated VPP. Seems like updating VPP has fixed the issue.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Release v1.8.0 Feb 6, 2023
@denis-tingaikin
Copy link
Member

Was fixed in networkservicemesh/cmd-forwarder-vpp#802

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
Status: Done
Development

No branches or pull requests

3 participants