Add a timeout flag for the network command destroy subcommand #815

jeromy-cannon · 2024-11-08T15:40:25Z

Per Alex Kuzmin, when there are pods currently in Pending state, then the following command:

solo network destroy --namespace "${SOLO_NAMESPACE}" --delete-pvcs --delete-secrets --force

will run/hang indefinitely.

Note: he is using the taskfile.yaml clean target in the examples folder. So we should update the [solo:network:destroy](https://github.com/hashgraph/solo/blob/ec63a659ae325ab1631409a76e6894deccdb0ed4/examples/custom-network-config/Taskfile.yml#L110-L110) target with the recommended timeout also. examples/custom-network-config/Taskfile.yml

Tasks

Give feedback

add a --timeout flag to solo network destroy, default it to 120 seconds. If the timeout is reached and it is still running, then abort the command and continue.
if the --timeout flag is used and the following 3 flags are also used: --delete-pvcs --delete-secrets --force then when the timeout is reached abort the command and then perform a delete on the namespace, use the timeout on this command too, and abort if it also runs over.
log appropriate error messages to the user when the timeout is reached and the command is aborted
make sure if the timeout is reached that we still give a non zero return code when exiting
update the solo:network:destroy target in examples/custom-network-config/Taskfile.yml to include the new --timeout option
Options

The text was updated successfully, but these errors were encountered:

alex-kuzmin-hg · 2024-11-13T17:43:34Z

New symptom: it is a hanging in cleaning healthy nodes. This step was always working fine before

hashsphere1@s05:~/workspaces/10nodes/solo$ task -t Taskfile.yml clean
task: [solo:node:stop] npm run solo-test -- node stop --namespace "${SOLO_NAMESPACE}" --node-aliases-unparsed node0,node1,node2,node3,node4,node5,node6 
[solo:node:stop] 
[solo:node:stop] > @hashgraph/solo@0.99.0 solo-test
[solo:node:stop] > node --no-deprecation --no-warnings --loader ts-node/esm solo.ts node stop --namespace solo-hashsphere1 --node-aliases-unparsed node0,node1,node2,node3,node4,node5,node6
[solo:node:stop] 
[solo:node:stop] 
[solo:node:stop] ******************************* Solo *********************************************
[solo:node:stop] Version			: 0.99.0
[solo:node:stop] Kubernetes Context	: gke_hashsphere-staging_us-central1_sphere-load-test-us-central
[solo:node:stop] Kubernetes Cluster	: gke_hashsphere-staging_us-central1_sphere-load-test-us-central
[solo:node:stop] Kubernetes Namespace	: solo-hashsphere1
[solo:node:stop] **********************************************************************************
[solo:node:stop] ❯ Initialize
[solo:node:stop] ❯ Acquire lease
[solo:node:stop] ✔ Acquire lease - lease acquired successfully, attempt: 1/10
[solo:node:stop] ✔ Initialize
[solo:node:stop] ❯ Identify network pods
[solo:node:stop] ❯ Check network pod: node0
[solo:node:stop] ❯ Check network pod: node1
[solo:node:stop] ❯ Check network pod: node2
[solo:node:stop] ❯ Check network pod: node3
[solo:node:stop] ❯ Check network pod: node4
[solo:node:stop] ❯ Check network pod: node5
[solo:node:stop] ❯ Check network pod: node6
^\SIGQUIT: quit
PC=0x473721 m=0 sigcode=128

goroutine 7 gp=0xc000133c00 m=0 mp=0x135b960 [syscall]:
runtime.notetsleepg(0x13bc500, 0xffffffffffffffff)
	runtime/lock_futex.go:246 +0x29 fp=0xc00049a7a0 sp=0xc00049a778 pc=0x4105a9
os/signal.signal_recv()
	runtime/sigqueue.go:152 +0x29 fp=0xc00049a7c0 sp=0xc00049a7a0 pc=0x46e589
os/signal.loop()
	os/signal/signal_unix.go:23 +0x13 fp=0xc00049a7e0 sp=0xc00049a7c0 pc=0xa6c093
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc00049a7e8 sp=0xc00049a7e0 pc=0x471921
created by os/signal.Notify.func1.1 in goroutine 1
	os/signal/signal.go:151 +0x1f

goroutine 1 gp=0xc0000061c0 m=3 mp=0xc0000b3008 [syscall]:
syscall.Syscall6(0xf7, 0x1, 0xba98b, 0xc00010d978, 0x1000004, 0x0, 0x0)
	syscall/syscall_linux.go:91 +0x39 fp=0xc00010d940 sp=0xc00010d8e0 pc=0x4886f9
os.(*Process).blockUntilWaitable(0xc0003da3c0)
	os/wait_waitid.go:32 +0x76 fp=0xc00010da18 sp=0xc00010d940 pc=0x4f65b6
os.(*Process).wait(0xc0003da3c0)
	os/exec_unix.go:22 +0x25 fp=0xc00010da78 sp=0xc00010da18 pc=0x4f04a5
os.(*Process).Wait(...)
	os/exec.go:134
os/exec.(*Cmd).Wait(0xc00001e180)
	os/exec/exec.go:906 +0x45 fp=0xc00010dad8 sp=0xc00010da78 pc=0x6e0b45

JeffreyDallas · 2024-11-13T18:06:41Z

So it starts hanging at node stop step, before node destroy step ?

Can you attach ~/.solo/logs/solo.log and also use k9s to check what are status of network pods ?

alex-kuzmin-hg · 2024-11-13T18:15:58Z

jeromy-cannon mentioned this issue Nov 8, 2024

Solo: CLI User Experience #47

Open

jeromy-cannon assigned JeffreyDallas Nov 8, 2024

jeromy-cannon added P0 An issue impacting production environments or impacting multiple releases or multiple individuals. Requested by Stakeholder Requested by an individual or team that uses Solo labels Nov 8, 2024

JeffreyDallas mentioned this issue Nov 12, 2024

feat: add timeout flag to network destroy command #821

Merged

JeffreyDallas closed this as completed Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a timeout flag for the network command destroy subcommand #815

Add a timeout flag for the network command destroy subcommand #815

jeromy-cannon commented Nov 8, 2024 •

edited by JeffreyDallas

Loading

Tasks

alex-kuzmin-hg commented Nov 13, 2024

JeffreyDallas commented Nov 13, 2024 •

edited

Loading

alex-kuzmin-hg commented Nov 13, 2024

Add a timeout flag for the network command destroy subcommand #815

Add a timeout flag for the network command destroy subcommand #815

Comments

jeromy-cannon commented Nov 8, 2024 • edited by JeffreyDallas Loading

Tasks

alex-kuzmin-hg commented Nov 13, 2024

JeffreyDallas commented Nov 13, 2024 • edited Loading

alex-kuzmin-hg commented Nov 13, 2024

jeromy-cannon commented Nov 8, 2024 •

edited by JeffreyDallas

Loading

JeffreyDallas commented Nov 13, 2024 •

edited

Loading