Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update demo #48

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Update demo #48

wants to merge 1 commit into from

Conversation

yuanchen8911
Copy link
Collaborator

This PR updates the demo script and recreates the demo svg file using the latest helm chart for creating virtual nodes.

@@ -15,6 +17,10 @@ kubectl apply -f charts/overrides/kwok/pod-complete.yml
kubectl apply -f https://github.com/${KWOK_REPO}/raw/main/kustomize/stage/pod/chaos/pod-init-container-running-failed.yaml
kubectl apply -f https://github.com/${KWOK_REPO}/raw/main/kustomize/stage/pod/chaos/pod-container-running-failed.yaml

# Set up virtual nodes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to deploy nodes separately if we are using Configure task

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you try to rebase/rebuild? I'm not seeing this issue

$ ./bin/knavigator -tasks ./resources/tests/k8s/test-job.yml
I0520 12:08:39.910353 1099652 k8s_config.go:42] "Using external kubeconfig"
I0520 12:08:39.915986 1099652 main.go:84] "Starting test" name="test-k8s-job"
I0520 12:08:39.916034 1099652 engine.go:111] "Creating task" name="RegisterObj" id="register"
I0520 12:08:39.916580 1099652 engine.go:247] "Starting task" id="RegisterObj/register"
I0520 12:08:39.916600 1099652 engine.go:253] "Task completed" id="RegisterObj/register" duration="3.535µs"
I0520 12:08:39.916612 1099652 engine.go:111] "Creating task" name="Configure" id="configure"
I0520 12:08:39.916795 1099652 engine.go:247] "Starting task" id="Configure/configure"
I0520 12:08:40.802256 1099652 engine.go:253] "Task completed" id="Configure/configure" duration="885.42569ms"
I0520 12:08:40.802304 1099652 engine.go:111] "Creating task" name="SubmitObj" id="job"
I0520 12:08:40.802636 1099652 engine.go:247] "Starting task" id="SubmitObj/job"
I0520 12:08:40.850344 1099652 engine.go:253] "Task completed" id="SubmitObj/job" duration="47.67867ms"
I0520 12:08:40.850383 1099652 engine.go:111] "Creating task" name="CheckPod" id="status"
I0520 12:08:40.850559 1099652 engine.go:247] "Starting task" id="CheckPod/status"
I0520 12:08:40.850576 1099652 check_pod_task.go:158] "Create pod informer" #pod=2 timeout="5s"
I0520 12:08:40.971440 1099652 check_pod_task.go:256] "Accounted for all pods"
I0520 12:08:40.971488 1099652 engine.go:253] "Task completed" id="CheckPod/status" duration="120.910655ms"
I0520 12:08:40.971503 1099652 engine.go:259] "Reset Engine"
$ k get no
NAME                    STATUS   ROLES           AGE     VERSION
test-control-plane      Ready    control-plane   11d     v1.29.2
virtual-dgxa100.80g-0   Ready    agent           9d      fake
virtual-dgxa100.80g-1   Ready    agent           3h58m   fake

Copy link
Collaborator Author

@yuanchen8911 yuanchen8911 May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize you've added a Configure task to the test, but I've noticed the following issues with the latest virtual node configuration. Can you take a look?

  1. Virtual nodes created in the task are NotReady.
$ k get nodes
NAME                    STATUS     ROLES           AGE     VERSION
minikube                Ready      control-plane   2m12s   v1.30.0
virtual-dgxa100.80g-0   NotReady   agent           118s    fake
virtual-dgxa100.80g-1   NotReady   agent           118s    fake
  1. The job shows Running while the pods are Pending.
$ k get job
NAME   STATUS    COMPLETIONS   DURATION   AGE
job1   Running   0/2           15s        15s

$k get pods
NAME           READY   STATUS    RESTARTS   AGE
job1-0-254qd   0/1     Pending   0          18s
job1-1-vsgkl   0/1     Pending   0          18s
  1. Running the test with the Configure task will remove the virtual nodes that were created by helm before.
  2. Deleting the test job will make all virtual nodes (from helm and task) become NotReady.
  3. Uninstalling the virtual node helm chart will remove all virtual nodes, including the those that were configured in a task.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create a brand new kind cluster and try it?

Copy link
Collaborator Author

@yuanchen8911 yuanchen8911 May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same. The nodes were Ready first and then became NotReaday. The pods are Pending. The test failed. BTW, I ran the test in minikube before.

$ k get nodes
NAME                    STATUS     ROLES           AGE    VERSION
test-control-plane      Ready      control-plane   106s   v1.29.2
virtual-dgxa100.80g-0   NotReady   agent           48s    fake
virtual-dgxa100.80g-1   NotReady   agent           48s    fake

$ k get pods
NAME           READY   STATUS    RESTARTS   AGE
job1-0-892gm   0/1     Pending   0          61s
job1-1-r8gcf   0/1     Pending   0          61s

$ k get jobs
NAME   COMPLETIONS   DURATION   AGE
job1   0/2           65s        65s

@yuanchen8911 yuanchen8911 requested a review from dmitsh May 20, 2024 19:23
Signed-off-by: Yuan Chen <yuanc@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants