Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execute ModuleRun tasks of the same weight in parallel #504

Merged
merged 3 commits into from
Sep 26, 2024

Conversation

miklezzzz
Copy link
Contributor

@miklezzzz miklezzzz commented Sep 2, 2024

Overview

ModuleRun tasks and corresponding ModuleHookRun tasks for modules of the same order (weight) are executed in parallel in parallel queues. There are 10 parallel queues by default in the operator's queue set.

What this PR does / why we need it

This pr adds new type of tasks - ParallelModuleRun. A task of this type represents a group of smaller tasks with the same order/weight of ModuleRun and ModuleHookRun types. These subordinate tasks are executed in parallel pre-created named parallel_queue_x queues and all the results and errors are propagated back to the corresponding ParallelModuleRun task that updates its status accordingly.

Special notes for your reviewer

@miklezzzz miklezzzz added the enhancement New feature or request label Sep 2, 2024
@miklezzzz miklezzzz self-assigned this Sep 2, 2024
@miklezzzz miklezzzz force-pushed the apply-modules-simultaneously branch 11 times, most recently from 033a95a to b6c26de Compare September 10, 2024 15:36
@miklezzzz miklezzzz requested a review from yalosev September 10, 2024 15:45
@miklezzzz miklezzzz marked this pull request as ready for review September 10, 2024 15:53
@miklezzzz miklezzzz changed the title Apply modules` releases simultaneously if possible Execute ModuleRun tasks of the same weight in parallel Sep 10, 2024
@miklezzzz
Copy link
Contributor Author

an example of grouped (parallel) run:

Queue 'main': length 31, status: 'run first task'

 1. GroupedModuleRun:main:Grouped run for cloud-data-crd, metallb-crd, operator-prometheus-crd, prometheus-crd, snapshot-controller-crd, user-authn-crd, vertical-pod-autoscaler-crd:OperatorStartup
 2. ModuleRun:main:flow-schema:doStartup:OperatorStartup
 3. ModuleRun:main:admission-policy-engine:doStartup:OperatorStartup
 4. ModuleRun:main:cloud-provider-openstack:doStartup:OperatorStartup
 5. ModuleRun:main:local-path-provisioner:doStartup:OperatorStartup
 6. ModuleRun:main:cni-flannel:doStartup:OperatorStartup
 7. ModuleRun:main:kube-proxy:doStartup:OperatorStartup
 8. ModuleRun:main:registry-packages-proxy:doStartup:OperatorStartup
 9. GroupedModuleRun:main:Grouped run for control-plane-manager, node-manager, terraform-manager:OperatorStartup
10. ModuleRun:main:kube-dns:doStartup:OperatorStartup
11. ModuleRun:main:snapshot-controller:doStartup:OperatorStartup
12. ModuleRun:main:cert-manager:doStartup:OperatorStartup
13. ModuleRun:main:user-authz:doStartup:OperatorStartup
14. ModuleRun:main:user-authn:doStartup:OperatorStartup
15. ModuleRun:main:operator-prometheus:doStartup:OperatorStartup
16. ModuleRun:main:prometheus:doStartup:OperatorStartup
17. ModuleRun:main:prometheus-metrics-adapter:doStartup:OperatorStartup
18. ModuleRun:main:vertical-pod-autoscaler:doStartup:OperatorStartup
19. GroupedModuleRun:main:Grouped run for extended-monitoring, monitoring-applications, monitoring-custom, monitoring-deckhouse, monitoring-kubernetes, monitoring-kubernetes-control-plane, monitoring-ping:OperatorStartup
20. ModuleRun:main:node-local-dns:doStartup:OperatorStartup
21. ModuleRun:main:ingress-nginx:doStartup:OperatorStartup
22. ModuleRun:main:log-shipper:doStartup:OperatorStartup
23. ModuleRun:main:pod-reloader:doStartup:OperatorStartup
24. ModuleRun:main:chrony:doStartup:OperatorStartup
25. GroupedModuleRun:main:Grouped run for dashboard, operator-trivy, upmeter:OperatorStartup
26. GroupedModuleRun:main:Grouped run for namespace-configurator, secret-copier:OperatorStartup
27. ModuleRun:main:deckhouse-tools:doStartup:OperatorStartup
28. ModuleRun:main:documentation:doStartup:OperatorStartup
29. GroupedModuleRun:main:Grouped run for echo, mcplay:OperatorStartup
30. ConvergeModules:main:::Operator-Startup
31. ModuleHookRun:main:kubernetes:002-deckhouse/hooks/change_host_ip.go:pod:Kubernetes

Queue 'group_queue_0': length 1, status: 'waiting for task 20s'

 1. ModuleRun:group_queue_0:cloud-data-crd:doStartup:OperatorStartup

Queue 'group_queue_1': length 1, status: 'waiting for task 20s'

 1. ModuleRun:group_queue_1:metallb-crd:doStartup:OperatorStartup

Queue 'group_queue_2': length 1, status: 'waiting for task 20s'

 1. ModuleRun:group_queue_2:operator-prometheus-crd:doStartup:OperatorStartup

Queue 'group_queue_3': length 1, status: 'waiting for task 20s'

 1. ModuleRun:group_queue_3:prometheus-crd:doStartup:OperatorStartup

Queue 'group_queue_4': length 1, status: 'waiting for task 20s'

 1. ModuleRun:group_queue_4:snapshot-controller-crd:doStartup:OperatorStartup

Queue 'group_queue_5': length 1, status: 'waiting for task 20s'

 1. ModuleRun:group_queue_5:user-authn-crd:doStartup:OperatorStartup

Queue 'group_queue_6': length 1, status: 'waiting for task 20s'

 1. ModuleRun:group_queue_6:vertical-pod-autoscaler-crd:doStartup:OperatorStartup

Summary:
- 'main' queue: 31 tasks.
- 14 other queues (7 active, 7 empty): 7 tasks.
- total 38 tasks to handle.

@miklezzzz
Copy link
Contributor Author

a failed task in a grouped run

Queue 'main': length 8, status: 'run first task'

 1. GroupedModuleRun:main:Grouped run for mcplay:OperatorStartup:failures 1:
	Errors:
	- mcplay: helm upgrade failed: cannot patch "mcplay" with kind Deployment: Deployment.apps "mcplay" is invalid: spec.template.spec.containers: Required value

 2. ConvergeModules:main:::Operator-Startup
 3. ModuleHookRun:main:kubernetes:002-deckhouse/hooks/change_host_ip.go:pod:Kubernetes
 4. ModuleRun:main:node-manager:Kubernetes-Change-ModuleValues
 5. ModuleHookRun:main:kubernetes:340-extended-monitoring/hooks/alert_old_annotation.go:namespaces:Kubernetes
 6. ModuleHookRun:main:kubernetes:340-extended-monitoring/hooks/alert_old_annotation.go:statefulsets:Kubernetes
 7. ModuleHookRun:main:kubernetes:340-extended-monitoring/hooks/alert_old_annotation.go:statefulsets:Kubernetes
 8. ModuleHookRun:main:kubernetes:340-extended-monitoring/hooks/alert_old_annotation.go:namespaces:Kubernetes

Queue 'group_queue_1': length 1, status: 'run first task'

 1. ModuleRun:group_queue_1:mcplay:doStartup:OperatorStartup:failures 1:helm upgrade failed: cannot patch "mcplay" with kind Deployment: Deployment.apps "mcplay" is invalid: spec.template.spec.containers: Required value


Summary:
- 'main' queue: 8 tasks.
- 99 other queues (1 active, 98 empty): 1 task.
- total 9 tasks to handle.

@miklezzzz
Copy link
Contributor Author

yet another example:

Queue 'main': length 12, status: 'run first task'

 1. GroupedModuleRun:main:Grouped run for echo, mcplay:OperatorStartup:failures 11:
	Errors:
	- echo: helm upgrade failed: cannot patch "echo-server" with kind Deployment: Deployment.apps "echo-server" is invalid: spec.template.spec.containers: Required value
	- mcplay: helm upgrade failed: cannot patch "mcplay" with kind Deployment: Deployment.apps "mcplay" is invalid: spec.template.spec.containers: Required value

 2. ConvergeModules:main:::Operator-Startup
 3. ModuleHookRun:main:kubernetes:002-deckhouse/hooks/change_host_ip.go:pod:Kubernetes
 4. ModuleRun:main:node-manager:Kubernetes-Change-ModuleValues
 5. ModuleHookRun:main:kubernetes:340-extended-monitoring/hooks/alert_old_annotation.go:namespaces:Kubernetes
 6. ModuleHookRun:main:kubernetes:340-extended-monitoring/hooks/alert_old_annotation.go:statefulsets:Kubernetes
 7. ModuleHookRun:main:kubernetes:340-extended-monitoring/hooks/alert_old_annotation.go:statefulsets:Kubernetes
 8. ModuleHookRun:main:kubernetes:340-extended-monitoring/hooks/alert_old_annotation.go:namespaces:Kubernetes
 9. ModuleHookRun:main:kubernetes:340-extended-monitoring/hooks/alert_old_annotation.go:namespaces:Kubernetes
10. ModuleHookRun:main:kubernetes:340-extended-monitoring/hooks/alert_old_annotation.go:statefulsets:Kubernetes
11. ModuleHookRun:main:kubernetes:340-extended-monitoring/hooks/alert_old_annotation.go:statefulsets:Kubernetes
12. ModuleHookRun:main:kubernetes:340-extended-monitoring/hooks/alert_old_annotation.go:namespaces:Kubernetes

Queue 'group_queue_0': length 1, status: 'sleep after fail for 21.4s (1s left of 21s delay)'

 1. ModuleRun:group_queue_0:echo:doStartup:OperatorStartup:failures 6:helm upgrade failed: cannot patch "echo-server" with kind Deployment: Deployment.apps "echo-server" is invalid: spec.template.spec.containers: Required value


Queue 'group_queue_1': length 1, status: 'sleep after fail for 13.3s (3s left of 13s delay)'

 1. ModuleRun:group_queue_1:mcplay:doStartup:OperatorStartup:failures 5:helm upgrade failed: cannot patch "mcplay" with kind Deployment: Deployment.apps "mcplay" is invalid: spec.template.spec.containers: Required value


Summary:
- 'main' queue: 12 tasks.
- 99 other queues (2 active, 97 empty): 2 tasks.
- total 14 tasks to handle.

@raabdullaev raabdullaev requested a review from juev September 23, 2024 08:24
@diafour
Copy link
Contributor

diafour commented Sep 23, 2024

Group makes it feel like a logic group of modules, e.g. "group of monitoring modules", "group of cni modules". Why don't name it according to PR description: ParallelModuleRun?

Also, there is a group parameter in kubernetes subscriptions.

@miklezzzz
Copy link
Contributor Author

makes sense

@miklezzzz
Copy link
Contributor Author

[deckhouse] deckhouse@dev-master-0 /deckhouse $ deckhouse-controller queue list
Queue 'main': length 33, status: 'run first task'

 1. ParallelModuleRun:main:Parallel run for cloud-data-crd, metallb-crd, operator-prometheus-crd, prometheus-crd, snapshot-controller-crd, user-authn-crd, vertical-pod-autoscaler-crd:OperatorStartup
 2. ModuleRun:main:flow-schema:doStartup:OperatorStartup
 3. ModuleRun:main:admission-policy-engine:doStartup:OperatorStartup
 4. ModuleRun:main:cloud-provider-openstack:doStartup:OperatorStartup
 5. ModuleRun:main:local-path-provisioner:doStartup:OperatorStartup
 6. ModuleRun:main:cni-flannel:doStartup:OperatorStartup
 7. ModuleRun:main:kube-proxy:doStartup:OperatorStartup
 8. ModuleRun:main:registry-packages-proxy:doStartup:OperatorStartup
 9. ParallelModuleRun:main:Parallel run for control-plane-manager, node-manager, terraform-manager:OperatorStartup
10. ModuleRun:main:kube-dns:doStartup:OperatorStartup
11. ModuleRun:main:snapshot-controller:doStartup:OperatorStartup
12. ModuleRun:main:cert-manager:doStartup:OperatorStartup
13. ModuleRun:main:user-authz:doStartup:OperatorStartup
14. ModuleRun:main:user-authn:doStartup:OperatorStartup
15. ModuleRun:main:operator-prometheus:doStartup:OperatorStartup
16. ModuleRun:main:prometheus:doStartup:OperatorStartup
17. ModuleRun:main:prometheus-metrics-adapter:doStartup:OperatorStartup
18. ModuleRun:main:vertical-pod-autoscaler:doStartup:OperatorStartup
19. ParallelModuleRun:main:Parallel run for extended-monitoring, monitoring-applications, monitoring-custom, monitoring-deckhouse, monitoring-kubernetes, monitoring-kubernetes-control-plane, monitoring-ping:OperatorStartup
20. ModuleRun:main:node-local-dns:doStartup:OperatorStartup
21. ModuleRun:main:metallb:doStartup:OperatorStartup
22. ModuleRun:main:l2-load-balancer:doStartup:OperatorStartup
23. ModuleRun:main:ingress-nginx:doStartup:OperatorStartup
24. ModuleRun:main:log-shipper:doStartup:OperatorStartup
25. ModuleRun:main:pod-reloader:doStartup:OperatorStartup
26. ModuleRun:main:chrony:doStartup:OperatorStartup
27. ParallelModuleRun:main:Parallel run for dashboard, operator-trivy, upmeter:OperatorStartup
28. ParallelModuleRun:main:Parallel run for namespace-configurator, secret-copier:OperatorStartup
29. ModuleRun:main:deckhouse-tools:doStartup:OperatorStartup
30. ModuleRun:main:documentation:doStartup:OperatorStartup
31. ParallelModuleRun:main:Parallel run for echo, mcplay:OperatorStartup
32. ConvergeModules:main:::Operator-Startup
33. ModuleHookRun:main:kubernetes:002-deckhouse/hooks/change_host_ip.go:pod:Kubernetes

Queue 'parallel_queue_0': length 1, status: 'run first task'

 1. ModuleRun:parallel_queue_0:snapshot-controller-crd:doStartup:OperatorStartup

Queue 'parallel_queue_1': length 1, status: 'run first task'

 1. ModuleRun:parallel_queue_1:user-authn-crd:doStartup:OperatorStartup

Queue 'parallel_queue_2': length 1, status: 'run first task'

 1. ModuleRun:parallel_queue_2:vertical-pod-autoscaler-crd:doStartup:OperatorStartup

Queue 'parallel_queue_3': length 1, status: 'run first task'

 1. ModuleRun:parallel_queue_3:cloud-data-crd:doStartup:OperatorStartup

Queue 'parallel_queue_4': length 1, status: 'run first task'

 1. ModuleRun:parallel_queue_4:metallb-crd:doStartup:OperatorStartup

Queue 'parallel_queue_5': length 1, status: 'run first task'

 1. ModuleRun:parallel_queue_5:operator-prometheus-crd:doStartup:OperatorStartup

Queue 'parallel_queue_6': length 1, status: 'run first task'

 1. ModuleRun:parallel_queue_6:prometheus-crd:doStartup:OperatorStartup

Summary:
- 'main' queue: 33 tasks.
- 14 other queues (7 active, 7 empty): 7 tasks.
- total 40 tasks to handle.

@miklezzzz miklezzzz requested a review from yalosev September 24, 2024 15:59
Signed-off-by: Mikhail Scherba <mikhail.scherba@flant.com>

add group queues

Signed-off-by: Mikhail Scherba <mikhail.scherba@flant.com>
Signed-off-by: Mikhail Scherba <mikhail.scherba@flant.com>
@miklezzzz miklezzzz force-pushed the apply-modules-simultaneously branch from 8a2c374 to eecd106 Compare September 26, 2024 15:04
Signed-off-by: Mikhail Scherba <mikhail.scherba@flant.com>
@yalosev yalosev merged commit 9dbe32b into main Sep 26, 2024
8 checks passed
@yalosev yalosev deleted the apply-modules-simultaneously branch September 26, 2024 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants