-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
api,agent,server,engine-schema: scalability improvements #9840
base: 4.20
Are you sure you want to change the base?
Conversation
2c750db
to
080e5af
Compare
Following changes and improvements have been added: - Improvements in handling of PingRoutingCommand 1. Added global config - `vm.sync.power.state.transitioning`, default value: true, to control syncing of power states for transitioning VMs. This can be set to false to prevent computation of transitioning state VMs. 2. Improved VirtualMachinePowerStateSync to allow power state sync for host VMs in a batch 3. Optimized scanning stalled VMs - Added option to set worker threads for capacity calculation using config - `capacity.calculate.workers` - Added caching framework based on Caffeine in-memory caching library, https://github.com/ben-manes/caffeine - Added caching for account/use role API access with expiration after write can be configured using config - `dynamic.apichecker.cache.period`. If set to zero then there will be no caching. Default is 0. - Added caching for account/use role API access with expiration after write set to 60 seconds. - Added caching for some recurring DB retrievals 1. CapacityManager - listing service offerings - beneficial in host capacity calculation 2. LibvirtServerDiscoverer existing host for the cluster - beneficial for host joins 3. DownloadListener - hypervisors for zone - beneficial for host joins 5. VirtualMachineManagerImpl - VMs in progress- beneficial for processing stalled VMs during PingRoutingCommands - Optimized MS list retrieval for agent connect - Optimize finding ready systemvm template for zone - Database retrieval optimisations - fix and refactor for cases where only IDs or counts are used mainly for hosts and other infra entities. Also similar cases for VMs and other entities related to host concerning background tasks - Changes in agent-agentmanager connection with NIO client-server classes 1. Optimized the use of the executor service 2. Refactore Agent class to better handle connections. 3. Do SSL handshakes within worker threads 5. Added global configs to control the behaviour depending on the infra. SSL handshake could be a bottleneck during agent connections. Configs - `agent.ssl.handshake.min.workers` and `agent.ssl.handshake.max.workers` can be used to control number of new connections management server handles at a time. `agent.ssl.handshake.timeout` can be used to set number of seconds after which SSL handshake times out at MS end. 6. On agent side backoff and sslhandshake timeout can be controlled by agent properties. `backoff.seconds` and `ssl.handshake.timeout` properties can be used. - Improvements in StatsCollection - minimize DB retrievals. - Improvements in DeploymentPlanner allow for the retrieval of only desired host fields and fewer retrievals. - Improvements in hosts connection for a storage pool. Added config - `storage.pool.host.connect.workers` to control the number of worker threads that can be used to connect hosts to a storage pool. Worker thread approach is followed currently only for NFS and ScaleIO pools. - Minor improvements in resource limit calculations wrt DB retrievals Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com> Co-authored-by: Abhishek Kumar <abhishek.mrt22@gmail.com> Co-authored-by: Rohit Yadav <rohit.yadav@shapeblue.com>
080e5af
to
e3cf7fd
Compare
Honestly, I don't like PRs with thousand of lines doing thousand of things. It is hard to review and test. I encourage you to separate it in several minor PRs that address each one of the changes you are proposing. |
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
@blueorangutan package |
@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11441 |
@blueorangutan package |
@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11486 |
@blueorangutan test |
@shwstppr a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
[SF] Trillian test result (tid-11738)
|
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
@blueorangutan package |
@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
@blueorangutan package |
@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✖️ el8 ✖️ el9 ✔️ debian ✖️ suse15. SL-JID 11708 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't review thoroughly as there are a lot of changes, but I left some comments around the code. Overall, it looks good, however, as @GutoVeronezi mentioned, I think this PR would benefit being break into smaller ones, specially as not all changes are intertwined.
It will probably need a lot of testing for this much change, as to make sure no regression is introduced (that's why I think smaller PRs would make this easier to test and approve).
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
Thanks @BryanMLima for the review. I've addressed your suggestions and responded to two queries. @blueorangutan package |
@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11887 |
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>
This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch. |
@blueorangutan package |
@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11920 |
@blueorangutan test matrix |
@shwstppr a [SL] Trillian-Jenkins matrix job (EL8 mgmt + EL8 KVM, Ubuntu22 mgmt + Ubuntu22 KVM, EL8 mgmt + VMware 7.0u3, EL9 mgmt + XCP-ng 8.2 ) has been kicked to run smoke tests |
[SF] Trillian Build Failed (tid-11992) |
Description
Following changes and improvements have been added:
Improvements in handling of PingRoutingCommand
vm.sync.power.state.transitioning
, default value: true, to control syncing of power states for transitioning VMs. This can be set to false to prevent computation of transitioning state VMs.Added option to set worker threads for capacity calculation using config -
capacity.calculate.workers
Added caching for account/use role API access with expiration after write can be configured using config -
dynamic.apichecker.cache.period
. If set to zero then there will be no caching. Default is 0.Added caching for account/use role API access with expiration after write set to 60 seconds.
Added caching for some recurring DB retrievals
Optimized MS list retrieval for agent connect
Optimize finding ready systemvm template for zone
Database retrieval optimisations - fix and refactor for cases where only IDs or counts are used mainly for hosts and other infra entities. Also similar cases for VMs and other entities related to host concerning background tasks
Changes in agent-agentmanager connection with NIO client-server classes
agent.ssl.handshake.min.workers
andagent.ssl.handshake.max.workers
can be used to control number of new connections management server handles at a time.agent.ssl.handshake.timeout
can be used to set number of seconds after which SSL handshake times out at MS end.backoff.seconds
andssl.handshake.timeout
properties can be used.Improvements in StatsCollection - minimize DB retrievals.
Improvements in DeploymentPlanner allow for the retrieval of only desired host fields and fewer retrievals.
Improvements in hosts connection for a storage pool. Added config -
storage.pool.host.connect.workers
to control the number of worker threads that can be used to connect hosts to a storage pool. Worker thread approach is followed currently only for NFS and ScaleIO pools.Minor improvements in resource limit calculations wrt DB retrievals
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
How did you try to break this feature and the system with this change?