Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic live-migration to balance load on cluster #1369

Merged
merged 10 commits into from
Nov 14, 2024

Conversation

presztak
Copy link
Collaborator

@presztak presztak commented Nov 12, 2024

This PR adds feature to automatically move some workloads to re-balance load across cluster.
The logic works in following way:

  • Spawns a background task which is performed on the current leader
  • Calculates a per-server score (based on CPU and memory)
  • Splits the servers by CPU architecture, then sort the servers by their score and check difference between most and least busy for each architecture, exit the re-balance if the difference is less than the threshold
  • Pulls the list of instances from the most busy server and filter out those that cannot be live migrated
  • Determines what instances should be migrated to equalize the score between the two servers looking at their current memory usage and CPU allocation and migrate them

Fixes: #485
Closes: #835

@github-actions github-actions bot added Documentation Documentation needs updating API Changes to the REST API labels Nov 12, 2024
@stgraber
Copy link
Member

A few initial tweaks I've made:

  • Updated the API extension commit to list the new config keys
  • Fixed config/config.go to have cluster.rebalance.frequency get a gendoc entry
  • Re-generated gendoc

@stgraber
Copy link
Member

Going to bump the default cooldown to 6H as otherwise with a 1h frequency, it'd be a bit useless :)

@stgraber
Copy link
Member

I'm also making cluster.rebalance.frequency be 0 by default as we shouldn't do this by default.
And making the default threshold be 20

@stgraber
Copy link
Member

I'm also making the threshold be between 10% and 100%

@stgraber stgraber force-pushed the live_migration_balance_load branch 2 times, most recently from 376284c to 121b587 Compare November 13, 2024 22:14
@stgraber
Copy link
Member

I'm renaming frequency to interval for consistency with other config keys.

@stgraber
Copy link
Member

Reducing the default batch size down to just 1 instance per batch as that's the most reliable, even if slowest. Also changing the batching to be per architecture rather than global.

@stgraber stgraber force-pushed the live_migration_balance_load branch 2 times, most recently from fcab705 to eb5918f Compare November 14, 2024 06:08
stgraber and others added 9 commits November 14, 2024 01:09
Signed-off-by: Stéphane Graber <stgraber@stgraber.org>
Signed-off-by: Stéphane Graber <stgraber@stgraber.org>
Signed-off-by: Stéphane Graber <stgraber@stgraber.org>
Signed-off-by: Piotr Resztak <piotr.resztak@futurfusion.io>
Signed-off-by: Piotr Resztak <piotr.resztak@futurfusion.io>
Signed-off-by: Piotr Resztak <piotr.resztak@futurfusion.io>
Signed-off-by: Piotr Resztak <piotr.resztak@futurfusion.io>
Signed-off-by: Piotr Resztak <piotr.resztak@futurfusion.io>
Signed-off-by: Piotr Resztak <piotr.resztak@futurfusion.io>
@stgraber
Copy link
Member

I did a few more tweaks on top of what I commented on before:

  • Extracted the instance resource logic we had in the placement logic and moved it to the instance package
  • Rebased the logic in the placement scriptlet to use that function
  • Replaced the logic in this PR with the shared logic
  • Renamed the main .go file to match the pattern for the clustering related files
  • Renamed some functions to keep them a bit better namespaced toegether
  • Optimized the migration code a bit

I've had it running for a bit on a test cluster now and it seems to be behaving as intended.

Signed-off-by: Stéphane Graber <stgraber@stgraber.org>
@stgraber stgraber merged commit 2f6386c into lxc:main Nov 14, 2024
30 checks passed
tmeijn pushed a commit to tmeijn/dotfiles that referenced this pull request Nov 22, 2024
This MR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [lxc/incus](https://github.com/lxc/incus) | minor | `v6.6.0` -> `v6.7.0` |

MR created with the help of [el-capitano/tools/renovate-bot](https://gitlab.com/el-capitano/tools/renovate-bot).

**Proposed changes to behavior should be submitted there as MRs.**

---

### Release Notes

<details>
<summary>lxc/incus (lxc/incus)</summary>

### [`v6.7.0`](https://github.com/lxc/incus/releases/tag/v6.7.0): Incus 6.7

[Compare Source](lxc/incus@v6.6.0...v6.7.0)

#### What's Changed

-   fix live update VM's limits.memory configuration when use a percentage value by [@&#8203;itviewer](https://github.com/itviewer) in lxc/incus#1287
-   fix: fix slice init length by [@&#8203;cuishuang](https://github.com/cuishuang) in lxc/incus#1285
-   incusd/instance/lxc: Remove restrictions on /run by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1288
-   Correct macvlan mode names by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1284
-   Translations update from Hosted Weblate by [@&#8203;weblate](https://github.com/weblate) in lxc/incus#1290
-   Translations update from Hosted Weblate by [@&#8203;weblate](https://github.com/weblate) in lxc/incus#1295
-   Translations update from Hosted Weblate by [@&#8203;weblate](https://github.com/weblate) in lxc/incus#1304
-   incus-simplestreams: Fix list -f json by [@&#8203;melato](https://github.com/melato) in lxc/incus#1310
-   Profile performance improvements by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1314
-   incus-agent: Add timeout for DNS query by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1313
-   incusd/instance/qemu: Don't fail on console retrival issue by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1316
-   Allow changing the parent value on physical networks by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1317
-   incus: Fix display of current project in projects list by [@&#8203;montag451](https://github.com/montag451) in lxc/incus#1318
-   Add `--format` to `incus admin sql` by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1319
-   incusd/internal/server/instance/drivers:  support for Chimera Linux (qemu/edk2) pkg layout by [@&#8203;mwyvr](https://github.com/mwyvr) in lxc/incus#1298
-   incusd/instance/common: Cleanup volatile on device add failure by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1323
-   incusd/network/bgp: Only advertise networks with BGP configuration by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1325
-   Make revert library shared by [@&#8203;gibmat](https://github.com/gibmat) in lxc/incus#1326
-   Fix to the cluster resources caching mechanism by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1324
-   Fix idmap issues by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1327
-   Make ask library shared by [@&#8203;gibmat](https://github.com/gibmat) in lxc/incus#1329
-   doc/network/resolved: Add disabling DNSSEC and DNSOverTLS by [@&#8203;ntnn](https://github.com/ntnn) in lxc/incus#1328
-   Add some application container documentation by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1331
-   incusd/device/nic/bridged: Handle invalid configuration by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1330
-   Fix handling of custom volume snapshot patterns by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1333
-   Add OCI DHCP renewal by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1334
-   doc/installing: Update for Chimera Linux by [@&#8203;mwyvr](https://github.com/mwyvr) in lxc/incus#1335
-   shared/cgo: Don't use strlcpy by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1337
-   Implement `incus webui` by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1338
-   incusd/scriptlet: Make set_target fail with invalid members by [@&#8203;bensmrs](https://github.com/bensmrs) in lxc/incus#1339
-   Export QMP functions by [@&#8203;bensmrs](https://github.com/bensmrs) in lxc/incus#1340
-   incusd/network/ovn: Add support to ipv4.dhcp.ranges by [@&#8203;jonatas-lima](https://github.com/jonatas-lima) in lxc/incus#1341
-   internal/server: Log QMP interaction to a file by [@&#8203;bensmrs](https://github.com/bensmrs) in lxc/incus#1345
-   incusd/instance/qemu: Log QEMU command line by [@&#8203;bensmrs](https://github.com/bensmrs) in lxc/incus#1346
-   Improve cluster instance placement by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1344
-   incusd/instance_logs: Update log file list by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1347
-   Add infrastructure for OVN events by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1349
-   Fix QEMU feature checks during startup by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1350
-   incusd/instance/lxc: Fix LXCFS per-instance path by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1352
-   doc/idmap: Clarify subuid/subgid configuration by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1353
-   incusd/instance/qmp: Fix logging with no log file by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1355
-   Add a GetOIDCTokens() method by [@&#8203;gibmat](https://github.com/gibmat) in lxc/incus#1357
-   Add get-current to show current project by [@&#8203;maveonair](https://github.com/maveonair) in lxc/incus#1356
-   incus/file/create: Use SFTP client instead of file API by [@&#8203;HassanAlsamahi](https://github.com/HassanAlsamahi) in lxc/incus#1354
-   internal/instance: Allow 0 as value to limits.cpu.nodes by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1358
-   Translations update from Hosted Weblate by [@&#8203;weblate](https://github.com/weblate) in lxc/incus#1361
-   Translations update from Hosted Weblate by [@&#8203;weblate](https://github.com/weblate) in lxc/incus#1362
-   Translations update from Hosted Weblate by [@&#8203;weblate](https://github.com/weblate) in lxc/incus#1368
-   Improve agent interface listing performance by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1367
-   Make `incus top` output configurable through options by [@&#8203;presztak](https://github.com/presztak) in lxc/incus#1370
-   Automatic live-migration to balance load on cluster by [@&#8203;presztak](https://github.com/presztak) in lxc/incus#1369
-   gomod: Update dependencies by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1372
-   Add refresh-exclude-older flag to only transfer new snapshots during instance/volume refresh by [@&#8203;ps-gill](https://github.com/ps-gill) in lxc/incus#1365
-   incusd/instances/publish: Fix base metadata by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1374
-   Fix TPM with long instance names by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1377
-   Don't BGP advertise OVN load-balancers when all backends are offline by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1376
-   incusd/instance/qemu: Don't take over operations on console retrieval by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1379
-   Tweak to cluster internal relocation by [@&#8203;stgraber](https://github.com/stgraber) in lxc/incus#1378

#### New Contributors

-   [@&#8203;cuishuang](https://github.com/cuishuang) made their first contribution in lxc/incus#1285
-   [@&#8203;mwyvr](https://github.com/mwyvr) made their first contribution in lxc/incus#1298
-   [@&#8203;ntnn](https://github.com/ntnn) made their first contribution in lxc/incus#1328
-   [@&#8203;jonatas-lima](https://github.com/jonatas-lima) made their first contribution in lxc/incus#1341

**Full Changelog**: lxc/incus@v6.6.0...v6.7.0

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever MR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this MR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this MR, check this box

---

This MR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy40NDAuNyIsInVwZGF0ZWRJblZlciI6IjM3LjQ0MC43IiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJSZW5vdmF0ZSBCb3QiXX0=-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Changes to the REST API Documentation Documentation needs updating
Development

Successfully merging this pull request may close these issues.

Automatic live-migration to balance load on cluster
2 participants