VM CPU auto pinning causes slowdowns and stealtime #14133

pkramme · 2024-09-19T12:27:40Z

Required information

Distribution: Ubuntu
Distribution version: 20.04.5

The output of "snap list --all lxd core20 core22 core24 snapd":

Name    Version      Rev    Tracking       Publisher   Notes
core22  20240809     1586   latest/stable  canonical✓  base,disabled
core22  20240823     1612   latest/stable  canonical✓  base
lxd     6.1-efad198  29943  latest/stable  canonical✓  disabled
lxd     6.1-78a3d8f  30130  latest/stable  canonical✓  -
snapd   2.62         21465  latest/stable  canonical✓  snapd,disabled
snapd   2.63         21759  latest/stable  canonical✓  snapd

The output of "lxc info" or if that fails:

 config:
 api_extensions:
 - storage_zfs_remove_snapshots
 - container_host_shutdown_timeout
 - container_stop_priority
 - container_syscall_filtering
 - auth_pki
 - container_last_used_at
 - etag
 - patch
 - usb_devices
 - https_allowed_credentials
 - image_compression_algorithm
 - directory_manipulation
 - container_cpu_time
 - storage_zfs_use_refquota
 - storage_lvm_mount_options
 - network
 - profile_usedby
 - container_push
 - container_exec_recording
 - certificate_update
 - container_exec_signal_handling
 - gpu_devices
 - container_image_properties
 - migration_progress
 - id_map
 - network_firewall_filtering
 - network_routes
 - storage
 - file_delete
 - file_append
 - network_dhcp_expiry
 - storage_lvm_vg_rename
 - storage_lvm_thinpool_rename
 - network_vlan
 - image_create_aliases
 - container_stateless_copy
 - container_only_migration
 - storage_zfs_clone_copy
 - unix_device_rename
 - storage_lvm_use_thinpool
 - storage_rsync_bwlimit
 - network_vxlan_interface
 - storage_btrfs_mount_options
 - entity_description
 - image_force_refresh
 - storage_lvm_lv_resizing
 - id_map_base
 - file_symlinks
 - container_push_target
 - network_vlan_physical
 - storage_images_delete
 - container_edit_metadata
 - container_snapshot_stateful_migration
 - storage_driver_ceph
 - storage_ceph_user_name
 - resource_limits
 - storage_volatile_initial_source
 - storage_ceph_force_osd_reuse
 - storage_block_filesystem_btrfs
 - resources
 - kernel_limits
 - storage_api_volume_rename
 - network_sriov
 - console
 - restrict_devlxd
 - migration_pre_copy
 - infiniband
 - maas_network
 - devlxd_events
 - proxy
 - network_dhcp_gateway
 - file_get_symlink
 - network_leases
 - unix_device_hotplug
 - storage_api_local_volume_handling
 - operation_description
 - clustering
 - event_lifecycle
 - storage_api_remote_volume_handling
 - nvidia_runtime
 - container_mount_propagation
 - container_backup
 - devlxd_images
 - container_local_cross_pool_handling
 - proxy_unix
 - proxy_udp
 - clustering_join
 - proxy_tcp_udp_multi_port_handling
 - network_state
 - proxy_unix_dac_properties
 - container_protection_delete
 - unix_priv_drop
 - pprof_http
 - proxy_haproxy_protocol
 - network_hwaddr
 - proxy_nat
 - network_nat_order
 - container_full
 - backup_compression
 - nvidia_runtime_config
 - storage_api_volume_snapshots
 - storage_unmapped
 - projects
 - network_vxlan_ttl
 - container_incremental_copy
 - usb_optional_vendorid
 - snapshot_scheduling
 - snapshot_schedule_aliases
 - container_copy_project
 - clustering_server_address
 - clustering_image_replication
 - container_protection_shift
 - snapshot_expiry
 - container_backup_override_pool
 - snapshot_expiry_creation
 - network_leases_location
 - resources_cpu_socket
 - resources_gpu
 - resources_numa
 - kernel_features
 - id_map_current
 - event_location
 - storage_api_remote_volume_snapshots
 - network_nat_address
 - container_nic_routes
 - cluster_internal_copy
 - seccomp_notify
 - lxc_features
 - container_nic_ipvlan
 - network_vlan_sriov
 - storage_cephfs
 - container_nic_ipfilter
 - resources_v2
 - container_exec_user_group_cwd
 - container_syscall_intercept
 - container_disk_shift
 - storage_shifted
 - resources_infiniband
 - daemon_storage
 - instances
 - image_types
 - resources_disk_sata
 - clustering_roles
 - images_expiry
 - resources_network_firmware
 - backup_compression_algorithm
 - ceph_data_pool_name
 - container_syscall_intercept_mount
 - compression_squashfs
 - container_raw_mount
 - container_nic_routed
 - container_syscall_intercept_mount_fuse
 - container_disk_ceph
 - virtual-machines
 - image_profiles
 - clustering_architecture
 - resources_disk_id
 - storage_lvm_stripes
 - vm_boot_priority
 - unix_hotplug_devices
 - api_filtering
 - instance_nic_network
 - clustering_sizing
 - firewall_driver
 - projects_limits
 - container_syscall_intercept_hugetlbfs
 - limits_hugepages
 - container_nic_routed_gateway
 - projects_restrictions
 - custom_volume_snapshot_expiry
 - volume_snapshot_scheduling
 - trust_ca_certificates
 - snapshot_disk_usage
 - clustering_edit_roles
 - container_nic_routed_host_address
 - container_nic_ipvlan_gateway
 - resources_usb_pci
 - resources_cpu_threads_numa
 - resources_cpu_core_die
 - api_os
 - container_nic_routed_host_table
 - container_nic_ipvlan_host_table
 - container_nic_ipvlan_mode
 - resources_system
 - images_push_relay
 - network_dns_search
 - container_nic_routed_limits
 - instance_nic_bridged_vlan
 - network_state_bond_bridge
 - usedby_consistency
 - custom_block_volumes
 - clustering_failure_domains
 - resources_gpu_mdev
 - console_vga_type
 - projects_limits_disk
 - network_type_macvlan
 - network_type_sriov
 - container_syscall_intercept_bpf_devices
 - network_type_ovn
 - projects_networks
 - projects_networks_restricted_uplinks
 - custom_volume_backup
 - backup_override_name
 - storage_rsync_compression
 - network_type_physical
 - network_ovn_external_subnets
 - network_ovn_nat
 - network_ovn_external_routes_remove
 - tpm_device_type
 - storage_zfs_clone_copy_rebase
 - gpu_mdev
 - resources_pci_iommu
 - resources_network_usb
 - resources_disk_address
 - network_physical_ovn_ingress_mode
 - network_ovn_dhcp
 - network_physical_routes_anycast
 - projects_limits_instances
 - network_state_vlan
 - instance_nic_bridged_port_isolation
 - instance_bulk_state_change
 - network_gvrp
 - instance_pool_move
 - gpu_sriov
 - pci_device_type
 - storage_volume_state
 - network_acl
 - migration_stateful
 - disk_state_quota
 - storage_ceph_features
 - projects_compression
 - projects_images_remote_cache_expiry
 - certificate_project
 - network_ovn_acl
 - projects_images_auto_update
 - projects_restricted_cluster_target
 - images_default_architecture
 - network_ovn_acl_defaults
 - gpu_mig
 - project_usage
 - network_bridge_acl
 - warnings
 - projects_restricted_backups_and_snapshots
 - clustering_join_token
 - clustering_description
 - server_trusted_proxy
 - clustering_update_cert
 - storage_api_project
 - server_instance_driver_operational
 - server_supported_storage_drivers
 - event_lifecycle_requestor_address
 - resources_gpu_usb
 - clustering_evacuation
 - network_ovn_nat_address
 - network_bgp
 - network_forward
 - custom_volume_refresh
 - network_counters_errors_dropped
 - metrics
 - image_source_project
 - clustering_config
 - network_peer
 - linux_sysctl
 - network_dns
 - ovn_nic_acceleration
 - certificate_self_renewal
 - instance_project_move
 - storage_volume_project_move
 - cloud_init
 - network_dns_nat
 - database_leader
 - instance_all_projects
 - clustering_groups
 - ceph_rbd_du
 - instance_get_full
 - qemu_metrics
 - gpu_mig_uuid
 - event_project
 - clustering_evacuation_live
 - instance_allow_inconsistent_copy
 - network_state_ovn
 - storage_volume_api_filtering
 - image_restrictions
 - storage_zfs_export
 - network_dns_records
 - storage_zfs_reserve_space
 - network_acl_log
 - storage_zfs_blocksize
 - metrics_cpu_seconds
 - instance_snapshot_never
 - certificate_token
 - instance_nic_routed_neighbor_probe
 - event_hub
 - agent_nic_config
 - projects_restricted_intercept
 - metrics_authentication
 - images_target_project
 - cluster_migration_inconsistent_copy
 - cluster_ovn_chassis
 - container_syscall_intercept_sched_setscheduler
 - storage_lvm_thinpool_metadata_size
 - storage_volume_state_total
 - instance_file_head
 - instances_nic_host_name
 - image_copy_profile
 - container_syscall_intercept_sysinfo
 - clustering_evacuation_mode
 - resources_pci_vpd
 - qemu_raw_conf
 - storage_cephfs_fscache
 - network_load_balancer
 - vsock_api
 - instance_ready_state
 - network_bgp_holdtime
 - storage_volumes_all_projects
 - metrics_memory_oom_total
 - storage_buckets
 - storage_buckets_create_credentials
 - metrics_cpu_effective_total
 - projects_networks_restricted_access
 - storage_buckets_local
 - loki
 - acme
 - internal_metrics
 - cluster_join_token_expiry
 - remote_token_expiry
 - init_preseed
 - storage_volumes_created_at
 - cpu_hotplug
 - projects_networks_zones
 - network_txqueuelen
 - cluster_member_state
 - instances_placement_scriptlet
 - storage_pool_source_wipe
 - zfs_block_mode
 - instance_generation_id
 - disk_io_cache
 - amd_sev
 - storage_pool_loop_resize
 - migration_vm_live
 - ovn_nic_nesting
 - oidc
 - network_ovn_l3only
 - ovn_nic_acceleration_vdpa
 - cluster_healing
 - instances_state_total
 - auth_user
 - security_csm
 - instances_rebuild
 - numa_cpu_placement
 - custom_volume_iso
 - network_allocations
 - storage_api_remote_volume_snapshot_copy
 - zfs_delegate
 - operations_get_query_all_projects
 - metadata_configuration
 - syslog_socket
 - event_lifecycle_name_and_project
 - instances_nic_limits_priority
 - disk_initial_volume_configuration
 - operation_wait
 - cluster_internal_custom_volume_copy
 - disk_io_bus
 - storage_cephfs_create_missing
 - instance_move_config
 - ovn_ssl_config
 - init_preseed_storage_volumes
 - metrics_instances_count
 - server_instance_type_info
 - resources_disk_mounted
 - server_version_lts
 - oidc_groups_claim
 - loki_config_instance
 - storage_volatile_uuid
 - import_instance_devices
 - instances_uefi_vars
 - instances_migration_stateful
 - container_syscall_filtering_allow_deny_syntax
 - access_management
 - vm_disk_io_limits
 - storage_volumes_all
 - instances_files_modify_permissions
 - image_restriction_nesting
 - container_syscall_intercept_finit_module
 - device_usb_serial
 - network_allocate_external_ips
 - explicit_trust_token
 api_status: stable
 api_version: "1.0"
 auth: trusted
 public: false
 auth_methods:
 - tls
 auth_user_name: root
 auth_user_method: unix
 environment:
   addresses:
   architectures:
   - x86_64
   - i686
   driver: lxc | qemu
   driver_version: 6.0.0 | 8.2.1
   instance_types:
   - container
   - virtual-machine
   firewall: nftables
   kernel: Linux
   kernel_architecture: x86_64
   kernel_features:
     idmapped_mounts: "true"
     netnsid_getifaddrs: "true"
     seccomp_listener: "true"
     seccomp_listener_continue: "true"
     uevent_injection: "true"
     unpriv_fscaps: "true"
   kernel_version: 6.6.1+441-dmf
   lxc_features:
     cgroup2: "true"
     core_scheduling: "true"
     devpts_fd: "true"
     idmapped_mounts_v2: "true"
     mount_injection_file: "true"
     network_gateway_device_route: "true"
     network_ipvlan: "true"
     network_l2proxy: "true"
     network_phys_macvlan_mtu: "true"
     network_veth_router: "true"
     pidfd: "true"
     seccomp_allow_deny_syntax: "true"
     seccomp_notify: "true"
     seccomp_proxy_send_notify_fd: "true"
   os_name: Ubuntu
   os_version: "20.04"
   project: default
   server: lxd
   server_clustered: false
   server_event_mode: full-mesh
   server_name: hyper4
   server_pid: 113133
   server_version: "6.1"
   server_lts: false
   storage: btrfs
   storage_version: 5.16.2
   storage_supported_drivers:
   - name: btrfs
     version: 5.16.2
     remote: false
   - name: ceph
     version: 17.2.7
     remote: true
   - name: cephfs
     version: 17.2.7
     remote: true
   - name: cephobject
     version: 17.2.7
     remote: true
   - name: dir
     version: "1"
     remote: false
   - name: lvm
     version: 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08) / 4.48.0
     remote: false
   - name: powerflex
     version: 1.16 (nvme-cli)
     remote: true

Issue description

The introduction of automatic core scheduling has led to significant decrease in performance in our infrastructure, with weird problems that make no sense if you are not aware of this issue, such as:

sudden drop in performance after vm creation or other insignificant events operators aren't looking at when diagnosing performance issues local to one VM
spikes in steal time that make no sense such as 25% persistent steal time
generally unpredictable performance where hypervisor load and vm stealtime are not related at all (or at least not outright visible if you are not tracing core scheduling decisions)

The current CPU scheduler doesn't seem to understand hardware topology, which is really surprising, considering that many new CPUs are now asymmetric and that on the kernel side much work is being done making sure that workload is put on "the best core for the job" with features like AMD Preferred Core and equivalents or new CPU schedulers like EEVDF.

It seems weird to put these placement decisions in LXD and turn them on by default, without offswitch when LXD does not consider that this might cause significant problems. LXD simply has not enough data, and static round robin placement is simply too simple.

From our perspective this is a significant design error for this feature, and we ask that this feature is either

reworked so that hardware topology is accurately picked up, including L3 cache differences, CCD layouts, preferred core data, etc
enhanced with an option to turn it completely off
turned off by default
removed.

Additionaly, snaps auto update mechanism has introduced this new feature to our infrastructure (which by itself is fine), and we'd ask you to consider that features like this are being continously applied to real workloads and while not being LTS, should still be at least not harmful.

Information to attach

Our current hardware topology has two L3 domains that have different sizes. Our VMs run typical web application workloads. The core load balancing has put multiple CPU bound cores on one physical core, leading to the weird stealtime above.

# lstopo
Machine (125GB total)
  Package L#0
    NUMANode L#0 (P#0 125GB)
    L3 L#0 (96MB)
      L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#16)
      L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#1)
        PU L#3 (P#17)
      L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
        PU L#4 (P#2)
        PU L#5 (P#18)
      L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
        PU L#6 (P#3)
        PU L#7 (P#19)
      L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
        PU L#8 (P#4)
        PU L#9 (P#20)
      L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
        PU L#10 (P#5)
        PU L#11 (P#21)
      L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
        PU L#12 (P#6)
        PU L#13 (P#22)
      L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
        PU L#14 (P#7)
        PU L#15 (P#23)
    L3 L#1 (32MB)
      L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
        PU L#16 (P#8)
        PU L#17 (P#24)
      L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
        PU L#18 (P#9)
        PU L#19 (P#25)
      L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
        PU L#20 (P#10)
        PU L#21 (P#26)
      L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
        PU L#22 (P#11)
        PU L#23 (P#27)
      L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
        PU L#24 (P#12)
        PU L#25 (P#28)
      L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
        PU L#26 (P#13)
        PU L#27 (P#29)
      L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
        PU L#28 (P#14)
        PU L#29 (P#30)
      L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
        PU L#30 (P#15)
        PU L#31 (P#31)

The text was updated successfully, but these errors were encountered:

pkramme · 2024-09-19T12:59:14Z

We threw together a quick script to visualize that problem:

0:	seg18-app1, seg18-mysql1
1:	seg17-app1, seg18-lb1
2:	seg19-app1, seg19-redis1
3:	seg19-mysql1
4:	seg17-app1
5:	seg17-mysql1, seg18-redis1
6:	seg18-redis1, seg19-mysql1
7:	seg19-app1, seg19-mysql1
8:	seg18-app1, seg19-app1
9:	seg18-app1, seg19-app1
10:	seg17-lb1, seg18-mysql1
11:	seg19-mysql1
12:	seg19-app1, seg19-redis1
13:	seg18-app1, seg18-mysql1
14:	seg17-app1, seg17-redis1
15:	seg18-mysql1
16:	seg19-mysql1
17:	seg19-app1, seg19-lb1
18:	seg17-app1, seg17-redis1
19:	seg17-mysql1, seg19-app1
20:	seg18-app1, seg18-mysql1
21:	seg19-mysql1
22:	seg18-app1, seg18-mysql1
23:	seg19-mysql1
24:	seg19-mysql1
25:	seg17-lb1, seg18-mysql1
26:	seg18-mysql1
27:	seg17-app1, seg18-app1
28:	seg17-app1, seg18-app1
29:	seg17-app1, seg19-app1
30:	seg17-app1, seg18-lb1
31:	seg19-lb1

Cores:

0-7, 16-23 are fast
8-15, 24-31 are slow
16-31 are the same physical core to $index-16

General rule with this system is:

seg17 has pratically no usage
seg18 and seg19 are really important

This core placement puts very latency critical systems on the same (hyper)core, while leaving systems that have no real load on their own core. Even if this was a completely symmetrical CPU and even if all of those cores weren't hypercores, this would still waste resources when those VMs aren't equally loaded.

tomponline · 2024-09-19T13:33:14Z

Thanks for your detailed report!

Yeah this was an area of concern originally:

Note: On systems that have mixed performance and efficiency cores (P+E) you may find that VM performance is decreased due to the way LXD now pins some of the VM’s vCPUs to efficiency cores rather than letting the Linux scheduler dynamically schedule them. You can use the explicit CPU pinning feature if needed to avoid this.

https://discourse.ubuntu.com/t/lxd-6-1-has-been-released/46259#vm-automatic-core-pinning-load-balancing

But we are considering option 2 and 3 of your suggestions.

pkramme · 2024-09-19T13:43:57Z

Thank you for your quick response! Would it be possible to get a patch for the 6.1 series that would give us the option to turn this off? Otherwise we'd have to write tooling to repin the VMs based on things like stealtime or cpu pressure. We'd much rather just let the kernel handle it.

tomponline · 2024-09-19T13:46:42Z

Thank you for your quick response! Would it be possible to get a patch for the 6.1 series that would give us the option to turn this off? Otherwise we'd have to write tooling to repin the VMs based on things like stealtime or cpu pressure. We'd much rather just let the kernel handle it.

The latest 5.21/stable LTS series does not have this feature (on purpose because it changes the default behaviour) so you could try that. Its more suitable for production purposes anyway as the latest/stable channel is the moving feature release channel and doesn't support downgrades.

See https://documentation.ubuntu.com/lxd/en/latest/installing/#installing-release

The 6.1 release won't get patches now, but will be replaced by 6.2, 6.3 etc. Hopefully we can land the new settings in one of those 2 releases, but 6.2 is imminent and might not make it in there.

tomponline added the Bug Confirmed to be a bug label Sep 19, 2024

tomponline added this to the lxd-6.2 milestone Sep 19, 2024

tomponline assigned hamistao and kadinsayani and unassigned hamistao Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VM CPU auto pinning causes slowdowns and stealtime #14133

VM CPU auto pinning causes slowdowns and stealtime #14133

pkramme commented Sep 19, 2024

pkramme commented Sep 19, 2024 •

edited

Loading

tomponline commented Sep 19, 2024 •

edited

Loading

pkramme commented Sep 19, 2024

tomponline commented Sep 19, 2024 •

edited

Loading

VM CPU auto pinning causes slowdowns and stealtime #14133

VM CPU auto pinning causes slowdowns and stealtime #14133

Comments

pkramme commented Sep 19, 2024

Required information

Issue description

Information to attach

pkramme commented Sep 19, 2024 • edited Loading

tomponline commented Sep 19, 2024 • edited Loading

pkramme commented Sep 19, 2024

tomponline commented Sep 19, 2024 • edited Loading

pkramme commented Sep 19, 2024 •

edited

Loading

tomponline commented Sep 19, 2024 •

edited

Loading

tomponline commented Sep 19, 2024 •

edited

Loading