Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VM CPU auto pinning causes slowdowns and stealtime #14133

Open
pkramme opened this issue Sep 19, 2024 · 4 comments
Open

VM CPU auto pinning causes slowdowns and stealtime #14133

pkramme opened this issue Sep 19, 2024 · 4 comments
Assignees
Labels
Bug Confirmed to be a bug
Milestone

Comments

@pkramme
Copy link

pkramme commented Sep 19, 2024

Required information

  • Distribution: Ubuntu
  • Distribution version: 20.04.5
  • The output of "snap list --all lxd core20 core22 core24 snapd":
    Name    Version      Rev    Tracking       Publisher   Notes
    core22  20240809     1586   latest/stable  canonical✓  base,disabled
    core22  20240823     1612   latest/stable  canonical✓  base
    lxd     6.1-efad198  29943  latest/stable  canonical✓  disabled
    lxd     6.1-78a3d8f  30130  latest/stable  canonical✓  -
    snapd   2.62         21465  latest/stable  canonical✓  snapd,disabled
    snapd   2.63         21759  latest/stable  canonical✓  snapd
    
  • The output of "lxc info" or if that fails:
     config:
     api_extensions:
     - storage_zfs_remove_snapshots
     - container_host_shutdown_timeout
     - container_stop_priority
     - container_syscall_filtering
     - auth_pki
     - container_last_used_at
     - etag
     - patch
     - usb_devices
     - https_allowed_credentials
     - image_compression_algorithm
     - directory_manipulation
     - container_cpu_time
     - storage_zfs_use_refquota
     - storage_lvm_mount_options
     - network
     - profile_usedby
     - container_push
     - container_exec_recording
     - certificate_update
     - container_exec_signal_handling
     - gpu_devices
     - container_image_properties
     - migration_progress
     - id_map
     - network_firewall_filtering
     - network_routes
     - storage
     - file_delete
     - file_append
     - network_dhcp_expiry
     - storage_lvm_vg_rename
     - storage_lvm_thinpool_rename
     - network_vlan
     - image_create_aliases
     - container_stateless_copy
     - container_only_migration
     - storage_zfs_clone_copy
     - unix_device_rename
     - storage_lvm_use_thinpool
     - storage_rsync_bwlimit
     - network_vxlan_interface
     - storage_btrfs_mount_options
     - entity_description
     - image_force_refresh
     - storage_lvm_lv_resizing
     - id_map_base
     - file_symlinks
     - container_push_target
     - network_vlan_physical
     - storage_images_delete
     - container_edit_metadata
     - container_snapshot_stateful_migration
     - storage_driver_ceph
     - storage_ceph_user_name
     - resource_limits
     - storage_volatile_initial_source
     - storage_ceph_force_osd_reuse
     - storage_block_filesystem_btrfs
     - resources
     - kernel_limits
     - storage_api_volume_rename
     - network_sriov
     - console
     - restrict_devlxd
     - migration_pre_copy
     - infiniband
     - maas_network
     - devlxd_events
     - proxy
     - network_dhcp_gateway
     - file_get_symlink
     - network_leases
     - unix_device_hotplug
     - storage_api_local_volume_handling
     - operation_description
     - clustering
     - event_lifecycle
     - storage_api_remote_volume_handling
     - nvidia_runtime
     - container_mount_propagation
     - container_backup
     - devlxd_images
     - container_local_cross_pool_handling
     - proxy_unix
     - proxy_udp
     - clustering_join
     - proxy_tcp_udp_multi_port_handling
     - network_state
     - proxy_unix_dac_properties
     - container_protection_delete
     - unix_priv_drop
     - pprof_http
     - proxy_haproxy_protocol
     - network_hwaddr
     - proxy_nat
     - network_nat_order
     - container_full
     - backup_compression
     - nvidia_runtime_config
     - storage_api_volume_snapshots
     - storage_unmapped
     - projects
     - network_vxlan_ttl
     - container_incremental_copy
     - usb_optional_vendorid
     - snapshot_scheduling
     - snapshot_schedule_aliases
     - container_copy_project
     - clustering_server_address
     - clustering_image_replication
     - container_protection_shift
     - snapshot_expiry
     - container_backup_override_pool
     - snapshot_expiry_creation
     - network_leases_location
     - resources_cpu_socket
     - resources_gpu
     - resources_numa
     - kernel_features
     - id_map_current
     - event_location
     - storage_api_remote_volume_snapshots
     - network_nat_address
     - container_nic_routes
     - cluster_internal_copy
     - seccomp_notify
     - lxc_features
     - container_nic_ipvlan
     - network_vlan_sriov
     - storage_cephfs
     - container_nic_ipfilter
     - resources_v2
     - container_exec_user_group_cwd
     - container_syscall_intercept
     - container_disk_shift
     - storage_shifted
     - resources_infiniband
     - daemon_storage
     - instances
     - image_types
     - resources_disk_sata
     - clustering_roles
     - images_expiry
     - resources_network_firmware
     - backup_compression_algorithm
     - ceph_data_pool_name
     - container_syscall_intercept_mount
     - compression_squashfs
     - container_raw_mount
     - container_nic_routed
     - container_syscall_intercept_mount_fuse
     - container_disk_ceph
     - virtual-machines
     - image_profiles
     - clustering_architecture
     - resources_disk_id
     - storage_lvm_stripes
     - vm_boot_priority
     - unix_hotplug_devices
     - api_filtering
     - instance_nic_network
     - clustering_sizing
     - firewall_driver
     - projects_limits
     - container_syscall_intercept_hugetlbfs
     - limits_hugepages
     - container_nic_routed_gateway
     - projects_restrictions
     - custom_volume_snapshot_expiry
     - volume_snapshot_scheduling
     - trust_ca_certificates
     - snapshot_disk_usage
     - clustering_edit_roles
     - container_nic_routed_host_address
     - container_nic_ipvlan_gateway
     - resources_usb_pci
     - resources_cpu_threads_numa
     - resources_cpu_core_die
     - api_os
     - container_nic_routed_host_table
     - container_nic_ipvlan_host_table
     - container_nic_ipvlan_mode
     - resources_system
     - images_push_relay
     - network_dns_search
     - container_nic_routed_limits
     - instance_nic_bridged_vlan
     - network_state_bond_bridge
     - usedby_consistency
     - custom_block_volumes
     - clustering_failure_domains
     - resources_gpu_mdev
     - console_vga_type
     - projects_limits_disk
     - network_type_macvlan
     - network_type_sriov
     - container_syscall_intercept_bpf_devices
     - network_type_ovn
     - projects_networks
     - projects_networks_restricted_uplinks
     - custom_volume_backup
     - backup_override_name
     - storage_rsync_compression
     - network_type_physical
     - network_ovn_external_subnets
     - network_ovn_nat
     - network_ovn_external_routes_remove
     - tpm_device_type
     - storage_zfs_clone_copy_rebase
     - gpu_mdev
     - resources_pci_iommu
     - resources_network_usb
     - resources_disk_address
     - network_physical_ovn_ingress_mode
     - network_ovn_dhcp
     - network_physical_routes_anycast
     - projects_limits_instances
     - network_state_vlan
     - instance_nic_bridged_port_isolation
     - instance_bulk_state_change
     - network_gvrp
     - instance_pool_move
     - gpu_sriov
     - pci_device_type
     - storage_volume_state
     - network_acl
     - migration_stateful
     - disk_state_quota
     - storage_ceph_features
     - projects_compression
     - projects_images_remote_cache_expiry
     - certificate_project
     - network_ovn_acl
     - projects_images_auto_update
     - projects_restricted_cluster_target
     - images_default_architecture
     - network_ovn_acl_defaults
     - gpu_mig
     - project_usage
     - network_bridge_acl
     - warnings
     - projects_restricted_backups_and_snapshots
     - clustering_join_token
     - clustering_description
     - server_trusted_proxy
     - clustering_update_cert
     - storage_api_project
     - server_instance_driver_operational
     - server_supported_storage_drivers
     - event_lifecycle_requestor_address
     - resources_gpu_usb
     - clustering_evacuation
     - network_ovn_nat_address
     - network_bgp
     - network_forward
     - custom_volume_refresh
     - network_counters_errors_dropped
     - metrics
     - image_source_project
     - clustering_config
     - network_peer
     - linux_sysctl
     - network_dns
     - ovn_nic_acceleration
     - certificate_self_renewal
     - instance_project_move
     - storage_volume_project_move
     - cloud_init
     - network_dns_nat
     - database_leader
     - instance_all_projects
     - clustering_groups
     - ceph_rbd_du
     - instance_get_full
     - qemu_metrics
     - gpu_mig_uuid
     - event_project
     - clustering_evacuation_live
     - instance_allow_inconsistent_copy
     - network_state_ovn
     - storage_volume_api_filtering
     - image_restrictions
     - storage_zfs_export
     - network_dns_records
     - storage_zfs_reserve_space
     - network_acl_log
     - storage_zfs_blocksize
     - metrics_cpu_seconds
     - instance_snapshot_never
     - certificate_token
     - instance_nic_routed_neighbor_probe
     - event_hub
     - agent_nic_config
     - projects_restricted_intercept
     - metrics_authentication
     - images_target_project
     - cluster_migration_inconsistent_copy
     - cluster_ovn_chassis
     - container_syscall_intercept_sched_setscheduler
     - storage_lvm_thinpool_metadata_size
     - storage_volume_state_total
     - instance_file_head
     - instances_nic_host_name
     - image_copy_profile
     - container_syscall_intercept_sysinfo
     - clustering_evacuation_mode
     - resources_pci_vpd
     - qemu_raw_conf
     - storage_cephfs_fscache
     - network_load_balancer
     - vsock_api
     - instance_ready_state
     - network_bgp_holdtime
     - storage_volumes_all_projects
     - metrics_memory_oom_total
     - storage_buckets
     - storage_buckets_create_credentials
     - metrics_cpu_effective_total
     - projects_networks_restricted_access
     - storage_buckets_local
     - loki
     - acme
     - internal_metrics
     - cluster_join_token_expiry
     - remote_token_expiry
     - init_preseed
     - storage_volumes_created_at
     - cpu_hotplug
     - projects_networks_zones
     - network_txqueuelen
     - cluster_member_state
     - instances_placement_scriptlet
     - storage_pool_source_wipe
     - zfs_block_mode
     - instance_generation_id
     - disk_io_cache
     - amd_sev
     - storage_pool_loop_resize
     - migration_vm_live
     - ovn_nic_nesting
     - oidc
     - network_ovn_l3only
     - ovn_nic_acceleration_vdpa
     - cluster_healing
     - instances_state_total
     - auth_user
     - security_csm
     - instances_rebuild
     - numa_cpu_placement
     - custom_volume_iso
     - network_allocations
     - storage_api_remote_volume_snapshot_copy
     - zfs_delegate
     - operations_get_query_all_projects
     - metadata_configuration
     - syslog_socket
     - event_lifecycle_name_and_project
     - instances_nic_limits_priority
     - disk_initial_volume_configuration
     - operation_wait
     - cluster_internal_custom_volume_copy
     - disk_io_bus
     - storage_cephfs_create_missing
     - instance_move_config
     - ovn_ssl_config
     - init_preseed_storage_volumes
     - metrics_instances_count
     - server_instance_type_info
     - resources_disk_mounted
     - server_version_lts
     - oidc_groups_claim
     - loki_config_instance
     - storage_volatile_uuid
     - import_instance_devices
     - instances_uefi_vars
     - instances_migration_stateful
     - container_syscall_filtering_allow_deny_syntax
     - access_management
     - vm_disk_io_limits
     - storage_volumes_all
     - instances_files_modify_permissions
     - image_restriction_nesting
     - container_syscall_intercept_finit_module
     - device_usb_serial
     - network_allocate_external_ips
     - explicit_trust_token
     api_status: stable
     api_version: "1.0"
     auth: trusted
     public: false
     auth_methods:
     - tls
     auth_user_name: root
     auth_user_method: unix
     environment:
       addresses:
       architectures:
       - x86_64
       - i686
       driver: lxc | qemu
       driver_version: 6.0.0 | 8.2.1
       instance_types:
       - container
       - virtual-machine
       firewall: nftables
       kernel: Linux
       kernel_architecture: x86_64
       kernel_features:
         idmapped_mounts: "true"
         netnsid_getifaddrs: "true"
         seccomp_listener: "true"
         seccomp_listener_continue: "true"
         uevent_injection: "true"
         unpriv_fscaps: "true"
       kernel_version: 6.6.1+441-dmf
       lxc_features:
         cgroup2: "true"
         core_scheduling: "true"
         devpts_fd: "true"
         idmapped_mounts_v2: "true"
         mount_injection_file: "true"
         network_gateway_device_route: "true"
         network_ipvlan: "true"
         network_l2proxy: "true"
         network_phys_macvlan_mtu: "true"
         network_veth_router: "true"
         pidfd: "true"
         seccomp_allow_deny_syntax: "true"
         seccomp_notify: "true"
         seccomp_proxy_send_notify_fd: "true"
       os_name: Ubuntu
       os_version: "20.04"
       project: default
       server: lxd
       server_clustered: false
       server_event_mode: full-mesh
       server_name: hyper4
       server_pid: 113133
       server_version: "6.1"
       server_lts: false
       storage: btrfs
       storage_version: 5.16.2
       storage_supported_drivers:
       - name: btrfs
         version: 5.16.2
         remote: false
       - name: ceph
         version: 17.2.7
         remote: true
       - name: cephfs
         version: 17.2.7
         remote: true
       - name: cephobject
         version: 17.2.7
         remote: true
       - name: dir
         version: "1"
         remote: false
       - name: lvm
         version: 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08) / 4.48.0
         remote: false
       - name: powerflex
         version: 1.16 (nvme-cli)
         remote: true
     
    

Issue description

The introduction of automatic core scheduling has led to significant decrease in performance in our infrastructure, with weird problems that make no sense if you are not aware of this issue, such as:

  • sudden drop in performance after vm creation or other insignificant events operators aren't looking at when diagnosing performance issues local to one VM
  • spikes in steal time that make no sense such as 25% persistent steal time
  • generally unpredictable performance where hypervisor load and vm stealtime are not related at all (or at least not outright visible if you are not tracing core scheduling decisions)

The current CPU scheduler doesn't seem to understand hardware topology, which is really surprising, considering that many new CPUs are now asymmetric and that on the kernel side much work is being done making sure that workload is put on "the best core for the job" with features like AMD Preferred Core and equivalents or new CPU schedulers like EEVDF.

It seems weird to put these placement decisions in LXD and turn them on by default, without offswitch when LXD does not consider that this might cause significant problems. LXD simply has not enough data, and static round robin placement is simply too simple.

From our perspective this is a significant design error for this feature, and we ask that this feature is either

  1. reworked so that hardware topology is accurately picked up, including L3 cache differences, CCD layouts, preferred core data, etc
  2. enhanced with an option to turn it completely off
  3. turned off by default
  4. removed.

Additionaly, snaps auto update mechanism has introduced this new feature to our infrastructure (which by itself is fine), and we'd ask you to consider that features like this are being continously applied to real workloads and while not being LTS, should still be at least not harmful.

Information to attach

Our current hardware topology has two L3 domains that have different sizes. Our VMs run typical web application workloads. The core load balancing has put multiple CPU bound cores on one physical core, leading to the weird stealtime above.

# lstopo
Machine (125GB total)
  Package L#0
    NUMANode L#0 (P#0 125GB)
    L3 L#0 (96MB)
      L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#16)
      L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#1)
        PU L#3 (P#17)
      L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
        PU L#4 (P#2)
        PU L#5 (P#18)
      L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
        PU L#6 (P#3)
        PU L#7 (P#19)
      L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
        PU L#8 (P#4)
        PU L#9 (P#20)
      L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
        PU L#10 (P#5)
        PU L#11 (P#21)
      L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
        PU L#12 (P#6)
        PU L#13 (P#22)
      L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
        PU L#14 (P#7)
        PU L#15 (P#23)
    L3 L#1 (32MB)
      L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
        PU L#16 (P#8)
        PU L#17 (P#24)
      L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
        PU L#18 (P#9)
        PU L#19 (P#25)
      L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
        PU L#20 (P#10)
        PU L#21 (P#26)
      L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
        PU L#22 (P#11)
        PU L#23 (P#27)
      L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
        PU L#24 (P#12)
        PU L#25 (P#28)
      L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
        PU L#26 (P#13)
        PU L#27 (P#29)
      L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
        PU L#28 (P#14)
        PU L#29 (P#30)
      L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
        PU L#30 (P#15)
        PU L#31 (P#31)
@pkramme
Copy link
Author

pkramme commented Sep 19, 2024

We threw together a quick script to visualize that problem:

0:	seg18-app1, seg18-mysql1
1:	seg17-app1, seg18-lb1
2:	seg19-app1, seg19-redis1
3:	seg19-mysql1
4:	seg17-app1
5:	seg17-mysql1, seg18-redis1
6:	seg18-redis1, seg19-mysql1
7:	seg19-app1, seg19-mysql1
8:	seg18-app1, seg19-app1
9:	seg18-app1, seg19-app1
10:	seg17-lb1, seg18-mysql1
11:	seg19-mysql1
12:	seg19-app1, seg19-redis1
13:	seg18-app1, seg18-mysql1
14:	seg17-app1, seg17-redis1
15:	seg18-mysql1
16:	seg19-mysql1
17:	seg19-app1, seg19-lb1
18:	seg17-app1, seg17-redis1
19:	seg17-mysql1, seg19-app1
20:	seg18-app1, seg18-mysql1
21:	seg19-mysql1
22:	seg18-app1, seg18-mysql1
23:	seg19-mysql1
24:	seg19-mysql1
25:	seg17-lb1, seg18-mysql1
26:	seg18-mysql1
27:	seg17-app1, seg18-app1
28:	seg17-app1, seg18-app1
29:	seg17-app1, seg19-app1
30:	seg17-app1, seg18-lb1
31:	seg19-lb1

Cores:

  • 0-7, 16-23 are fast
  • 8-15, 24-31 are slow
  • 16-31 are the same physical core to $index-16

General rule with this system is:

  • seg17 has pratically no usage
  • seg18 and seg19 are really important

This core placement puts very latency critical systems on the same (hyper)core, while leaving systems that have no real load on their own core. Even if this was a completely symmetrical CPU and even if all of those cores weren't hypercores, this would still waste resources when those VMs aren't equally loaded.

@tomponline tomponline added the Bug Confirmed to be a bug label Sep 19, 2024
@tomponline tomponline added this to the lxd-6.2 milestone Sep 19, 2024
@tomponline
Copy link
Member

tomponline commented Sep 19, 2024

Thanks for your detailed report!

Yeah this was an area of concern originally:

Note: On systems that have mixed performance and efficiency cores (P+E) you may find that VM performance is decreased due to the way LXD now pins some of the VM’s vCPUs to efficiency cores rather than letting the Linux scheduler dynamically schedule them. You can use the explicit CPU pinning feature if needed to avoid this.

https://discourse.ubuntu.com/t/lxd-6-1-has-been-released/46259#vm-automatic-core-pinning-load-balancing

But we are considering option 2 and 3 of your suggestions.

@pkramme
Copy link
Author

pkramme commented Sep 19, 2024

Thank you for your quick response! Would it be possible to get a patch for the 6.1 series that would give us the option to turn this off? Otherwise we'd have to write tooling to repin the VMs based on things like stealtime or cpu pressure. We'd much rather just let the kernel handle it.

@tomponline
Copy link
Member

tomponline commented Sep 19, 2024

Thank you for your quick response! Would it be possible to get a patch for the 6.1 series that would give us the option to turn this off? Otherwise we'd have to write tooling to repin the VMs based on things like stealtime or cpu pressure. We'd much rather just let the kernel handle it.

The latest 5.21/stable LTS series does not have this feature (on purpose because it changes the default behaviour) so you could try that. Its more suitable for production purposes anyway as the latest/stable channel is the moving feature release channel and doesn't support downgrades.

See https://documentation.ubuntu.com/lxd/en/latest/installing/#installing-release

The 6.1 release won't get patches now, but will be replaced by 6.2, 6.3 etc. Hopefully we can land the new settings in one of those 2 releases, but 6.2 is imminent and might not make it in there.

@tomponline tomponline assigned hamistao and kadinsayani and unassigned hamistao Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Confirmed to be a bug
Projects
None yet
Development

No branches or pull requests

4 participants