Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lxc start fails despite stopped state #13453

Closed
1 task done
holmanb opened this issue May 6, 2024 · 10 comments
Closed
1 task done

lxc start fails despite stopped state #13453

holmanb opened this issue May 6, 2024 · 10 comments
Assignees
Labels
Bug Confirmed to be a bug
Milestone

Comments

@holmanb
Copy link
Member

holmanb commented May 6, 2024

Required information

  • Distribution: Ubuntu
  • Distribution version: all
  • The output of "snap list --all lxd core20 core22 core24 snapd":
Name    Version         Rev    Tracking       Publisher   Notes
core22  20240111        1122   latest/stable  canonical✓  base,disabled
core22  20240408        1380   latest/stable  canonical✓  base
lxd     5.21.1-d46c406  28460  5.21/stable    canonical✓  -
snapd   2.61.2          21184  latest/stable  canonical✓  snapd,disabled
snapd   2.62            21465  latest/stable  canonical✓  snapd
  • The output of "lxc info" or if that fails:
config: {}
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- backup_compression
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- instances_nic_host_name
- image_copy_profile
- container_syscall_intercept_sysinfo
- clustering_evacuation_mode
- resources_pci_vpd
- qemu_raw_conf
- storage_cephfs_fscache
- network_load_balancer
- vsock_api
- instance_ready_state
- network_bgp_holdtime
- storage_volumes_all_projects
- metrics_memory_oom_total
- storage_buckets
- storage_buckets_create_credentials
- metrics_cpu_effective_total
- projects_networks_restricted_access
- storage_buckets_local
- loki
- acme
- internal_metrics
- cluster_join_token_expiry
- remote_token_expiry
- init_preseed
- storage_volumes_created_at
- cpu_hotplug
- projects_networks_zones
- network_txqueuelen
- cluster_member_state
- instances_placement_scriptlet
- storage_pool_source_wipe
- zfs_block_mode
- instance_generation_id
- disk_io_cache
- amd_sev
- storage_pool_loop_resize
- migration_vm_live
- ovn_nic_nesting
- oidc
- network_ovn_l3only
- ovn_nic_acceleration_vdpa
- cluster_healing
- instances_state_total
- auth_user
- security_csm
- instances_rebuild
- numa_cpu_placement
- custom_volume_iso
- network_allocations
- storage_api_remote_volume_snapshot_copy
- zfs_delegate
- operations_get_query_all_projects
- metadata_configuration
- syslog_socket
- event_lifecycle_name_and_project
- instances_nic_limits_priority
- disk_initial_volume_configuration
- operation_wait
- cluster_internal_custom_volume_copy
- disk_io_bus
- storage_cephfs_create_missing
- instance_move_config
- ovn_ssl_config
- init_preseed_storage_volumes
- metrics_instances_count
- server_instance_type_info
- resources_disk_mounted
- server_version_lts
- oidc_groups_claim
- loki_config_instance
- storage_volatile_uuid
- import_instance_devices
- instances_uefi_vars
- instances_migration_stateful
- container_syscall_filtering_allow_deny_syntax
- access_management
- vm_disk_io_limits
- storage_volumes_all
- instances_files_modify_permissions
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
auth_user_name: holmanb
auth_user_method: unix
environment:
  addresses: []
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
    MIIB3TCCAWOgAwIBAgIQBECwI03mRzmVUgAKkav1ejAKBggqhkjOPQQDAzAiMQww
    CgYDVQQKEwNMWEQxEjAQBgNVBAMMCXJvb3RAaml2ZTAeFw0yNDA1MDYwODMxNDFa
    Fw0zNDA1MDQwODMxNDFaMCIxDDAKBgNVBAoTA0xYRDESMBAGA1UEAwwJcm9vdEBq
    aXZlMHYwEAYHKoZIzj0CAQYFK4EEACIDYgAEC2ZKsbCIKcuVrBWpLCY8eaL13dBc
    bro6wgVAg4014UeIBfDpmNKb/mJKKt/DxlRIq9/w7kvxMHHpLa9+NPB+pr6H/R51
    Vcz24YlY7Gp+almRBnWJIVjBT2tbFUjp+0lco14wXDAOBgNVHQ8BAf8EBAMCBaAw
    EwYDVR0lBAwwCgYIKwYBBQUHAwEwDAYDVR0TAQH/BAIwADAnBgNVHREEIDAeggRq
    aXZlhwR/AAABhxAAAAAAAAAAAAAAAAAAAAABMAoGCCqGSM49BAMDA2gAMGUCMDKI
    vl5f2CGnF+m6AilFUAEIYZk+HYL6zkFlc+vWBVAVqxRiuQu0AkrqWKa0j1hgxAIx
    AOdkEP6KYCB3XcXH6cw6b9o+yZkCB2S8lQnKyk1lH76dHCx8e2Ivu/mhiYRfr9lz
    Sw==
    -----END CERTIFICATE-----
  certificate_fingerprint: 4a5b695606afb8fa1b37f921e76fc748f910a84e6d21c22a868b8dc7c9b80090
  driver: lxc | qemu
  driver_version: 6.0.0 | 8.2.1
  instance_types:
  - container
  - virtual-machine
  firewall: nftables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    idmapped_mounts: "true"
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    uevent_injection: "true"
    unpriv_fscaps: "true"
  kernel_version: 6.8.0-31-generic
  lxc_features:
    cgroup2: "true"
    core_scheduling: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Ubuntu
  os_version: "24.04"
  project: default
  server: lxd
  server_clustered: false
  server_event_mode: full-mesh
  server_name: jive
  server_pid: 3915
  server_version: 5.21.1
  server_lts: true
  storage: dir
  storage_version: "1"
  storage_supported_drivers:
  - name: dir
    version: "1"
    remote: false
  - name: lvm
    version: 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08) / 4.48.0
    remote: false
  - name: powerflex
    version: 1.16 (nvme-cli)
    remote: true
  - name: zfs
    version: 2.2.2-0ubuntu9
    remote: false
  - name: btrfs
    version: 5.16.2
    remote: false
  - name: ceph
    version: 17.2.7
    remote: true
  - name: cephfs
    version: 17.2.7
    remote: true
  - name: cephobject
    version: 17.2.7
    remote: true

Issue description

When an instance has recently shutdown and reports a state of STOPPED, an attempt to start the instance may fail with an error: The instance is already running. I observe this on occasion while manually running commands, but I wasn't bothered too much by the behavior so I never bothered to report. I discovered a bug in our test framework that prevents us from running certain tests, so I'm reporting it now.

Steps to reproduce

I see this in an integration test that fails intermittently, and I can reproduce this locally with the test running in a loop. A trivial reproducer could be made that launches an instance, shuts it back down (our test uses guest-initiated shutdown), then waits for STOPPED state before running lxc start.

Information to attach

Note that messages about instance shutdown occur both before and after the start fails, yet the code that initiates the lxc start doesn't run until STOPPED is the reported instance state.

@holmanb
Copy link
Member Author

holmanb commented May 7, 2024

@tomponline thanks for brainstorming this issue earlier, I've collected some additional information to ease debugging. I discovered a couple of more things that are relevant:

  1. Like you suspected, immediately after the lxc start fails, lxc list shows the state RUNNING.

  2. This does not reproduce when running lxc stop manually rather than a guest-initiated shutdown

The following script reproduces the error while collecting lxc monitor logs.

For me this fails reliability in about 5 seconds.

#!/bin/sh

NAME=me
LOG=./reproduced.log

# stop if started
lxc stop $NAME 2> /dev/null || true

while true; do

    # monitor, wipe log if repro failed
    (lxc monitor me --pretty > $LOG) &

    echo "starting instance"
    lxc start $NAME

    # wait until dbus is available (required for shutdown to work)
    lxc exec $NAME -- sh -c 'while [ ! -S /run/dbus/system_bus_socket ]; do sleep 0.1; done'

    # shutdown from the host
    echo "shutting down the instance"
    lxc exec $NAME -- shutdown -H now
    while true; do

        status=$(lxc list $NAME -cs --format csv)
        if [ $status = STOPPED ] ; then

            # this will fail sometimes
            lxc start $NAME
            rc="$?"
            if [ $rc ]; then
                echo "lxd start failed"
                rerun="$(lxc list -cs --format csv $NAME)"
                echo "lxc state immediately after failure: $rerun"
                exit $rc
            fi

            # no error, retry
            echo "did not repro, retrying"
            lxc stop $NAME
            break
        fi
    done
done

@holmanb
Copy link
Member Author

holmanb commented May 7, 2024

And a monitor log using --pretty:

reproduced.log

@tomponline tomponline added this to the lxd-6.1 milestone May 8, 2024
@tomponline
Copy link
Member

  • Like you suspected, immediately after the lxc start fails, lxc list shows the state RUNNING.

  • This does not reproduce when running lxc stop manually rather than a guest-initiated shutdown

Perfect thanks, so that confirms my suspicion that there is a tiny duration of time where liblxc is reporting the container's state as stopped, before it has notified LXD that the guest has self-stopped which would trigger LXD's stop cleanup operation (which is then reporting the container's status as running).

@MggMuggins
Copy link
Contributor

MggMuggins commented Nov 4, 2024

Thanks for the script Brett.

When shutting down a container, LXC sets the container state to STOPPING, runs the on-stop hook, and then closes the command socket (to avoid a different but similar race). When the command socket is closed, get_state (from a client) returns STOPPED instead of an error.

If GET /1.0/instances/{name}/state can start a get_state command before the operation has been set up, but lxc closes the command socket before it finishes responding to the request, the client ends up with STOPPED instead of STOPPING. The next request will then pick up the operation and report RUNNING until the operation has completed.

The LXC log for the container with raw.lxc: lxc.log.level = 0 makes this look like a race in LXC:

lxc me 20241104211249.598 TRACE    start - ../src/lxc/start.c:lxc_serve_state_clients:484 - Set container state to STOPPING
lxc me 20241104211249.598 TRACE    start - ../src/lxc/start.c:lxc_serve_state_clients:487 - No state clients registered
lxc me 20241104211249.598 TRACE    start - ../src/lxc/start.c:lxc_expose_namespace_environment:907 - Set environment variable LXC_USER_NS=/proc/59272/fd/18
lxc me 20241104211249.598 TRACE    start - ../src/lxc/start.c:lxc_expose_namespace_environment:907 - Set environment variable LXC_MNT_NS=/proc/59272/fd/19
lxc me 20241104211249.598 TRACE    start - ../src/lxc/start.c:lxc_expose_namespace_environment:907 - Set environment variable LXC_PID_NS=/proc/59272/fd/20
lxc me 20241104211249.598 TRACE    start - ../src/lxc/start.c:lxc_expose_namespace_environment:907 - Set environment variable LXC_UTS_NS=/proc/59272/fd/21
lxc me 20241104211249.598 TRACE    start - ../src/lxc/start.c:lxc_expose_namespace_environment:907 - Set environment variable LXC_IPC_NS=/proc/59272/fd/22
lxc me 20241104211249.598 TRACE    start - ../src/lxc/start.c:lxc_expose_namespace_environment:907 - Set environment variable LXC_NET_NS=/proc/59272/fd/4
lxc me 20241104211249.598 TRACE    start - ../src/lxc/start.c:lxc_expose_namespace_environment:907 - Set environment variable LXC_CGROUP_NS=/proc/59272/fd/23
lxc me 20241104211249.598 INFO     utils - ../src/lxc/utils.c:run_script_argv:590 - Executing script "/home/wesley/.local/go/bin/lxd callhook /var/lib/lxd "default" "me" stopns" for container "me"
lxc me 20241104211249.598 TRACE    utils - ../src/lxc/utils.c:run_script_argv:633 - Set environment variable: LXC_HOOK_TYPE=stop
lxc me 20241104211249.598 TRACE    utils - ../src/lxc/utils.c:run_script_argv:638 - Set environment variable: LXC_HOOK_SECTION=lxc
lxc me 20241104211249.715 TRACE    cgfsng - ../src/lxc/cgroups/cgfsng.c:cgroup_tree_remove:491 - Removed cgroup tree 10(lxc.payload.me)
lxc me 20241104211249.715 TRACE    cgfsng - ../src/lxc/cgroups/cgfsng.c:__cgroup_tree_create:726 - Reusing 10(lxc.pivot) cgroup
lxc me 20241104211249.715 TRACE    cgfsng - ../src/lxc/cgroups/cgfsng.c:__cgroup_tree_create:741 - Opened cgroup lxc.pivot as 3
lxc me 20241104211249.731 TRACE    cgfsng - ../src/lxc/cgroups/cgfsng.c:cgfsng_monitor_destroy:927 - Removed cgroup tree 10(lxc.monitor.me)
lxc me 20241104211249.731 TRACE    start - ../src/lxc/start.c:lxc_end:964 - Closed command socket
lxc 20241104211249.731 ERROR    af_unix - ../src/lxc/af_unix.c:lxc_abstract_unix_recv_fds_iov:218 - Connection reset by peer - Failed to receive response
lxc me 20241104211249.731 TRACE    start - ../src/lxc/start.c:lxc_end:975 - Set container state to "STOPPED"
lxc 20241104211249.731 ERROR    commands - ../src/lxc/commands.c:lxc_cmd_rsp_recv_fds:128 - Failed to receive file descriptors for command "get_state"
lxc me 20241104211249.749 INFO     utils - ../src/lxc/utils.c:run_script_argv:590 - Executing script "/usr/share/lxcfs/lxc.reboot.hook" for container "me"
lxc me 20241104211249.749 TRACE    utils - ../src/lxc/utils.c:run_script_argv:633 - Set environment variable: LXC_HOOK_TYPE=post-stop
lxc me 20241104211249.749 TRACE    utils - ../src/lxc/utils.c:run_script_argv:638 - Set environment variable: LXC_HOOK_SECTION=lxc
lxc me 20241104211250.255 INFO     utils - ../src/lxc/utils.c:run_script_argv:590 - Executing script "/home/wesley/.local/go/bin/lxd callhook /var/lib/lxd "default" "me" stop" for container "me"
lxc me 20241104211250.255 TRACE    utils - ../src/lxc/utils.c:run_script_argv:633 - Set environment variable: LXC_HOOK_TYPE=post-stop
lxc me 20241104211250.255 TRACE    utils - ../src/lxc/utils.c:run_script_argv:638 - Set environment variable: LXC_HOOK_SECTION=lxc

The socket is closed partway through handling of the get_state command.

If lxc could close the socket and wait for existing requests to complete before continuing with the container stop, then we wouldn't see this behavior. @mihalicyn might disagree, but I doubt that this is super straightforward/possible. My naive attempt with setsockopt(fd, SOL_SOCKET, SO_LINGER, ...) was not sufficient.

I've done a little testing and this doesn't appear to impact VM instances.

@holmanb
Copy link
Member Author

holmanb commented Nov 12, 2024

Thanks for digging into this @MggMuggins!

The socket is closed partway through handling of the get_state command.

If lxc could close the socket and wait for existing requests to complete before continuing with the container stop, then we wouldn't see this behavior. @mihalicyn might disagree, but I doubt that this is super straightforward/possible.

Do you think that this deserves an upstream bug report to lxc?

My naive attempt with setsockopt(fd, SOL_SOCKET, SO_LINGER, ...) was not sufficient.

Digging around in lxc's source code, it looks like there are some timeouts that are set as well - I'm not sure if that may have an affect. I'd be curious to take a look if you have a public copy of your effort.

MggMuggins added a commit to MggMuggins/lxc that referenced this issue Nov 13, 2024
This doesn't work (I think) because SO_LINGER only prevents queued
messages from being dropped; lxc hasn't queued a response to the client's
request yet, so the race still exists.

See canonical/lxd#13453

Signed-off-by: Wesley Hershberger <wesley.hershberger@canonical.com>
@MggMuggins
Copy link
Contributor

When I say naive, I mean really naive: MggMuggins/lxc@0efd5c6

I considered a bug report but dismissed it since lxc does report a consistent transition from RUNNING -> STOPPING -> STOPPED; with fresh eyes today that seems like a bad excuse. I'll see if I can throw something together.

@holmanb
Copy link
Member Author

holmanb commented Nov 14, 2024

When I say naive, I mean really naive: MggMuggins/lxc@0efd5c6

:-)

I considered a bug report but dismissed it since lxc does report a consistent transition from RUNNING -> STOPPING -> STOPPED; with fresh eyes today that seems like a bad excuse. I'll see if I can throw something together.

Sounds great, thanks for digging further. Please let me know how it goes either way.

MggMuggins added a commit to MggMuggins/lxd that referenced this issue Nov 14, 2024
Fixes canonical#13453

Signed-off-by: Wesley Hershberger <wesley.hershberger@canonical.com>
@MggMuggins
Copy link
Contributor

MggMuggins commented Nov 14, 2024

I think my "fresh eyes" were just "poor memory eyes"... 😅. I got as far as our ask of upstream:

In an ideal world, lxc would partially close the command socket (shutdown(fd, SHUT_WR)?) and then block until all requests have been handled before continuing with the shutdown process.

But that doesn't resolve this; it just punts the race down the road a ways. Even if lxc doesn't interrupt get_state it will still pass back a status that could be stale.

Fundamentally, LXD (and clients) cannot make race-free decisions based on the state of instances because LXD does not maintain a canonical source for instance state; it is a middle-man between lxc|qemu and the client. We've run into this a few times recently with other features; it would be useful to have a more robust system for making decisions based on instance state. For what it's worth, I would consider it a dependency for some of our longer-term roadmap.

Without a bunch of design work I don't think it's feasible to truly fix this. However, checking for an ongoing operation after lxc returns significantly reduces the likelihood that get_state and the stop hook interleave. I've opened #14463 with this change.

I suspect that my initial assessment WRT VMs was wrong, they are likely affected by a similar race.

tomponline added a commit that referenced this issue Nov 16, 2024
If an instance self-stops while `statusCode()` is waiting for
`getLxcState()` to finish, `statusCode()` may return a stale instance
state.

This PR is a workaround for the use-case in #13453 and significantly
reduces the likelihood that `statusCode` returns a stale status.

In an ideal world, LXD would maintain a canonical cluster-wide view of
instance state. This would allow making race-free decisions based on
whether an instance is running or not. For example:
- Project CPU/RAM limits could be enforced at instance start instead of
at instance creation
- Volumes with content-type block could be attached to more than one
instance without `security.shared`; instance start could fail if another
instance with any shared block volumes is already running.
@holmanb
Copy link
Member Author

holmanb commented Nov 16, 2024

Thanks so much for working on this @MggMuggins and @tomponline! I'd like to have an idea of how frequently this race still occurs. In your testing of the fix, did the reproducer still trigger eventually with this fix?

@MggMuggins
Copy link
Contributor

I got 15-20 iterations with no race; for comparison the script always reproduced it on the first try for me. I didn't see it again after the fix.

tomponline pushed a commit to tomponline/lxd that referenced this issue Dec 4, 2024
Fixes canonical#13453

Signed-off-by: Wesley Hershberger <wesley.hershberger@canonical.com>
(cherry picked from commit a7e88b0)
tomponline pushed a commit to tomponline/lxd that referenced this issue Dec 9, 2024
Fixes canonical#13453

Signed-off-by: Wesley Hershberger <wesley.hershberger@canonical.com>
(cherry picked from commit a7e88b0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Confirmed to be a bug
Projects
None yet
Development

No branches or pull requests

3 participants