Skip to content

Instance stuck in starting on dogfood #8206

@iximeow

Description

@iximeow

i tried to create an instance on dogfood after updating this week and it ended up stuck in Starting.

switch 0 on dogfood seems in some kind of nebulous unhappy state (issue to come: short of it, we accidentally filled the switch zone with a core file when copying an old one out, and everything there went sideways), but separately the instance-start saga for this instance seems stuck in instance_start.dpd_ensure:

root@oxz_switch1:~# /tmp/omdb-saga db sagas show 727d4812-9383-4df3-985b-0c1bce68d5ad
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:109::3]:32221,[fd00:1122:3344:105::3]:32221,[fd00:1122:3344:10b::3]:32221,[fd00:1122:3344:107::3]:32221,[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable
WARN: found schema version 144.0.0, expected 7.0.0
It's possible the database is running a version that's different from what this
tool understands.  This may result in errors or incorrect output.
 id                                   | time_created                   | name           | state
--------------------------------------+--------------------------------+----------------+--------------------------
 727d4812-9383-4df3-985b-0c1bce68d5ad | 2025-05-23 00:24:21.112472 UTC | instance-start | SagaCachedState(Running)

                             saga id | event time                     | node id                                  | event type | data
------------------------------------ | ------------------------------ | ---------------------------------------- | ---------- | ---
727d4812-9383-4df3-985b-0c1bce68d5ad | 2025-05-23 00:24:21.120631 UTC |  10: start                               | started    |
727d4812-9383-4df3-985b-0c1bce68d5ad | 2025-05-23 00:24:21.126739 UTC |  10: start                               | succeeded  |
727d4812-9383-4df3-985b-0c1bce68d5ad | 2025-05-23 00:24:21.130584 UTC |   0: instance_start.generate_propolis_id | started    |
727d4812-9383-4df3-985b-0c1bce68d5ad | 2025-05-23 00:24:21.134919 UTC |   0: instance_start.generate_propolis_id | succeeded  | "b5bf8281-09fc-43e1-b12c-c91c0bb18543"
727d4812-9383-4df3-985b-0c1bce68d5ad | 2025-05-23 00:24:21.138944 UTC |   1: instance_start.alloc_server         | started    |
727d4812-9383-4df3-985b-0c1bce68d5ad | 2025-05-23 00:24:21.179417 UTC |   1: instance_start.alloc_server         | succeeded  | "b886b58a-1e3f-4be1-b9f2-0c2e66c6bc88"
727d4812-9383-4df3-985b-0c1bce68d5ad | 2025-05-23 00:24:21.183787 UTC |   2: instance_start.alloc_propolis_ip    | started    |
727d4812-9383-4df3-985b-0c1bce68d5ad | 2025-05-23 00:24:21.193104 UTC |   2: instance_start.alloc_propolis_ip    | succeeded  | "fd00:1122:3344:106::1:9b7"
727d4812-9383-4df3-985b-0c1bce68d5ad | 2025-05-23 00:24:21.196560 UTC |   3: instance_start.create_vmm_record    | started    |
727d4812-9383-4df3-985b-0c1bce68d5ad | 2025-05-23 00:24:21.205571 UTC |   3: instance_start.create_vmm_record    | succeeded  | {"id":"b5bf8281-09fc-43e1-b12c-c91c0bb18543","instance_id":"bf2e1d9e-fcb4-47fe-9cc5-c2e9a268fda4","propolis_ip":"fd00:1122:3344:106::1:9b7/128","propolis_port":12400,"runtime":{"gen":1,"state":"Creating","time_state_updated":"2025-05-23T00:24:21.200216Z"},"sled_id":"b886b58a-1e3f-4be1-b9f2-0c2e66c6bc88","time_created":"2025-05-23T00:24:21.200216Z","time_deleted":null}
727d4812-9383-4df3-985b-0c1bce68d5ad | 2025-05-23 00:24:21.208748 UTC |   4: instance_start.mark_as_starting     | started    |
727d4812-9383-4df3-985b-0c1bce68d5ad | 2025-05-23 00:24:21.314530 UTC |   4: instance_start.mark_as_starting     | succeeded  | {"auto_restart":{"cooldown":null,"policy":null},"boot_disk_id":null,"hostname":"ixi-600g-mem","identity":{"description":"beeeeeg memory (shouldn't panic a sled, probably)","id":"bf2e1d9e-fcb4-47fe-9cc5-c2e9a268fda4","name":"ixi-600g-mem","time_created":"2025-05-23T00:24:19.204040Z","time_deleted":null,"time_modified":"2025-05-23T00:24:19.204040Z"},"intended_state":"Running","memory":644245094400,"ncpus":2,"project_id":"9c4152f9-4317-4269-9018-66142964d21c","runtime_state":{"dst_propolis_id":null,"gen":3,"migration_id":null,"nexus_state":"Vmm","propolis_id":"b5bf8281-09fc-43e1-b12c-c91c0bb18543","time_last_auto_restarted":null,"time_updated":"2025-05-23T00:24:19.204040Z"},"updater_gen":1,"updater_id":null,"user_data":[]}
727d4812-9383-4df3-985b-0c1bce68d5ad | 2025-05-23 00:24:21.318499 UTC |   5: instance_start.dpd_ensure           | started    |

very unfortunately, enough of the instance's state was determined that we started by looking for a Propolis issue, and came up blank for a while even though it looks convincing from omdb:

root@oxz_switch1:~# omdb db instance info bf2e1d9e-fcb4-47fe-9cc5-c2e9a268fda4
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:109::3]:32221,[fd00:1122:3344:105::3]:32221,[fd00:1122:3344:10b::3]:32221,[fd00:1122:3344:107::3]:32221,[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (144.0.0)

== INSTANCE ====================================================================
                        ID: bf2e1d9e-fcb4-47fe-9cc5-c2e9a268fda4
                project ID: 9c4152f9-4317-4269-9018-66142964d21c
                      name: ixi-600g-mem
               description: beeeeeg memory (shouldn't panic a sled, probably)
                created at: 2025-05-23 00:24:19.204040 UTC
          last modified at: 2025-05-23 00:24:19.204040 UTC

== CONFIGURATION ===============================================================
                     vCPUs: 2
                    memory: 600 GiB
                  hostname: ixi-600g-mem
                 boot disk: None
              auto-restart:
                  InstanceAutoRestart {
                      policy: None,
                      cooldown: None,
                  }

== RUNTIME STATE ===============================================================
               nexus state: Vmm
(i)     external API state: Starting
            intended state: running
           last updated at: 2025-05-23T00:24:19.204040Z (generation 3)
       needs reincarnation: false
             karmic status: saṃsāra (reincarnation enabled)
      last reincarnated at: None
             active VMM ID: Some(b5bf8281-09fc-43e1-b12c-c91c0bb18543)
             target VMM ID: None
              migration ID: None
              updater lock: UNLOCKED at generation: 1

== ACTIVE VMM ==================================================================
                        ID: b5bf8281-09fc-43e1-b12c-c91c0bb18543
               instance ID: bf2e1d9e-fcb4-47fe-9cc5-c2e9a268fda4
                created at: 2025-05-23 00:24:21.200216 UTC
                     state: creating
                updated at: 2025-05-23T00:24:21.200216Z (generation 1)
          propolis address: fd00:1122:3344:106::1:9b7:12400
                   sled ID: b886b58a-1e3f-4be1-b9f2-0c2e66c6bc88

at the very least, we probably should have timed out and failed the instance start at some point?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions