Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oxide instance stop -> Instance stopping forever #1126

Open
smklein opened this issue May 27, 2022 · 0 comments
Open

oxide instance stop -> Instance stopping forever #1126

smklein opened this issue May 27, 2022 · 0 comments
Labels
bug Something that isn't working. Sled Agent Related to the Per-Sled Configuration and Management virtualization Propolis Integration & VM Management

Comments

@smklein
Copy link
Collaborator

smklein commented May 27, 2022

"I wish I knew how to quit you"

For some context, see: #1124

Basically, when running a looping script of "instance start, stop, destroy" in a loop, after ~500 iterations I saw a hang in the request to instance stop.

From the CLI:

+ oxide instance stop --confirm --organization myorg --project myproject myinstance529                                                                                                                                                        
⡀ Waiting for instance status to be `stopped` 
^C

 $ oxide instance get --organization myorg --project myproject myinstance529
 time_run_state_updated | 14 hours ago                         
 time_modified          | 14 hours ago                         
 time_created           | 14 hours ago                         
 run_state              | stopping                             
 project_id             | 96812850-4387-4b4a-8c15-e7b405d588f5 
 ncpus                  | 2                                    
 memory                 | 1073741824                           
 hostname               | myinstance                           
 description            | description                          
 name                   | myinstance529                        
 id                     | cd7cbd04-5507-4c53-a1e0-1f1e6016e7bb

The Propolis zone still appears to exist.

Propolis logs, within the propolis zone:

$ cat $(svcs -L propolis-server)
May 27 00:08:01.536 INFO Calling initialize_virtio_block, uri: /instance, method: PUT, req_id: 687b2b06-85a8-4de1-b709-8b1d74067b71, remote_addr: [fd00:1122:3344:101::1]:54316, local_addr: [fd00:1122:3344:101::227]:12400
May 27 00:08:01.537 INFO cloud-init disk created, uri: /instance, method: PUT, req_id: 687b2b06-85a8-4de1-b709-8b1d74067b71, remote_addr: [fd00:1122:3344:101::1]:54316, local_addr: [fd00:1122:3344:101::227]:12400
May 27 00:08:01.537 INFO request completed, response_code: 201, uri: /instance, method: PUT, req_id: 687b2b06-85a8-4de1-b709-8b1d74067b71, remote_addr: [fd00:1122:3344:101::1]:54316, local_addr: [fd00:1122:3344:101::227]:12400
May 27 00:08:01.537 INFO request completed, response_code: 204, uri: /instance/state, method: PUT, req_id: d1c4f129-6f8c-41b7-bf56-3640841a0ecf, remote_addr: [fd00:1122:3344:101::1]:54316, local_addr: [fd00:1122:3344:101::227]:12400
May 27 00:08:01.537 INFO Instance transition, state_target: Some(Run), state_prev: Initialize, state: Boot, task: instance-driver, component: vmm
May 27 00:08:01.538 INFO Instance transition, state_target: Some(Run), state_prev: Boot, state: Run, task: instance-driver, component: vmm
May 27 00:08:01.538 INFO request completed, response_code: 200, uri: /instance/state-monitor, method: GET, req_id: 9e7c3427-ca12-4f2a-bc22-91568716372d, remote_addr: [fd00:1122:3344:101::1]:54316, local_addr: [fd00:1122:3344:101::227]:12400
May 27 00:08:01.636 INFO unhandled PIO, bytes: 4, port: 44800, op: out, sync_task: vcpu-0, component: vmm
May 27 00:08:01.636 INFO unhandled PIO, bytes: 4, port: 44800, op: out, sync_task: vcpu-0, component: vmm
May 27 00:08:01.636 INFO unhandled PIO, bytes: 1, port: 44805, op: out, sync_task: vcpu-0, component: vmm
May 27 00:08:01.636 INFO unhandled PIO, bytes: 4, port: 44800, op: in, sync_task: vcpu-0, component: vmm
May 27 00:08:01.642 INFO accepted connection, remote_addr: [fd00:1122:3344:101::1]:49044, local_addr: [fd00:1122:3344:101::227]:12400
May 27 00:08:01.642 INFO Instance transition, state_target: Some(Halt), state_prev: Run, state: Quiesce, task: instance-driver, component: vmm
May 27 00:08:01.642 INFO request completed, response_code: 204, uri: /instance/state, method: PUT, req_id: 2d0a2675-d427-41f0-ac56-fd09dc1e7370, remote_addr: [fd00:1122:3344:101::1]:49044, local_addr: [fd00:1122:3344:101::227]:12400

Sled Agent logs:

[2022-05-27T00:08:01.540212385Z]  INFO: SledAgent/InstanceManager/6692 on thelio: Observing new propolis state: Running (instance_id=cd7cbd04-5507-4c53-a1e0-1f1e6016e7bb)
[2022-05-27T00:08:01.540561549Z]  INFO: SledAgent/InstanceManager/6692 on thelio: New state: Running, action: None (instance_id=cd7cbd04-5507-4c53-a1e0-1f1e6016e7bb)
[2022-05-27T00:08:01.555240667Z]  INFO: SledAgent/InstanceManager/6692 on thelio: GET request to http://[fd00:1122:3344:101::227]:12400/instance/state-monitor (propolis_client address=[fd00:1122:3344:101::227]:12400, instance_id=cd7cbd04-5507-4c53-a1e0-1f1e6016e7bb)
[2022-05-27T00:08:01.642352035Z]  INFO: SledAgent/dropshot (SledAgent)/6692 on thelio: accepted connection (local_addr=[fd00:1122:3344:101::1]:12345, remote_addr=[fd00:1122:3344:101::3]:64101)
[2022-05-27T00:08:01.642892056Z]  INFO: SledAgent/InstanceManager/6692 on thelio: instance_ensure cd7cbd04-5507-4c53-a1e0-1f1e6016e7bb -> InstanceRuntimeStateRequested { run_state: Stopped, migration_params: None }
[2022-05-27T00:08:01.643274629Z]  INFO: SledAgent/InstanceManager/6692 on thelio: instance already exists
[2022-05-27T00:08:01.643642965Z]  INFO: SledAgent/InstanceManager/6692 on thelio: transition to InstanceRuntimeStateRequested { run_state: Stopped, migration_params: None }; action: Stop (instance_id=cd7cbd04-5507-4c53-a1e0-1f1e6016e7bb)
[2022-05-27T00:08:01.644092145Z]  INFO: SledAgent/InstanceManager/6692 on thelio: Taking action: Stop (instance_id=cd7cbd04-5507-4c53-a1e0-1f1e6016e7bb)
[2022-05-27T00:08:01.644521605Z]  INFO: SledAgent/InstanceManager/6692 on thelio: PUT request to http://[fd00:1122:3344:101::227]:12400/instance/state (propolis_client address=[fd00:1122:3344:101::227]:12400, instance_id=cd7cbd04-5507-4c53-a1e0-1f1e6016e7bb)
[2022-05-27T00:08:01.645027377Z]  INFO: SledAgent/dropshot (SledAgent)/6692 on thelio: request completed (req_id=776154d6-a921-42f4-8f5d-b4e182f56b59, uri=/instances/cd7cbd04-5507-4c53-a1e0-1f1e6016e7bb, method=PUT, remote_addr=[fd00:1122:3344:101::3]:64101, local_addr=[fd00:1122:3344:101::1]:12345, response_code=200)
[2022-05-27T00:21:03.111197687Z]  INFO: SledAgent/dropshot (SledAgent)/6692 on thelio: accepted connection (local_addr=[fd00:1122:3344:101::1]:12345, remote_addr=[fd00:1122:3344:101::3]:45080)
[2022-05-27T00:21:03.111581481Z]  INFO: SledAgent/InstanceManager/6692 on thelio: instance_ensure cd7cbd04-5507-4c53-a1e0-1f1e6016e7bb -> InstanceRuntimeStateRequested { run_state: Stopped, migration_params: None }
[2022-05-27T00:21:03.111849981Z]  INFO: SledAgent/InstanceManager/6692 on thelio: instance already exists
[2022-05-27T00:21:03.112111437Z]  INFO: SledAgent/dropshot (SledAgent)/6692 on thelio: request completed (req_id=b9aebcf4-d90c-463b-b0d0-5ca21db6e9c3, uri=/instances/cd7cbd04-5507-4c53-a1e0-1f1e6016e7bb, method=PUT, remote_addr=[fd00:1122:3344:101::3]:45080, local_addr=[fd00:1122:3344:101::1]:12345, response_code=200)
@smklein smklein added bug Something that isn't working. Sled Agent Related to the Per-Sled Configuration and Management virtualization Propolis Integration & VM Management labels May 27, 2022
leftwo pushed a commit that referenced this issue Feb 9, 2024
Crucible changes:
Remove unused fields in IOop (#1149)
New downstairs clone subcommand. (#1129)
Simplify the do_work_task loop (#1150)
Move `Guest` stuff into a module (#1125)
Bump nix to 0.27.1 and use new safer Fd APIs (#1110)
Move `FramedWrite` work to a separate task (#1145)
Use fewer borrows in ExtentInner API (#1147)
Update Rust crate reedline to 0.28.0 (#1141)
Update Rust crate tokio to 1.36 (#1143)
Update Rust crate slog-bunyan to 2.5.0 (#1139)
Update Rust crate rayon to 1.8.1 (#1138)
Update Rust crate itertools to 0.12.1 (#1137)
Update Rust crate byte-unit to 5.1.4 (#1136)
Update Rust crate base64 to 0.21.7 (#1135)
Update Rust crate async-trait to 0.1.77 (#1134)
Discard deferred msgs (#1131)
Minor Downstairs cleanup (#1127)
Update test_fail_live_repair to support pstop (#1128)
Ignore client messages after stopping the IO task (#1126)
Move client IO task into a struct (#1124)
Bump Rust to 1.75 and fix new Clippy lints (#1123)

Propolis changes:
PHD: convert to async (#633)
PHD: assume specialized Windows images (#636)
propolis-standalone-config needn't be a crate
standalone: Use tar for snapshot/restore
phd: use latest "lab-2.0-opte" target, not a specific version (#637)
PHD: add tests for migration of running processes (#623)
PHD: fix `cargo xtask phd` tidy not doing anything (#630)
PHD: add documentation for `cargo xtask phd` (#629)
standalone: improve virtual device creation errors (#632)
phd: add Windows Server 2019 guest adapter (#627)
PHD: add `cargo xtask phd` to make using PHD nicer (#619)
leftwo added a commit that referenced this issue Feb 9, 2024
Crucible changes:
Remove unused fields in IOop (#1149)
New downstairs clone subcommand. (#1129)
Simplify the do_work_task loop (#1150)
Move `Guest` stuff into a module (#1125)
Bump nix to 0.27.1 and use new safer Fd APIs (#1110) Move `FramedWrite`
work to a separate task (#1145) Use fewer borrows in ExtentInner API
(#1147)
Update Rust crate reedline to 0.28.0 (#1141)
Update Rust crate tokio to 1.36 (#1143)
Update Rust crate slog-bunyan to 2.5.0 (#1139)
Update Rust crate rayon to 1.8.1 (#1138)
Update Rust crate itertools to 0.12.1 (#1137)
Update Rust crate byte-unit to 5.1.4 (#1136)
Update Rust crate base64 to 0.21.7 (#1135)
Update Rust crate async-trait to 0.1.77 (#1134)
Discard deferred msgs (#1131)
Minor Downstairs cleanup (#1127)
Update test_fail_live_repair to support pstop (#1128) Ignore client
messages after stopping the IO task (#1126) Move client IO task into a
struct (#1124)
Bump Rust to 1.75 and fix new Clippy lints (#1123)

Propolis changes:
PHD: convert to async (#633)
PHD: assume specialized Windows images (#636)
propolis-standalone-config needn't be a crate
standalone: Use tar for snapshot/restore
phd: use latest "lab-2.0-opte" target, not a specific version (#637)
PHD: add tests for migration of running processes (#623) PHD: fix `cargo
xtask phd` tidy not doing anything (#630) PHD: add documentation for
`cargo xtask phd` (#629) standalone: improve virtual device creation
errors (#632) phd: add Windows Server 2019 guest adapter (#627)
PHD: add `cargo xtask phd` to make using PHD nicer (#619)

Co-authored-by: Alan Hanson <alan@oxide.computer>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that isn't working. Sled Agent Related to the Per-Sled Configuration and Management virtualization Propolis Integration & VM Management
Projects
None yet
Development

No branches or pull requests

1 participant