Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sled-agent not serving vmm requests #6911

Closed
askfongjojo opened this issue Oct 21, 2024 · 16 comments
Closed

sled-agent not serving vmm requests #6911

askfongjojo opened this issue Oct 21, 2024 · 16 comments
Milestone

Comments

@askfongjojo
Copy link

The issue was seen on sled 13 of rack3. There was a problem with mupdate during that time and we're not sure if it's related at all. Other sleds do not have this particular issue AFAICT. https://github.com/oxidecomputer/colo/issues/88#issuecomment-2423940805 is where we noted the problem, also cloned below:

Looking at sled agent logs it looks like calls to the /vmms/{id}/state endpoint are timing out on sled 13:

15:00:47.203Z WARN SledAgent (dropshot (SledAgent)): request handling cancelled (client disconnected)
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:887
    latency_us = 60005110
    local_addr = [fd00:1122:3344:104::1]:12345
    method = GET
    remote_addr = [fd00:1122:3344:10e::3]:36498
    req_id = ccf79204-80fb-4814-a778-59a48ab273a4
    uri = /vmms/90676c5f-a315-4930-a90f-43790839069b/state

I don't know for certain if this is related but it seems suspicious.

The recent sled-agent logs and core files have been uploaded to /staff/rack3/colo-88.

@gjcolombo
Copy link
Contributor

gjcolombo commented Oct 21, 2024

Probably related to #6904. I found a relevant-looking sled agent log snippet (from oxide-sled-agent:default.log.1729284364); this is the point where the system crossed over from successfully servicing VMM requests to failing to do so:

20:38:35.072Z INFO SledAgent (dropshot (SledAgent)): request completed
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:950
    latency_us = 365
    local_addr = [fd00:1122:3344:104::1]:12345
    method = GET
    remote_addr = [fd00:1122:3344:10e::3]:61554
    req_id = 27622099-2c46-440c-b189-386adcebecea
    response_code = 200
    uri = /vmms/3bfe0e53-b1e9-4473-8a92-298984d08e17/state
20:38:35.072Z INFO SledAgent (dropshot (SledAgent)): request completed
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:950
    latency_us = 375
    local_addr = [fd00:1122:3344:104::1]:12345
    method = GET
    remote_addr = [fd00:1122:3344:10e::3]:64130
    req_id = 616faeea-d021-4145-9ba0-b004bfaf66be
    response_code = 200
    uri = /vmms/08704209-ff3e-4d96-a6e9-901b9b288134/state
20:38:35.098Z INFO SledAgent (dropshot (SledAgent)): request completed
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:950
    latency_us = 210
    local_addr = [fd00:1122:3344:104::1]:12345
    method = GET
    remote_addr = [fd00:1122:3344:10e::3]:64130
    req_id = a7997c8e-8f69-414e-83e1-8bf905229a39
    response_code = 200
    uri = /vmms/677d7236-412a-4141-8876-f09cc7dbee44/state
20:38:35.134Z INFO SledAgent (dropshot (SledAgent)): request completed
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:950
    latency_us = 127
    local_addr = [fd00:1122:3344:104::1]:12345
    method = GET
    remote_addr = [fd00:1122:3344:10e::3]:64130
    req_id = 2bc807ce-f42e-4921-87a4-930a0b23b513
    response_code = 200
    uri = /vmms/90676c5f-a315-4930-a90f-43790839069b/state

// gjcolombo: 10 seconds elapse

// gjcolombo: note remote port number
20:38:45.544Z INFO SledAgent (dropshot (SledAgent)): accepted connection
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:1105
    local_addr = [fd00:1122:3344:104::1]:12345
    remote_addr = [fd00:1122:3344:111::3]:46877

// gjcolombo: 7 more seconds elapse

20:38:52.095Z INFO SledAgent (dropshot (BootstrapAgent)): accepted connection
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:1105
    local_addr = [fdb0:a840:2504:350::1]:80
    remote_addr = [fdb0:a840:2504:4d0::2]:48619
20:38:52.929Z INFO SledAgent (dropshot (BootstrapAgent)): accepted connection
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:1105
    local_addr = [fdb0:a840:2504:350::1]:80
    remote_addr = [fdb0:a840:2504:1d4::2]:41366

// gjcolombo: several successful requests redacted; these hit the /eip-gateways and /vpc-routes endpoints

// gjcolombo: 60-second client timeout; turns out this was a snapshot request...
20:39:45.265Z WARN SledAgent (dropshot (SledAgent)): request handling cancelled (client disconnected)
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:887
    latency_us = 59718455
    local_addr = [fd00:1122:3344:104::1]:12345
    method = POST
    remote_addr = [fd00:1122:3344:111::3]:46877
    req_id = b52dafd4-82b6-4d6c-b044-c72619bd0078
    uri = /vmms/3bfe0e53-b1e9-4473-8a92-298984d08e17/disks/31c1a97b-db69-4185-a198-78679c0f1f36/snapshot

I suspect what's happened is that this specific snapshot request is stuck. It's not timing out because it's being serviced via a call to Propolis, and sled agent's Propolis clients don't configure a timeout:

// We use a custom client builder here because the default progenitor
// one has a timeout of 15s but we want to be able to wait indefinitely.
let reqwest_client = reqwest::ClientBuilder::new().build().unwrap();
let client = Arc::new(PropolisClient::new_with_client(
&format!("http://{}", &self.propolis_addr),
reqwest_client,
));

Things to follow up on:

  1. Why is this snapshot request stuck? Does this disk (31c1a97b-db69-4185-a198-78679c0f1f36) have a downstairs on another sled that's in bad repair? @askfongjojo, @augustuswm: Propolis logs from VMM 3bfe0e53-b1e9-4473-8a92-298984d08e17 might be helpful here, if we don't have them already and we're not immediately sure that something is wrong with this specific Crucible disk.
  2. Allocating guest memory is (or should be) much faster than it was when we configured this Propolis client. Should we really be running this client without a timeout? (Another related question: is it safe to do that, or will timing out a Propolis operation throw something off between Propolis and sled agent?)
  3. Because each individual instance has its own driver task, it seems like we could (and should) rework InstanceManager so that it doesn't block waiting for each individual reply for a specific instance. That is, functions like issue_snapshot_request seem like they could spawn a separate tokio task that sends a message to the Instance in question, awaits the reply, and then forwards the reply back to the task that originally handled the relevant HTTP request. I say "seems" here because I haven't looked too hard, and this may be harder than it first appears!

@gjcolombo
Copy link
Contributor

I should add that I also took a very quick peek at the core file, but all it reveals at first glance is that all of the tokio threads in the agent are currently idle. That's consistent with the observations above. I don't know if we have any techniques available to parse the tokio task wheel manually to figure out where the InstanceManager task most recently awaited.

@hawkw
Copy link
Member

hawkw commented Oct 21, 2024

@gjcolombo

Because each individual instance has its own driver task, it seems like we could (and should) rework InstanceManager so that it doesn't block waiting for each individual reply for a specific instance. That is, functions like issue_snapshot_request seem like they could spawn a separate tokio task that sends a message to the Instance in question, awaits the reply, and then forwards the reply back to the task that originally handled the relevant HTTP request. I say "seems" here because I haven't looked too hard, and this may be harder than it first appears!

I think this seems quite doable and is almost certainly a good idea. The only thing that should be synchronous in the InstanceManager ought to be the actual routing of requests to instances and management of the "table of running instances" state. Potentially, I think we might not even need to spawn a separate task here, although I'll have to take another look at the code.

Happy to take a crack at this change unless you were planning to imminently.

@gjcolombo
Copy link
Contributor

Happy to take a crack at this change unless you were planning to imminently.

It's all yours if you want it! Let me know if you'd like me to pitch in.

@gjcolombo
Copy link
Contributor

One additional note re this and #6904: Although transferring requests from the InstanceManager to an individual Instance without blocking would have allowed requests to other VMMs to go through, requests to the snapshotting VM could still block (since the InstanceRunner still won't be responding). An operation like InstanceManagerRunner::use_only_these_disks would therefore still block indefinitely if the InstanceManager decided that it needed to dispatch a call to the affected Instance. Making Instance calls not block the InstanceManager is still righteous, but we still need to figure out what to do about seemingly-stuck calls to Propolis.

@askfongjojo
Copy link
Author

  1. Why is this snapshot request stuck? Does this disk (31c1a97b-db69-4185-a198-78679c0f1f36) have a downstairs on another sled that's in bad repair?

Yes, it indeed has a downstairs on sled 9 (BRM42220063) so that would explain why the snapshot task was stuck:

root@oxz_switch1:~# omdb db disks info 31c1a97b-db69-4185-a198-78679c0f1f36
HOST_SERIAL DISK_NAME                                          INSTANCE_NAME   PROPOLIS_ZONE                                            VOLUME_ID                            DISK_STATE 
BRM42220064 issue-collector-21dce8b6-7be8-4f31-8d52-958-f9cad9 issue-collector oxz_propolis-server_3bfe0e53-b1e9-4473-8a92-298984d08e17 b209a8e3-9dca-41b7-aa9d-9a7d232c602f attached   
HOST_SERIAL REGION                               ZONE                                              PHYSICAL_DISK                        
BRM44220017 f4112406-22c4-46d2-a6b5-9f22bd49a90d oxz_crucible_1455c069-853c-49cd-853a-3ea81b89acd4 c5058411-d64a-42d6-acda-903e65171d25 
BRM42220063 a5f0cf81-3686-471f-87da-22efbb533a78 oxz_crucible_816f26a7-4c28-4a39-b9ad-a036678520ab a7787335-6d18-4a96-910b-4d841f72dab4 
BRM42220073 25db1691-243f-4d92-bafd-8af0dc724af0 oxz_crucible_cf667caf-304c-40c4-acce-f0eb05d011ef e1f7c8aa-c45b-412c-9466-f3a490877b0a 

@askfongjojo
Copy link
Author

Before we have a fix to avoid the blocking behavior, what would be a reasonable workaround in this case? Would restarting sled-agent free up InstanceManager to serve other requests? Would the snapshot task retry indefinitely because it's a saga (not sure at what point it would give up and unwind)?

@gjcolombo
Copy link
Contributor

Would restarting sled-agent free up InstanceManager to serve other requests?

Good call--I think this should work, at least as far as instance management is concerned. All the existing Propolis zones will be reaped when the new sled agent starts, but any affected instances should go to Failed and then be restarted. The new sled agent will otherwise have a brand new instance manager task with an empty instance table.

Would the snapshot task retry indefinitely because it's a saga (not sure at what point it would give up and unwind)?

The next snapshot saga retry should produce a permanent error, since this request is directed at a particular VMM, and sled agent will now return "no such VMM" instead of stalling and timing out. Assuming the saga recognizes that it now has a permanent error it should unwind at that point.

@askfongjojo
Copy link
Author

Thanks @gjcolombo. @augustuswm - There is no rush but it sounds like we can put sled 13 back in service with a sled-agent restart.

@augustuswm
Copy link
Contributor

Sounds good. I will do that now.

@augustuswm
Copy link
Contributor

That looks to have worked. VMs on sled 13 were replaced, and then stop + start operations afterwards were successful.

hawkw added a commit that referenced this issue Oct 22, 2024
Sled-agent's `InstanceManager` task is responsible for managing the
table of all instances presently running on the sled. When the
sled-agent receives a request relating to an individual instance on the
sled, it's sent to the `InstanceManager` over a `tokio::sync::mpsc`
channel, and is then dispatched by the `InstanceManager` to the
`InstanceRunner` task responsible for that individual instance by
sending it over a *second* `tokio::sync::mpsc` channel. This is where
things start to get interesting.[^1]

`tokio::sync::mpsc` is a *bounded* channel: there is a maximum number of
messages which may be queued by a given MPSC channel at any given time.
The `mpsc::Sender::send` method is an `async fn`, and if the channel is
at capacity, that method will _wait_ until there is once again space in
the channel to send the message being sent. Presently,
`mpsc::Sender::send` is called by the `InstanceManager`'s main run loop
when dispatching a request to an individual instance. As you may have
already started to piece together, this means that if a given
`InstanceRunner` task is not able to process requests fast enough to
drain its channel, the entire `InstanceManager` loop will wait when
dispatching a request to that instance until the queue has been drained.
This means that if one instance's runner task has gotten stuck on
something, like waiting for a Crucible flush that will never complete
(as seen in #6911), that instance will prevent requests being dispatched
to *any other instance* managed by the sled-agent. This is quite
unfortunate!

This commit fixes this behavior by changing the functions that send
requests to an individual instance's task to instead *shed load* when
that instance's request queue is full. We now use the
`mpsc::Sender::try_send` method, rather than `mpsc::Sender::send`, which
does not wait and instead immediately returns an error when the channel
is full. This allows the `InstanceManager` to instead return an error to
the client indicating the channel is full, and move on to processing
requests to other instances which may not be stuck. Thus, a single stuck
instance can no longer block requests from being dispatched to other,
perfectly fine instances.

The error returned when the channel is at capacity is converted to an
HTTP 503 Service Unavailable error by the API. This indicates to the
client that their request to that instance was not able to be processed
at this time, but that it may be processed successfully in the
future.[^2] Now, we can shed load while allowing clients to retry later,
which seems much better than the present situation.

[^1]: In the sense of "may you live in interesting times", naturally.
[^2]: I also considered returning 429 Too Many Requests here, but my
    understanding is that that status code is supposed to indicate that
    too many requests have been received from *that specific client*. In
    this case, we haven't hit a per-client rate limit; we're just
    overloaded by requests more broadly, so it's not that particular
    client's fault.
hawkw added a commit that referenced this issue Oct 25, 2024
Sled-agent's `InstanceManager` task is responsible for managing the
table of all instances presently running on the sled. When the
sled-agent receives a request relating to an individual instance on the
sled, it's sent to the `InstanceManager` over a `tokio::sync::mpsc`
channel, and is then dispatched by the `InstanceManager` to the
`InstanceRunner` task responsible for that individual instance by
sending it over a *second* `tokio::sync::mpsc` channel. This is where
things start to get interesting.[^1]

`tokio::sync::mpsc` is a *bounded* channel: there is a maximum number of
messages which may be queued by a given MPSC channel at any given time.
The `mpsc::Sender::send` method is an `async fn`, and if the channel is
at capacity, that method will _wait_ until there is once again space in
the channel to send the message being sent. Presently,
`mpsc::Sender::send` is called by the `InstanceManager`'s main run loop
when dispatching a request to an individual instance. As you may have
already started to piece together, this means that if a given
`InstanceRunner` task is not able to process requests fast enough to
drain its channel, the entire `InstanceManager` loop will wait when
dispatching a request to that instance until the queue has been drained.
This means that if one instance's runner task has gotten stuck on
something, like waiting for a Crucible flush that will never complete
(as seen in #6911), that instance will prevent requests being dispatched
to *any other instance* managed by the sled-agent. This is quite
unfortunate!

This commit fixes this behavior by changing the functions that send
requests to an individual instance's task to instead *shed load* when
that instance's request queue is full. We now use the
`mpsc::Sender::try_send` method, rather than `mpsc::Sender::send`, which
does not wait and instead immediately returns an error when the channel
is full. This allows the `InstanceManager` to instead return an error to
the client indicating the channel is full, and move on to processing
requests to other instances which may not be stuck. Thus, a single stuck
instance can no longer block requests from being dispatched to other,
perfectly fine instances.

The error returned when the channel is at capacity is converted to an
HTTP 503 Service Unavailable error by the API. This indicates to the
client that their request to that instance was not able to be processed
at this time, but that it may be processed successfully in the
future.[^2] Now, we can shed load while allowing clients to retry later,
which seems much better than the present situation.

[^1]: In the sense of "may you live in interesting times", naturally.
[^2]: I also considered returning 429 Too Many Requests here, but my
    understanding is that that status code is supposed to indicate that
    too many requests have been received from *that specific client*. In
    this case, we haven't hit a per-client rate limit; we're just
    overloaded by requests more broadly, so it's not that particular
    client's fault.
@morlandi7 morlandi7 added this to the 12 milestone Oct 29, 2024
hawkw added a commit that referenced this issue Oct 30, 2024
Stacked on top of #6913 

Presently, sled-agent sends requests to terminate an instance to the
`InstanceRunner` task over the same `tokio::sync::mpsc` request channel
as all other requests sent to that instance. This means that the
`InstanceRunner` will attempt to terminate the instance only once other
requests received before the termination request have been processed,
and an instance cannot be terminated if its request channel has filled
up. Similarly, if an instance's `InstanceRunner` task is waiting for an
in-flight request to the VMM to complete, the request to terminate the
instance will not be seen until the current request to Propolis has
returned. This means that if the instance has gotten stuck for some
reason --- e.g., because it is attempting a Crucible snapshot that
cannot complete because a physical disk has gone missing, as seen in
#6911 --- the instance cannot be terminated. Sadly, in this case, the
only way to resolve the stuck request is to terminate the instance, but
we cannot do so *because* the instance is stuck.

This seems unfortunate: Ii we try to kill an instance because it's doing
something that it will never be able to finish, it shouldn't be able to
say "no, you can't kill me, I'm too *busy* to die!". Instead, requests
to terminate the instance should be prioritized over other requests.
This commit does that.

Rather than sending termination requests to the `InstanceRunner` over
the same channel as all other requests, we instead introduce a separate
channel that's *just* for termination requests, which is preferred over
the request channel in the biased `tokio::select!` in the
`InstanceRunner` run loop. This means that a full request channel cannot
stop a termination request from being sent. When a request to the VMM is
in flight, the future that awaits that request's completion is now one
branch of a similar `tokio::select!` with the termination channel. This
way, if a termination request comes in while the `InstanceRunner` is
awaiting an in-flight instance operation, it will still be notified
immediately of the termination request, cancel whatever operation it's
waiting for, and go ahead and terminate the VMM immediately. This is the
correct behavior here, since the terminate operation is intended to
forcefully terminate the VMM *now*, and is used internally for purposes
such as `use_only_these_disks` killing instances that are using a
no-longer-extant disk, or the control plane requesting that the
sled-agent forcibly unregister the instance. "Normal" requests to stop
the instance gracefully will go through the `instance_put_state` API
instead, sending requests through the normal request channel and
allowing in flight operations to complete.
@hawkw
Copy link
Member

hawkw commented Oct 30, 2024

PR #6913 fixes the issue where a sled-agent with one instance which has gotten stuck will not handle requests for any other instances on the sled. PR #6915 (which just merged) which fixes the sled-agent not being able to forcefully terminate instances when it is waiting for an in-flight operation to complete.

Collectively, those two changes fix this issue as far as the sled-agent is concerned. Unless we want to keep this issue open until the underlying Crucible issue is resolved, I think we can close this --- and I think #6932 picked up the Crucible fix, but I'd want @leftwo to confirm?

@askfongjojo
Copy link
Author

Unless we want to keep this issue open until the underlying Crucible issue is resolved, I think we can close this

The underlying issue was because a sled has gone away (potentially an electrical issue), not a crucible problem. We can close this ticket.

@hawkw
Copy link
Member

hawkw commented Oct 30, 2024

Unless we want to keep this issue open until the underlying Crucible issue is resolved, I think we can close this

The underlying issue was because a sled has gone away (potentially an electrical issue), not a crucible problem. We can close this ticket.

Hm, I thought it was a Crucible problem insofar as the instance had gotten stuck because it was trying to take a snapshot while in a degraded state because one of its downstairs replicas was on the sled that had gone away, and that condition was being handled by retrying the snapshot in a loop forever? In any case, happy to close this!

@hawkw hawkw closed this as completed Oct 30, 2024
@askfongjojo
Copy link
Author

The situation was a bit unusal as we chose not to expunge the sled. If we had gone through expungement, the missing region downstairs would have been replaced on another sled and brought the disk back to a healthy state which in turn allowed the snapshot to complete. If @leftwo thinks we should handle the prolonged degraded state in a better way, we probably want to have a new crucible ticket filed for that (so closing this ticket is still the right thing to do!).

@hawkw
Copy link
Member

hawkw commented Oct 30, 2024

Yup, that makes sense! Thanks for clarifying!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants