Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MGS: Improve error messages when we timeout waiting for an SP #953

Merged
merged 3 commits into from
Apr 21, 2022

Conversation

jgallagher
Copy link
Contributor

@jgallagher jgallagher commented Apr 20, 2022

This should address #951:

  1. Timeout log messages now include both the timeout duration and the SP we timed out waiting for; what was previously
error_message_internal="timeout elapsed"

is now something like

error_message_internal="timeout (500ms) elapsed communicating with SpIdentifier { typ: Switch, slot: 0 }"
  1. Increase the timeouts for communication with simulated SPs during integration tests.

I don't love the latter of those at all - "flaky test with a timeout? increase the timeout" feels awful. But I'm not sure what a reasonable alternative is in this case. Since MGS is communicating with SPs over UDP, it has to have timeouts waiting for responses from those SPs, and the integration tests therefore inherently have those timeouts. This PR feels like it bumps them up to "surely we'll never hit these even in CI runners" territory, but if there are better alternatives I'd be happy to pursue them.

Instead of only "timeout elapsed", we now include both the timeout
duration and the SP we were waiting for; e.g.,

```
error_message_internal: timeout (5ms) elapsed communicating with SpIdentifier { typ: Switch, slot: 0 }
```
@jgallagher jgallagher requested a review from davepacheco April 20, 2022 14:29
@andrewjstone
Copy link
Contributor

I don't love the latter of those at all - "flaky test with a timeout? increase the timeout" feels awful. But I'm not sure what a reasonable alternative is in this case.

I'm not sure if you can make it work here, but wait_for_condition is useful for when something may timeout in a test. If you can configure the timeout to be approximately poll_interval, and make your test use a wrapper closure, then you may be able to utilize this function.

@davepacheco
Copy link
Collaborator

I don't love the latter of those at all - "flaky test with a timeout? increase the timeout" feels awful. But I'm not sure what a reasonable alternative is in this case. Since MGS is communicating with SPs over UDP, it has to have timeouts waiting for responses from those SPs, and the integration tests therefore inherently have those timeouts. This PR feels like it bumps them up to "surely we'll never hit these even in CI runners" territory, but if there are better alternatives I'd be happy to pursue them.

I mentioned most this in chat, but I'm putting here for the record. Yes, timeouts are always problematic. And it's definitely an anti-pattern to "bump the timeout in the tests" to make a flaky test less flaky. The question is why are we hitting the timeout? If the underlying problem is that 1 out of 1000 times there was a dropped notification and the system was never going to do the thing we were waiting for, then bumping the timeout is obviously wrong. If the problem is that GitHub Actions runs our tests on a heavily multi-tenant system where we might be off-CPU for the better part of 500ms, then I'd say bumping the timeout is totally reasonable. To handle the latter case, my rule of thumb (which came from @wesolows 10+ years ago) is to choose a timeout for tests such that if you hit the timeout, that by itself means the system is not working as intended. Say, if it took an hour for an SP to respond, even if it did respond correctly, we'd consider that broken. I don't know where the line is for this case but 10 seconds seems fine. 30 seconds even seems okay. Timeouts in tests are usually just to prevent the test suite from hanging indefinitely so I think it's okay to be pretty generous.

But I realize in this case it's not actually a test-specific timeout that's firing and I think you're asking for thoughts on the timeout in the general case. I think this depends on what might cause these things to take a long time, who the callers are, and what they're doing. I imagine possible reasons for this taking a while might include:

  1. Transient network issue: the request UDP packet never made it to the SP phy or the response never made it back to MGS. But the problem was transient -- if we try again, it's likely to succeed.
  2. The SP is a little busier than usual and takes a little while to get around to answering this request.
  3. The SP is pathologically busy -- it might get to this, but it'll take so long it may as well be not functioning.
  4. The SP is in some state where it will never respond to this message (e.g., a bug in the request handler causes it to drop the request without responding)
  5. The SP is in some state where it will never respond to any message (e.g., offline, physically partitioned, CPU stuck in a loop).

Of course ideally in case 1, we'd retransmit and wait a sec; in case 2, we'd just wait a sec; in cases 3-5, we'd fail immediately. But we don't know what case we're in.

How long does the caller want to wait? Doesn't this depend on the caller? Most of the use cases I can think of are not latency sensitive: e.g., a background software update. In the special case of an Oxide support rep querying state, they might want to wait a really long time if it would help get the actual state. I could imagine being latency-sensitive if we're doing this in the context of an external API call (e.g., to get the current power state)...but I think that might be better done by polling such state in the background, saving it to the database, and reporting cached state from the API.

I guess I'm making an end-to-end argument that MGS doesn't know how long it should wait. Maybe we should (1) retransmit the SP message periodically to deal with transient drops, and then (2) not have a timeout, but let the client decide how long to wait (by closing their TCP connection to MGS when they want to give up)? Is that a terrible idea?

There's another issue, which is you might not want one stuck SP to hold up a request for information about all SPs...but I think that's an argument for clients making separate requests for the state of each SP.

@jgallagher
Copy link
Contributor Author

jgallagher commented Apr 21, 2022

I mentioned most this in chat, but I'm putting here for the record. Yes, timeouts are always problematic. And it's definitely an anti-pattern to "bump the timeout in the tests" to make a flaky test less flaky. The question is why are we hitting the timeout? If the underlying problem is that 1 out of 1000 times there was a dropped notification and the system was never going to do the thing we were waiting for, then bumping the timeout is obviously wrong. If the problem is that GitHub Actions runs our tests on a heavily multi-tenant system where we might be off-CPU for the better part of 500ms, then I'd say bumping the timeout is totally reasonable. To handle the latter case, my rule of thumb (which came from @wesolows 10+ years ago) is to choose a timeout for tests such that if you hit the timeout, that by itself means the system is not working as intended. Say, if it took an hour for an SP to respond, even if it did respond correctly, we'd consider that broken. I don't know where the line is for this case but 10 seconds seems fine. 30 seconds even seems okay. Timeouts in tests are usually just to prevent the test suite from hanging indefinitely so I think it's okay to be pretty generous.

In this case we're in the latter; I'm inclined to merge this as-is to (presumably) eliminate the flakiness, and then follow up with changes based on the below (or further discussion) in a later PR.

But I realize in this case it's not actually a test-specific timeout that's firing and I think you're asking for thoughts on the timeout in the general case. I think this depends on what might cause these things to take a long time, who the callers are, and what they're doing. I imagine possible reasons for this taking a while might include:

  1. Transient network issue: the request UDP packet never made it to the SP phy or the response never made it back to MGS. But the problem was transient -- if we try again, it's likely to succeed.
  2. The SP is a little busier than usual and takes a little while to get around to answering this request.
  3. The SP is pathologically busy -- it might get to this, but it'll take so long it may as well be not functioning.
  4. The SP is in some state where it will never respond to this message (e.g., a bug in the request handler causes it to drop the request without responding)
  5. The SP is in some state where it will never respond to any message (e.g., offline, physically partitioned, CPU stuck in a loop).

Of course ideally in case 1, we'd retransmit and wait a sec; in case 2, we'd just wait a sec; in cases 3-5, we'd fail immediately. But we don't know what case we're in.

How long does the caller want to wait? Doesn't this depend on the caller? Most of the use cases I can think of are not latency sensitive: e.g., a background software update. In the special case of an Oxide support rep querying state, they might want to wait a really long time if it would help get the actual state. I could imagine being latency-sensitive if we're doing this in the context of an external API call (e.g., to get the current power state)...but I think that might be better done by polling such state in the background, saving it to the database, and reporting cached state from the API.

I guess I'm making an end-to-end argument that MGS doesn't know how long it should wait. Maybe we should (1) retransmit the SP message periodically to deal with transient drops, and then (2) not have a timeout, but let the client decide how long to wait (by closing their TCP connection to MGS when they want to give up)? Is that a terrible idea?

Thanks for writing all this out. This doesn't seem like a terrible idea at all, but I worry a bit that it might be punting the problem up a layer. If the requests are coming from a human, letting them decide how long to wait seems fine. If the requests are coming from nexus in some automated way, it's less clear (at least to me) that it would know how long it wants to wait. Or put another way, if we take the timeouts out of MGS, do we end up putting something very similar to them back into nexus? Maybe the right thing to do for now is to take the timeouts out of MGS and then when we integrate it into nexus decide what to do. Does that seem reasonable?

There's another issue, which is you might not want one stuck SP to hold up a request for information about all SPs...but I think that's an argument for clients making separate requests for the state of each SP.

Maybe, but this also feels like it punts the problem up a layer. If I'm RSS and need to query all SPs to get an inventory to show to the user, if MGS doesn't expose some kind of bulk status API, it now has to do the bulk collection itself. FWIW the current implementation does try to deal with the "one stuck SP" problem: the bulk status API is paginated, and there's a separate timeout for "we've waiting long enough; go ahead and return what we have so far as a page", allowing the client to call back later to get subsequent pages (during which time MGS is continuing to wait for the slow SP(s)).

@jgallagher jgallagher merged commit cc7127d into main Apr 21, 2022
@jgallagher jgallagher deleted the mgs-test-fail-timeout branch April 21, 2022 15:07
@jgallagher
Copy link
Contributor Author

I merged this to hopefully kill the test flakiness for now, but happy to continue discussing here or elsewhere.

@davepacheco
Copy link
Collaborator

Or put another way, if we take the timeouts out of MGS, do we end up putting something very similar to them back into nexus? Maybe the right thing to do for now is to take the timeouts out of MGS and then when we integrate it into nexus decide what to do. Does that seem reasonable?

I think that makes sense.

For what it's worth: my intuition is that Nexus could fetch SP state on some interval in the background (and possibly proactively when needed for some operation), cache it in the database, and used the cached state when possible. Then the problem isn't "how long do I wait" -- you wait forever, as long as we're retransmitting the request from MGS to the SP. It's "how out of date is my information about SP X?" If it gets too long, we can raise an alert, maybe take some repair action (power cycle?), or whatever. But we also still have the last-known information, which may be useful for some purposes. (And if it's not useful for some purpose, we can fail immediately with "this SP is out to lunch", rather than waiting some arbitrary time to re-discover what we already know.) A similar approach might work for some non-read actions, like updating firmware: you try on some interval, alarm if it hasn't responded for too long, but you still keep trying indefinitely or until directed to do otherwise.

Maybe, but this also feels like it punts the problem up a layer. If I'm RSS and need to query all SPs to get an inventory to show to the user, if MGS doesn't expose some kind of bulk status API, it now has to do the bulk collection itself.

Totally -- but I think that should be pretty straightforward (or at least no harder in the consumer than in MGS)? And it gives the consumer a lot more control over which ones to fetch with what frequency, how long to wait, what to do with partial failures, etc. The bulk API is fine too. I'd just suggest that if we run into hard questions because these requests are combined, it's worth considering whether it would be easier for the consumer to be in control.

@andrewjstone
Copy link
Contributor

For what it's worth: my intuition is that Nexus could fetch SP state on some interval in the background (and possibly proactively when needed for some operation), cache it in the database, and used the cached state when possible.

I've used this strategy successfully in the past and it's really nice for limiting load. I actually used it at one point just across cores to limit contention. My only suggestion is to also include a timestamp with the info so the user can see how stale it is.

leftwo pushed a commit that referenced this pull request Oct 4, 2023
Crucible updates
    all Crucible connections should set TCP_NODELAY (#983)
    Use a fixed size for tag and nonce (#957)
    Log crucible opts on start, order crutest options (#974)
    Lock the Downstairs less (#966)
    Cache dirty flag locally, reducing SQLite operations (#970)
    Make stats mutex synchronous (#961)
    Optimize requeue during flow control conditions (#962)
    Update Rust crate base64 to 0.21.4 (#950)
    Do less in control (#949)
    Fix --flush-per-blocks (#959)
    Fast dependency checking (#916)
    Update actions/checkout action to v4 (#960)
    Use `cargo hakari` for better workspace deps (#956)
    Update actions/checkout digest to 8ade135 (#939)
    Cache block size in Guest (#947)
    Update Rust crate ringbuffer to 0.15.0 (#954)
    Update Rust crate toml to 0.8 (#955)
    Update Rust crate reedline to 0.24.0 (#953)
    Update Rust crate libc to 0.2.148 (#952)
    Update Rust crate indicatif to 0.17.7 (#951)
    Remove unused async (#943)
    Use a synchronous mutex for bw/iop_tokens (#946)
    Make flush ID non-locking (#945)
    Use `oneshot` channels instead of `mpsc` for notification (#918)
    Use a strong type for upstairs negotiation (#941)
    Add a "dynamometer" option to crucible-downstairs (#931)
    Get new work and active count in one lock (#938)
    A bunch of misc test cleanup stuff (#937)
    Wait for a snapshot to finish on all downstairs (#920)
    dsc and clippy cleanup. (#935)
    No need to sort ackable_work (#934)
    Use a strong type for repair ID (#928)
    Keep new jobs sorted (#929)
    Remove state_count function on Downstairs (#927)
    Small cleanup to IOStateCount (#932)
    let cmon and IOStateCount use ClientId (#930)
    Fast return for zero length IOs (#926)
    Use a strong type for client ID (#925)
    A few Crucible Agent fixes (#922)
    Use a newtype for `JobId` (#919)
    Don't pass MutexGuard into functions (#917)
    Crutest updates, rename tests, new options (#911)

Propolis updates
    Update tungstenite crates to 0.20
    Use `strum` crate for enum-related utilities
    Wire up bits for CPUID customization
    PHD: improve artifact store (#529)
    Revert abort-on-panic in 'dev' cargo profile
leftwo added a commit that referenced this pull request Oct 5, 2023
Crucible updates
    all Crucible connections should set TCP_NODELAY (#983)
    Use a fixed size for tag and nonce (#957)
    Log crucible opts on start, order crutest options (#974)
    Lock the Downstairs less (#966)
    Cache dirty flag locally, reducing SQLite operations (#970)
    Make stats mutex synchronous (#961)
    Optimize requeue during flow control conditions (#962)
    Update Rust crate base64 to 0.21.4 (#950)
    Do less in control (#949)
    Fix --flush-per-blocks (#959)
    Fast dependency checking (#916)
    Update actions/checkout action to v4 (#960)
    Use `cargo hakari` for better workspace deps (#956)
    Update actions/checkout digest to 8ade135 (#939)
    Cache block size in Guest (#947)
    Update Rust crate ringbuffer to 0.15.0 (#954)
    Update Rust crate toml to 0.8 (#955)
    Update Rust crate reedline to 0.24.0 (#953)
    Update Rust crate libc to 0.2.148 (#952)
    Update Rust crate indicatif to 0.17.7 (#951)
    Remove unused async (#943)
    Use a synchronous mutex for bw/iop_tokens (#946)
    Make flush ID non-locking (#945)
    Use `oneshot` channels instead of `mpsc` for notification (#918)
    Use a strong type for upstairs negotiation (#941)
    Add a "dynamometer" option to crucible-downstairs (#931)
    Get new work and active count in one lock (#938)
    A bunch of misc test cleanup stuff (#937)
    Wait for a snapshot to finish on all downstairs (#920)
    dsc and clippy cleanup. (#935)
    No need to sort ackable_work (#934)
    Use a strong type for repair ID (#928)
    Keep new jobs sorted (#929)
    Remove state_count function on Downstairs (#927)
    Small cleanup to IOStateCount (#932)
    let cmon and IOStateCount use ClientId (#930)
    Fast return for zero length IOs (#926)
    Use a strong type for client ID (#925)
    A few Crucible Agent fixes (#922)
    Use a newtype for `JobId` (#919)
    Don't pass MutexGuard into functions (#917)
    Crutest updates, rename tests, new options (#911)

Propolis updates
    Update tungstenite crates to 0.20
    Use `strum` crate for enum-related utilities
    Wire up bits for CPUID customization
    PHD: improve artifact store (#529)
    Revert abort-on-panic in 'dev' cargo profile

---------

Co-authored-by: Alan Hanson <alan@oxide.computer>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants