-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MGS: Improve error messages when we timeout waiting for an SP #953
Conversation
Instead of only "timeout elapsed", we now include both the timeout duration and the SP we were waiting for; e.g., ``` error_message_internal: timeout (5ms) elapsed communicating with SpIdentifier { typ: Switch, slot: 0 } ```
I'm not sure if you can make it work here, but wait_for_condition is useful for when something may timeout in a test. If you can configure the timeout to be approximately |
I mentioned most this in chat, but I'm putting here for the record. Yes, timeouts are always problematic. And it's definitely an anti-pattern to "bump the timeout in the tests" to make a flaky test less flaky. The question is why are we hitting the timeout? If the underlying problem is that 1 out of 1000 times there was a dropped notification and the system was never going to do the thing we were waiting for, then bumping the timeout is obviously wrong. If the problem is that GitHub Actions runs our tests on a heavily multi-tenant system where we might be off-CPU for the better part of 500ms, then I'd say bumping the timeout is totally reasonable. To handle the latter case, my rule of thumb (which came from @wesolows 10+ years ago) is to choose a timeout for tests such that if you hit the timeout, that by itself means the system is not working as intended. Say, if it took an hour for an SP to respond, even if it did respond correctly, we'd consider that broken. I don't know where the line is for this case but 10 seconds seems fine. 30 seconds even seems okay. Timeouts in tests are usually just to prevent the test suite from hanging indefinitely so I think it's okay to be pretty generous. But I realize in this case it's not actually a test-specific timeout that's firing and I think you're asking for thoughts on the timeout in the general case. I think this depends on what might cause these things to take a long time, who the callers are, and what they're doing. I imagine possible reasons for this taking a while might include:
Of course ideally in case 1, we'd retransmit and wait a sec; in case 2, we'd just wait a sec; in cases 3-5, we'd fail immediately. But we don't know what case we're in. How long does the caller want to wait? Doesn't this depend on the caller? Most of the use cases I can think of are not latency sensitive: e.g., a background software update. In the special case of an Oxide support rep querying state, they might want to wait a really long time if it would help get the actual state. I could imagine being latency-sensitive if we're doing this in the context of an external API call (e.g., to get the current power state)...but I think that might be better done by polling such state in the background, saving it to the database, and reporting cached state from the API. I guess I'm making an end-to-end argument that MGS doesn't know how long it should wait. Maybe we should (1) retransmit the SP message periodically to deal with transient drops, and then (2) not have a timeout, but let the client decide how long to wait (by closing their TCP connection to MGS when they want to give up)? Is that a terrible idea? There's another issue, which is you might not want one stuck SP to hold up a request for information about all SPs...but I think that's an argument for clients making separate requests for the state of each SP. |
In this case we're in the latter; I'm inclined to merge this as-is to (presumably) eliminate the flakiness, and then follow up with changes based on the below (or further discussion) in a later PR.
Thanks for writing all this out. This doesn't seem like a terrible idea at all, but I worry a bit that it might be punting the problem up a layer. If the requests are coming from a human, letting them decide how long to wait seems fine. If the requests are coming from nexus in some automated way, it's less clear (at least to me) that it would know how long it wants to wait. Or put another way, if we take the timeouts out of MGS, do we end up putting something very similar to them back into nexus? Maybe the right thing to do for now is to take the timeouts out of MGS and then when we integrate it into nexus decide what to do. Does that seem reasonable?
Maybe, but this also feels like it punts the problem up a layer. If I'm RSS and need to query all SPs to get an inventory to show to the user, if MGS doesn't expose some kind of bulk status API, it now has to do the bulk collection itself. FWIW the current implementation does try to deal with the "one stuck SP" problem: the bulk status API is paginated, and there's a separate timeout for "we've waiting long enough; go ahead and return what we have so far as a page", allowing the client to call back later to get subsequent pages (during which time MGS is continuing to wait for the slow SP(s)). |
I merged this to hopefully kill the test flakiness for now, but happy to continue discussing here or elsewhere. |
I think that makes sense. For what it's worth: my intuition is that Nexus could fetch SP state on some interval in the background (and possibly proactively when needed for some operation), cache it in the database, and used the cached state when possible. Then the problem isn't "how long do I wait" -- you wait forever, as long as we're retransmitting the request from MGS to the SP. It's "how out of date is my information about SP X?" If it gets too long, we can raise an alert, maybe take some repair action (power cycle?), or whatever. But we also still have the last-known information, which may be useful for some purposes. (And if it's not useful for some purpose, we can fail immediately with "this SP is out to lunch", rather than waiting some arbitrary time to re-discover what we already know.) A similar approach might work for some non-read actions, like updating firmware: you try on some interval, alarm if it hasn't responded for too long, but you still keep trying indefinitely or until directed to do otherwise.
Totally -- but I think that should be pretty straightforward (or at least no harder in the consumer than in MGS)? And it gives the consumer a lot more control over which ones to fetch with what frequency, how long to wait, what to do with partial failures, etc. The bulk API is fine too. I'd just suggest that if we run into hard questions because these requests are combined, it's worth considering whether it would be easier for the consumer to be in control. |
I've used this strategy successfully in the past and it's really nice for limiting load. I actually used it at one point just across cores to limit contention. My only suggestion is to also include a timestamp with the info so the user can see how stale it is. |
Crucible updates all Crucible connections should set TCP_NODELAY (#983) Use a fixed size for tag and nonce (#957) Log crucible opts on start, order crutest options (#974) Lock the Downstairs less (#966) Cache dirty flag locally, reducing SQLite operations (#970) Make stats mutex synchronous (#961) Optimize requeue during flow control conditions (#962) Update Rust crate base64 to 0.21.4 (#950) Do less in control (#949) Fix --flush-per-blocks (#959) Fast dependency checking (#916) Update actions/checkout action to v4 (#960) Use `cargo hakari` for better workspace deps (#956) Update actions/checkout digest to 8ade135 (#939) Cache block size in Guest (#947) Update Rust crate ringbuffer to 0.15.0 (#954) Update Rust crate toml to 0.8 (#955) Update Rust crate reedline to 0.24.0 (#953) Update Rust crate libc to 0.2.148 (#952) Update Rust crate indicatif to 0.17.7 (#951) Remove unused async (#943) Use a synchronous mutex for bw/iop_tokens (#946) Make flush ID non-locking (#945) Use `oneshot` channels instead of `mpsc` for notification (#918) Use a strong type for upstairs negotiation (#941) Add a "dynamometer" option to crucible-downstairs (#931) Get new work and active count in one lock (#938) A bunch of misc test cleanup stuff (#937) Wait for a snapshot to finish on all downstairs (#920) dsc and clippy cleanup. (#935) No need to sort ackable_work (#934) Use a strong type for repair ID (#928) Keep new jobs sorted (#929) Remove state_count function on Downstairs (#927) Small cleanup to IOStateCount (#932) let cmon and IOStateCount use ClientId (#930) Fast return for zero length IOs (#926) Use a strong type for client ID (#925) A few Crucible Agent fixes (#922) Use a newtype for `JobId` (#919) Don't pass MutexGuard into functions (#917) Crutest updates, rename tests, new options (#911) Propolis updates Update tungstenite crates to 0.20 Use `strum` crate for enum-related utilities Wire up bits for CPUID customization PHD: improve artifact store (#529) Revert abort-on-panic in 'dev' cargo profile
Crucible updates all Crucible connections should set TCP_NODELAY (#983) Use a fixed size for tag and nonce (#957) Log crucible opts on start, order crutest options (#974) Lock the Downstairs less (#966) Cache dirty flag locally, reducing SQLite operations (#970) Make stats mutex synchronous (#961) Optimize requeue during flow control conditions (#962) Update Rust crate base64 to 0.21.4 (#950) Do less in control (#949) Fix --flush-per-blocks (#959) Fast dependency checking (#916) Update actions/checkout action to v4 (#960) Use `cargo hakari` for better workspace deps (#956) Update actions/checkout digest to 8ade135 (#939) Cache block size in Guest (#947) Update Rust crate ringbuffer to 0.15.0 (#954) Update Rust crate toml to 0.8 (#955) Update Rust crate reedline to 0.24.0 (#953) Update Rust crate libc to 0.2.148 (#952) Update Rust crate indicatif to 0.17.7 (#951) Remove unused async (#943) Use a synchronous mutex for bw/iop_tokens (#946) Make flush ID non-locking (#945) Use `oneshot` channels instead of `mpsc` for notification (#918) Use a strong type for upstairs negotiation (#941) Add a "dynamometer" option to crucible-downstairs (#931) Get new work and active count in one lock (#938) A bunch of misc test cleanup stuff (#937) Wait for a snapshot to finish on all downstairs (#920) dsc and clippy cleanup. (#935) No need to sort ackable_work (#934) Use a strong type for repair ID (#928) Keep new jobs sorted (#929) Remove state_count function on Downstairs (#927) Small cleanup to IOStateCount (#932) let cmon and IOStateCount use ClientId (#930) Fast return for zero length IOs (#926) Use a strong type for client ID (#925) A few Crucible Agent fixes (#922) Use a newtype for `JobId` (#919) Don't pass MutexGuard into functions (#917) Crutest updates, rename tests, new options (#911) Propolis updates Update tungstenite crates to 0.20 Use `strum` crate for enum-related utilities Wire up bits for CPUID customization PHD: improve artifact store (#529) Revert abort-on-panic in 'dev' cargo profile --------- Co-authored-by: Alan Hanson <alan@oxide.computer>
This should address #951:
is now something like
I don't love the latter of those at all - "flaky test with a timeout? increase the timeout" feels awful. But I'm not sure what a reasonable alternative is in this case. Since MGS is communicating with SPs over UDP, it has to have timeouts waiting for responses from those SPs, and the integration tests therefore inherently have those timeouts. This PR feels like it bumps them up to "surely we'll never hit these even in CI runners" territory, but if there are better alternatives I'd be happy to pursue them.