Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need prototype reporting of persistent sled faults #1366

Closed
bnaecker opened this issue Jul 7, 2022 · 2 comments
Closed

Need prototype reporting of persistent sled faults #1366

bnaecker opened this issue Jul 7, 2022 · 2 comments

Comments

@bnaecker
Copy link
Collaborator

bnaecker commented Jul 7, 2022

During investigation of #1364, Josh brought up the general point of fault reporting. See this comment thread for context. This issue tracks adding some prototype or preliminary reporting of persistent faults on a sled. In that particular issue, a failure to delete an OPTE port means that the sled cannot be used further, at least for hosting that particular guest instance. We'd like a simple way to track that fact, ideally in CockroachDB, and use that knowledge in Nexus to direct instances (or Oxide services, potentially) to other sleds.

cc @jclulow

@andrewjstone
Copy link
Contributor

I, as well as @rmustacc, have lots of thoughts on this subject. While I can't speak directly for Robert, please include me in any relevant discussion.

@bnaecker
Copy link
Collaborator Author

This is really a dupe of #612, closing in favor of that.

leftwo pushed a commit that referenced this issue Jun 26, 2024
Added a new package, crucible-dtrace that pulls from buildomat a package
that contains a set of DTrace scripts.  These scripts are extracted into
the global zone at /opt/oxide/crucible_dtrace/

Update Crucible to latest includes these updates:
Clean up dependency checking, fixing space leak (#1372)
Make a DTrace package (#1367)
Use a single context in all messages (#1363)
Remove `DownstairsWork`, because it's redundant (#1371)
Remove `WorkState`, because it's implicit (#1370)
Do work immediately upon receipt of a job, if possible (#1366)
Move 'do work for one job' into a helper function (#1365)
Remove `DownstairsWork` from map when handling it (#1361)
Using `block_in_place` for IO operations (#1357)
update omicron deps; use re-exported dropshot types in oximeter-producer configuration (#1369)
Parameterize more tests (#1364)
Misc cleanup, remove sqlite references. (#1360)
Fix `Extent::close` docstring (#1359)
Make many `Region` functions synchronous (#1356)
Remove `Workstate::Done` (unused) (#1355)
Return a sorted `VecDeque` directly (#1354)
Combine `proc_frame` and `do_work_for` (#1351)
Move `do_work_for` and `do_work` into `ActiveConnection` (#1350)
Support arbitrary Volumes during replace compare (#1349)
Remove the SQLite backend (#1352)
Add a custom timeout for buildomat tests (#1344)
Move `proc_frame` into `ActiveConnection` (#1348)
Remove `UpstairsConnection` from `DownstairsWork` (#1341)
Move Work into ConnectionState (#1340)
Make `ConnectionState` an enum type (#1339)
Parameterize `test_repair.sh` directories (#1345)
Remove `Arc<Mutex<Downstairs>>` (#1338)
Send message to Downstairs directly (#1336)
Consolidate `on_disconnected` and `remove_connection` (#1333)
Move disconnect logic to the Downstairs (#1332)
Remove invalid DTrace probes. (#1335)
Fix outdated comments (#1331)
Use message passing when a new connection starts (#1330)
Move cancellation into Downstairs, using a token to kill IO tasks (#1329)
Make the Downstairs own per-connection state (#1328)
Move remaining local state into a `struct ConnectionState` (#1327)
Consolidate negotiation + IO operations into one loop (#1322)
Allow replacement of a target in a read_only_parent (#1281)
Do all IO through IO tasks (#1321)
Make `reqwest_client` only present if it's used (#1326)
Move negotiation into Downstairs as well (#1320)
Update Rust crate clap to v4.5.4 (#1301)
Reuse a reqwest client when creating Nexus clients (#1317)
Reuse a reqwest client when creating repair client (#1324)
Add % to keep buildomat happy (#1323)
Downstairs task cleanup (#1313)
Update crutest replace test, and mismatch printing. (#1314)
Added more DTrace scripts. (#1309)
Update Rust crate async-trait to 0.1.80 (#1298)
leftwo added a commit that referenced this issue Jun 26, 2024
Update Crucible and Propolis to the latest

Added a new package, crucible-dtrace that pulls from buildomat a package
that contains a set of DTrace scripts. These scripts are extracted into the 
global zone at /opt/oxide/crucible_dtrace/

Crucible latest includes these updates:
Clean up dependency checking, fixing space leak (#1372) Make a DTrace
package (#1367)
Use a single context in all messages (#1363)
Remove `DownstairsWork`, because it's redundant (#1371) Remove
`WorkState`, because it's implicit (#1370)
Do work immediately upon receipt of a job, if possible (#1366) Move 'do
work for one job' into a helper function (#1365) Remove `DownstairsWork`
from map when handling it (#1361) Using `block_in_place` for IO
operations (#1357)
update omicron deps; use re-exported dropshot types in oximeter-producer
configuration (#1369) Parameterize more tests (#1364)
Misc cleanup, remove sqlite references. (#1360)
Fix `Extent::close` docstring (#1359)
Make many `Region` functions synchronous (#1356)
Remove `Workstate::Done` (unused) (#1355)
Return a sorted `VecDeque` directly (#1354)
Combine `proc_frame` and `do_work_for` (#1351)
Move `do_work_for` and `do_work` into `ActiveConnection` (#1350) Support
arbitrary Volumes during replace compare (#1349) Remove the SQLite
backend (#1352)
Add a custom timeout for buildomat tests (#1344)
Move `proc_frame` into `ActiveConnection` (#1348)
Remove `UpstairsConnection` from `DownstairsWork` (#1341) Move Work into
ConnectionState (#1340)
Make `ConnectionState` an enum type (#1339)
Parameterize `test_repair.sh` directories (#1345)
Remove `Arc<Mutex<Downstairs>>` (#1338)
Send message to Downstairs directly (#1336)
Consolidate `on_disconnected` and `remove_connection` (#1333) Move
disconnect logic to the Downstairs (#1332)
Remove invalid DTrace probes. (#1335)
Fix outdated comments (#1331)
Use message passing when a new connection starts (#1330) Move
cancellation into Downstairs, using a token to kill IO tasks (#1329)
Make the Downstairs own per-connection state (#1328) Move remaining
local state into a `struct ConnectionState` (#1327) Consolidate
negotiation + IO operations into one loop (#1322) Allow replacement of a
target in a read_only_parent (#1281) Do all IO through IO tasks (#1321)
Make `reqwest_client` only present if it's used (#1326) Move negotiation
into Downstairs as well (#1320)
Update Rust crate clap to v4.5.4 (#1301)
Reuse a reqwest client when creating Nexus clients (#1317) Reuse a
reqwest client when creating repair client (#1324) Add % to keep
buildomat happy (#1323)
Downstairs task cleanup (#1313)
Update crutest replace test, and mismatch printing. (#1314) Added more
DTrace scripts. (#1309)
Update Rust crate async-trait to 0.1.80 (#1298)

Propolis just has this one update:
Allow boot order config in propolis-standalone
---------

Co-authored-by: Alan Hanson <alan@oxide.computer>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants