-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flaky build failure: "remote wall time is too far ahead" #962
Labels
development
Bugs, paper cuts, feature requests, or other thoughts on making omicron development better
Test Flake
Tests that work. Wait, no. Actually yes. Hang on. Something is broken.
Comments
We saw this again today: This time it was in buildomat. That rules out the GitHub actions CPU starvation idea. I guess it's conceivable that general CPU starvation could cause it. In case we lose the logs:
|
davepacheco
added
the
Test Flake
Tests that work. Wait, no. Actually yes. Hang on. Something is broken.
label
Jul 15, 2022
jordanhendricks
added
the
development
Bugs, paper cuts, feature requests, or other thoughts on making omicron development better
label
Aug 11, 2023
leftwo
pushed a commit
that referenced
this issue
Oct 4, 2023
Crucible updates all Crucible connections should set TCP_NODELAY (#983) Use a fixed size for tag and nonce (#957) Log crucible opts on start, order crutest options (#974) Lock the Downstairs less (#966) Cache dirty flag locally, reducing SQLite operations (#970) Make stats mutex synchronous (#961) Optimize requeue during flow control conditions (#962) Update Rust crate base64 to 0.21.4 (#950) Do less in control (#949) Fix --flush-per-blocks (#959) Fast dependency checking (#916) Update actions/checkout action to v4 (#960) Use `cargo hakari` for better workspace deps (#956) Update actions/checkout digest to 8ade135 (#939) Cache block size in Guest (#947) Update Rust crate ringbuffer to 0.15.0 (#954) Update Rust crate toml to 0.8 (#955) Update Rust crate reedline to 0.24.0 (#953) Update Rust crate libc to 0.2.148 (#952) Update Rust crate indicatif to 0.17.7 (#951) Remove unused async (#943) Use a synchronous mutex for bw/iop_tokens (#946) Make flush ID non-locking (#945) Use `oneshot` channels instead of `mpsc` for notification (#918) Use a strong type for upstairs negotiation (#941) Add a "dynamometer" option to crucible-downstairs (#931) Get new work and active count in one lock (#938) A bunch of misc test cleanup stuff (#937) Wait for a snapshot to finish on all downstairs (#920) dsc and clippy cleanup. (#935) No need to sort ackable_work (#934) Use a strong type for repair ID (#928) Keep new jobs sorted (#929) Remove state_count function on Downstairs (#927) Small cleanup to IOStateCount (#932) let cmon and IOStateCount use ClientId (#930) Fast return for zero length IOs (#926) Use a strong type for client ID (#925) A few Crucible Agent fixes (#922) Use a newtype for `JobId` (#919) Don't pass MutexGuard into functions (#917) Crutest updates, rename tests, new options (#911) Propolis updates Update tungstenite crates to 0.20 Use `strum` crate for enum-related utilities Wire up bits for CPUID customization PHD: improve artifact store (#529) Revert abort-on-panic in 'dev' cargo profile
leftwo
added a commit
that referenced
this issue
Oct 5, 2023
Crucible updates all Crucible connections should set TCP_NODELAY (#983) Use a fixed size for tag and nonce (#957) Log crucible opts on start, order crutest options (#974) Lock the Downstairs less (#966) Cache dirty flag locally, reducing SQLite operations (#970) Make stats mutex synchronous (#961) Optimize requeue during flow control conditions (#962) Update Rust crate base64 to 0.21.4 (#950) Do less in control (#949) Fix --flush-per-blocks (#959) Fast dependency checking (#916) Update actions/checkout action to v4 (#960) Use `cargo hakari` for better workspace deps (#956) Update actions/checkout digest to 8ade135 (#939) Cache block size in Guest (#947) Update Rust crate ringbuffer to 0.15.0 (#954) Update Rust crate toml to 0.8 (#955) Update Rust crate reedline to 0.24.0 (#953) Update Rust crate libc to 0.2.148 (#952) Update Rust crate indicatif to 0.17.7 (#951) Remove unused async (#943) Use a synchronous mutex for bw/iop_tokens (#946) Make flush ID non-locking (#945) Use `oneshot` channels instead of `mpsc` for notification (#918) Use a strong type for upstairs negotiation (#941) Add a "dynamometer" option to crucible-downstairs (#931) Get new work and active count in one lock (#938) A bunch of misc test cleanup stuff (#937) Wait for a snapshot to finish on all downstairs (#920) dsc and clippy cleanup. (#935) No need to sort ackable_work (#934) Use a strong type for repair ID (#928) Keep new jobs sorted (#929) Remove state_count function on Downstairs (#927) Small cleanup to IOStateCount (#932) let cmon and IOStateCount use ClientId (#930) Fast return for zero length IOs (#926) Use a strong type for client ID (#925) A few Crucible Agent fixes (#922) Use a newtype for `JobId` (#919) Don't pass MutexGuard into functions (#917) Crutest updates, rename tests, new options (#911) Propolis updates Update tungstenite crates to 0.20 Use `strum` crate for enum-related utilities Wire up bits for CPUID customization PHD: improve artifact store (#529) Revert abort-on-panic in 'dev' cargo profile --------- Co-authored-by: Alan Hanson <alan@oxide.computer>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
development
Bugs, paper cuts, feature requests, or other thoughts on making omicron development better
Test Flake
Tests that work. Wait, no. Actually yes. Hang on. Something is broken.
Here's where we saw it:
https://github.com/oxidecomputer/omicron/runs/6133927246?check_suite_focus=true#step:11:1956
This message appears to come from CockroachDB, which has a
poll-show-jobs
thing and a component with this error message. What's weird is that this is a single-node CockroachDB cluster and there's only one system here. My first thought was maybe the system clock jumped during the test, but that seems unlikely. I wonder if this is a symptom of CPU starvation due to GitHub Actions workers being starved. That is: maybe CockroachDB gets a timestamp on the client, gets one from the server, and compares them, and hits this message when they're too far apart. On a single system, you could still fail that check if you were stuck off-CPU for a while between the calls to get timestamps.The text was updated successfully, but these errors were encountered: