Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build-and-test jobs time out after 8 hours, possibly a deadlock? #1576

Closed
iliana opened this issue Aug 11, 2022 · 6 comments
Closed

build-and-test jobs time out after 8 hours, possibly a deadlock? #1576

iliana opened this issue Aug 11, 2022 · 6 comments
Labels
development Bugs, paper cuts, feature requests, or other thoughts on making omicron development better Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.

Comments

@iliana
Copy link
Contributor

iliana commented Aug 11, 2022

https://github.com/oxidecomputer/omicron/pull/1564/checks?check_run_id=7729390898

While running the nexus test_all target, some tests seem to deadlock with each other sometimes?

| test integration_tests::disks::test_disk_reject_total_size_less_than_one_gibibyte ... ok
| test integration_tests::disks::test_disk_region_creation_failure ... ok
| test integration_tests::basic::test_projects_list ... ok
| test integration_tests::images::test_global_image_create ... ok
| test integration_tests::disks::test_disk_reject_total_size_not_divisible_by_block_size ... ok
| test integration_tests::disks::test_disk_reject_total_size_not_divisible_by_min_disk_size ... ok
| test integration_tests::images::test_global_image_create_bad_content_length ... ok
| test integration_tests::images::test_global_image_create_bad_image_size ... ok
| test integration_tests::images::test_global_image_create_bad_url ... ok
| test integration_tests::images::test_global_image_create_url_404 ... ok
| test integration_tests::images::test_make_disk_from_global_image ... ok
| test integration_tests::images::test_make_disk_from_global_image_too_small ... ok
| test integration_tests::disks::test_disk_create_attach_detach_delete has been running for over 60 se [...]
| test integration_tests::instances::test_cannot_attach_nine_disks_to_instance ... ok
| test integration_tests::disks::test_disk_metrics has been running for over 60 seconds
| test integration_tests::disks::test_disk_metrics_paginated has been running for over 60 seconds
| test integration_tests::disks::test_disk_move_between_instances has been running for over 60 seconds
| test integration_tests::instances::test_cannot_attach_faulted_disks ... ok
| test integration_tests::instances::test_attach_eight_disks_to_instance has been running for over 60  [...]
| test integration_tests::instances::test_attach_one_disk_to_instance has been running for over 60 sec [...]
| test integration_tests::instances::test_disks_detached_when_instance_destroyed has been running for  [...]
| test integration_tests::instances::test_instance_create_delete_network_interface has been running fo [...]
control: job duration 28823 exceeds 28800 seconds; aborting
control: worker failed without completing job
control: task 4 was incomplete, marked failed
@iliana iliana added the Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. label Aug 11, 2022
@davepacheco
Copy link
Collaborator

I haven't looked at this example at all, but another possibility might be #808. If these all use a saga that blows an assertion (or otherwise pancis), I think you'd see this as well. This should be evident from the log files that are created by the tests. I've seen buildomat include them before, but I don't see them from this run. Maybe it doesn't save them when the test times out?

@iliana
Copy link
Contributor Author

iliana commented Aug 11, 2022

Given all the "running for over 60 seconds" tests are disk-related, I think that makes sense.

@bnaecker
Copy link
Collaborator

I see some network-interface-related tests too, so not all disk related. The commonality between those is the instance creation saga I think, though there may be other shared code too.

@davepacheco
Copy link
Collaborator

I assume this is unrelated but in case it's not: I ran into a couple of runs that also timed out unexpectedly:

https://github.com/oxidecomputer/omicron/pull/1123/checks?check_run_id=7796250815
https://github.com/oxidecomputer/omicron/runs/7797083227

In both of these, it appears to be a new test added in those PRs that's still running. But I've never seen that test take longer than 5 minutes locally or in the Helios or GitHub Ubuntu CI. I imagine this is unrelated to the above but I thought I'd mention it in case there's some other systemic issue going on.

@davepacheco
Copy link
Collaborator

The issues I linked above were due to a deadlock in Oso that I'm going to pull in with #1123. It's conceivable that's what happened here, too, but it's hard to say. If we see this again and have either live state, a core file from the test process, or log files from the test process, that would help answer that question.

@iliana
Copy link
Contributor Author

iliana commented Aug 23, 2022

Haven't seen this issue in a while, #1123 probably fixed it.

@iliana iliana closed this as completed Aug 23, 2022
@jordanhendricks jordanhendricks added the development Bugs, paper cuts, feature requests, or other thoughts on making omicron development better label Aug 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Bugs, paper cuts, feature requests, or other thoughts on making omicron development better Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.
Projects
None yet
Development

No branches or pull requests

4 participants