Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spawn_test: prolong termination time to be more tolerant. #1483

Merged
merged 1 commit into from
Feb 12, 2023

Conversation

balusch
Copy link
Contributor

@balusch balusch commented Feb 9, 2023

pidfd_open(2) can fail on old kernels due to several reasons (e.g.
syscall not supported, O_NONBLOCK flags not supported), in which
case process:wait() falls back to waitpid(2) with backoff (at least
20ms), which may make it longer than 10ms.

Here we add the minimal backoff to 10ms so that the test can pass
on old kernels.

Signed-off-by: Jianyong Chen baluschch@gmail.com

@balusch
Copy link
Contributor Author

balusch commented Feb 9, 2023

@tchaikov Hi, kefu, I think I've found the reason.

My ArchLinux is hosted on a machine with kernel 5.4, which supports pidfd_open(2) but not thePIDFD_NONBLOCK flag(alias to O_NONBLOCK, introduced in 5.10), and thus it returns EINVAL and process::wait falls back to use waitpid(2) which however returns 0(the state of the sleep process is not changed yet) and then after 20ms backoff it retries and finally succeeds.

@tchaikov
Copy link
Contributor

tchaikov commented Feb 9, 2023

@balusch hi Jianyong, thanks for your excellent analysis. that explains the symptom in your environment!

the test failures are not related to this change:


54/74 Test #54: Seastar.unit.stall_detector ...................***Failed    6.10 sec
[0/1] cd /home/circleci/project/build/release/tests/unit && /home/circleci/project/build/release/tests/unit/stall_detector_test -- -c 2
Running 5 test cases...
INFO  2023-02-09 04:15:22,535 seastar - Reactor backend: linux-aio
WARN  2023-02-09 04:15:22,535 [shard 0] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:13, perf_event_open() failed: Permission denied)
INFO  2023-02-09 04:15:22,536 [shard 0] seastar - Created fair group io-queue-0, capacity rate 2147483:2147483, limit 12582912, rate 16777216 (factor 1), threshold 2000
INFO  2023-02-09 04:15:22,536 [shard 0] seastar - IO queue uses 0.75ms latency goal for device 0
INFO  2023-02-09 04:15:22,536 [shard 0] seastar - Created io group dev(0), length limit 4194304:4194304, rate 2147483647:2147483647
INFO  2023-02-09 04:15:22,536 [shard 0] seastar - Created io queue dev(0) capacities: 512:2000:2000 1024:3000:3000 2048:5000:5000 4096:9000:9000 8192:17000:17000 16384:33000:33000 32768:65000:65000 65536:129000:129000 131072:257000:257000
WARN  2023-02-09 04:15:22,550 [shard 1] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:13, perf_event_open() failed: Permission denied)
random-seed=1797987249
INFO  2023-02-09 04:15:22,551 [shard 0] seastar - updated: blocked-reactor-notify-ms=10
/home/circleci/project/tests/unit/stall_detector_test.cc(92): fatal error: in "normal_case": critical check reports == 0 has failed [1 != 0]
INFO  2023-02-09 04:15:23,551 [shard 0] seastar - updated: blocked-reactor-notify-ms=25
INFO  2023-02-09 04:15:23,551 [shard 0] seastar - updated: blocked-reactor-notify-ms=10
INFO  2023-02-09 04:15:24,853 [shard 0] seastar - updated: blocked-reactor-notify-ms=25
INFO  2023-02-09 04:15:24,853 [shard 0] seastar - updated: blocked-reactor-notify-ms=10
INFO  2023-02-09 04:15:25,861 [shard 0] seastar - updated: blocked-reactor-notify-ms=25
INFO  2023-02-09 04:15:25,861 [shard 0] testlog - Starting spin test: userspace

BOOST_CHECK_LE(ms, 10);
// sleep should be terminated in 10ms.
// pidfd_open(2) may fail and thus p.wait() falls back to
// waitpid(2) with backoff(at least 20ms).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, please add a space before " (at".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// sleep should be terminated in 10ms.
// pidfd_open(2) may fail and thus p.wait() falls back to
// waitpid(2) with backoff(at least 20ms).
// here we allow one backoff to be more tolerant.
Copy link
Contributor

@tchaikov tchaikov Feb 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// here we allow one backoff to be more tolerant.
// the minimal backoff is added to 10ms, so the test can pass on
// older kernels as well.

@tchaikov
Copy link
Contributor

tchaikov commented Feb 9, 2023

the spawn_test is a known one and is not fixed by this change. i was looking at it, but no clues so far. probably i should drop it.

the stall_detector one might be a flaky test. let's note it down just in case.

@balusch balusch force-pushed the spawn_test-kill branch 2 times, most recently from 0d2b708 to 307d02f Compare February 9, 2023 11:00
pidfd_open(2) can fail on old kernels due to several reasons (e.g.
syscall not supported, O_NONBLOCK flags not supported), in which
case process:wait() falls back to waitpid(2) with backoff (at least
20ms), which may make it longer than 10ms.

Here we add the minimal backoff to 10ms so that the test can pass
on old kernels.

Signed-off-by: Jianyong Chen <baluschch@gmail.com>
Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@tchaikov
Copy link
Contributor

tchaikov commented Feb 9, 2023

@scylladb/seastar-maint could you help merge this one as well?

@avikivity avikivity merged commit dccc827 into scylladb:master Feb 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants