Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_spawn_input fails randomly #1320

Closed
tchaikov opened this issue Nov 30, 2022 · 12 comments · Fixed by #1555
Closed

test_spawn_input fails randomly #1320

tchaikov opened this issue Nov 30, 2022 · 12 comments · Fixed by #1555

Comments

@tchaikov
Copy link
Contributor

./run ./configure.py --compiler g++-12 --c++-standard 20 ./run ninja -C build/debug ./run ./test.py --mode=debug 
...
72/74 Test #72: Seastar.unit.spawn ............................***Failed    1.10 sec
[0/1] cd /home/circleci/project/build/debug/tests/unit && /home/circleci/project/build/debug/tests/unit/spawn_test -- -c 2
Running 5 test cases...
WARNING: debug mode. Not for benchmarking or production
INFO  2022-11-30 06:52:59,248 seastar - Reactor backend: linux-aio
WARN  2022-11-30 06:52:59,259 [shard 0] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:13, perf_event_open() failed: Permission denied)
WARN  2022-11-30 06:52:59,259 [shard 1] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:13, perf_event_open() failed: Permission denied)
INFO  2022-11-30 06:52:59,273 [shard 1] seastar - Created fair group io-queue-0, capacity rate 2147483:2147483, limit 12582912, rate 16777216 (factor 1), threshold 2000
INFO  2022-11-30 06:52:59,275 [shard 1] seastar - IO queue uses 0.75ms latency goal for device 0
INFO  2022-11-30 06:52:59,277 [shard 1] seastar - Created io group dev(0), length limit 4194304:4194304, rate 2147483647:2147483647
INFO  2022-11-30 06:52:59,278 [shard 0] seastar - Created io queue dev(0) capacities: 512:2000:2000 1024:3000:3000 2048:5000:5000 4096:9000:9000 8192:17000:17000 16384:33000:33000 32768:65000:65000 65536:129000:129000 131072:257000:257000
random-seed=4260918692
unknown location(0): fatal error: in "test_spawn_input": std::system_error: Broken pipe
/home/circleci/project/src/testing/seastar_test.cc(43): last checkpoint

not able to reproduce this failure locally with

$ for i in `seq 1000`; do
  tests/unit/spawn_test -- -c 2 || break
done
  • on rocky linux 9 + amd64 + linux 6.0.0 + gcc (GCC) 12.1.1 20220628 (Red Hat 12.1.1-3)
  • on debian sid + amd64 + linux 6.0.6 + g++-12 (Debian 12.2.0-9) 12.2.0
@tchaikov
Copy link
Contributor Author

tchaikov commented Dec 4, 2022

looks like an issue related to kernel 5.11 [0]. i am not able to reproduce it with exactly the same docker image on linux 6.0.0-5-amd64 .

testing with kernel 5.11, i've repeated the test for 613 times. still not able to reproduce it.


[0] https://discuss.circleci.com/t/linux-machine-executor-images-2022-january-q1-update/42831

@tchaikov
Copy link
Contributor Author

ERROR 2022-12-09 16:21:09,318 [shard 0] testlog - failed to write to stdin: std::system_error (error system:32, Broken pipe)
ERROR 2022-12-09 16:21:09,318 [shard 0] testlog - failed to read from stdout: std::system_error (error system:32, Broken pipe)
unknown location(0): fatal error: in "test_spawn_input": std::system_error: Broken pipe

see https://app.circleci.com/pipelines/github/tchaikov/seastar/289/workflows/b88e8c00-45f5-49b2-99ce-8036b372b728/jobs/3438

@tchaikov
Copy link
Contributor Author

tchaikov commented Jan 8, 2023

now that a4e1508 has been merged. i am closing this issue. will reopen it if this failure surfaces again.

@tchaikov tchaikov closed this as completed Jan 8, 2023
@tchaikov tchaikov reopened this Jan 8, 2023
tchaikov added a commit to tchaikov/seastar that referenced this issue Feb 12, 2023
because test_spawn_input fails randomly, and we are not able to
fix it in a timely manner. `spawn_process()` is still marked
experimental. so as an intermediate solution, let's just drop this
test. this should help to unblock some PRs.

See also scylladb#1320

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
tchaikov added a commit to tchaikov/seastar that referenced this issue Feb 13, 2023
because test_spawn_input fails randomly, and we are not able to
fix it in a timely manner. spawn_process() is still marked
experimental. so as a temporary measure, let's mark the
tests `test_spawn_input` as known failures. so up to 3 failures
are allowed when testing it. this should help to unblock some PRs.

See also scylladb#1320

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
tchaikov added a commit to tchaikov/seastar that referenced this issue Feb 13, 2023
because test_spawn_input fails randomly, and we are not able to
fix it in a timely manner. spawn_process() is still marked
experimental. so as a temporary measure, let's mark the
tests `test_spawn_input` as known failures. so up to 3 failures
are allowed when testing it. this should help to unblock some PRs.

See also scylladb#1320

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
tchaikov added a commit to tchaikov/seastar that referenced this issue Feb 13, 2023
because test_spawn_input fails randomly, and we are not able to
fix it in a timely manner. spawn_process() is still marked
experimental. so as a temporary measure, let's mark the
tests `test_spawn_input` as known failures. so up to 3 failures
are allowed when testing it. this should help to unblock some PRs.

See also scylladb#1320

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
tchaikov added a commit to tchaikov/seastar that referenced this issue Feb 13, 2023
because test_spawn_input fails randomly, and we are not able to
fix it in a timely manner. spawn_process() is still marked
experimental. so as a temporary measure, let's mark the
tests `test_spawn_input` as known failures. so up to 3 failures
are allowed when testing it. this should help to unblock some PRs.

See also scylladb#1320

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
tchaikov added a commit to tchaikov/seastar that referenced this issue Feb 13, 2023
because test_spawn_input fails randomly, and we are not able to
fix it in a timely manner. spawn_process() is still marked
experimental. so as a temporary measure, let's mark the
tests `test_spawn_input` as known failures. so up to 3 failures
are allowed when testing it. this should help to unblock some PRs.

See also scylladb#1320

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
tchaikov added a commit to tchaikov/seastar that referenced this issue Feb 13, 2023
because test_spawn_input fails randomly, and we are not able to
fix it in a timely manner. spawn_process() is still marked
experimental. so as a temporary measure, let's mark the
tests `test_spawn_input` as known failures. so up to 3 failures
are allowed when testing it. this should help to unblock some PRs.

See also scylladb#1320

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
tchaikov added a commit to tchaikov/seastar that referenced this issue Feb 14, 2023
because test_spawn_input fails randomly, and we are not able to
fix it in a timely manner. spawn_process() is still marked
experimental. so as a temporary measure, let's mark the
tests `test_spawn_input` as known failures. so up to 3 failures
are allowed when testing it. this should help to unblock some PRs.

See also scylladb#1320

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
tchaikov added a commit to tchaikov/seastar that referenced this issue Feb 14, 2023
because test_spawn_input fails randomly, and we are not able to
fix it in a timely manner. spawn_process() is still marked
experimental. so as a temporary measure, let's mark the
tests `test_spawn_input` as known failures. so up to 3 failures
are allowed when testing it. this should help to unblock some PRs.

See also scylladb#1320

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
tchaikov added a commit to tchaikov/seastar that referenced this issue Feb 14, 2023
because test_spawn_input fails randomly, and we are not able to
fix it in a timely manner. spawn_process() is still marked
experimental. so as a temporary measure, let's mark the
tests `test_spawn_input` as known failures. so up to 3 failures
are allowed when testing it. this should help to unblock some PRs.

See also scylladb#1320

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
tchaikov added a commit to tchaikov/seastar that referenced this issue Feb 14, 2023
because test_spawn_input fails randomly, and we are not able to
fix it in a timely manner. spawn_process() is still marked
experimental. so as a temporary measure, let's mark the
tests `test_spawn_input` as known failures. so up to 3 failures
are allowed when testing it. this should help to unblock some PRs.

See also scylladb#1320

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
tchaikov added a commit to tchaikov/seastar that referenced this issue Feb 17, 2023
because test_spawn_input fails randomly, and we are not able to
fix it in a timely manner. spawn_process() is still marked
experimental. so as a temporary measure, let's mark the
tests `test_spawn_input` as known failures. so up to 3 failures
are allowed when testing it. this should help to unblock some PRs.

See also scylladb#1320

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
tchaikov added a commit to tchaikov/seastar that referenced this issue Feb 17, 2023
because test_spawn_input fails randomly, and we are not able to
fix it in a timely manner. spawn_process() is still marked
experimental. so as a temporary measure, let's mark the
tests `test_spawn_input` as known failures. so up to 3 failures
are allowed when testing it. this should help to unblock some PRs.

See also scylladb#1320

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
avikivity added a commit that referenced this issue Mar 8, 2023
… test_spawn_input' from Kefu Chai

* testing/entry_point: do not support Boost <= 1.58
* testing: set test name and file in constructor
* testing: do not use global vector for collecting tests
* testing: add boost unit test decorator support
* tests: do not fail spawn_test if up to 3 tests fail

See also #1320

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #1498

* github.com:scylladb/seastar:
  tests: do not fail spawn_test if less or equal to 3 tests fail
  testing: add boost unit test decorator support
  testing: do not use global vector for collecting tests
  testing: set line number for each test
  testing: set test name and file in constructor
  testing/entry_point: do not support Boost <= 1.58
@balusch
Copy link
Contributor

balusch commented Mar 13, 2023

Hi, kefu. I'm also curious about the 'Broken pipe' issue and have made some investigations last weekend.

Since the error is caused by cat closing the pipe fd prematurely, which shouldn't have happened in our simple test, but was not the case indeed. So I was assured that something went wrong with cat, and I guessed that cat itself may leave some clues to us and then I added some code to read the stderr of the subprocess after stdin.flush():

return stdin.write(text).then([&stdin] {
    return stdin.flush();
}).finally([&stderr]() {
    return stderr.read_up_to(1024).then([](temporary_buffer<char> err) {
        seastar_logger.error("read error from /bin/cat: {}", std::string(err.begin(), err.size()));
        return make_ready_future<>();
    });
});

And luckily I reproduced the issue on my machine(thank God!):

% for i in `seq 65535`; do ./build/debug/tests/unit/spawn_test | grep -i 'Broken pipe' && break; done;
WARNING: debug mode. Not for benchmarking or production
INFO  2023-03-13 02:56:52,063 seastar - Reactor backend: linux-aio
WARN  2023-03-13 02:56:52,084 [shard  6] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard  0] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard  5] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard  4] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard  2] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard  9] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard 11] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard 10] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard 13] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard  7] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard  1] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard  8] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard 12] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard  3] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard 15] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard 14] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
INFO  2023-03-13 02:56:52,118 [shard  0] seastar - Created fair group io-queue-0, capacity rate 2147483:2147483, limit 12582912, rate 16777216 (factor 1), threshold 2000
INFO  2023-03-13 02:56:52,119 [shard  0] seastar - IO queue uses 0.75ms latency goal for device 0
INFO  2023-03-13 02:56:52,120 [shard  0] seastar - Created io group dev(0), length limit 4194304:4194304, rate 2147483647:2147483647
INFO  2023-03-13 02:56:52,120 [shard  0] seastar - Created io queue dev(0) capacities: 512:2000:2000 1024:3000:3000 2048:5000:5000 4096:9000:9000 8192:17000:17000 16384:33000:33000 32768:65000:65000 65536:129000:129000 131072:257000:257000
ERROR 2023-03-13 02:56:52,264 [shard  0] seastar - read error from /bin/cat: /bin/cat: -: Resource temporarily unavailable

/data/Workspace/iSoft/seastar/tests/unit/spawn_test.cc(111): error: in "test_spawn_input": failed to write to stdin: std::system_error (error system:32, Broken pipe)

*** No errors detected

ERROR 2023-03-13 02:56:52,264 [shard 0] seastar - read error from /bin/cat: /bin/cat: -: Resource temporarily unavailable

And after searching google, I found several similar issue reports and this one gave me a hint:

Node.js sets its stdio pipe ends into non-blocking mode to work with async I/O, but cat expects blocking pipes.

We did make pipe fds non-blocking!

Maybe we should set fds used in subprocess blocking by default and leave the right to enable non-blocking to the child itself?

int nonblocking = 0;
std::get<pipefd_read_end>(cin_pipe).ioctl(FIONBIO, &nonblocking);
std::get<pipefd_write_end>(cout_pipe).ioctl(FIONBIO, &nonblocking);
std::get<pipefd_write_end>(cerr_pipe).ioctl(FIONBIO, &nonblocking);

tchaikov added a commit to tchaikov/seastar that referenced this issue Mar 13, 2023
it turns out tools like "cat" expect fd opened blocking mode. while
the pipe fds are always created in non-blocking mode. so in order to
appease these tools, let's set the fds passed to the spawned process
to blocking mode.

Fixes scylladb#1320
Signed-off-by: Jianyong Chen <baluschch@gmail.com>
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
@tchaikov
Copy link
Contributor Author

@balusch hi Jianyong, thank you very much for looking into this issue! both your analysis and fix make sense to me. so i took the liberty of creating a PR based on your proposal -- the only contribution from me is to verify the fix and adapt your reasoning to a commit message. could you help review it?

tchaikov added a commit to tchaikov/seastar that referenced this issue Mar 13, 2023
it turns out tools like "cat" expect fd opened blocking mode. while
the pipe fds are always created in non-blocking mode. so in order to
appease these tools, let's set the fds passed to the spawned process
to blocking mode.

Fixes scylladb#1320
Signed-off-by: Jianyong Chen <baluschch@gmail.com>
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
tchaikov added a commit to tchaikov/seastar that referenced this issue Mar 13, 2023
* pass fds opened in blocking mode to spawned process
* do not tolerate test failures in test_spawn_input

it turns out tools like "cat" expect fd opened blocking mode. while
the pipe fds are always created in non-blocking mode. so in order to
appease these tools, let's set the fds passed to the spawned process
to blocking mode.

Fixes scylladb#1320
Signed-off-by: Jianyong Chen <baluschch@gmail.com>
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
@balusch
Copy link
Contributor

balusch commented Mar 15, 2023

...
WARN  2023-03-15 11:16:33,958 [shard  1] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARNING: unable to mbind shard memory; performance may suffer: Operation not permitted
WARN  2023-03-15 11:16:34,002 [shard 15] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
random-seed=2719424098
ERROR 2023-03-15 11:16:34,408 [shard  0] seastar - read error from /bin/cat: /bin/cat: -: Resource temporarily unavailable

/data/Workspace/iSoft/seastar/tests/unit/spawn_test.cc(108): error: in "test_spawn_input": failed to write to stdin: std::system_error (error system:32, Broken pipe)
/data/Workspace/iSoft/seastar/tests/unit/spawn_test.cc(117): error: in "test_spawn_input": check sstring(echo.get(), echo.size()) == text has failed [ != hello world
]

*** 2 failures are detected in the test module "Master Test Suite"

also reproduced on my machine after an era, but fotunatelly we have some clues this time -- yes, the same error message as before, so I think the problem is still caused by the nonblockingness of pipe fds used by the child, and we just didn't fix it correctly.

I start to guess maybe the fds used by child are still nonblocking, which I cannot check since /bin/cat is a utility without source code, so I write my own version:

// mycat.c

#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <assert.h>

static void test_nonblocking();

int
main(int argc, char **argv)
{
    test_nonblocking();
    return 0;
}

static void
test_nonblocking()
{

#define TEST(desc, fd)                              \
do {                                                \
    int status = fcntl(fd, F_GETFL, 0);             \
    assert(status >= 0);                            \
    if (status & O_NONBLOCK) {                      \
        printf(#desc " is nonblocking\n");          \
    } else {                                        \
        printf(#desc " is blocking\n");             \
    }                                               \
} while (0)

    TEST(stdin, STDIN_FILENO);
    TEST(stdout, STDOUT_FILENO);
    TEST(stderr, STDERR_FILENO);

#undef TEST
}

it simply prints the blockingness of stdin/stdout/stderr through stdout, without echoing data back.

And replaced /bin/cat in test_spawn_input, re-ran it and got the result:

/data/Workspace/iSoft/seastar/tests/unit/spawn_test.cc(117): error: in "test_spawn_input": check sstring(echo.get(), echo.size()) == text has failed [stdin is non != hello world

although the message read from stdout of the child is incomplete, we still know it's nonblocking!

Finally I found the reason: we misused the file_desc::ioctl method.

There are lots of versions of ioctl in file_desc:

    int ioctl(int request) {
        return ioctl(request, 0);
    }
    int ioctl(int request, int value) {
        int r = ::ioctl(_fd, request, value);
        throw_system_error_on(r == -1, "ioctl");
        return r;
    }
    int ioctl(int request, unsigned int value) {
        int r = ::ioctl(_fd, request, value);
        throw_system_error_on(r == -1, "ioctl");
        return r;
    }
    template <class X>
    int ioctl(int request, X& data) {
        int r = ::ioctl(_fd, request, &data);
        throw_system_error_on(r == -1, "ioctl");
        return r;
    }
    template <class X>
    int ioctl(int request, X&& data) {
        int r = ::ioctl(_fd, request, &data);
        throw_system_error_on(r == -1, "ioctl");
        return r;
    }

what we need is the last two, and std::get<pipefd_read_end>(cin_pipe).ioctl(FIONBIO, &nonblocking); have chosen the last version actually, but with data = &nonblocking != 0, which is not what we intend to do.

So we could replace previous fix with

std::get<pipefd_read_end>(cin_pipe).template ioctl<int>(FIONBIO, 0);
std::get<pipefd_write_end>(cout_pipe).template ioctl<int>(FIONBIO, 0);
std::get<pipefd_write_end>(cerr_pipe).template ioctl<int>(FIONBIO, 0);

it should work.

@tchaikov
Copy link
Contributor Author

tchaikov commented Mar 15, 2023

actually, i was reading cat's source code couple months ago. but ended up being clueless and gave up: https://github.com/coreutils/coreutils/blob/master/src/cat.c

@tchaikov
Copy link
Contributor Author

@balusch hi Jianyolng, could you prepare a patch? i think it's you who did all the heavy lift.

@balusch
Copy link
Contributor

balusch commented Mar 15, 2023

could you prepare a patch? i think it's you who did all the heavy lift.

Sure, my pleasure.

@tchaikov
Copy link
Contributor Author

thank you!

@balusch
Copy link
Contributor

balusch commented Mar 15, 2023

actually, i was reading cat's source code couple months ago. but ended up being clueless and gave up: coreutils/coreutils@master/src/cat.c

I intended to read it last weekend, but suddenly thought that cat may leave us some information through stderr. 😃

avikivity added a commit that referenced this issue Mar 16, 2023
…rom Jianyong Chen

It turns out tools like "cat" expect fd opened blocking mode,
while the pipe fds are always created in non-blocking mode.
So in order to appease these tools, let's set the fds passed
to the spawned process to blocking mode.

Fixes #1320

Signed-off-by: Kefu Chai kefu.chai@scylladb.com
Signed-off-by: Jianyong Chen <baluschch@gmail.com>

Closes #1555

* github.com:scylladb/seastar:
  spawn_test: fix /bin/cat stuck in reading input.
  reactor: pass fd opened in blocking mode to spawned process
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants