Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (random consumer timed out) in TieredStorageCacheStressTest.tiny_cache_test #13648

Closed
abhijat opened this issue Sep 25, 2023 · 47 comments · Fixed by #14096 or #23006
Closed
Labels
area/cloud-storage Shadow indexing subsystem ci-failure kind/bug Something isn't working sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low.

Comments

@abhijat
Copy link
Contributor

abhijat commented Sep 25, 2023

https://buildkite.com/redpanda/vtools/builds/9591
https://buildkite.com/redpanda/vtools/builds/9594
https://buildkite.com/redpanda/vtools/builds/9598
https://buildkite.com/redpanda/vtools/builds/9601

Module: rptest.scale_tests.tiered_storage_cache_stress_test
Class: TieredStorageCacheStressTest
Method: tiny_cache_test
test_id:    TieredStorageCacheStressTest.tiny_cache_test
status:     FAIL
run time:   138.107 seconds

TimeoutError("KgoVerifierRandomConsumer-0-140055370647680 didn't complete in 49.99999952316284 seconds")
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 269, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 82, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/tiered_storage_cache_stress_test.py", line 344, in tiny_cache_test
    KgoVerifierRandomConsumer.oneshot(
  File "/home/ubuntu/redpanda/tests/rptest/services/kgo_verifier_services.py", line 106, in oneshot
    inst.wait(**timeout_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/services/service.py", line 287, in wait
    if not self.wait_node(node, end - now):
  File "/home/ubuntu/redpanda/tests/rptest/services/kgo_verifier_services.py", line 205, in wait_node
    return self._do_wait_node(node, timeout_sec)
  File "/home/ubuntu/redpanda/tests/rptest/services/kgo_verifier_services.py", line 241, in _do_wait_node
    self._redpanda.wait_until(
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 1149, in wait_until
    wait_until(wrapped,
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: KgoVerifierRandomConsumer-0-140055370647680 didn't complete in 49.99999952316284 seconds

JIRA Link: CORE-1464

@abhijat abhijat added kind/bug Something isn't working ci-failure area/cloud-storage Shadow indexing subsystem labels Sep 25, 2023
@abhijat
Copy link
Contributor Author

abhijat commented Sep 25, 2023

looking at one instance of test failure the consumer had read upto 78 messages. The expectation is to read 100 messages, so the consumer seems slow but not stuck

[INFO  - 2023-09-23 13:48:04,020 - kgo_verifier_services - _ingest_status - lineno:368]: Worker KgoVerifierRandomConsumer-0-140055370647680 status: ConsumerStatus<True, 0, ValidatorStatus<78 0 0>>

@andijcr
Copy link
Contributor

andijcr commented Oct 6, 2023

from the kgoverifier log, towards the end:

time="2023-09-25T02:15:36Z" level=info msg="Loading offsets for topic tiny-cache t=-2..."
time="2023-09-25T02:15:36Z" level=error msg="Error reading from partition :0: context deadline exceeded" tag=random-000
time="2023-09-25T02:15:36Z" level=error msg="Reloading offsets on empty response reading from partition 0 at {47179 0 -1 0 %!d(bool=false) %!d(bool=false)}" tag=random-000
time="2023-09-25T02:15:36Z" level=info msg="Loading offsets for topic tiny-cache t=-1..."
time="2023-09-25T02:15:36Z" level=error msg="Error reading from partition :0: context deadline exceeded" tag=random-016
time="2023-09-25T02:15:36Z" level=error msg="Reloading offsets on empty response reading from partition 0 at {105623 0 -1 0 %!d(bool=false) %!d(bool=false)}" tag=random-016
time="2023-09-25T02:15:36Z" level=info msg="Loading offsets for topic tiny-cache t=-1..."
time="2023-09-25T02:15:36Z" level=info msg="Loading offsets for topic tiny-cache t=-2..."
time="2023-09-25T02:15:36Z" level=info msg="Loading offsets for topic tiny-cache t=-2..."
time="2023-09-25T02:15:36Z" level=info msg="Validator status: {\"name\":\"\",\"valid_reads\":5,\"invalid_reads\":0,\"out_of_scope_invalid_reads\":0,\"max_offsets_consumed\":{\"0\":188858}}"
time="2023-09-25T02:15:36Z" level=info msg="Validator status: {\"name\":\"\",\"valid_reads\":5,\"invalid_reads\":0,\"out_of_scope_invalid_reads\":0,\"max_offsets_consumed\":{\"0\":188858}}"
time="2023-09-25T02:15:36Z" level=error msg="Error reading from partition :0: context deadline exceeded" tag=random-015
time="2023-09-25T02:15:36Z" level=error msg="Reloading offsets on empty response reading from partition 0 at {59696 0 -1 0 %!d(bool=false) %!d(bool=false)}" tag=random-015
time="2023-09-25T02:15:36Z" level=info msg="Loading offsets for topic tiny-cache t=-1..."
time="2023-09-25T02:15:36Z" level=info msg="Loading offsets for topic tiny-cache t=-2..."
time="2023-09-25T02:15:36Z" level=error msg="Error reading from partition :0: context deadline exceeded" tag=random-006
time="2023-09-25T02:15:36Z" level=error msg="Reloading offsets on empty response reading from partition 0 at {89594 0 -1 0 %!d(bool=false) %!d(bool=false)}" tag=random-006
time="2023-09-25T02:15:36Z" level=info msg="Loading offsets for topic tiny-cache t=-1..."
time="2023-09-25T02:15:36Z" level=info msg="Loading offsets for topic tiny-cache t=-2..."
time="2023-09-25T02:15:40Z" level=error msg="Error reading from partition :0: context deadline exceeded" tag=random-005
time="2023-09-25T02:15:40Z" level=error msg="Reloading offsets on empty response reading from partition 0 at {68864 0 -1 0 %!d(bool=false) %!d(bool=false)}" tag=random-005
time="2023-09-25T02:15:40Z" level=info msg="Loading offsets for topic tiny-cache t=-1..."
time="2023-09-25T02:15:40Z" level=info msg="Loading offsets for topic tiny-cache t=-2..."
time="2023-09-25T02:15:40Z" level=info msg="Printing stack on remote request:"
goroutine profile: total 63
8 @ 0x4a454 0x5b014 0x2e54e0 0x7b0a4
#	0x2e54df	github.com/twmb/franz-go/pkg/kgo.(*Client).reapConnectionsLoop+0xef	/root/go/pkg/mod/github.com/twmb/franz-go@v1.14.3/pkg/kgo/broker.go:557

8 @ 0x4a454 0x5b014 0x332b58 0x7b0a4
#	0x332b57	github.com/twmb/franz-go/pkg/kgo.(*Client).updateMetadataLoop+0x157	/root/go/pkg/mod/github.com/twmb/franz-go@v1.14.3/pkg/kgo/metadata.go:173

7 @ 0x4a454 0x431c0 0x75490 0xcad08 0xcba48 0xcba39 0x1651c8 0x16fb24 0x9e178 0x2ea610 0x2ea5dd 0x7b0a4
#	0x7548f		internal/poll.runtime_pollWait+0x9f					/usr/local/go/src/runtime/netpoll.go:305
#	0xcad07		internal/poll.(*pollDesc).wait+0x27					/usr/local/go/src/internal/poll/fd_poll_runtime.go:84
#	0xcba47		internal/poll.(*pollDesc).waitRead+0x1e7				/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
#	0xcba38		internal/poll.(*FD).Read+0x1d8						/usr/local/go/src/internal/poll/fd_unix.go:167
#	0x1651c7	net.(*netFD).Read+0x27							/usr/local/go/src/net/fd_posix.go:55
#	0x16fb23	net.(*conn).Read+0x33							/usr/local/go/src/net/net.go:183
#	0x9e177		io.ReadAtLeast+0xa7							/usr/local/go/src/io/io.go:332
#	0x2ea60f	io.ReadFull+0x14f							/usr/local/go/src/io/io.go:351
#	0x2ea5dc	github.com/twmb/franz-go/pkg/kgo.(*brokerCxn).readConn.func2+0x11c	/root/go/pkg/mod/github.com/twmb/franz-go@v1.14.3/pkg/kgo/broker.go:1126

7 @ 0x4a454 0x5b014 0x2ea188 0x2eaed0 0x2ecbe4 0x2ec99c 0x7b0a4
#	0x2ea187	github.com/twmb/franz-go/pkg/kgo.(*brokerCxn).readConn+0x327	/root/go/pkg/mod/github.com/twmb/franz-go@v1.14.3/pkg/kgo/broker.go:1142
#	0x2eaecf	github.com/twmb/franz-go/pkg/kgo.(*brokerCxn).readResponse+0x5f	/root/go/pkg/mod/github.com/twmb/franz-go@v1.14.3/pkg/kgo/broker.go:1217
#	0x2ecbe3	github.com/twmb/franz-go/pkg/kgo.(*brokerCxn).handleResp+0xc3	/root/go/pkg/mod/github.com/twmb/franz-go@v1.14.3/pkg/kgo/broker.go:1448
#	0x2ec99b	github.com/twmb/franz-go/pkg/kgo.(*brokerCxn).handleResps+0x9b	/root/go/pkg/mod/github.com/twmb/franz-go@v1.14.3/pkg/kgo/broker.go:1438

7 @ 0x4a454 0x5b014 0x310990 0x395564 0x3999e8 0x7b0a4
#	0x31098f	github.com/twmb/franz-go/pkg/kgo.(*Client).PollRecords+0x1ef					/root/go/pkg/mod/github.com/twmb/franz-go@v1.14.3/pkg/kgo/consumer.go:511
#	0x395563	github.com/redpanda-data/kgo-verifier/pkg/worker/verifier.(*RandomReadWorker).Wait+0x5e3	/opt/kgo-verifier/pkg/worker/verifier/random_read_worker.go:136
#	0x3999e7	main.main.func6+0x37										/opt/kgo-verifier/cmd/kgo-verifier/main.go:263

7 @ 0x4a454 0x5b014 0x314cb4 0x7b0a4
#	0x314cb3	github.com/twmb/franz-go/pkg/kgo.(*consumerSession).manageFetchConcurrency+0x103	/root/go/pkg/mod/github.com/twmb/franz-go@v1.14.3/pkg/kgo/consumer.go:1465

7 @ 0x4a454 0x5b014 0x34fae4 0x34f40c 0x7b0a4
#	0x34fae3	github.com/twmb/franz-go/pkg/kgo.(*source).fetch+0x273		/root/go/pkg/mod/github.com/twmb/franz-go@v1.14.3/pkg/kgo/source.go:636
#	0x34f40b	github.com/twmb/franz-go/pkg/kgo.(*source).loopFetch+0x30b	/root/go/pkg/mod/github.com/twmb/franz-go@v1.14.3/pkg/kgo/source.go:547

7 @ 0x4a454 0x77030 0x77011 0x81b40 0x310f24 0x7b0a4
#	0x77010		sync.runtime_notifyListWait+0x140					/usr/local/go/src/runtime/sema.go:517
#	0x81b3f		sync.(*Cond).Wait+0xcf							/usr/local/go/src/sync/cond.go:70
#	0x310f23	github.com/twmb/franz-go/pkg/kgo.(*Client).PollRecords.func2+0x103	/root/go/pkg/mod/github.com/twmb/franz-go@v1.14.3/pkg/kgo/consumer.go:500

1 @ 0x1c3f4 0x774b0 0x2223ec 0x7b0a4
#	0x774af		os/signal.signal_recv+0x2f	/usr/local/go/src/runtime/sigqueue.go:152
#	0x2223eb	os/signal.loop+0x1b		/usr/local/go/src/os/signal/signal_unix.go:23

1 @ 0x3fc7c 0x74f54 0x228664 0x228480 0x225a8c 0x397974 0x210fa8 0x212620 0x21376c 0x210034 0x7b0a4
#	0x74f53		runtime/pprof.runtime_goroutineProfileWithLabels+0x23	/usr/local/go/src/runtime/mprof.go:846
#	0x228663	runtime/pprof.writeRuntimeProfile+0xb3			/usr/local/go/src/runtime/pprof/pprof.go:723
#	0x22847f	runtime/pprof.writeGoroutine+0x4f			/usr/local/go/src/runtime/pprof/pprof.go:683
#	0x225a8b	runtime/pprof.(*Profile).WriteTo+0x14b			/usr/local/go/src/runtime/pprof/pprof.go:330
#	0x397973	main.main.func5+0x73					/opt/kgo-verifier/cmd/kgo-verifier/main.go:204
#	0x210fa7	net/http.HandlerFunc.ServeHTTP+0x37			/usr/local/go/src/net/http/server.go:2109
#	0x21261f	net/http.(*ServeMux).ServeHTTP+0x13f			/usr/local/go/src/net/http/server.go:2487
#	0x21376b	net/http.serverHandler.ServeHTTP+0x2cb			/usr/local/go/src/net/http/server.go:2947
#	0x210033	net/http.(*conn).serve+0x543				/usr/local/go/src/net/http/server.go:1991

1 @ 0x4a454 0x431c0 0x75490 0xcad08 0xcba48 0xcba39 0x1651c8 0x16fb24 0x20a434 0x7b0a4
#	0x7548f		internal/poll.runtime_pollWait+0x9f		/usr/local/go/src/runtime/netpoll.go:305
#	0xcad07		internal/poll.(*pollDesc).wait+0x27		/usr/local/go/src/internal/poll/fd_poll_runtime.go:84
#	0xcba47		internal/poll.(*pollDesc).waitRead+0x1e7	/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
#	0xcba38		internal/poll.(*FD).Read+0x1d8			/usr/local/go/src/internal/poll/fd_unix.go:167
#	0x1651c7	net.(*netFD).Read+0x27				/usr/local/go/src/net/fd_posix.go:55
#	0x16fb23	net.(*conn).Read+0x33				/usr/local/go/src/net/net.go:183
#	0x20a433	net/http.(*connReader).backgroundRead+0x43	/usr/local/go/src/net/http/server.go:678

1 @ 0x4a454 0x431c0 0x75490 0xcad08 0xcc3f8 0xcc3e9 0x166038 0x176db8 0x17613c 0x213c24 0x2138e4 0x21467c 0x7b0a4
#	0x7548f		internal/poll.runtime_pollWait+0x9f		/usr/local/go/src/runtime/netpoll.go:305
#	0xcad07		internal/poll.(*pollDesc).wait+0x27		/usr/local/go/src/internal/poll/fd_poll_runtime.go:84
#	0xcc3f7		internal/poll.(*pollDesc).waitRead+0x1d7	/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
#	0xcc3e8		internal/poll.(*FD).Accept+0x1c8		/usr/local/go/src/internal/poll/fd_unix.go:614
#	0x166037	net.(*netFD).accept+0x27			/usr/local/go/src/net/fd_unix.go:172
#	0x176db7	net.(*TCPListener).accept+0x27			/usr/local/go/src/net/tcpsock_posix.go:142
#	0x17613b	net.(*TCPListener).Accept+0x2b			/usr/local/go/src/net/tcpsock.go:288
#	0x213c23	net/http.(*Server).Serve+0x2f3			/usr/local/go/src/net/http/server.go:3070
#	0x2138e3	net/http.(*Server).ListenAndServe+0x83		/usr/local/go/src/net/http/server.go:2999
#	0x21467b	net/http.ListenAndServe+0x9b			/usr/local/go/src/net/http/server.go:3255

1 @ 0x4a454 0x5bf3c 0x5bf19 0x76d18 0x84c90 0x3991c8 0x4a084 0x7b0a4
#	0x76d17		sync.runtime_Semacquire+0x27	/usr/local/go/src/runtime/sema.go:62
#	0x84c8f		sync.(*WaitGroup).Wait+0x7f	/usr/local/go/src/sync/waitgroup.go:139
#	0x3991c7	main.main+0x1837		/opt/kgo-verifier/cmd/kgo-verifier/main.go:273
#	0x4a083		runtime.main+0x253		/usr/local/go/src/runtime/proc.go:250

time="2023-09-25T02:15:40Z" level=error msg="Error reading from partition :0: context deadline exceeded" tag=random-011
time="2023-09-25T02:15:40Z" level=error msg="Reloading offsets on empty response reading from partition 0 at {173229 0 -1 0 %!d(bool=false) %!d(bool=false)}" tag=random-011
time="2023-09-25T02:15:40Z" level=info msg="Loading offsets for topic tiny-cache t=-1..."
time="2023-09-25T02:15:40Z" level=info msg="Loading offsets for topic tiny-cache t=-2..."
time="2023-09-25T02:15:41Z" level=warning msg="Gap detected in consumed offsets. Expected 124223, but got 128968"
time="2023-09-25T02:15:41Z" level=info msg="Validator status: {\"name\":\"\",\"valid_reads\":3,\"invalid_reads\":0,\"out_of_scope_invalid_reads\":0,\"max_offsets_consumed\":{\"0\":128968}}"
time="2023-09-25T02:15:41Z" level=warning msg="Gap detected in consumed offsets. Expected 128969, but got 132425"
time="2023-09-25T02:15:41Z" level=info msg="Validator status: {\"name\":\"\",\"valid_reads\":5,\"invalid_reads\":0,\"out_of_scope_invalid_reads\":0,\"max_offsets_consumed\":{\"0\":132425}}"
time="2023-09-25T02:15:41Z" level=error msg="Error reading from partition :0: context deadline exceeded" tag=random-012
time="2023-09-25T02:15:41Z" level=error msg="Reloading offsets on empty response reading from partition 0 at {141934 0 -1 0 %!d(bool=false) %!d(bool=false)}" tag=random-012
time="2023-09-25T02:15:41Z" level=info msg="Loading offsets for topic tiny-cache t=-1..."
time="2023-09-25T02:15:41Z" level=info msg="Loading offsets for topic tiny-cache t=-2..."
time="2023-09-25T02:15:41Z" level=warning msg="Gap detected in consumed offsets. Expected 145878, but got 204202"
time="2023-09-25T02:15:41Z" level=info msg="Validator status: {\"name\":\"\",\"valid_reads\":3,\"invalid_reads\":0,\"out_of_scope_invalid_reads\":0,\"max_offsets_consumed\":{\"0\":204202}}"
time="2023-09-25T02:15:41Z" level=error msg="Error reading from partition :0: context deadline exceeded" tag=random-000
time="2023-09-25T02:15:41Z" level=error msg="Reloading offsets on empty response reading from partition 0 at {121680 0 -1 0 %!d(bool=false) %!d(bool=false)}" tag=random-000
time="2023-09-25T02:15:41Z" level=info msg="Loading offsets for topic tiny-cache t=-1..."
time="2023-09-25T02:15:41Z" level=info msg="Loading offsets for topic tiny-cache t=-2..."
time="2023-09-25T02:15:41Z" level=error msg="Error reading from partition :0: context deadline exceeded" tag=random-016
time="2023-09-25T02:15:41Z" level=error msg="Reloading offsets on empty response reading from partition 0 at {153847 0 -1 0 %!d(bool=false) %!d(bool=false)}" tag=random-016
time="2023-09-25T02:15:41Z" level=info msg="Loading offsets for topic tiny-cache t=-1..."
time="2023-09-25T02:15:41Z" level=info msg="Loading offsets for topic tiny-cache t=-2..."
time="2023-09-25T02:15:41Z" level=error msg="Error reading from partition :0: context deadline exceeded" tag=random-015
time="2023-09-25T02:15:41Z" level=error msg="Reloading offsets on empty response reading from partition 0 at {195256 0 -1 0 %!d(bool=false) %!d(bool=false)}" tag=random-015
time="2023-09-25T02:15:41Z" level=info msg="Loading offsets for topic tiny-cache t=-1..."
time="2023-09-25T02:15:41Z" level=info msg="Loading offsets for topic tiny-cache t=-2..."

wonder why the stacktrace

edit:
it's explicitly requested as part of the lifecycle
[INFO - 2023-09-25 02:15:40,738 - kgo_verifier_services - _remote - lineno:184]: KgoVerifierRandomConsumer-0-281473120795568 remote call: http://ip-172-31-47-228:8080/print_stack

@andijcr
Copy link
Contributor

andijcr commented Oct 6, 2023

the timeout is computed as


        # Use a random consumer to pathologically overload the ability of the cache to
        # handle incoming reads.
        # This should be slow, but:
        #  - Eventually complete
        #  - Do not overshoot cache size limits

        # Runtime will be as long as it takes to cycle each read chunk through
        # the cache, multiplied by the trim throttle interval (5s)
        expect_consume_time_s = self.runtime_grace_factor * (
            (consume_parallel * consume_messages_per_worker * 5) /
            (cache_size / chunk_size))

where self.runtime_grace_factor is already 2.
increasing is easy and probably the faster way, but if it works I'll add a ticket to study if there is a trend of worse performances

would be cool if it was possible to mark a test result as "passed but these metrics are not green"

@piyushredpanda
Copy link
Contributor

@andijcr what is next here?

@andijcr
Copy link
Contributor

andijcr commented Oct 12, 2023

@andijcr what is next here?

wating for cdt here #14096

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@dotnwat dotnwat added the sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low. label Apr 5, 2024
@andijcr
Copy link
Contributor

andijcr commented Apr 23, 2024

these failures seems to point, like before, to a slow kgo verifier, that is not stuck

[INFO - 2024-04-22 13:26:53,316 - kgo_verifier_services - _ingest_status - lineno:398]: Worker KgoVerifierRandomConsumer-0-132699595391568 status: ConsumerStatus<True, 0, ValidatorStatus<valid_reads=167, invalid_reads=0, out_of_scope_invalid_reads=0, lost_offsets={}>>

but we already increased the grace factor once. investigating, but does not seems sev/medium

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Aug 23, 2024
This is similar to the fix applied in
redpanda-data#22796 commit 7763669
which is needed after carryover trim was disabled and we hit cache
trimming throttling making the test slower than our threshold

Fixes redpanda-data#13648
CORE-1464

(cherry picked from commit b5faa6d)
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Sep 3, 2024
This is similar to the fix applied in
redpanda-data#22796 commit 7763669
which is needed after carryover trim was disabled and we hit cache
trimming throttling making the test slower than our threshold

Fixes redpanda-data#13648
CORE-1464

(cherry picked from commit b5faa6d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem ci-failure kind/bug Something isn't working sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low.
Projects
None yet
5 participants