Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (key symptom) in ShadowIndexingCacheSpaceLeakTest.test_si_cache #21597

Closed
vbotbuildovich opened this issue Jul 23, 2024 · 10 comments · Fixed by #22796
Closed

CI Failure (key symptom) in ShadowIndexingCacheSpaceLeakTest.test_si_cache #21597

vbotbuildovich opened this issue Jul 23, 2024 · 10 comments · Fixed by #22796
Labels
auto-triaged used to know which issues have been opened from a CI job ci-failure

Comments

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jul 23, 2024

https://buildkite.com/redpanda/redpanda/builds/51904

Module: rptest.tests.test_si_cache_space_leak
Class: ShadowIndexingCacheSpaceLeakTest
Method: test_si_cache
Arguments: {
    "concurrency": 2,
    "message_size": 10000,
    "num_messages": 100000
}
test_id:    ShadowIndexingCacheSpaceLeakTest.test_si_cache
status:     FAIL
run time:   640.675 seconds

TimeoutError("KgoVerifierRandomConsumer-0-139809235584384 didn't complete in 599.9999992847443 seconds")
Traceback (most recent call last):
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 276, in run_test
    return self.test_context.function(self.test)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 535, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 105, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/test_si_cache_space_leak.py", line 148, in test_si_cache
    self._consumer.wait()
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/services/service.py", line 287, in wait
    if not self.wait_node(node, end - now):
  File "/root/tests/rptest/services/kgo_verifier_services.py", line 251, in wait_node
    return self._do_wait_node(node, timeout_sec)
  File "/root/tests/rptest/services/kgo_verifier_services.py", line 287, in _do_wait_node
    self._redpanda.wait_until(
  File "/root/tests/rptest/services/redpanda.py", line 1054, in wait_until
    wait_until(wrapped,
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: KgoVerifierRandomConsumer-0-139809235584384 didn't complete in 599.9999992847443 seconds

JIRA Link: CORE-5760

@vbotbuildovich vbotbuildovich added auto-triaged used to know which issues have been opened from a CI job ci-failure labels Jul 23, 2024
@vbotbuildovich
Copy link
Collaborator Author

@vbotbuildovich
Copy link
Collaborator Author

@vbotbuildovich
Copy link
Collaborator Author

@vbotbuildovich
Copy link
Collaborator Author

@nvartolomei
Copy link
Contributor

The failure is caused by this PR #21556. The test now takes 3x more time and sometimes it fails due to timeouts.

@abhijat @Lazin @jcipar any insights on why disabling trim carryover makes this test so slow? Are cache puts getting blocked? Do we expect this? Should we just bump the timeout? Reverting the above mentioned PR makes the test run just fine.

@ztlpn
Copy link
Contributor

ztlpn commented Sep 5, 2024

@nvartolomei Should we backport the fix? Seeing the error in 24.2.

@nvartolomei
Copy link
Contributor

@ztlpn actually, #23179

@redpanda-data redpanda-data deleted a comment from vbotbuildovich Sep 5, 2024
@ztlpn
Copy link
Contributor

ztlpn commented Sep 5, 2024

@nvartolomei Hmm, but #23006 (which was backported as #23179 and #23024) fixes the scale test.

I was talking about #22796 (which wasn't backported)

@nvartolomei
Copy link
Contributor

Got confused. Backporting now.

vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this issue Sep 5, 2024
This test is too slow with default configuration making the test flaky.
Instead of raising the timeouts I'm trying to reduce the cache eviction
throttling which makes the test 3x faster.

The test became flaky after in-memory trim was introduced in
redpanda-data#21556.

The main insight was provided by https://github.com/abhijat in a private
exchange:

> I think it might be the extra throttling. With the carry over
> disabled, we always have to do a trim when reserving space, which
> results in a lot more throttling and sleep:
>
> ```
> $ grep -Ri "Cache trimming throttled" * | grep -c cache
> 139
> ```
>
> With the carryover list in place, about half of the calls to reserve
> space end up in an early return because the list provides enough room
> to clear up space, which does not cause the trimming to be throttled
> as much:
>
> ```
> $ grep -Ri "Cache trimming throttled" * | grep -c cache
> 63
> ```
>
> Although that doesn't explain how this test used to work before, IIRC
> carryover is a fairly new feature

Fixes redpanda-data#21597

(cherry picked from commit 7763669)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-triaged used to know which issues have been opened from a CI job ci-failure
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants