Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redpanda:testing: partition_movement_test failing in Github CI #1426

Closed
emaxerrno opened this issue May 19, 2021 · 2 comments · Fixed by #1432
Closed

redpanda:testing: partition_movement_test failing in Github CI #1426

emaxerrno opened this issue May 19, 2021 · 2 comments · Fixed by #1432
Assignees
Labels

Comments

@emaxerrno
Copy link
Contributor

2021-05-19T21:17:04.4985740Z Errors while running CTest
2021-05-19T21:17:04.4986085Z 
2021-05-19T21:17:04.4986975Z 98% tests passed, 1 tests failed out of 50
2021-05-19T21:17:04.4987382Z 
2021-05-19T21:17:04.4987718Z Label Time Summary:
2021-05-19T21:17:04.4988164Z archival          =   8.12 sec*proc (2 tests)
2021-05-19T21:17:04.4988621Z bytes             =   0.48 sec*proc (1 test)
2021-05-19T21:17:04.4989055Z cluster           = 295.39 sec*proc (3 tests)
2021-05-19T21:17:04.4989536Z compression       =   0.41 sec*proc (1 test)
2021-05-19T21:17:04.4990032Z config            =   0.40 sec*proc (1 test)
2021-05-19T21:17:04.4990458Z coproc            =  25.76 sec*proc (1 test)
2021-05-19T21:17:04.4990908Z hashing           =   0.10 sec*proc (2 tests)
2021-05-19T21:17:04.4991605Z http              =   0.92 sec*proc (1 test)
2021-05-19T21:17:04.4992045Z json              =   0.35 sec*proc (1 test)
2021-05-19T21:17:04.4992487Z kafka             =  76.01 sec*proc (9 tests)
2021-05-19T21:17:04.4992938Z kafka_protocol    =   0.08 sec*proc (1 test)
2021-05-19T21:17:04.4993419Z model             =   0.12 sec*proc (2 tests)
2021-05-19T21:17:04.4993891Z pandaproxy        =  10.76 sec*proc (4 tests)
2021-05-19T21:17:04.4994590Z raft              =  51.88 sec*proc (3 tests)
2021-05-19T21:17:04.4995035Z random            =   0.06 sec*proc (1 test)
2021-05-19T21:17:04.4995670Z reflection        =   0.45 sec*proc (2 tests)
2021-05-19T21:17:04.4996130Z rpc               =   2.93 sec*proc (3 tests)
2021-05-19T21:17:04.4996687Z s3                =   1.30 sec*proc (1 test)
2021-05-19T21:17:04.4997137Z ssx               =   0.39 sec*proc (1 test)
2021-05-19T21:17:04.4997627Z storage           =  26.54 sec*proc (6 tests)
2021-05-19T21:17:04.4998125Z utils             =  10.72 sec*proc (3 tests)
2021-05-19T21:17:04.4998431Z 
2021-05-19T21:17:04.4998808Z Total Test time (real) = 514.49 sec
2021-05-19T21:17:04.4999114Z 
2021-05-19T21:17:04.4999692Z The following tests FAILED:
2021-05-19T21:17:04.5004283Z 	 37 - partition_moving_test_rpunit (Failed)

https://pipelines.actions.githubusercontent.com/XSI7xPi23IdMcR1QOMsb00ALQsxoSJhYpNBOdFaptPX1RGgaWB/_apis/pipelines/1/runs/16391/signedlogcontent/3?urlExpires=2021-05-19T21%3A50%3A32.5588711Z&urlSigningMethod=HMACV1&urlSignature=9z5cm6zU710JKfGY49qSJ%2BhR0KUyZPE%2BEP16zmK%2BUfg%3D

@mmaslankaprv
Copy link
Member

mmaslankaprv commented May 20, 2021

I did a deep dive into this issue. The whole problem is caused by one of the nodes (there are 5 of them) is not responding for controller (raft0) consensus RPC requests, they all timeout.

2021-05-19T21:15:44.2895764Z TRACE 2021-05-19 21:14:40,046 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:175 - Error append entries response from 3, rpc::errc::client_request_timeout

this effectively prevents one of the nodes from getting ring-0 updates.

The test asserts state on every node. The node that doesn't get the updates is unable to execute partition configuration update operation hence the test can not continue and fails.

I have to investigate why one of the nodes is timeouting all requests during 30 seconds window.

mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue May 20, 2021
Fixed bug leading to situation in which caller have to wait for
operation requiring log write log until reader where evicted from cache
by eviction timer.

The problem was related with the fact that when `cached_reader`
`entry_guard` was destroyed it didn't scheduled disposing readers that
were marked as invalid but only those which were no longer reusable.
This might lead to the situation in which reader that was in use during
truncate call might not be disposed immediately after it was destroyed
even tho it was marked as invalid by `readers_cache::evict_if` method.
The reader was reinserted to the readers cache, preventing truncation to
finish. The reader was then removed in the next eviction timer
iteration.

Example leading to this bug occurrence:

1) create reader
2) truncate log in a range for which reader holds a lock
3) destroy a reader
4) truncation should finish immediately after reader was destroyed

Fixes: redpanda-data#1426

Signed-off-by: Michal Maslanka <michal@vectorized.io>
@mmaslankaprv
Copy link
Member

This problem is related with readers cache. More details in:

#1432

mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue May 21, 2021
Fixed bug leading to situation in which caller have to wait for
operation requiring log write log until reader where evicted from cache
by eviction timer.

The problem was related with the fact that when `cached_reader`
`entry_guard` was destroyed it didn't scheduled disposing readers that
were marked as invalid but only those which were no longer reusable.
This might lead to the situation in which reader that was in use during
truncate call might not be disposed immediately after it was destroyed
even tho it was marked as invalid by `readers_cache::evict_if` method.
The reader was reinserted to the readers cache, preventing truncation to
finish. The reader was then removed in the next eviction timer
iteration.

Example leading to this bug occurrence:

1) create reader
2) truncate log in a range for which reader holds a lock
3) destroy a reader
4) truncation should finish immediately after reader was destroyed

Fixes: redpanda-data#1426

Signed-off-by: Michal Maslanka <michal@vectorized.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants