redpanda:testing: partition_movement_test failing in Github CI #1426

emaxerrno · 2021-05-19T21:56:22Z

2021-05-19T21:17:04.4985740Z Errors while running CTest
2021-05-19T21:17:04.4986085Z 
2021-05-19T21:17:04.4986975Z 98% tests passed, 1 tests failed out of 50
2021-05-19T21:17:04.4987382Z 
2021-05-19T21:17:04.4987718Z Label Time Summary:
2021-05-19T21:17:04.4988164Z archival          =   8.12 sec*proc (2 tests)
2021-05-19T21:17:04.4988621Z bytes             =   0.48 sec*proc (1 test)
2021-05-19T21:17:04.4989055Z cluster           = 295.39 sec*proc (3 tests)
2021-05-19T21:17:04.4989536Z compression       =   0.41 sec*proc (1 test)
2021-05-19T21:17:04.4990032Z config            =   0.40 sec*proc (1 test)
2021-05-19T21:17:04.4990458Z coproc            =  25.76 sec*proc (1 test)
2021-05-19T21:17:04.4990908Z hashing           =   0.10 sec*proc (2 tests)
2021-05-19T21:17:04.4991605Z http              =   0.92 sec*proc (1 test)
2021-05-19T21:17:04.4992045Z json              =   0.35 sec*proc (1 test)
2021-05-19T21:17:04.4992487Z kafka             =  76.01 sec*proc (9 tests)
2021-05-19T21:17:04.4992938Z kafka_protocol    =   0.08 sec*proc (1 test)
2021-05-19T21:17:04.4993419Z model             =   0.12 sec*proc (2 tests)
2021-05-19T21:17:04.4993891Z pandaproxy        =  10.76 sec*proc (4 tests)
2021-05-19T21:17:04.4994590Z raft              =  51.88 sec*proc (3 tests)
2021-05-19T21:17:04.4995035Z random            =   0.06 sec*proc (1 test)
2021-05-19T21:17:04.4995670Z reflection        =   0.45 sec*proc (2 tests)
2021-05-19T21:17:04.4996130Z rpc               =   2.93 sec*proc (3 tests)
2021-05-19T21:17:04.4996687Z s3                =   1.30 sec*proc (1 test)
2021-05-19T21:17:04.4997137Z ssx               =   0.39 sec*proc (1 test)
2021-05-19T21:17:04.4997627Z storage           =  26.54 sec*proc (6 tests)
2021-05-19T21:17:04.4998125Z utils             =  10.72 sec*proc (3 tests)
2021-05-19T21:17:04.4998431Z 
2021-05-19T21:17:04.4998808Z Total Test time (real) = 514.49 sec
2021-05-19T21:17:04.4999114Z 
2021-05-19T21:17:04.4999692Z The following tests FAILED:
2021-05-19T21:17:04.5004283Z 	 37 - partition_moving_test_rpunit (Failed)

https://pipelines.actions.githubusercontent.com/XSI7xPi23IdMcR1QOMsb00ALQsxoSJhYpNBOdFaptPX1RGgaWB/_apis/pipelines/1/runs/16391/signedlogcontent/3?urlExpires=2021-05-19T21%3A50%3A32.5588711Z&urlSigningMethod=HMACV1&urlSignature=9z5cm6zU710JKfGY49qSJ%2BhR0KUyZPE%2BEP16zmK%2BUfg%3D

The text was updated successfully, but these errors were encountered:

mmaslankaprv · 2021-05-20T08:30:29Z

I did a deep dive into this issue. The whole problem is caused by one of the nodes (there are 5 of them) is not responding for controller (raft0) consensus RPC requests, they all timeout.

2021-05-19T21:15:44.2895764Z TRACE 2021-05-19 21:14:40,046 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:175 - Error append entries response from 3, rpc::errc::client_request_timeout

this effectively prevents one of the nodes from getting ring-0 updates.

The test asserts state on every node. The node that doesn't get the updates is unable to execute partition configuration update operation hence the test can not continue and fails.

I have to investigate why one of the nodes is timeouting all requests during 30 seconds window.

Fixed bug leading to situation in which caller have to wait for operation requiring log write log until reader where evicted from cache by eviction timer. The problem was related with the fact that when `cached_reader` `entry_guard` was destroyed it didn't scheduled disposing readers that were marked as invalid but only those which were no longer reusable. This might lead to the situation in which reader that was in use during truncate call might not be disposed immediately after it was destroyed even tho it was marked as invalid by `readers_cache::evict_if` method. The reader was reinserted to the readers cache, preventing truncation to finish. The reader was then removed in the next eviction timer iteration. Example leading to this bug occurrence: 1) create reader 2) truncate log in a range for which reader holds a lock 3) destroy a reader 4) truncation should finish immediately after reader was destroyed Fixes: redpanda-data#1426 Signed-off-by: Michal Maslanka <michal@vectorized.io>

mmaslankaprv · 2021-05-20T10:23:05Z

This problem is related with readers cache. More details in:

#1432

Fixed bug leading to situation in which caller have to wait for operation requiring log write log until reader where evicted from cache by eviction timer. The problem was related with the fact that when `cached_reader` `entry_guard` was destroyed it didn't scheduled disposing readers that were marked as invalid but only those which were no longer reusable. This might lead to the situation in which reader that was in use during truncate call might not be disposed immediately after it was destroyed even tho it was marked as invalid by `readers_cache::evict_if` method. The reader was reinserted to the readers cache, preventing truncation to finish. The reader was then removed in the next eviction timer iteration. Example leading to this bug occurrence: 1) create reader 2) truncate log in a range for which reader holds a lock 3) destroy a reader 4) truncation should finish immediately after reader was destroyed Fixes: redpanda-data#1426 Signed-off-by: Michal Maslanka <michal@vectorized.io>

emaxerrno assigned dotnwat and mmaslankaprv May 19, 2021

emaxerrno added area/build area/redpanda kind/bug Something isn't working labels May 19, 2021

dotnwat mentioned this issue May 19, 2021

handling moving not initialized partition cross shards #1414

Merged

mmaslankaprv mentioned this issue May 20, 2021

Fix removing invalid readers from cache #1432

Merged

emaxerrno closed this as completed in #1432 May 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

redpanda:testing: partition_movement_test failing in Github CI #1426

redpanda:testing: partition_movement_test failing in Github CI #1426

emaxerrno commented May 19, 2021

mmaslankaprv commented May 20, 2021 •

edited

Loading

mmaslankaprv commented May 20, 2021

redpanda:testing: partition_movement_test failing in Github CI #1426

redpanda:testing: partition_movement_test failing in Github CI #1426

Comments

emaxerrno commented May 19, 2021

mmaslankaprv commented May 20, 2021 • edited Loading

mmaslankaprv commented May 20, 2021

mmaslankaprv commented May 20, 2021 •

edited

Loading