-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v24.1.x] cloud_storage: avoid possible double stop in segment reader #18241
Merged
piyushredpanda
merged 1 commit into
redpanda-data:v24.1.x
from
vbotbuildovich:backport-pr-18229-v24.1.x-528
May 3, 2024
Merged
[v24.1.x] cloud_storage: avoid possible double stop in segment reader #18241
piyushredpanda
merged 1 commit into
redpanda-data:v24.1.x
from
vbotbuildovich:backport-pr-18229-v24.1.x-528
May 3, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
It's possible that when a client disconnects, the abort source subscription associated with the client connection ends up evicting the current segment reader. 1. Remote partition reader is in the process of reading a segment 2. The read eventually scans past the LSO, and stops. This stop() is asynchonous, closing IO streams, etc. 3. The Kafka client disconnects, triggering the abort source subscription callback to call evict_segment_reader(), moving the segment reader to the remote partition's eviction list 4. The remote partition's eviction loop sees this reader and runs stop(), and destructs the underlying parser 5. The read fiber, still stopping, resumes control, but the underlying parser is destructed This is roughly the sequence of events seen below. ``` DEBUG 2024-05-02 14:55:06,231 [shard 2:fetc] cloud_storage - [fiber848~132~149~0 kafka/topic/0] - remote_segment.cc:1197 - [{{0.0.0.0:0}}] accept_batch_start skip because record batch type is batch_type::raft_configuration DEBUG 2024-05-02 14:55:06,231 [shard 2:fetc] cloud_storage - [fiber848~132~149~0 kafka/topic/0] - remote_segment.cc:1271 - [{{0.0.0.0:0}}] skip_batch_start called for 1398 DEBUG 2024-05-02 14:55:06,231 [shard 2:fetc] cloud_storage - [fiber848~132~149~0 kafka/topic/0] - remote_segment.cc:1288 - added offset translation gap [1398-1398], current state: {{base offset/delta: {1397}/1221, map size: 1, last delta: 1221}} DEBUG 2024-05-02 14:55:06,232 [shard 2:fetc] cloud_storage - [fiber848~132~149~0 kafka/topic/0] - remote_segment.cc:1197 - [{{0.0.0.0:0}}] accept_batch_start skip because record batch type is batch_type::archival_metadata DEBUG 2024-05-02 14:55:06,232 [shard 2:fetc] cloud_storage - [fiber848~132~149~0 kafka/topic/0] - remote_segment.cc:1271 - [{{0.0.0.0:0}}] skip_batch_start called for 1399 DEBUG 2024-05-02 14:55:06,232 [shard 2:fetc] cloud_storage - [fiber848~132~149~0 kafka/topic/0] - remote_segment.cc:1288 - added offset translation gap [1399-1400], current state: {{base offset/delta: {1397}/1221, map size: 2, last delta: 1222}} DEBUG 2024-05-02 14:55:06,232 [shard 2:fetc] cloud_storage - [fiber848~132~149~0 kafka/topic/0] - remote_segment.cc:1197 - [{{0.0.0.0:0}}] accept_batch_start skip because record batch type is batch_type::archival_metadata DEBUG 2024-05-02 14:55:06,232 [shard 2:fetc] cloud_storage - [fiber848~132~149~0 kafka/topic/0] - remote_segment.cc:1271 - [{{0.0.0.0:0}}] skip_batch_start called for 1401 DEBUG 2024-05-02 14:55:06,232 [shard 2:fetc] cloud_storage - [fiber848~132~149~0 kafka/topic/0] - remote_segment.cc:1288 - added offset translation gap [1401-1401], current state: {{base offset/delta: {1397}/1221, map size: 3, last delta: 1224}} DEBUG 2024-05-02 14:55:06,232 [shard 2:fetc] cloud_storage - [fiber848~132~149~0 kafka/topic/0] - remote_segment.cc:1197 - [{{0.0.0.0:0}}] accept_batch_start skip because record batch type is batch_type::tx_fence DEBUG 2024-05-02 14:55:06,232 [shard 2:fetc] cloud_storage - [fiber848~132~149~0 kafka/topic/0] - remote_segment.cc:1271 - [{{0.0.0.0:0}}] skip_batch_start called for 1402 DEBUG 2024-05-02 14:55:06,234 [shard 2:fetc] cloud_storage - [fiber848~132~149~0 kafka/topic/0] - remote_segment.cc:1185 - [{{0.0.0.0:0}}] accept_batch_start stop parser because 1403 > 177(kafka offset) DEBUG 2024-05-02 14:55:06,234 [shard 2:fetc] cloud_storage - [fiber41664 kafka/topic/0] - remote_partition.cc:435 - No results, current rp offset: 1403, current kafka offset: 178, max rp offset: 177 DEBUG 2024-05-02 14:55:06,234 [shard 2:fetc] cloud_storage - [fiber848~132~149 kafka/topic/0] - remote_segment.cc:1476 - [{{0.0.0.0:0}}] remote_segment_batch_reader::stop DEBUG 2024-05-02 14:55:06,234 [shard 2:fetc] cloud_storage - [fiber848~132~149 kafka/topic/0] - remote_segment.cc:1482 - [{{0.0.0.0:0}}] remote_segment_batch_reader::stop - parser-close DEBUG 2024-05-02 14:55:06,234 [shard 2:main] cloud_storage - [fiber41664 kafka/topic/0] - remote_partition.cc:247 - abort requested via config.abort_source DEBUG 2024-05-02 14:55:06,234 [shard 2:fetc] cloud_storage - [fiber41665 defa47c0/kafka/topic/0_18762/1398-1405-846-134-v1.log.137] - segment_chunk_data_source.cc:48 - chunk data source destroyed DEBUG 2024-05-02 14:55:06,235 [shard 2:main] cloud_storage - [fiber848~132~149 kafka/topic/0] - remote_segment.cc:1476 - [{{0.0.0.0:0}}] remote_segment_batch_reader::stop DEBUG 2024-05-02 14:55:06,235 [shard 2:main] cloud_storage - [fiber848~132~149 kafka/topic/0] - remote_segment.cc:1482 - [{{0.0.0.0:0}}] remote_segment_batch_reader::stop - parser-close ERROR 2024-05-02 14:55:06,235 [shard 2:fetc] assert - Assert failure: (/var/lib/buildkite-agent/builds/buildkite-arm64-builders-i-009787d74859cb582-1/redpanda/redpanda/src/v/storage/segment_reader.cc:210) '_parent == nullptr' Must close before destroying ``` Note, the tx_fence is assigned a Kafka offset 178, but isn't reflected by the LSO-1=177 calculation used as an upper bound for the log reader config, hence the early return. This commit fixes the issue by stopping the segment reader before stopping, ensuring only one concurrent caller of stop(). This also adds test simulates a failure where a client disconnect racing with the natural stopping of a segment reader led to a crash. Without the fix, and adding some sleeps in remote_segment_batch_reader::stop(), this would crash fairly consistently, albeit with slightly different backtraces than reported in the issue. Fixes redpanda-data#18213 (cherry picked from commit 226c43e)
nvartolomei
approved these changes
May 3, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport of PR #18229
Fixes: #18240,