-
Notifications
You must be signed in to change notification settings - Fork 599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CORE-7058]: storage
: fix race condition in segment::release_appender_in_background()
#24483
[CORE-7058]: storage
: fix race condition in segment::release_appender_in_background()
#24483
Conversation
storage
: fix race condition in segment::release_appender_in_background()
storage
: fix race condition in segment::release_appender_in_background()
non flaky failures in https://buildkite.com/redpanda/redpanda/builds/59427#01939f0d-e5b3-4448-b9d1-51dae0694acb:
|
Retry command for Build#59427please wait until all jobs are finished before running the slash command
|
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/59427#01939fe0-ce7f-4dc5-b353-d352e6575a7a |
1999901
to
1cb2bcc
Compare
Force push to:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add to the description which fibers are performing which actions? for example
F1: _gate.close in segment::close
F2: auto a = std::exchange in segment::release_appender_in_background()
F1: ...
like one lingering question is that if the interleaving exists, does it have a race or should it exist at all? for instance, if the segment is being closed, where is the concurrency originating from that allow it to have other fibers operating on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The race seems plausible, but there doesn't seem to be enough information in the cover letter to retrace the path.
For example, assume that a fiber reached release_appender_in_background(readers_cache);
via disk_log_impl::apply_segment_ms
which invokes last->release_appender
. This fiber would be holding _compaction_housekeeping_gate
.
So where did segment::close()
originate from. One option would be disk_log_impl::close
. But the first thing it does is closes the _compaction_housekeeping_gate
.
But this interleaving I don't think is possible. Either closing the gate blocks because the release is in progress, or the releasing fiber isn't allowed to enter the gate. Does that sound right?
In order to fully evaluate the fix it's useful to be able to understand how the race occurred.
Can you provide that information in the cover letter?
From the logs we have here, it originates from
Will add more detail to the cover letter after I finish sketching the sequence of events out further. |
^ I think the above is close to the situation we are seeing here. |
auto next_offset = offsets().dirty_offset + model::offset(1); | ||
auto next_offset = model::next_offset(offsets().dirty_offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this related to the race?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to the race, related to the test case I wrote in which I call force_roll()
on a log that has nothing produced to it, offsets.dirty_offset() == -9223372036854775808
, and the offset value -9223372036854775808 + 1
is going to trigger an assert in disk_log_impl::new_segment()
.
If, instead, we use model::next_offset(-9223372036854775808) == 0
, we are okay 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, got it. i forgot that force_roll is test-only code. yolo!
…ound()` This function caused race conditions between `segment::close()` and a segment roll. Consider the following sequence of events: 1. `_gate.close()` called in `segment::close()` 2. `auto a = std::exchange(_appender, nullptr)` called in `segment::release_appender_in_background()` 3. `ssx::spawn_with_gate()` called in `segment::release_appender_in_background()` 4. `return ignore_shutdown_exceptions()` in `ssx::spawn_with_gate()` 5. rest of `release_appender_in_background()` is ignored 6. `a` goes out of scope in `release_appender_in_background()` without `close()`ing the `appender` 7. one sad panda Add an explicit check to `_gate.check()` in `release_appender_in_background()` to throw an exception in case the gate is closed, and defer the closing of the appender to `segment::close()` in order to avoid the potential race condition here.
If the dirty offset is default initialized to `int64_t::min()` and not updated before a segment is force rolled, an assert will fail in `disk_log_impl::new_segment()` for the offset being < 0. Use `model::next_offset()` instead of simply adding `1` to the dirty offset to avoid this case.
To test race conditions between `segment::close()` and a segment roll, particularly one which goes through `release_appender_in_background()`.
1cb2bcc
to
24ec834
Compare
Force push to:
|
auto next_offset = offsets().dirty_offset + model::offset(1); | ||
auto next_offset = model::next_offset(offsets().dirty_offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, got it. i forgot that force_roll is test-only code. yolo!
/ci-repeat 5 |
1 similar comment
/ci-repeat 5 |
/backport v24.3.x |
/backport v24.2.x |
This function presented a race between
segment::close()
and a segment roll.Consider the following sequence of events:
_gate.close()
called insegment::close()
auto a = std::exchange(_appender, nullptr)
called insegment::release_appender_in_background()
ssx::spawn_with_gate()
called insegment::release_appender_in_background()
return ignore_shutdown_exceptions()
inssx::spawn_with_gate()
release_appender_in_background()
is ignoreda
goes out of scope inrelease_appender_in_background()
withoutclose()
ing theappender
Add an explicit check to
_gate.is_closed()
inrelease_appender_in_background()
to defer the closing of the appender tosegment::close()
and avoid the potential race condition here.Backports Required
Release Notes
Bug Fixes
vassert
.