Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix timequery returning wrong offset after trim-prefix which could lead to stuck consumers #18112

Merged
merged 8 commits into from
May 7, 2024

Conversation

nvartolomei
Copy link
Contributor

@nvartolomei nvartolomei commented Apr 26, 2024

The fix works only if compression is not in use. We need follow-up work which would decompress the batches to find the exact offset to return, or (!) we could to prevent trim-offset inside a batch in that case.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.1.x
  • v23.3.x
  • v23.2.x

Release Notes

Bug Fixes

  • Fix a scenario where list_offset with a timestamp could return a lower offset than partition start after a trim-prefix command. This could lead to consumers being stuck with an out-of-range-offset exception if they began consuming from an offset below the one which was used in the trim-prefix command.

@nvartolomei
Copy link
Contributor Author

/dt

@vbotbuildovich

This comment was marked as outdated.

@nvartolomei nvartolomei force-pushed the nv/trim-prefix-timequery branch from d880351 to 21d876c Compare April 27, 2024 23:46
@github-actions github-actions bot added area/redpanda area/wasm WASM Data Transforms labels Apr 27, 2024
@nvartolomei nvartolomei force-pushed the nv/trim-prefix-timequery branch from ed92b11 to 7121fa8 Compare April 27, 2024 23:52
@nvartolomei nvartolomei requested review from andrwng and dotnwat April 28, 2024 12:49
@nvartolomei nvartolomei marked this pull request as ready for review April 28, 2024 12:50
@nvartolomei nvartolomei requested a review from andijcr April 28, 2024 12:51
@nvartolomei nvartolomei changed the title rptest: test trim-prefix with timequery rptest: trim-prefix with timequery Apr 29, 2024
src/v/storage/log_reader.h Show resolved Hide resolved
src/v/cluster/partition.cc Outdated Show resolved Hide resolved
src/v/storage/lock_manager.cc Outdated Show resolved Hide resolved
src/v/storage/log_reader.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find!

Probably worth updating the PR title to reflect that this is a bug fix

Also should this be backported to 23.2 as well? IIRC DeleteRecords support landed then

src/v/storage/types.cc Outdated Show resolved Hide resolved
src/v/storage/disk_log_impl.cc Show resolved Hide resolved
src/v/kafka/server/replicated_partition.cc Outdated Show resolved Hide resolved
src/v/storage/log_reader.cc Outdated Show resolved Hide resolved
@nvartolomei nvartolomei force-pushed the nv/trim-prefix-timequery branch from 7121fa8 to 383804b Compare April 30, 2024 12:42
@nvartolomei nvartolomei changed the title rptest: trim-prefix with timequery Fix timequery returning -1 after trim-prefix which could lead to stuck consumers Apr 30, 2024
@nvartolomei nvartolomei changed the title Fix timequery returning -1 after trim-prefix which could lead to stuck consumers Fix timequery returning wrong offset after trim-prefix which could lead to stuck consumers Apr 30, 2024
@nvartolomei nvartolomei force-pushed the nv/trim-prefix-timequery branch 2 times, most recently from 518d7fd to fe96ed2 Compare April 30, 2024 15:02
@nvartolomei
Copy link
Contributor Author

  • Rebased on dev to resolve conflicts after a PR introduced by Willem to fix an unrelated timequery bug.
  • Addressed reviewer's comments.
  • Updated offset_range to bounded_offset_range to avoid misuse. It is less useful and more useful both at the same time!

Let's see what CI says.

src/v/storage/disk_log_impl.cc Show resolved Hide resolved
src/v/storage/log_reader.h Outdated Show resolved Hide resolved
src/v/cluster/partition.cc Outdated Show resolved Hide resolved
@vbotbuildovich
Copy link
Collaborator

@nvartolomei nvartolomei force-pushed the nv/trim-prefix-timequery branch 2 times, most recently from 9a01673 to 2369538 Compare April 30, 2024 19:19
@nvartolomei
Copy link
Contributor Author

Force pushed:

  • improve commit message as requested by @andrwng
  • add a fix for an edge case where cloud storage shouldn't be read at all f906e24
  • commented out trim prefix with tiered storage as they run into an (existing) edge case which will be addressed in another PR

@nvartolomei nvartolomei force-pushed the nv/trim-prefix-timequery branch from 2369538 to ed19202 Compare April 30, 2024 20:21
@nvartolomei
Copy link
Contributor Author

Force push:

  • Fix off-by-one error in reader max offset
  • Rename bounded_offset_range to bounded_offset_interval and redesign the API to make it easier to use correctly/harder to misuse

@nvartolomei nvartolomei force-pushed the nv/trim-prefix-timequery branch 2 times, most recently from 548ca53 to 0da4cf9 Compare April 30, 2024 20:40
@nvartolomei
Copy link
Contributor Author

Last 2 force-pushes fixes some typos in text.

Encapsulates common operations on offset intervals. For now, although it
is named bounded, the maximum offset can still be set to
`model::offset::max()`. I will likely change this in the future as it
requires changing quite a bit of call sites, most likely only tests.

This little data structure tries to be very light weight and impose
minimum overhead on basic interval operations like intersection or
inclusion. It is also quite hard to use it incorrectly due to the
checked construction variant and debug assertions.

Later, we might want to refactor things like log_reader to use this
instead of min and max offsets like they do today. Once that is done,
the checked variant needs to be called only once at the kafka layer. For
everyone else it becomes a ~0 cost abstraction.
kafka ListOffsets request allows to query offsets by timestamps. The
result of such a request is the first kafka offset in the log that has a
timestamp equal to or greater than the requested timestamp. Or, -1 if no
such record can be found.

The implementation we have today assumes that the start of the physical
log matches the start of the log as it is seen by external users/kafka
clients.

However, this is not always true. In particular, when [trim-prefix][1]
(prefix truncation) is used. There are 2 sources of problems:

  1) trim-prefix is synchronously applied at cluster layer where it
     changes the visibility of the log from the client point-of-view,
     but it is asynchronously applied to the consensus log/physical log/
     disk_log_impl class, cloud storage.

  2) trim-prefix is executed with an offset of a record that in in the
     middle of a batch.

As a result, in these scenarios, if the clients sends a kafka Fetch
request with the received offset they'll be replied with
OffsetOutOfRange error.

This commit changes such queries are implemented at the lower levels
of the system by carrying the information about the visible start and
end of the log together with the timestamp. Then, at the lower levels,
we use these to limit our search only to that range.

Although this commit does not change the implementation of the tiered
storage timequery it does fix the trim-prefix problem there too in the
general case because of check and "bump" added in
redpanda-data#11579.

Tiered Storage timequeries have some additional problems which I plan to
address in redpanda-data#18097.

[1]: https://docs.redpanda.com/current/reference/rpk/rpk-topic/rpk-topic-trim-prefix/
Previous code contained a bug which is masked by the retry logic in
replicated partition:

    const auto kafka_start_override = _partition->kafka_start_offset_override();
    if (
      !kafka_start_override.has_value()
      || kafka_start_override.value() <= res.value().offset) {
        // The start override doesn't affect the result of the timequery.
        co_return res;
    }
    vlog(
      klog.debug,
      "{} timequery result {} clamped by start override, fetching result at "
      "start {}",
      ntp(),
      res->offset,
      kafka_start_override.value());

    ...
Not easy to test that this is right so not going to for now.
@nvartolomei nvartolomei force-pushed the nv/trim-prefix-timequery branch from 0da4cf9 to 8f2de96 Compare April 30, 2024 20:41
@andijcr andijcr self-requested a review May 1, 2024 07:37
Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some unit testing would be nice for some of the lower level functions, but otherwise this pretty much LGTM

src/v/model/offset_interval.h Show resolved Hide resolved
Comment on lines +26 to +42
static bounded_offset_interval
unchecked(model::offset min, model::offset max) noexcept {
return {min, max};
}

static bounded_offset_interval
checked(model::offset min, model::offset max) {
if (min < model::offset(0) || max < model::offset(0) || min > max) {
throw std::invalid_argument(fmt::format(
"Invalid arguments for constructing a non-empty bounded offset "
"interval: min({}) <= max({})",
min,
max));
}

return {min, max};
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe consider create() and create_checked()? Or create_unchecked() and create() if the idea is to make checked be the default and only expose unchecked for the noexcept? "checked" doesn't quite convey that this is a constructor IMO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll continue iterating on this if we are to adopt it in more places. I intentionally named them unchecked/checked so that the name communicates to the caller that there is a "cost" involved. ::create/constructor would make it too easy to choose the "expensive" variant.

I'm still not sure whether the checked/unchecked are enough as constructor variants or whether they should be 2 separate structs and "propagate" the checked/uncheck property as a data structure "attribute". I.e. that all operations on that struct should be checked.

Deferring this for now hoping that it will become clear when additional use-cases appear.

Comment on lines +510 to +511
&& cfg.min_offset < kafka::offset_cast(
_cloud_storage_partition->next_kafka_offset());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if there's a good place to unit test this bug, given it was masked by higher level code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will think about it. I'm proposing to remove the masking code in the next PR.

Comment on lines +462 to +463
# @parametrize(cloud_storage=True, spillover=False)
# @parametrize(cloud_storage=True, spillover=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will re-enable this in the next PR #18097

@@ -132,7 +132,7 @@ static ss::future<list_offset_partition_response> list_offsets_partition(
auto res = co_await kafka_partition->timequery(storage::timequery_config{
kafka_partition->start_offset(),
timestamp,
offset,
model::prev_offset(offset),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes, nice catch!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering what the implications of this are. Presumably if the offset doesn't exist, the timequery will just return early, and if it does exist, it won't affect the result of a timequery. Is that right?

Maybe a good idea to add any implications (or lackthereof) into the commit message

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and if it does exist, it won't affect the result of a timequery. Is that right?

You mean prior to this commit? It would have returned a potentially non-committed offset even with acks=all. Normally, redpanda doesn't expose such offsets unless write caching is in use. Fetch would have failed with out of range too, causing the consumer to reset.

@nvartolomei
Copy link
Contributor Author

Merging this to unblock #18097. Will address comments as follow ups.

@nvartolomei nvartolomei merged commit 6c630bd into redpanda-data:dev May 7, 2024
18 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v24.1.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-18112-v23.3.x-984 remotes/upstream/v23.3.x
git cherry-pick -x 99d2bec5f7ee765cb3de88b446278262f6dae84f d97d61fb8a9b06fbde9dcf7ff03799bf200b561b 4f87afa392201c493f24f21af3e1cd7f0727649f f13bfa6c490490487d9a926c9a5d4e441adc3ca6 76a1ea2452b09a5730f2574646fc06ab2b8b8e32 f9ed5cabe479b355d370bec6bc9b693ad2928f3c a40999d2a09e0586c3fa81521c4d5fb5d0abc9dc 8f2de964c0e915f4f10ae8eb74400e6288c5680f

Workflow run logs.

Comment on lines +326 to +328
* //-------+-------------+------------+-------------+-------//
* \\...9 | 10...14 | 15..15 | 16.....22 | 23...\\
* //-------+-------------+------------+-------------+-------//
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is awesome

Comment on lines +44 to +48
inline bool overlaps(const bounded_offset_interval& other) const noexcept {
return _min <= other._max && _max >= other._min;
}

inline bool contains(model::offset o) const noexcept {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline is unnecessary here. it's implied when the definition is defined in the class like this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL

bounded_offset_interval(model::offset min, model::offset max) noexcept
: _min(min)
, _max(max) {
#ifndef NDEBUG
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it impactful enough to performance to disable the check?

Copy link
Contributor Author

@nvartolomei nvartolomei Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention is to use this in hot-loops and I made the assumption that it will be impactful enough.

E.g. I want checks like the one in

https://github.com/redpanda-data/redpanda/pull/18112/files#diff-adc3b4388ce88f3bd5b5de62a42ebc298737e9da94fa94ea9cde18342e246435R48

or

if (record_t >= t && query_interval.contains(record_o)) {

to be compiled to nothing more than a single if statement that one would write manually.

The bounded_offset_interval::checked "static-constructor" always checks the invariants. The ones in constructor are redundant in this case. They are relevant for the ::unchecked construction however which is meant to be used in hot loops

checked(model::offset min, model::offset max) {
if (min < model::offset(0) || max < model::offset(0) || min > max) {
throw std::invalid_argument(fmt::format(
"Invalid arguments for constructing a non-empty bounded offset "
"interval: min({}) <= max({})",
min,
max));
}
return {min, max};
}

@@ -132,7 +132,7 @@ static ss::future<list_offset_partition_response> list_offsets_partition(
auto res = co_await kafka_partition->timequery(storage::timequery_config{
kafka_partition->start_offset(),
timestamp,
offset,
model::prev_offset(offset),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kafka: fix off-by-one in timequery
Not easy to test that this is right so not going to for now.

@nvartolomei please include in commit message more detail. the commit by inspection implies an off-by-one error, but there are critically useful pieces of information a reader in the future may want. for example, how was it discovered, what are symptoms of it being wrong, etc...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda area/wasm WASM Data Transforms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants