Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE-1912] mvlog: add basic segment reader #18245

Merged
merged 5 commits into from
May 11, 2024

Conversation

andrwng
Copy link
Contributor

@andrwng andrwng commented May 3, 2024

Introduces a few abstractions to be used as a part of the upcoming
new log_reader implementation. These abstractions are:

  • entry_stream: A utility class to parse input_streams as a contiguous run of
    entries.
  • readable_segment: A long-lived representation of a segment file, used to
    construct segment_readers.
  • segment_reader: A short-lived accessor to the file used to construct
    input_streams. This doesn't enforce any lifecycle relationships
    between segment_readers and input_streams, but it is expected that the
    readers will outlive the streams.

These abstractions are pretty bare bones compared to the existing
storage::segment, storage::segment_reader, and reader handle. There is
some basic accounting done for the readers, but the complexity of a
ref-counted handle doesn't seem critical for now.

Some tests are added to exercise appends and reading record batch entries.

Fixes #17255

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.1.x
  • v23.3.x
  • v23.2.x

Release Notes

  • none

@andrwng andrwng changed the title mvlost: add basic segment reader mvlog: add basic segment reader May 3, 2024
@andrwng andrwng changed the title mvlog: add basic segment reader [CORE-1912] mvlog: add basic segment reader May 3, 2024
WillemKauf
WillemKauf previously approved these changes May 9, 2024
Copy link
Contributor

@WillemKauf WillemKauf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this LGTM, still groking the details from RFC, but I am liking the simplified interfaces.

jcipar
jcipar previously approved these changes May 9, 2024
Copy link
Contributor

@jcipar jcipar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good. Just two minor things.

// Parses and collects the record batches from the given entry stream.
// Ignores other kinds of entries.
ss::future<result<collect_stream_outcome, errc>>
collect_batches_from_stream(entry_stream* entries, batch_collector* collector);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed a few places where this uses raw pointers instead of references or managed pointers. Is there a reason for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally prefer raw pointers for in-out and out parameters, a la Google style guide.

As for member pointer variables, unless there's some complicated lifetime relationship between classes, I prefer having a single owner being denoted by a unique_ptr and passing around raw pointers. I'm hoping for the lifetimes of these obejcts to be straightforward enough that it's easily verifiable (e.g. readable_segment is long lived, segment_readers must not outlive the readable_segment, etc), but if that ends up not being the case, I wouldn't be against some kind of smart pointer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrwng I see google style guide recommending references https://google.github.io/styleguide/cppguide.html#Inputs_and_Outputs

non-optional output and input/output parameters should usually be references (which cannot be null)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I'll be! Let me fix

nvartolomei
nvartolomei previously approved these changes May 10, 2024
auto header_buf = co_await read_iobuf_exactly(
stream_, packed_entry_header_size);
if (header_buf.size_bytes() != packed_entry_header_size) {
co_return std::nullopt;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this wouldn't be an error? It is an unexpected end of stream. Unless size_bytes() == 0, which might be an acceptable end of stream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, this could be an error. Will fix

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why wouldn't we hard stop / crash the system or not check at all? what i mean is that the contract of read_iobuf_exactly is to guarantee this or throw, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If i'm reading correctly, read_iobuf_exactly doesn't have this guarantee. If the stream is too short, it has no choice but to return a shorter buffer.

Maybe adding to the confusion, input_stream::read_up_to and input_stream::read_exactly both exist, but they actually mean "read some amount as long as it's under n" and "read as much as you can up to n". The contract of read_iobuf_exactly appears identical to the latter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, i wasn't remembering correctly.

src/v/storage/mvlog/entry_stream.cc Show resolved Hide resolved
ss::future<result<std::optional<entry>, errc>> next();

private:
ss::input_stream<char> stream_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this missing a close() method? It is not clear if it is safe to destroy an ss::input_stream without calling close first.

/// Detaches the \c input_stream from the underlying data source.
///
/// Waits for any background operations (for example, read-ahead) to
/// complete, so that the any resources the stream is using can be
/// safely destroyed.  An example is a \ref file resource used by
/// the stream returned by make_file_input_stream().
///
/// \return a future that becomes ready when this stream no longer
///         needs the data source.
future<> close() noexcept {
    return _fd.close();
}

https://github.com/scylladb/seastar/blob/c799845803ea4e5672a19970be1de373e2eb6fa1/include/seastar/core/iostream.hh#L329-L340

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very possible. I'll add a close in subsequent PRs when this gets plugged into more stuff.

// Parses and collects the record batches from the given entry stream.
// Ignores other kinds of entries.
ss::future<result<collect_stream_outcome, errc>>
collect_batches_from_stream(entry_stream* entries, batch_collector* collector);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrwng I see google style guide recommending references https://google.github.io/styleguide/cppguide.html#Inputs_and_Outputs

non-optional output and input/output parameters should usually be references (which cannot be null)

src/v/storage/mvlog/batch_collecting_stream_utils.cc Outdated Show resolved Hide resolved
nvartolomei

This comment was marked as outdated.

Adds an wrapper around an input_stream<char> to pull out entries. The
expectation here is that the stream will contain a dense collection of
entries.

This, along with the batch_collector, will be used to build an
equivalent to the log_reader.
@andrwng
Copy link
Contributor Author

andrwng commented May 10, 2024

@nvartolomei hope you don't mind I'd like to merge this and address feedback in a followup change

andrwng added 4 commits May 10, 2024 10:08
Adds a couple utility methods for parsing record batch entries from an
entry_stream.
Introduces a couple of abstractions to be used as a part of the upcoming
new log_reader implementation. These abstractions are:
- readable_segment: A long-lived representation of a segment file, used
  to construct segment_readers.
- segment_reader: A short-lived accessor to the file used to construct
  input_streams. This doesn't enforce any lifecycle relationships
  between segment_readers and input_streams, but it is expected that the
  readers will outlive the streams.

These abstractions are pretty bare bones compared to the existing
storage::segment, storage::segment_reader, and reader handle. There is
some basic accounting done for the readers, but the complexity of a
ref-counted handle doesn't seem critical for now.
Adds a test that uses the segment appender and segment readers
end-to-end, exercising various batch collecting utilities along the way.
This serves as an example of how a future log reader may use these
methods to read.
A short read can be indicative of corruption and shouldn't be ignored.
Adds a specific error code for this.
@andrwng andrwng dismissed stale reviews from jcipar and WillemKauf via 1ee2e4f May 10, 2024 17:09
@andrwng andrwng force-pushed the mvlog-segment-reader branch from a03ac97 to 1ee2e4f Compare May 10, 2024 17:09
@andrwng
Copy link
Contributor Author

andrwng commented May 10, 2024

@nvartolomei hope you don't mind I'd like to merge this and address feedback in a followup change

Nevermind, the changes ended up being not very intrusive.

Latest force push:

  • added a commit to introduce a short read error code
  • changed signatures for batch_collecting_stream_utils to use references
  • tweaked comments in batch_collecting_stream_utils

// Parses and collects the record batches from the given entry stream.
// Ignores other kinds of entries.
ss::future<result<collect_stream_outcome, errc>>
collect_batches_from_stream(entry_stream& entries, batch_collector& collector);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we pass raw pointers to avoid clang-tidy warnings on reference parameters to coroutines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that reminds me this is actually partially the reason I had pointers in the first place. I opted to change them to abide by GSG, as called out by others, though maybe it's worth an exception for coroutines... 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dotnwat the benefit would be that the need for explicit & will make you think twice about the lifetimes? Or something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it can also prevent cases where a temporary would be otherwise destroyed?

i.e. co_await collect_batches_from_stream(entry_stream{input}, c)? does entry_stream here gets destroyed on suspension?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the clang-tidy warning is a good callout. Allowing passing a temporary as a reference is a crash waiting to happen.

Copy link
Contributor

@nvartolomei nvartolomei May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, pointer is cleaner for coroutine. i mean we could also disable the clang-tidy warning.

@vbotbuildovich
Copy link
Collaborator

@andrwng
Copy link
Contributor Author

andrwng commented May 11, 2024

CI failure: #18329

@piyushredpanda piyushredpanda merged commit 72d7a14 into redpanda-data:dev May 11, 2024
14 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Segment entry reader as a building block for record_batch_reader
7 participants