virtual_file: take a `Slice` in the read APIs, eliminate `read_exact_at_n`, fix UB for engine `std-fs` #8186

problame · 2024-06-27T15:04:31Z

part of #7418

I reviewed how the VirtualFile API's read methods look like and came to the conclusion that we've been using IoBufMut / BoundedBufMut / Slice wrong.

This patch rectifies the situation.

Change 1: take `tokio_epoll_uring::Slice` in the read APIs

Before, we took an IoBufMut, which is too low of a primitive and while it seems convenient to be able to pass in a Vec<u8> without any fuzz, it's actually very unclear at the callsite that we're going to fill up that Vec up to its capacity(), because that's what IoBuf::bytes_total() returns and that's what VirtualFile::read_exact_at fills.

By passing a Slice instead, a caller that "just wants to read into a Vec" is forced to be explicit about it, adding either slice_full() or slice(x..y), and these methods panic if the read is outside of the bounds of the Vec::capacity().

Last, passing slices is more similar to what the std::io APIs look like.

Change 2: fix UB in `virtual_file_io_engine=std-fs`

While reviewing call sites, I noticed that the io_engine::IoEngine::read_at method for StdFs mode has been constructing an &mut[u8] from raw parts that were uninitialized.

We then used std::fs::File::read_exact to initialize that memory, but, IIUC we must not even be constructing an &mut[u8] where some of the memory isn't initialized.

So, stop doing that and add a helper ext trait on Slice to do the zero-initialization.

Change 3: eliminate `read_exact_at_n`

The read_exact_at_n doesn't make sense because the caller can just

slice = buf.slice() the exact memory it wants to fill
slice = read_exact_at(slice)
buf = slice.into_inner()

Again, the std::io APIs specify the length of the read via the Rust slice length.
We should do the same for the owned buffers IO APIs, i.e., via Slice::bytes_total().

Change 4: simplify filling of `PageWriteGuard`

The PageWriteGuardBuf::init_up_to was never necessary.
Remove it. See changes to doc comment for more details.

Reviewers should probably look at the added test case first, it illustrates my case a bit.

github-actions · 2024-06-27T15:15:12Z

2940 tests run: 2823 passed, 0 failed, 117 skipped (full report)

Code coverage* (full report)

functions: 32.7% (6912 of 21137 functions)
lines: 50.1% (54174 of 108142 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
9fa4f47 at 2024-06-27T18:57:04.856Z :recycle:}

pageserver/src/virtual_file/io_engine.rs

VladLazar

+1 for the Slice based api

pageserver/src/virtual_file.rs

part of #7418 # Motivation (reproducing #7418) When we do an `InMemoryLayer::write_to_disk`, there is a tremendous amount of random read I/O, as deltas from the ephemeral file (written in LSN order) are written out to the delta layer in key order. In benchmarks (#7409) we can see that this delta layer writing phase is substantially more expensive than the initial ingest of data, and that within the delta layer write a significant amount of the CPU time is spent traversing the page cache. # High-Level Changes Add a new mode for L0 flush that works as follows: * Read the full ephemeral file into memory -- layers are much smaller than total memory, so this is afforable * Do all the random reads directly from this in memory buffer instead of using blob IO/page cache/disk reads. * Add a semaphore to limit how many timelines may concurrently do this (limit peak memory). * Make the semaphore configurable via PS config. # Implementation Details The new `BlobReaderRef::Slice` is a temporary hack until we can ditch `blob_io` for `InMemoryLayer` => Plan for this is laid out in #8183 # Correctness The correctness of this change is quite obvious to me: we do what we did before (`blob_io`) but read from memory instead of going to disk. The highest bug potential is in doing owned-buffers IO. I refactored the API a bit in preliminary PR #8186 to make it less error-prone, but still, careful review is requested. # Performance I manually measured single-client ingest performance from `pgbench -i ...`. Full report: https://neondatabase.notion.site/2024-06-28-benchmarking-l0-flush-performance-e98cff3807f94cb38f2054d8c818fe84?pvs=4 tl;dr: * no speed improvements during ingest, but * significantly lower pressure on PS PageCache (eviction rate drops to 1/3) * (that's why I'm working on this) * noticable but modestly lower CPU time This is good enough for merging this PR because the changes require opt-in. We'll do more testing in staging & pre-prod. # Stability / Monitoring **memory consumption**: there's no _hard_ limit on max `InMemoryLayer` size (aka "checkpoint distance") , hence there's no hard limit on the memory allocation we do for flushing. In practice, we a) [log a warning](https://github.com/neondatabase/neon/blob/23827c6b0d400cbb9a972d4d05d49834816c40d1/pageserver/src/tenant/timeline.rs#L5741-L5743) when we flush oversized layers, so we'd know which tenant is to blame and b) if we were to put a hard limit in place, we would have to decide what to do if there is an InMemoryLayer that exceeds the limit. It seems like a better option to guarantee a max size for frozen layer, dependent on `checkpoint_distance`. Then limit concurrency based on that. **metrics**: we do have the [flush_time_histo](https://github.com/neondatabase/neon/blob/23827c6b0d400cbb9a972d4d05d49834816c40d1/pageserver/src/tenant/timeline.rs#L3725-L3726), but that includes the wait time for the semaphore. We could add a separate metric for the time spent after acquiring the semaphore, so one can infer the wait time. Seems unnecessary at this point, though.

problame added 7 commits June 25, 2024 17:39

read_exact_at_impl: accept a BoundedBuf

282a633

WIP

79a33dc

WIP

c7fc169

it compiles

27cae3c

Merge branch 'main' into problame/virtualfile-use-boundedbuf

1e5c126

Merge branch 'main' into problame/virtualfile-use-boundedbuf

f49b32f

get rid of read_exact_at alltogether

647f084

problame added 3 commits June 27, 2024 15:31

re-add read_exact_at

928c1dc

finish & fix some ub with std-fs (will pull this into a preliminary)

df56595

Merge branch 'main' into problame/virtualfile-use-boundedbuf

b481740

problame changed the title ~~[DO NOT REVIEW YET]: refactor(virtual_file): read_exact_at to accept a BoundedBufMut~~ virtual_file: take a Slice in the read APIs, eliminate read_exact_at_n, fix UB for engine std-fs Jun 27, 2024

fix test and pretty up

98d8721

problame commented Jun 27, 2024

View reviewed changes

pageserver/src/virtual_file/io_engine.rs Show resolved Hide resolved

problame marked this pull request as ready for review June 27, 2024 18:00

problame requested a review from a team as a code owner June 27, 2024 18:00

problame requested review from skyzh and VladLazar June 27, 2024 18:00

problame mentioned this pull request Jun 27, 2024

bypass PageCache for L0 flush #7418

Closed

Merge branch 'main' into problame/virtualfile-use-boundedbuf

9fa4f47

problame mentioned this pull request Jun 27, 2024

L0 flush: opt-in mechanism to bypass PageCache reads and writes #8190

Merged

VladLazar approved these changes Jun 28, 2024

View reviewed changes

pageserver/src/virtual_file.rs Show resolved Hide resolved

problame merged commit deec3bc into main Jun 28, 2024
64 checks passed

problame deleted the problame/virtualfile-use-boundedbuf branch June 28, 2024 09:20

arpad-m mentioned this pull request Jul 1, 2024

Use Slice<_> in write path instead of B: BoundedBuf<...> #8225

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

virtual_file: take a `Slice` in the read APIs, eliminate `read_exact_at_n`, fix UB for engine `std-fs` #8186

virtual_file: take a `Slice` in the read APIs, eliminate `read_exact_at_n`, fix UB for engine `std-fs` #8186

problame commented Jun 27, 2024 •

edited

Loading

github-actions bot commented Jun 27, 2024 •

edited

Loading

VladLazar left a comment

virtual_file: take a Slice in the read APIs, eliminate read_exact_at_n, fix UB for engine std-fs #8186

virtual_file: take a Slice in the read APIs, eliminate read_exact_at_n, fix UB for engine std-fs #8186

Conversation

problame commented Jun 27, 2024 • edited Loading

Change 1: take tokio_epoll_uring::Slice in the read APIs

Change 2: fix UB in virtual_file_io_engine=std-fs

Change 3: eliminate read_exact_at_n

Change 4: simplify filling of PageWriteGuard

github-actions bot commented Jun 27, 2024 • edited Loading

2940 tests run: 2823 passed, 0 failed, 117 skipped (full report)

Code coverage* (full report)

VladLazar left a comment

Choose a reason for hiding this comment

virtual_file: take a `Slice` in the read APIs, eliminate `read_exact_at_n`, fix UB for engine `std-fs` #8186

virtual_file: take a `Slice` in the read APIs, eliminate `read_exact_at_n`, fix UB for engine `std-fs` #8186

problame commented Jun 27, 2024 •

edited

Loading

Change 1: take `tokio_epoll_uring::Slice` in the read APIs

Change 2: fix UB in `virtual_file_io_engine=std-fs`

Change 3: eliminate `read_exact_at_n`

Change 4: simplify filling of `PageWriteGuard`

github-actions bot commented Jun 27, 2024 •

edited

Loading