Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Layer::overlaps method and use it in count_deltas to avoid unnecessary image layer generation #3348

Closed
wants to merge 44 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
121b5e9
Add Layer::overlaps method and use it in count_deltas to avoid unness…
knizhnik Jan 15, 2023
7d1e6ac
Remove wanrings
knizhnik Jan 15, 2023
c14cc5b
Store information about largest holes in layer map
knizhnik Jan 20, 2023
9b2321e
Merge with main
knizhnik Jan 26, 2023
8f23a5e
Store information about holes in delta layers in S3 index file
knizhnik Jan 27, 2023
652fd5a
Remove excessive clone() when merging holes
knizhnik Jan 27, 2023
aa89e26
Fix unit tests building
knizhnik Jan 27, 2023
119afc3
Make clippy happy
knizhnik Jan 27, 2023
f63c663
Add traces to debug physical layer size calculation
knizhnik Jan 28, 2023
8b85811
Temporary disarm overlaps
knizhnik Jan 28, 2023
4099101
Revert "Temporary disarm overlaps"
knizhnik Jan 29, 2023
38754b2
Revert "Make clippy happy"
knizhnik Jan 29, 2023
4f38224
Ad delay to test_on_demand_download
knizhnik Jan 29, 2023
c2be42b
Revert "Add traces to debug physical layer size calculation"
knizhnik Jan 29, 2023
3100984
Make clippy happy
knizhnik Jan 27, 2023
08f450e
Ignore errors in find_lsn_for_timestamp
knizhnik Jan 29, 2023
d7c09ca
Revert "Ignore errors in find_lsn_for_timestamp"
knizhnik Jan 30, 2023
c05095f
Temp: do not use information about holes in GC
knizhnik Jan 30, 2023
0b32d14
Fix LayerMap::search method
knizhnik Jan 30, 2023
0824044
Trnsacte index_part.json file in test_tenant_upgrades_index_json_from_v0
knizhnik Jan 30, 2023
3dc4ab3
Add holes information while upgrading layer metadata
knizhnik Jan 31, 2023
eef2ab6
Make clippy happy
knizhnik Jan 31, 2023
6c54d0d
Provide custom serializer for Hole
knizhnik Jan 31, 2023
d63a745
Propagate RequestContext
knizhnik Feb 1, 2023
7c3c271
Store historic layers in separate set
knizhnik Feb 4, 2023
a58b703
Remove all occupied segments in layer map
knizhnik Feb 4, 2023
161e0d8
Remove all occupied segments in layer map
knizhnik Feb 4, 2023
5e2cea2
Rebase with main
knizhnik Feb 4, 2023
e020247
Cleanup after merge with main
knizhnik Feb 4, 2023
88fc5c3
Minor refactoring
knizhnik Feb 4, 2023
6aab1a9
Minor refactoring
knizhnik Feb 4, 2023
d89e5a7
Ignore load layer error in get_occupied_ranges
knizhnik Feb 4, 2023
be24282
Ignore load layer error in get_occupied_ranges
knizhnik Feb 4, 2023
689ef9e
Sort layers in LayerMapInfo
knizhnik Feb 4, 2023
e068e48
Sort layers in LayerMapInfo
knizhnik Feb 4, 2023
8051f94
Update pageserver/src/tenant/layer_map.rs
knizhnik Feb 6, 2023
04c81fc
Add test for format version 2 of index_part.json
knizhnik Feb 6, 2023
4b3fe74
Try to use wait_for_upload in test_ondemand_download_timetravel
knizhnik Feb 6, 2023
ed44d66
Restore v1_indexpart_is_parsed test
knizhnik Feb 7, 2023
ad4d678
Merge with main
knizhnik Feb 7, 2023
284d167
Restore sleep in test_ondemand_download.py
knizhnik Feb 7, 2023
e9b0e43
Update pageserver/src/tenant/layer_map.rs
knizhnik Feb 9, 2023
6e5efc5
Treate Arc as raw pointers in hash implementation for LayerRef
knizhnik Feb 9, 2023
d78fbb0
Mak eclippy happy
knizhnik Feb 9, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions libs/pageserver_api/src/models.rs
Original file line number Diff line number Diff line change
Expand Up @@ -241,6 +241,7 @@ pub struct LayerMapInfo {
#[repr(usize)]
pub enum LayerAccessKind {
GetValueReconstructData,
ExtractHoles,
Iter,
KeyIter,
Dump,
Expand Down
18 changes: 12 additions & 6 deletions pageserver/benches/bench_layer_map.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
use pageserver::context::{DownloadBehavior, RequestContext};
use pageserver::keyspace::{KeyPartitioning, KeySpace};
use pageserver::repository::Key;
use pageserver::task_mgr::TaskKind;
use pageserver::tenant::layer_map::LayerMap;
use pageserver::tenant::storage_layer::{Layer, LayerDescriptor, LayerFileName};
use rand::prelude::{SeedableRng, SliceRandom, StdRng};
Expand All @@ -16,6 +18,7 @@ use utils::lsn::Lsn;
use criterion::{criterion_group, criterion_main, Criterion};

fn build_layer_map(filename_dump: PathBuf) -> LayerMap<LayerDescriptor> {
let ctx = RequestContext::new(TaskKind::Benchmark, DownloadBehavior::Error);
let mut layer_map = LayerMap::<LayerDescriptor>::default();

let mut min_lsn = Lsn(u64::MAX);
Expand All @@ -33,7 +36,7 @@ fn build_layer_map(filename_dump: PathBuf) -> LayerMap<LayerDescriptor> {
min_lsn = min(min_lsn, lsn_range.start);
max_lsn = max(max_lsn, Lsn(lsn_range.end.0 - 1));

updates.insert_historic(Arc::new(layer));
updates.insert_historic(Arc::new(layer), &ctx).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can plumb ctx everywhere if it's needed. But I wonder what particular change in this PR required the need to plumb ctx. Why is it unavoidable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we really need to add ctx to all these places, please extract that to a separate commit, for easier review

Copy link
Member

@koivunej koivunej Feb 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or a separate PR, but I think this depends on the "should LayerMap receive layers with full hole information or can it calculate them on demand". I think this would be possible to side-step by requiring that the layers have the metadata before being put into the LayerMap.

See: https://github.com/neondatabase/neon/pull/3348/files#r1097689494

Copy link
Contributor

@bojanserafimov bojanserafimov Feb 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand why "ctx" in LayerMap is red flag. RequestContext is actually just a way to propagate task specific context. From my point of view it should be everywhere (in each our method) or never (use LS when it is actually needed).

First reason: The layer map is just a data structure. It should be possible to separate it into its own lib crate. Why? When I rewrote the layer map I spent 1 week writing layer map code and 2 months dealing with the lack of separation between the layer map and the caller (no tests, no documented assumptions, too many requirements of a minimum viable implementation, etc.) What we're doing here with ctx is equivalent to plumbing ctx into the HashMap implementation because we want to read a delta file inside DeltaLayer::hash. It's bloated. It's not modular.

Second reason: The need to plumb ctx for this PR reveals that the layer map needed some kind of access to some kind of expensive operation worth tracking. When someone reads this code, their guess would be that it either:
a) it reads the holes from the header of the delta layer
b) it reads the entire delta layer to construct the holes
c) it downloads layers (pls no!)

A paranoid reader will assume it's (c) and will have to read a lot more code just to make sure. A careless reader will assume it's (a) and move on. In reality it's (b), so nobody is right. Reading code should not be an adventure. The code should make it obvious what's going on and what can be improved. The reader's attention should be on the important bits. Given how complicated the layer map already is, I'd try to remove ctx completely so that there would be no doubt, but if we have to hack something I'd at least be very explicit with comments and TODOs explaining how it can be done better.

Overall, I agree that if we have ctx we should plumb it in all methods proactively. But there are exceptions:

  1. We should plumb it in all pageserver code, not all libraries. The layer map is not a library, but it really needs to be (see my reasoning above)
  2. In theory, more context is always better. But passing the nuclear launch codes into a get_timezone function will raise red flags. We're now passing error handling logic into a function that should never attempt to do any of those things that need the error handling.

It seems like an avoidable problem given that we actually compute the holes explicitly whenever we create a delta layer and add it to the layer map. Why not keep or pass that information?

https://github.com/neondatabase/neon/blob/56af67b5931569525b80256b4a9549d05e2fbc9a/pageserver/src/tenant/timeline.rs#L2422-L2429

I'd be fine with at least documenting // HACK this can be done a lot better and moving on, if the solution is actually a lot more involved than I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But passing the nuclear launch codes into a get_timezone function will raise red flags.

😆

FWIW, I agree 100% on your reasoning.

Copy link
Contributor

@bojanserafimov bojanserafimov Feb 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Information about holes is not yet calculated.

Line 2242 that I linked to, says let holes = new_delta.get_holes(ctx)?; . Is this not calculating holes?

It is calculated only in one place - in delta layer by traversing disk B-Tree.

Let's just not traverse the layer B-Tree while holding the layer map write lock. That's all I'm asking. If it can't be done now because it's difficult, add a TODO and issue for later.

Let's address the "reading 256MB under layer map write lock" problem, and leave the ctx issue for another day. I'll take out the context from the layer map later if necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 2242 that I linked to, says let holes = new_delta.get_holes(ctx)?; . Is this not calculating holes?

Yes, it does. But it is stored in layer's inner. No need to pass it somewhere. Or may be I do not understand you.

Let's just not traverse the layer B-Tree while holding the layer map write lock. T

It is not done under layer map lock. As I mentioned above it really done in = new_delta.get_holes(ctx)?
where no layer lock is hold.

Let's address the "reading 256MB under layer map write lock" problem, an

No such problem. And moreover, traversing B-Tree doesn't mean reading all 256Mb. Size of layer index is about 40Mb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not done under layer map lock.

Then why do we need ctx under write lock? (Rethorical question, hopefully makes it clear where the code readability problem is)

Ok it's not an actual issue, just a readability issue. If it took 10 messages of back and forth to reach this conclusion, it deserves at least a comment in the code. I don't think it's at all obvious.

Copy link
Contributor

@bojanserafimov bojanserafimov Feb 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like: "insert_historic would usually read the layer to compute holes, but in practice that information would already be cached because the caller happens to call new_delta.get_holes."

Is that the explanation? Or something else?

Even if that's the explanation, it's very fragile to line reordering refactors, and needs an extra comment..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok it's not an actual issue, just a readability issue. If it took 10 messages of back and forth to reach this conclusion, it deserves at least a comment in the code. I don't think it's at all obvious.

Sorry, looks like it is just some misunderstanding: I didn't understand your concerns and you didn't understand my explanations. As far as almost all reviewer complain about a lot of changes related with propagation of RequestContext and first of all concentrated on this issue I do not notice your main concern - that it can happen that we we perform some expensive operation (extracting holes) under layer map write lock. Actually it is not true, but it happens unintentionally - actually I didn't realize this problem. I will add comment explaining it. But I want to notice that this problem isn't somehow related with propagation of request contexts. It is present event ven if DeltaLyaer:::load() method doesn't require request context and I do not need to propagate it everywhere.

}

println!("min: {min_lsn}, max: {max_lsn}");
Expand Down Expand Up @@ -135,6 +138,7 @@ fn bench_from_captest_env(c: &mut Criterion) {
// Benchmark using metadata extracted from a real project that was taknig
// too long processing layer map queries.
fn bench_from_real_project(c: &mut Criterion) {
let ctx = RequestContext::new(TaskKind::Benchmark, DownloadBehavior::Error);
// Init layer map
let now = Instant::now();
let layer_map = build_layer_map(PathBuf::from("benches/odd-brook-layernames.txt"));
Expand All @@ -157,12 +161,13 @@ fn bench_from_real_project(c: &mut Criterion) {
println!("running correctness check");

let now = Instant::now();
let result_bruteforce = layer_map.get_difficulty_map_bruteforce(latest_lsn, &partitioning);
let result_bruteforce =
layer_map.get_difficulty_map_bruteforce(latest_lsn, &partitioning, &ctx);
assert!(result_bruteforce.len() == partitioning.parts.len());
println!("Finished bruteforce in {:?}", now.elapsed());

let now = Instant::now();
let result_fast = layer_map.get_difficulty_map(latest_lsn, &partitioning, None);
let result_fast = layer_map.get_difficulty_map(latest_lsn, &partitioning, None, &ctx);
assert!(result_fast.len() == partitioning.parts.len());
println!("Finished fast in {:?}", now.elapsed());

Expand All @@ -189,14 +194,15 @@ fn bench_from_real_project(c: &mut Criterion) {
});
group.bench_function("get_difficulty_map", |b| {
b.iter(|| {
layer_map.get_difficulty_map(latest_lsn, &partitioning, Some(3));
layer_map.get_difficulty_map(latest_lsn, &partitioning, Some(3), &ctx);
});
});
group.finish();
}

// Benchmark using synthetic data. Arrange image layers on stacked diagonal lines.
fn bench_sequential(c: &mut Criterion) {
let ctx = RequestContext::new(TaskKind::Benchmark, DownloadBehavior::Error);
// Init layer map. Create 100_000 layers arranged in 1000 diagonal lines.
//
// TODO This code is pretty slow and runs even if we're only running other
Expand All @@ -206,7 +212,7 @@ fn bench_sequential(c: &mut Criterion) {
let now = Instant::now();
let mut layer_map = LayerMap::default();
let mut updates = layer_map.batch_update();
for i in 0..100_000 {
for i in 1..100_000 {
let i32 = (i as u32) % 100;
let zero = Key::from_hex("000000000000000000000000000000000000").unwrap();
let layer = LayerDescriptor {
Expand All @@ -215,7 +221,7 @@ fn bench_sequential(c: &mut Criterion) {
is_incremental: false,
short_id: format!("Layer {}", i),
};
updates.insert_historic(Arc::new(layer));
updates.insert_historic(Arc::new(layer), &ctx).unwrap();
}
updates.flush();
println!("Finished layer map init in {:?}", now.elapsed());
Expand Down
3 changes: 2 additions & 1 deletion pageserver/src/http/routes.rs
Original file line number Diff line number Diff line change
Expand Up @@ -570,14 +570,15 @@ async fn layer_download_handler(request: Request<Body>) -> Result<Response<Body>
}

async fn evict_timeline_layer_handler(request: Request<Body>) -> Result<Response<Body>, ApiError> {
let ctx = RequestContext::new(TaskKind::MgmtRequest, DownloadBehavior::Error);
let tenant_id: TenantId = parse_request_param(&request, "tenant_id")?;
check_permission(&request, Some(tenant_id))?;
let timeline_id: TimelineId = parse_request_param(&request, "timeline_id")?;
let layer_file_name = get_request_param(&request, "layer_file_name")?;

let timeline = active_timeline_of_active_tenant(tenant_id, timeline_id).await?;
let evicted = timeline
.evict_layer(layer_file_name)
.evict_layer(layer_file_name, &ctx)
.await
.map_err(ApiError::InternalServerError)?;

Expand Down
2 changes: 2 additions & 0 deletions pageserver/src/task_mgr.rs
Original file line number Diff line number Diff line change
Expand Up @@ -264,6 +264,8 @@ pub enum TaskKind {

DebugTool,

Benchmark,

#[cfg(test)]
UnitTest,
}
Expand Down
26 changes: 14 additions & 12 deletions pageserver/src/tenant.rs
Original file line number Diff line number Diff line change
Expand Up @@ -176,9 +176,9 @@ impl UninitializedTimeline<'_> {
///
/// The new timeline is initialized in Active state, and its background jobs are
/// started
pub fn initialize(self, _ctx: &RequestContext) -> anyhow::Result<Arc<Timeline>> {
pub fn initialize(self, ctx: &RequestContext) -> anyhow::Result<Arc<Timeline>> {
let mut timelines = self.owning_tenant.timelines.lock().unwrap();
self.initialize_with_lock(&mut timelines, true, true)
self.initialize_with_lock(&mut timelines, true, true, ctx)
}

/// Like `initialize`, but the caller is already holding lock on Tenant::timelines.
Expand All @@ -191,6 +191,7 @@ impl UninitializedTimeline<'_> {
timelines: &mut HashMap<TimelineId, Arc<Timeline>>,
load_layer_map: bool,
activate: bool,
ctx: &RequestContext,
) -> anyhow::Result<Arc<Timeline>> {
let timeline_id = self.timeline_id;
let tenant_id = self.owning_tenant.tenant_id;
Expand All @@ -211,7 +212,7 @@ impl UninitializedTimeline<'_> {
Entry::Vacant(v) => {
if load_layer_map {
new_timeline
.load_layer_map(new_disk_consistent_lsn)
.load_layer_map(new_disk_consistent_lsn, ctx)
.with_context(|| {
format!(
"Failed to load layermap for timeline {tenant_id}/{timeline_id}"
Expand Down Expand Up @@ -459,7 +460,7 @@ impl Tenant {
local_metadata: Option<TimelineMetadata>,
ancestor: Option<Arc<Timeline>>,
first_save: bool,
_ctx: &RequestContext,
ctx: &RequestContext,
) -> anyhow::Result<()> {
let tenant_id = self.tenant_id;

Expand Down Expand Up @@ -494,7 +495,7 @@ impl Tenant {
// Do not start walreceiver here. We do need loaded layer map for reconcile_with_remote
// But we shouldnt start walreceiver before we have all the data locally, because working walreceiver
// will ingest data which may require looking at the layers which are not yet available locally
match timeline.initialize_with_lock(&mut timelines_accessor, true, false) {
match timeline.initialize_with_lock(&mut timelines_accessor, true, false, ctx) {
Ok(new_timeline) => new_timeline,
Err(e) => {
error!("Failed to initialize timeline {tenant_id}/{timeline_id}: {e:?}");
Expand Down Expand Up @@ -528,6 +529,7 @@ impl Tenant {
.reconcile_with_remote(
up_to_date_metadata,
remote_startup_data.as_ref().map(|r| &r.index_part),
ctx,
)
.await
.context("failed to reconcile with remote")?
Expand Down Expand Up @@ -1954,7 +1956,7 @@ impl Tenant {
// made.
break;
}
let result = timeline.gc().await?;
let result = timeline.gc(ctx).await?;
totals += result;
}

Expand Down Expand Up @@ -2078,7 +2080,7 @@ impl Tenant {
src_timeline: &Arc<Timeline>,
dst_id: TimelineId,
start_lsn: Option<Lsn>,
_ctx: &RequestContext,
ctx: &RequestContext,
) -> anyhow::Result<Arc<Timeline>> {
let src_id = src_timeline.timeline_id;

Expand Down Expand Up @@ -2171,7 +2173,7 @@ impl Tenant {
false,
Some(Arc::clone(src_timeline)),
)?
.initialize_with_lock(&mut timelines, true, true)?;
.initialize_with_lock(&mut timelines, true, true, ctx)?;
drop(timelines);
info!("branched timeline {dst_id} from {src_id} at {start_lsn}");

Expand Down Expand Up @@ -2272,7 +2274,7 @@ impl Tenant {

let timeline = {
let mut timelines = self.timelines.lock().unwrap();
raw_timeline.initialize_with_lock(&mut timelines, false, true)?
raw_timeline.initialize_with_lock(&mut timelines, false, true, ctx)?
};

info!(
Expand Down Expand Up @@ -3426,7 +3428,7 @@ mod tests {
.await?;
tline.freeze_and_flush().await?;
tline.compact(&ctx).await?;
tline.gc().await?;
tline.gc(&ctx).await?;
}

Ok(())
Expand Down Expand Up @@ -3498,7 +3500,7 @@ mod tests {
.await?;
tline.freeze_and_flush().await?;
tline.compact(&ctx).await?;
tline.gc().await?;
tline.gc(&ctx).await?;
}

Ok(())
Expand Down Expand Up @@ -3582,7 +3584,7 @@ mod tests {
.await?;
tline.freeze_and_flush().await?;
tline.compact(&ctx).await?;
tline.gc().await?;
tline.gc(&ctx).await?;
}

Ok(())
Expand Down
Loading