Post initial `tenant_size` issues #2748

koivunej · 2022-11-03T12:03:58Z

Initial PR: #2714, condensing the review comments and remaining TODO's, observations here.

Incremental size update ideas for both:
- libs/tenant_size_model
- pageserver::tenant::size::calculate_logical_size
- in discussions: might not be a blocker for production usage
test_get_tenant_size_with_multiple_branches is flaky: tenant size mismatch #2962
Make Postgres 15 default #2809 -- test failure blocking

#2817 will change a lot, but while that rewrite has been in progress, many of the issues are now handled.

Done or irrelevant now given all of the post-initial changes:

Current panicking vs. fallible operations in libs/tenant_size_model (More tenant size fixes #3410)
Behaviour below tenant's gc_horizon (currently just zero)
- Fix tenant size modeling code to include WAL at end of branch #2781
- overhauled in Fix tenant size orphans #3377 to be non-zero (about initdb_lsn)
Behaviour with regards to next_gc_cutoff being on the ancestor_timeline (currently filtered out)
- This seems ok
Added and removed test in which timeline is attempted to be broken by deleting it's layer file start and stop raised questions
- Size remained the same, but data was lost in postgres
- Timeline was not broken in pageserver's view, because of degenerate local only test case
Interaction with on-demand downloads (comment, Epic: on-demand S3 downloads #2029)
- Related find: assertion failed: self.historic_layers.remove(&LayerRTreeObject::new(layer)).is_some() #3387 which has not been reproduced, unlikely to be caused by tenant_size
Wrongly calculated and poorly tested retention_period parameter
- altered in Tenant size calculation: refactor, rewrite, and add SVG #2817, assuming it's now handled

The text was updated successfully, but these errors were encountered:

Tenant size information is gathered by using existing parts of `Tenant::gc_iteration` which are now separated as `Tenant::refresh_gc_info`. `Tenant::refresh_gc_info` collects branch points, and invokes `Timeline::update_gc_info`; nothing was supposed to be changed there. The gathered branch points (through Timeline's `GcInfo::retain_lsns`), `GcInfo::horizon_cutoff`, and `GcInfo::pitr_cutoff` are used to build up a Vec of updates fed into the `libs/tenant_size_model` to calculate the history size. The gathered information is now exposed using `GET /v1/tenant/{tenant_id}/size`, which which will respond with the actual calculated size. Initially the idea was to have this delivered as tenant background task and exported via metric, but it might be too computationally expensive to run it periodically as we don't yet know if the returned values are any good. Adds one new metric: - pageserver_storage_operations_seconds with label `logical_size` - separating from original `init_logical_size` Adds a pageserver wide configuration variable: - `concurrent_tenant_size_logical_size_queries` with default 1 This leaves a lot of TODO's, tracked on issue #2748.

LizardWizzard · 2022-11-03T13:00:45Z

Another related point tied to on-demand download. In current on-demand model we need to download significant amount of layers to calculate size. Would be good if we can make incremental calculation reliable through restarts, and if not we'll probably need some tweaking to place sizes in one layer (or metadata) so it is cheaper to obtain it on startup

koivunej · 2022-11-07T10:03:34Z

On the #2755 my selection of next_gc_cutoff to represent the non-disk_consistent_lsn oldest retained Lsn raised questions, also the _cutoff name. Originally Heikki proposed last_retain_lsn for this, but I went with next_gc_cutoff because it's calculated so much similarly. My interpretation of how it should be calculated could be equally wrong.

With more realistic selection of gc_horizon in tests there is an immediate failure with trying to query logical size with lsn < initdb_lsn. Fixes that, adds illustration gathered from clarity of explaining this tenant size calculation to more people. Cc: #2748, #2599.

koivunej · 2023-01-19T19:11:29Z

#3377 fixes bugs, adds test cases, some with skipped test annotations and changes how the initial tenant size is calculated to how it should had been all along.

koivunej mentioned this issue Nov 4, 2022

fix: logical size query at before initdb_lsn #2755

Merged

neondatabase-bot bot added this to the 2022/12 milestone Nov 16, 2022

jcsp closed this as completed Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Post initial `tenant_size` issues #2748

Post initial `tenant_size` issues #2748

koivunej commented Nov 3, 2022 •

edited

Loading

LizardWizzard commented Nov 3, 2022

koivunej commented Nov 7, 2022

koivunej commented Jan 19, 2023

Post initial tenant_size issues #2748

Post initial tenant_size issues #2748

Comments

koivunej commented Nov 3, 2022 • edited Loading

LizardWizzard commented Nov 3, 2022

koivunej commented Nov 7, 2022

koivunej commented Jan 19, 2023

Post initial `tenant_size` issues #2748

Post initial `tenant_size` issues #2748

koivunej commented Nov 3, 2022 •

edited

Loading