Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: latest_gc_cutoff can go backwards after restart #10208

Closed
skyzh opened this issue Dec 19, 2024 · 1 comment · Fixed by #10209
Closed

pageserver: latest_gc_cutoff can go backwards after restart #10208

skyzh opened this issue Dec 19, 2024 · 1 comment · Fixed by #10209
Assignees
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug

Comments

@skyzh
Copy link
Member

skyzh commented Dec 19, 2024

Still root-causing #10192, but the underlying issue seems to be gc_cutoff can go backwards

2024-12-17T20:25:23.079668Z  INFO gc_loop{tenant_id=12fd6e6d7a50bf7dd96154ec39b8b7c8 shard_id=0000}:run:gc_timeline{timeline_id=9136e295b2647dae2fc5e2a2abbb1dc6 cutoff=0/E4B96D18}: keeping 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000E4B706C9-00000000E4B96D19 because it's newer than space_cutoff 0/E4B96D18
2024-12-17T20:40:41.170702Z  INFO compaction_loop{tenant_id=12fd6e6d7a50bf7dd96154ec39b8b7c8 shard_id=0000}:run:scheduled_compact_timeline{timeline_id=9136e295b2647dae2fc5e2a2abbb1dc6}: picked 30 layers for compaction (0 layers need rewriting) with max_layer_lsn=0/E4B96D19 min_layer_lsn=0/14EE9E8 gc_cutoff=0/E4B96D18 lowest_retain_lsn=0/E4B96D18, key_range=000000000000000000000000000000000000..FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF, has_data_below=false
2024-12-17T20:46:56.028412Z  INFO loading tenant configuration from /storage/pageserver/data/tenants/12fd6e6d7a50bf7dd96154ec39b8b7c8/config-v1
2024-12-17T20:47:13.726136Z ERROR synthetic_size_worker: failed to calculate synthetic size for tenant 12fd6e6d7a50bf7dd96154ec39b8b7c8: could not find data for key 010000000000000000000000000000000000 (shard ShardNumber(0)) at LSN 0/E4B839F1, request LSN 0/E4B839F0, ancestor 0/0

Looking at the current index_part.json, "latest_gc_cutoff_lsn": "0/E4B706C8",

This means that we didn't persist latest_gc_cutoff_lsn=0/E4B96D18 to index_part.json, and when we restart the pageserver, we get the old latest_gc_cutoff_lsn and try to access things at that LSN for tasks like synthetic size calculation.

Therefore, a correct implementation of legacy GC / GC compaction is to use the persisted latest gc cutoff, in other words, we should upload latest_gc_cutoff to index_part before starting gc.

@skyzh skyzh added c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug labels Dec 19, 2024
@skyzh skyzh self-assigned this Dec 19, 2024
@skyzh
Copy link
Member Author

skyzh commented Dec 19, 2024

confirmed that legacy GC will schedule an index-only update before removing the files so it's only a problem with gc-compaction.

github-merge-queue bot pushed a commit that referenced this issue Dec 19, 2024
…10209)

## Problem

close #10208
part of #9114 

## Summary of changes

* Ensure remote `latest_gc_cutoff` is up-to-date before removing any
files for gc-compaction.

Signed-off-by: Alex Chi Z <chi@neon.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant