-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate safekeeper eviction errors #8758
Comments
Is the check over-sensitive?
|
Context: generally non-zero bytes after last record is not something impossible (you could get it by killing compute when it sent half record to safekeepers), but it is expected to disappear eventually, because elected walproposer zeros out WAL after its initial position which is record boundary. So we wanted to understand how this could happen. So I collected debug dumps and did more thorough analysis. Overall there are 23 timelines like that (querying non evicted timelines which were last modified more than a couple of days ago and don't have lagging remote_consistent_lsn and don't have other issue):
Last modification time of these: such latest is at
) I looked more closely at ~5 of these, and they expose the same pattern:
So, 7719 record bytes size hints that partial record is the one with full page image of some pg_proc page. pg_waldump is doesn't give anything useful for partial/invalid WAL records. Moreover, it is in principle hard to extract info from corrupted records because of WAL record structure (xlogrecord.h): first goes XLogRecord, then XLogRecordBlockHeader(s), then XLogRecordDataHeader[Short|Long], then blocks (images), then the record itself. In our case, when we have ~300 bytes of whole 7719 bytes most likely we don't have main record at all (most of these 7719 byte is likely the page image). I patched pg_waldump to provide at least
Here are its results and xxds on some timelines:
Initially I was confused by zero xid, thinking it looks like a corruption in the middle (xl_prev follows it and is definitely valid), but looking at heapdesc.c and vacuum code there are xidless heap2 records: Moreover, looking at last operations for affected timelines, it was check_availability. It means there was no usual 5 minutes of idleness: just some check (transaction) and shutdown. So such race of commit (xlogflush) and vacuum is even more likely. Why we don't have more occurrences of this? At 11.03 we merged #6712 which started flushing all outstanding WAL to safekeepers before the shutdown when pg is stopped in non immediate mode. And we do So I conclude that these bytes is not a corruption and quite normal. However, given that we shouldn't have more of this I'm thinking to leave this too sensitive check in place and force wake up affected tenants to remove the tail. I automated segment fetching / cmp a bit with
|
Force started these 23 timelines with
script
|
Repeated the same with staging eu-west-1.
AFAIS there is no such problem in staging eu-central-1. |
https://neondb.slack.com/archives/C0756RKTCNR/p1723624642168319
https://neondb.slack.com/archives/C04KGFVUWUQ/p1725512770462649
The text was updated successfully, but these errors were encountered: