Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix whatever is causing the pg_wal folder to grow so large in prod cluster! #338

Open
Venryx opened this issue Jul 25, 2024 · 3 comments
Open

Comments

@Venryx
Copy link
Collaborator

Venryx commented Jul 25, 2024

For now, when wal folder grows too large, use following fix route:

  • 1) Increase size of the out-of-space PVC(s), using Lens. (instead of default 10000Mi, set to 10100Mi or something)
  • 2) Kill the pods that are using those PVC(s), so they can finish resizing (kill the "...instance1..." and "repo-host-0" pods, OR do "Restart" on the two corresponding Stateful Sets).
    • Note: The resizing may fail to take place the first attempt, for one or both PVCs; repeat the process until it works. (can confirm by checking df in the two pods mentioned above, or more easily, by running describe on the PVCs, like seen here, except done for the purpose of seeing the events rather than the pods using the PVC; can also just observe the events in the Lens UI)
    • Note: In some cases, this step appears unnecessary. (it wasn't necessary one time when I increased the size of just the repo pvc)
  • 3) Open shell in the "instance1" and/or "repo-host-0" pods; check if "df" shows 100% still, on any of the filesystems. If they all are lower than 100% now, then that's good.
  • 4) IF the database was not corrupted by the space running out (the first time this issue happened the db got corrupted to some extent, require a full scp -> ... -> pgdump import process -- but the second time it didn't), then the main database PVC should reduce in size a lot as the WAL segments get cleared out. (doesn't seem to happen for the repo PVC as well unfortunately; EDIT: see second comment in thread for apparent explanation)
  • 5) Restart the app-server, to confirm that it works again. (it should, if step 4 succeeded)
  • 6) You should now try to do a pgdump of the contents. See: https://github.com/debate-map/app#pg-dump
    • Note: Atm the option 1 pgdump approach (nodejs script) is failing for prod cluster, since the database has too much data for the HTTP request to be solidified prior to nginx timing out the request. Need to fix this. For now, use option 4.

Also see: #331 (comment)

Other misc. info from DM

Btw: The issue of the PVC getting to 100% space usage happened again. Thankfully this time it did not corrupt the database, so I was able to fix it by simply increasing the PVC's size, restarting the database pod, then restarting the app-server. After that, the 100% usage (from the pg_wal folder like before) went down to ~20%, presumably since the cause of the WAL sticking around was disconnected, letting the WAL segments get cleaned up.

However, this is of course a terrible thing to keep happening.

Some remeditation plans:

  1. Detection: Make space usage more observable. I want to get emails set up at some point, but for now I added this little display to my custom taskbar panel: (it updates by sending a graphql query to the monitor backend once per minute) [image]
  2. Root cause: Discover whatever is causing the database to keep its WAL segments from being cleaned up, and resolve it.
    Possibly it is my logical-replication slot, maybe after an app-server crash or something.
    But possibly it's some side-effect of the pgbackrest backups getting broken. (I discovered that after we restored from backup on June 25th, the next day the pgbackrest backups started working like normal. They kept working until July 20th. Maybe that's the point where postgres knew the backups were failing and so started keeping all WAL segments until the pgbackrest backups could complete, similar to here: https://www.crunchydata.com/blog/postgres-is-out-of-disk-and-how-to-recover-the-dos-and-donts#broken-archives)

Other notes:

  • The PVC size-increasing worked for the main database pod+pvc, but more complicated for the "repo-host" / in-cluster database copy, as seen in screenshot above. (not exactly sure what that repo1 is, but anyway it's large folder is /pgbackrest/archive rather than the /pgdata/pg_wal on the main postgres database pod)
    • More specifically, the size increase worked, but the WAL data did not clear out in that repo-host PVC like it did for the main database PVC.
@Venryx
Copy link
Collaborator Author

Venryx commented Aug 5, 2024

Update: After I fixed the 2nd crash (mentioned in original post above), by increasing the PVC size by ~500mb, the next day (July 26th) the pgbackrest backups appear to have started up again.

My current understanding of the problem is the following: (edited as I've learned more)

  • 1) The "repo" PVC gets 100% filled. (I think the "repo1" in yaml)
    • Reason: The in-cluster "repo1" had no backup schedule or retention policy set, so the defaults were used.
      • The default backup schedule is just to make one initial full-backup at time of pgo-init/pvc-reset. (along with WAL segments that just get added as changes are made)
      • The default retention policy is just to keep all WAL fragments since the last non-expired full-backup. Combined with the default backup schedule above, this means keeping all WAL segments for all db changes since pgo-init/pvc-reset.
  • 2) The main db PVC sees that its backup to the "repo" pod/PVC is failing, so starts buffering up all WAL segments. (so that no history is lost for repo-pvc backup) [this could also/instead be a side-effect of 3 below, ie. remote pgbackrest backup failing]
  • 3) Due to either 1 or 2, the pgbackrest backups to the remote pgbackrest repo (I think the "repo2" in yaml) start failing. (small chance that it is "making a backup", but sees no changes in repo1 to backup to repo2; most likely just failing though)

So the biggest red flag to resolve atm seems to be that the "repo" PVC (repo1) is taking up way more storage space than the main db PVC itself (even after allowing plenty of time for it to "do its thing" and clear out unneeded data).

  • EDIT: with newer understanding, this makes sense; main db can clear out WAL since it only needs to keep it long enough to send to "repo" pod/pvc, whereas the "repo" pod/pvc keeps those WAL fragments forever due to lack of sane settings for backup-schedule + retention-policy.

@Venryx
Copy link
Collaborator Author

Venryx commented Sep 10, 2024

More relevant links:

@Venryx
Copy link
Collaborator Author

Venryx commented Sep 10, 2024

Update: I tried adding a valid backup-schedule and retention-policy for the in-cluster "repo1", and this fixed its PVC ballooning in size! (Within a minute of the new full-backup completing [which I triggered manually using Lens], the PVC dropped from ~11gb to 625.5mb! This roughly matches what I would expect, since a [less compact] pgdump is ~800mb atm.)

Now I'm curious why the main db pvc is 2.4gb while the repo1 pvc is so much lower... (this is within expectations though since it may need extra space for storing indexes, have other WAL keep-alive requirements, etc.)

I think this means this thread's issue is now fixed. But I will keep it open for several more months first, to see if it happens again.

Venryx added a commit that referenced this issue Sep 10, 2024
…orage usage of the db itself's pvc), by setting a valid backup-schedule and retention-policy for it. See: #338 (comment)

This probably fixes issue 338, but I'll wait to mark it as fixed until some more time has passed without incident.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🔖 Short-term (Venryx)
Development

No branches or pull requests

1 participant