[Bug] pg_rewind: fatal: could not find previous WAL record #803

antoniopetrole · 2024-12-19T17:48:42Z

Apache Cloudberry version

1.6.0

What happened

Hey all, so our cluster that we have has had a number of segments go down over the past month. Typically this isn't a major issue since we can just run gprecoverseg in an incremental fashion and things are back up within the hour. However almost every time we've had segments go down, the incremental recovery fails with the error

pg_rewind: fatal: could not find previous WAL record

I'm not sure if the default WAL settings are perhaps insufficient in the sense that the WAL records are being rotated out too quickly or if this is a bug. To provide some additional context, here's a timeline of events for segments that crashed on 12/16

12/16/2024 22:08:13: Segment 171 crashes due to a segfault (I'll be opening another issue for this today) from a query
12/16/2024 22:09:33: Segment 51 crashes due to a segfault from the same query
12/16/2024 22:18:38: We run gprecoverseg (no args) and it shows it's only going to recover segment51 and it fails with the following error

pg_rewind: connected to server
pg_rewind: servers diverged at WAL location 22DB/87CEBDC0 on timeline 3
pg_rewind: error: could not open file "/data4/primary/gpseg51/pg_wal/00000003000022DB0000001B": No such file or directory
pg_rewind: fatal: could not find previous WAL record at 22DB/6FFFFFF8

12/16/2024 22:19:29: We try running gprecoverseg (no args) again and receive the same exact error as above. It also is saying it's only recovering segment 51
12/16/2024 22:28:01: We try running gprecoverseg -F -a and get the following error

pg_basebackup: error: connection to server at "10.9.1.232", port 5003 failed: FATAL:  the database system is resetting

12/16/2024 22:40:04: We try running gprecoverseg (no args) again and now it's showing it's going to try and recover both segment 51 and segment 171 but ends up with the same exact error as before
12/16/2024 22:48:33: 6 more segments go down due to a segfault from the same query (segments 187,19,138,108,143,57)
12/16/2024 22:56:33: 23 more segments go down due to a segfault from the same query (segments 60,129,53,144,131,62,63,175,54,107,71,105,45,160,12,47,177,161,180,70,178,61,147)
12/16/2024 23:42:40: We successfully recover segment 70 with incremental (the duration was 34 minutes)
12/17/2024 1:20:45: We successfully recover 8 more segments incrementally (segments 60,53,62,54,71,107,12,47)
12/17/2024 1:30:12: We try recovering 19 more segments incrementally (19,57,61,63,105,108,129,131,138,143,144,147,160,161,171,175,177,178,180) and received the following error

pg_rewind: connected to server
pg_rewind: executing "/usr/local/cloudberry-db-1.6.0/bin/postgres" for target server to complete crash recovery
LOG:  gp_role forced to 'utility' in single-user mode
2024-12-17 01:30:34.368316 UTC,,,p540072,th1765611648,,,,0,,,seg108,,,,,"LOG","00000","database system was interrupted; last known up at 2024-12-16 23:36:51 UTC",,,,,,,,"StartupXLOG","xlog.c",6819,
2024-12-17 01:30:34.383430 UTC,,,p540072,th1765611648,,,,0,,,seg108,,,,,"LOG","00000","Synchronization of the wal directory starts.",,,,,,,,"SyncAllXLogFiles","fd.c",3446,
2024-12-17 01:30:34.383713 UTC,,,p540072,th1765611648,,,,0,,,seg108,,,,,"LOG","00000","synchronization of the wal directory finishes.",,,,,,,,"SyncAllXLogFiles","fd.c",3448,
2024-12-17 01:30:34.513928 UTC,,,p540072,th1765611648,,,,0,,,seg108,,,,,"LOG","00000","database system was not properly shut down; automatic recovery in progress",,,,,,,,"StartupXLOG","xlog.c",7388,
2024-12-17 01:31:00.783898 UTC,,,p540072,th1765611648,,,,0,,,seg108,,,,,"LOG","00000","redo starts at 22D8/8C750D40",,,,,,,,"StartupXLOG","xlog.c",7677,
2024-12-17 01:31:00.783968 UTC,,,p540072,th1765611648,,,,0,,,seg108,,,,,"LOG","00000","invalid record length at 22D8/8C750D78: wanted 24, got 0",,,,,,,,"ReadRecord","xlog.c",4484,
2024-12-17 01:31:00.784030 UTC,,,p540072,th1765611648,,,,0,,,seg108,,,,,"LOG","00000","redo done at 22D8/8C750D40 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s",,,,,,,,"StartupXLOG","xlog.c",7966,

PostgreSQL stand-alone backend 14.4
backend> pg_rewind: servers diverged at WAL location 22D8/8C73A0B8 on timeline 1
pg_rewind: error: could not open file "/data1/primary/gpseg108/pg_wal/00000001000022D80000001D": No such file or directory
pg_rewind: fatal: could not find previous WAL record at 22D8/77FFFFA8

12/17/2024 3:02:46: We end up recovering 13 specified segments incrementally according to the logs (61,63,105,131,144,147,160,161,171,175,177,178,180)
12/17/2024 4:09:44: On our next run we try to recover more segments and we end up recovering duplicates of our last run (it seems like they didn't recover successfully although gprecoverseg says they did? I need to double check this)
(12,47,53,54,60,61,62,63,70,71,105,107,131,144,147,160,161,171,175,177,178,180)
12/17/2024 4:18:41: We kick off a FULL recovery for the remaining segments which works after running for 10 hours (segments 19,45,51,57,108,129,138,143,187)
12/17/2024 19:02:03: We run a rebalance and the cluster is back in a healthy state

So I know this is a lot of information to digest, I have a spreadsheet link below that shows the timeline of events along with error messages etc. The main thing I want to call out is that we received this WAL error on segments 51, 187, 45, 108, 57, and 19. However a bunch of the other segments that crashed at the same time as those were able to do an incremental with no issues. It seems like we effectively had to go through the process of eliminating potential problem segments from our gprecoverseg input file one at a time to figure out what's going wrong. What's also strange is that our first incremental recovery attempt was only 9 minutes after the segment went down and we hit this issue. However for the final incremental that worked it was almost 5 hours after those segments went down, so I'm not sure how valid the WAL record rotation theory is.

I'm happy to provide any clarifications/logs that might be needed for this, here is a more detailed timeline for review
https://docs.google.com/spreadsheets/d/1qJNC88rFAD6OzF2b0fEeNgAX57QdnKY8SZhnhbg9q6g/edit?usp=sharing

What you think should happen instead

The expected behavior here would be that there aren't inconsistencies in WAL files being available amongst a group of segments that all failed at the same time

How to reproduce

Unfortunately I'm not sure how to even reproduce this

Operating System

Rocky Linux 8.10 (Green Obsidian)

Anything else

This has happened 3 times in the past month requiring a full recovery for either all downed segments or a subset of them

Are you willing to submit PR?

Yes, I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct.

The text was updated successfully, but these errors were encountered:

edespino · 2024-12-19T19:53:06Z

FYI: technically, this is a pre-Apache Cloudberry (Incubating) release. The project has not had their initial Apache release.

edespino · 2024-12-19T20:09:42Z

@antoniopetrole - the error messages indicate there were segfaults. Are your systems configured to capture core files?

antoniopetrole · 2024-12-19T20:32:40Z

@edespino Yes we have specific directories where core files are dumped and I believe we have the core files for all of the times segments have gone down since we deployed the cluster. I'm going to include more information about that in my next issue regarding the segfault itself

edespino · 2024-12-19T20:36:59Z

@antoniopetrole I am preparing a tool to assist with the core file review - extracting relevant information. I should have a version to share shortly.

edespino · 2024-12-19T21:31:17Z

To assist with core file analysis, I created the following repository containing apache_cloudberry_core_analyzer.py:

https://github.com/cloudberry-contrib/support-tools

my-ship-it · 2024-12-20T02:28:45Z

Please wait, then there will be engineers to help support.

foreyes · 2024-12-20T03:01:15Z

@antoniopetrole Thanks for reporting this complicated issue, we are going to investigate this soon.
BTW, could you please @ me when another issue is opened?

antoniopetrole added the type: Bug Something isn't working label Dec 19, 2024

antoniopetrole mentioned this issue Dec 20, 2024

[Bug] Query Causes Segmentation Fault #806

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] pg_rewind: fatal: could not find previous WAL record #803

[Bug] pg_rewind: fatal: could not find previous WAL record #803

antoniopetrole commented Dec 19, 2024

edespino commented Dec 19, 2024

edespino commented Dec 19, 2024

antoniopetrole commented Dec 19, 2024

edespino commented Dec 19, 2024

edespino commented Dec 19, 2024

my-ship-it commented Dec 20, 2024

foreyes commented Dec 20, 2024

[Bug] pg_rewind: fatal: could not find previous WAL record #803

[Bug] pg_rewind: fatal: could not find previous WAL record #803

Comments

antoniopetrole commented Dec 19, 2024

Apache Cloudberry version

What happened

What you think should happen instead

How to reproduce

Operating System

Anything else

Are you willing to submit PR?

Code of Conduct

edespino commented Dec 19, 2024

edespino commented Dec 19, 2024

antoniopetrole commented Dec 19, 2024

edespino commented Dec 19, 2024

edespino commented Dec 19, 2024

my-ship-it commented Dec 20, 2024

foreyes commented Dec 20, 2024