Skip to content

Commit 63f4182

Browse files
author
Heikki Linnakangas
committed
Fix initialization of the WAL buffer at startup
There were two bugs: 1. We initialized the page headers on the allocated temporary WAL page even if the end-of-log was precisely at page boundary. That means we wrote beyond the end of the allocation. That readily gives an assertion failure on debug-enabled builds. Add test case and fix. 2. We initialize the WAL buffer by copying the contents of the WAL read buffer. The idea is that when we stop reading WAL, the last buffer in the reader becomes the new WAL buffer we'll write to. That's how PostgreSQL does it too, see code around comment "Tricky point here" in StartupXLOG(). However, that doesn't work with Neon. The startup procedure is a little different: we don't do normal WAL recovery and we don't read any WAL at startup, except when promoting a read replica. Vanilla PostgreSQL always reads WAL: it reads the last checkpoint record from the WAL if nothing else, but in Neon we don't necessarily read even that. In that case, the xlogreader's read buffer is still uninitialized by the time that we copy it. That's relatively harmless, the only consequence is that the initial WAL segment on local disk can contain garbage before the first WAL record that we write. That's why we haven't noticed until now. Furthermore, it seems that the uninitialized memory just happens to be all-zeros. However, it now caused the test_pg_waldump.py test to fail with the new communicator implementation. That was very coincidental - the new communicator process isn't even running yet when the WAL buffer is initialized. It seems to have changed the memory allocation just so that the uninitialized memory is no longer all-zeros. That's normally harmless too, but it makes the pg_waldump test to fail: pg_waldump, with the --ignore option, starts reading the WAL from the first non-zero bytes, so when the uninitialized portion was filled with garbage rather than zeros, it fails. This little patch to poison the allocated buffer with garbage was helpful while debugging, to make the test fail in a repeatable fashion with or without the new communicator: ``` diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 988be3f..2f4844c2b86 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -97,6 +97,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir, */ state->readBuf = (char *) palloc_extended(XLOG_BLCKSZ, MCXT_ALLOC_NO_OOM); + memset(state->readBuf, 0x7e, XLOG_BLCKSZ); if (!state->readBuf) { pfree(state); ```
1 parent 46fed80 commit 63f4182

File tree

1 file changed

+57
-62
lines changed

1 file changed

+57
-62
lines changed

src/backend/access/transam/xlogrecovery.c

Lines changed: 57 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -1570,6 +1570,16 @@ FinishWalRecovery(void)
15701570
Assert(!WalRcvStreaming());
15711571
StandbyMode = false;
15721572

1573+
/*
1574+
* We cannot start generating new WAL if we don't have a valid prev-LSN
1575+
* to use for the first new WAL record. (Shouldn't happen.)
1576+
*/
1577+
if (NeonRecoveryRequested &&!neonWriteOk)
1578+
ereport(ERROR,
1579+
(errmsg("cannot start in read-write mode from this base backup")));
1580+
1581+
// FIXME: should we unlink neon.signal?
1582+
15731583
/*
15741584
* Determine where to start writing WAL next.
15751585
*
@@ -1633,83 +1643,68 @@ FinishWalRecovery(void)
16331643
}
16341644
}
16351645

1636-
/*
1637-
* When starting from a neon base backup, we don't have WAL. Initialize
1638-
* the WAL page where we will start writing new records from scratch,
1639-
* instead.
1640-
*/
1641-
if (NeonRecoveryRequested)
1642-
{
1643-
if (!neonWriteOk)
1644-
{
1645-
/*
1646-
* We cannot start generating new WAL if we don't have a valid prev-LSN
1647-
* to use for the first new WAL record. (Shouldn't happen.)
1648-
*/
1649-
ereport(ERROR,
1650-
(errmsg("cannot start in read-write mode from this base backup")));
1651-
}
1652-
else
1653-
{
1654-
int offs = endOfLog % XLOG_BLCKSZ;
1655-
XLogRecPtr pageBeginPtr = endOfLog - offs;
1656-
bool isLongHeader = (pageBeginPtr % wal_segment_size) == 0;
1657-
int lastPageSize = isLongHeader ? SizeOfXLogLongPHD : SizeOfXLogShortPHD;
1658-
char *page = palloc0(offs);
1659-
XLogPageHeader xlogPageHdr = (XLogPageHeader) page;
1660-
1661-
memcpy(page, xlogreader->readBuf, offs);
1662-
if (xlogPageHdr->xlp_magic != XLOG_PAGE_MAGIC)
1663-
{
1664-
xlogPageHdr->xlp_pageaddr = pageBeginPtr;
1665-
xlogPageHdr->xlp_magic = XLOG_PAGE_MAGIC;
1666-
xlogPageHdr->xlp_tli = recoveryTargetTLI;
1667-
xlogPageHdr->xlp_info = 0;
1668-
/*
1669-
* If we start writing with offset from page beginning, pretend in
1670-
* page header there is a record ending where actual data will
1671-
* start.
1672-
*/
1673-
xlogPageHdr->xlp_rem_len = offs - lastPageSize;
1674-
if (xlogPageHdr->xlp_rem_len > 0)
1675-
xlogPageHdr->xlp_info |= XLP_FIRST_IS_CONTRECORD;
1676-
readOff = XLogSegmentOffset(pageBeginPtr, wal_segment_size);
1677-
1678-
if (isLongHeader)
1679-
{
1680-
XLogLongPageHeader longHdr = (XLogLongPageHeader) page;
1681-
1682-
longHdr->xlp_sysid = GetSystemIdentifier();
1683-
longHdr->xlp_seg_size = wal_segment_size;
1684-
longHdr->xlp_xlog_blcksz = XLOG_BLCKSZ;
1646+
elog(LOG, "Continue writing WAL at %X/%X", LSN_FORMAT_ARGS(endOfLog));
16851647

1686-
xlogPageHdr->xlp_info |= XLP_LONG_HEADER;
1687-
}
1688-
}
1689-
result->lastPageBeginPtr = pageBeginPtr;
1690-
result->lastPage = page;
1691-
elog(LOG, "Continue writing WAL at %X/%X", LSN_FORMAT_ARGS(xlogreader->EndRecPtr));
1692-
1693-
// FIXME: should we unlink neon.signal?
1694-
}
1695-
}
16961648
/*
16971649
* Copy the last partial block to the caller, for initializing the WAL
16981650
* buffer for appending new WAL.
16991651
*/
1700-
else if (endOfLog % XLOG_BLCKSZ != 0)
1652+
if (endOfLog % XLOG_BLCKSZ != 0)
17011653
{
17021654
char *page;
17031655
int len;
17041656
XLogRecPtr pageBeginPtr;
17051657

17061658
pageBeginPtr = endOfLog - (endOfLog % XLOG_BLCKSZ);
1707-
Assert(readOff == XLogSegmentOffset(pageBeginPtr, wal_segment_size));
17081659

17091660
/* Copy the valid part of the last block */
17101661
len = endOfLog % XLOG_BLCKSZ;
17111662
page = palloc(len);
1712-
memcpy(page, xlogreader->readBuf, len);
1663+
1664+
/*
1665+
* With neon, it's possible that we start without having read any WAL
1666+
* whatsoever. In that case, initialize the WAL page where we will
1667+
* start writing new records from scratch, instead.
1668+
*/
1669+
if (NeonRecoveryRequested && endOfLog == RedoStartLSN)
1670+
{
1671+
bool isLongHeader = (pageBeginPtr % wal_segment_size) == 0;
1672+
int lastPageSize = isLongHeader ? SizeOfXLogLongPHD : SizeOfXLogShortPHD;
1673+
XLogPageHeader xlogPageHdr = (XLogPageHeader) page;
1674+
1675+
Assert(len >= lastPageSize);
1676+
1677+
xlogPageHdr->xlp_pageaddr = pageBeginPtr;
1678+
xlogPageHdr->xlp_magic = XLOG_PAGE_MAGIC;
1679+
xlogPageHdr->xlp_tli = recoveryTargetTLI;
1680+
xlogPageHdr->xlp_info = 0;
1681+
/*
1682+
* If we start writing with offset from page beginning, pretend in
1683+
* page header there is a record ending where actual data will
1684+
* start.
1685+
*/
1686+
xlogPageHdr->xlp_rem_len = len - lastPageSize;
1687+
if (xlogPageHdr->xlp_rem_len > 0)
1688+
xlogPageHdr->xlp_info |= XLP_FIRST_IS_CONTRECORD;
1689+
readOff = XLogSegmentOffset(pageBeginPtr, wal_segment_size);
1690+
1691+
if (isLongHeader)
1692+
{
1693+
XLogLongPageHeader longHdr = (XLogLongPageHeader) page;
1694+
1695+
longHdr->xlp_sysid = GetSystemIdentifier();
1696+
longHdr->xlp_seg_size = wal_segment_size;
1697+
longHdr->xlp_xlog_blcksz = XLOG_BLCKSZ;
1698+
1699+
xlogPageHdr->xlp_info |= XLP_LONG_HEADER;
1700+
}
1701+
}
1702+
else
1703+
{
1704+
Assert(readOff == XLogSegmentOffset(pageBeginPtr, wal_segment_size));
1705+
1706+
memcpy(page, xlogreader->readBuf, len);
1707+
}
17131708

17141709
result->lastPageBeginPtr = pageBeginPtr;
17151710
result->lastPage = page;

0 commit comments

Comments
 (0)