Skip to content

Commit 4bcd647

Browse files
committed
Improve stability of btree page split on ERRORs
This improves the stability of VACUUM when processing btree indexes, which was previously able to trigger an assertion failure in _bt_lock_subtree_parent() when an error was previously thrown outside the scope of _bt_split() when splitting a btree page. VACUUM would consider the index as in a corrupted state as the right page would not be zeroed for the error thrown (allocation failure is one pattern). In a non-assert build, VACUUM is able to succeed, reporting what it sees as a corruption while attempting to fix the index. This would manifest as a LOG message, as of: LOG: failed to re-find parent key in index "idx" for deletion target page N CONTEXT: while vacuuming index "idx" of relation "public.tab" This commit improves the code to rely on two PGAlignedBlocks that are used as a temporary space for the left and right pages. The main change concerns the right page, whose contents are now copied into the "temporary" PGAlignedBlock page while its original space is zeroed. Its contents are moved from the PGAlignedBlock page back to the page once we enter in the critical section used for the split. This simplifies the split logic, as it is not necessary to zero the right page before throwing an error anymore. Hence errors can now be thrown outside the split code. For the left page, this shaves one allocation, with PageGetTempPage() being previously used. The previous logic originates from commit 8fa30f9, at a point where PGAlignedBlock did not exist yet. This could be argued as something that should be backpatched, but the lack of complaints indicates that it may not be necessary. Author: Konstantin Knizhnik <knizhnik@garret.ru> Discussion: https://postgr.es/m/566dacaf-5751-47e4-abc6-73de17a5d42a@garret.ru
1 parent 88e17d7 commit 4bcd647

File tree

1 file changed

+30
-21
lines changed

1 file changed

+30
-21
lines changed

src/backend/access/nbtree/nbtinsert.c

Lines changed: 30 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1469,6 +1469,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
14691469
Page origpage;
14701470
Page leftpage,
14711471
rightpage;
1472+
PGAlignedBlock leftpage_buf,
1473+
rightpage_buf;
14721474
BlockNumber origpagenumber,
14731475
rightpagenumber;
14741476
BTPageOpaque ropaque,
@@ -1542,8 +1544,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
15421544
firstrightoff = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
15431545
newitem, &newitemonleft);
15441546

1545-
/* Allocate temp buffer for leftpage */
1546-
leftpage = PageGetTempPage(origpage);
1547+
/* Use temporary buffer for leftpage */
1548+
leftpage = leftpage_buf.data;
15471549
_bt_pageinit(leftpage, BufferGetPageSize(buf));
15481550
lopaque = (BTPageOpaque) PageGetSpecialPointer(leftpage);
15491551

@@ -1706,19 +1708,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
17061708

17071709
/*
17081710
* Acquire a new right page to split into, now that left page has a new
1709-
* high key. From here on, it's not okay to throw an error without
1710-
* zeroing rightpage first. This coding rule ensures that we won't
1711-
* confuse future VACUUM operations, which might otherwise try to re-find
1712-
* a downlink to a leftover junk page as the page undergoes deletion.
1711+
* high key.
17131712
*
1714-
* It would be reasonable to start the critical section just after the new
1715-
* rightpage buffer is acquired instead; that would allow us to avoid
1716-
* leftover junk pages without bothering to zero rightpage. We do it this
1717-
* way because it avoids an unnecessary PANIC when either origpage or its
1718-
* existing sibling page are corrupt.
1719-
*/
1713+
* To not confuse future VACUUM operations, we zero the right page and
1714+
* work on an in-memory copy of it before writing WAL, then copy its
1715+
* contents back to the actual page once we start the critical section
1716+
* work. This simplifies the split work, so as there is no need to zero
1717+
* the right page before throwing an error.
1718+
*/
17201719
rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
1721-
rightpage = BufferGetPage(rbuf);
1720+
rightpage = rightpage_buf.data;
1721+
1722+
/*
1723+
* Copy the contents of the right page into its temporary location, and
1724+
* zero the original space.
1725+
*/
1726+
memcpy(rightpage, BufferGetPage(rbuf), BLCKSZ);
1727+
memset(BufferGetPage(rbuf), 0, BLCKSZ);
17221728
rightpagenumber = BufferGetBlockNumber(rbuf);
17231729
/* rightpage was initialized by _bt_getbuf */
17241730
ropaque = (BTPageOpaque) PageGetSpecialPointer(rightpage);
@@ -1767,7 +1773,6 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
17671773
if (PageAddItem(rightpage, (Item) righthighkey, itemsz, afterrightoff,
17681774
false, false) == InvalidOffsetNumber)
17691775
{
1770-
memset(rightpage, 0, BufferGetPageSize(rbuf));
17711776
elog(ERROR, "failed to add high key to the right sibling"
17721777
" while splitting block %u of index \"%s\"",
17731778
origpagenumber, RelationGetRelationName(rel));
@@ -1815,7 +1820,6 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
18151820
if (!_bt_pgaddtup(leftpage, newitemsz, newitem, afterleftoff,
18161821
false))
18171822
{
1818-
memset(rightpage, 0, BufferGetPageSize(rbuf));
18191823
elog(ERROR, "failed to add new item to the left sibling"
18201824
" while splitting block %u of index \"%s\"",
18211825
origpagenumber, RelationGetRelationName(rel));
@@ -1828,7 +1832,6 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
18281832
if (!_bt_pgaddtup(rightpage, newitemsz, newitem, afterrightoff,
18291833
afterrightoff == minusinfoff))
18301834
{
1831-
memset(rightpage, 0, BufferGetPageSize(rbuf));
18321835
elog(ERROR, "failed to add new item to the right sibling"
18331836
" while splitting block %u of index \"%s\"",
18341837
origpagenumber, RelationGetRelationName(rel));
@@ -1842,7 +1845,6 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
18421845
{
18431846
if (!_bt_pgaddtup(leftpage, itemsz, dataitem, afterleftoff, false))
18441847
{
1845-
memset(rightpage, 0, BufferGetPageSize(rbuf));
18461848
elog(ERROR, "failed to add old item to the left sibling"
18471849
" while splitting block %u of index \"%s\"",
18481850
origpagenumber, RelationGetRelationName(rel));
@@ -1854,7 +1856,6 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
18541856
if (!_bt_pgaddtup(rightpage, itemsz, dataitem, afterrightoff,
18551857
afterrightoff == minusinfoff))
18561858
{
1857-
memset(rightpage, 0, BufferGetPageSize(rbuf));
18581859
elog(ERROR, "failed to add old item to the right sibling"
18591860
" while splitting block %u of index \"%s\"",
18601861
origpagenumber, RelationGetRelationName(rel));
@@ -1875,7 +1876,6 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
18751876
if (!_bt_pgaddtup(rightpage, newitemsz, newitem, afterrightoff,
18761877
afterrightoff == minusinfoff))
18771878
{
1878-
memset(rightpage, 0, BufferGetPageSize(rbuf));
18791879
elog(ERROR, "failed to add new item to the right sibling"
18801880
" while splitting block %u of index \"%s\"",
18811881
origpagenumber, RelationGetRelationName(rel));
@@ -1895,7 +1895,6 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
18951895
sopaque = (BTPageOpaque) PageGetSpecialPointer(spage);
18961896
if (sopaque->btpo_prev != origpagenumber)
18971897
{
1898-
memset(rightpage, 0, BufferGetPageSize(rbuf));
18991898
ereport(ERROR,
19001899
(errcode(ERRCODE_INDEX_CORRUPTED),
19011900
errmsg_internal("right sibling's left-link doesn't match: "
@@ -1938,9 +1937,19 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
19381937
* original. We need to do this before writing the WAL record, so that
19391938
* XLogInsert can WAL log an image of the page if necessary.
19401939
*/
1941-
PageRestoreTempPage(leftpage, origpage);
1940+
memcpy(origpage, leftpage, BLCKSZ);
19421941
/* leftpage, lopaque must not be used below here */
19431942

1943+
/*
1944+
* Move the contents of the right page from its temporary location to the
1945+
* destination buffer, before writing the WAL record. Unlike the left
1946+
* page, the right page and its opaque area are still needed to complete
1947+
* the update of the page, so reinitialize them.
1948+
*/
1949+
rightpage = BufferGetPage(rbuf);
1950+
memcpy(rightpage, rightpage_buf.data, BLCKSZ);
1951+
ropaque = (BTPageOpaque) PageGetSpecialPointer(rightpage);
1952+
19441953
MarkBufferDirty(buf);
19451954
MarkBufferDirty(rbuf);
19461955

0 commit comments

Comments
 (0)