Skip to content

Commit 165f042

Browse files
author
Konstantin Knizhnik
committed
Improve stability of btree page split on ERRORs
This improves the stability of VACUUM when processing btree indexes, which was previously able to trigger an assertion failure in _bt_lock_subtree_parent() when an error was previously thrown outside the scope of _bt_split() when splitting a btree page. VACUUM would consider the index as in a corrupted state as the right page would not be zeroed for the error thrown (allocation failure is one pattern). In a non-assert build, VACUUM is able to succeed, reporting what it sees as a corruption while attempting to fix the index. This would manifest as a LOG message, as of: LOG: failed to re-find parent key in index "idx" for deletion target page N CONTEXT: while vacuuming index "idx" of relation "public.tab" This commit improves the code to rely on two PGAlignedBlocks that are used as a temporary space for the left and right pages. The main change concerns the right page, whose contents are now copied into the "temporary" PGAlignedBlock page while its original space is zeroed. Its contents are moved from the PGAlignedBlock page back to the page once we enter in the critical section used for the split. This simplifies the split logic, as it is not necessary to zero the right page before throwing an error anymore. Hence errors can now be thrown outside the split code. For the left page, this shaves one allocation, with PageGetTempPage() being previously used. The previous logic originates from commit 8fa30f9, at a point where PGAlignedBlock did not exist yet. This could be argued as something that should be backpatched, but the lack of complaints indicates that it may not be necessary. Author: Konstantin Knizhnik <knizhnik@garret.ru> Discussion: https://postgr.es/m/566dacaf-5751-47e4-abc6-73de17a5d42a@garret.ru
1 parent 10b394a commit 165f042

File tree

1 file changed

+29
-20
lines changed

1 file changed

+29
-20
lines changed

src/backend/access/nbtree/nbtinsert.c

Lines changed: 29 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1472,6 +1472,8 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
14721472
Page origpage;
14731473
Page leftpage,
14741474
rightpage;
1475+
PGAlignedBlock leftpage_buf,
1476+
rightpage_buf;
14751477
BlockNumber origpagenumber,
14761478
rightpagenumber;
14771479
BTPageOpaque ropaque,
@@ -1545,8 +1547,8 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
15451547
firstrightoff = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
15461548
newitem, &newitemonleft);
15471549

1548-
/* Allocate temp buffer for leftpage */
1549-
leftpage = PageGetTempPage(origpage);
1550+
/* Use temporary buffer for leftpage */
1551+
leftpage = leftpage_buf.data;
15501552
_bt_pageinit(leftpage, BufferGetPageSize(buf));
15511553
lopaque = BTPageGetOpaque(leftpage);
15521554

@@ -1709,19 +1711,23 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
17091711

17101712
/*
17111713
* Acquire a new right page to split into, now that left page has a new
1712-
* high key. From here on, it's not okay to throw an error without
1713-
* zeroing rightpage first. This coding rule ensures that we won't
1714-
* confuse future VACUUM operations, which might otherwise try to re-find
1715-
* a downlink to a leftover junk page as the page undergoes deletion.
1714+
* high key.
17161715
*
1717-
* It would be reasonable to start the critical section just after the new
1718-
* rightpage buffer is acquired instead; that would allow us to avoid
1719-
* leftover junk pages without bothering to zero rightpage. We do it this
1720-
* way because it avoids an unnecessary PANIC when either origpage or its
1721-
* existing sibling page are corrupt.
1716+
* To not confuse future VACUUM operations, we zero the right page and
1717+
* work on an in-memory copy of it before writing WAL, then copy its
1718+
* contents back to the actual page once we start the critical section
1719+
* work. This simplifies the split work, so as there is no need to zero
1720+
* the right page before throwing an error.
17221721
*/
17231722
rbuf = _bt_allocbuf(rel, heaprel);
1724-
rightpage = BufferGetPage(rbuf);
1723+
rightpage = rightpage_buf.data;
1724+
1725+
/*
1726+
* Copy the contents of the right page into its temporary location, and
1727+
* zero the original space.
1728+
*/
1729+
memcpy(rightpage, BufferGetPage(rbuf), BLCKSZ);
1730+
memset(BufferGetPage(rbuf), 0, BLCKSZ);
17251731
rightpagenumber = BufferGetBlockNumber(rbuf);
17261732
/* rightpage was initialized by _bt_getbuf */
17271733
ropaque = BTPageGetOpaque(rightpage);
@@ -1770,7 +1776,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
17701776
if (PageAddItem(rightpage, (Item) righthighkey, itemsz, afterrightoff,
17711777
false, false) == InvalidOffsetNumber)
17721778
{
1773-
memset(rightpage, 0, BufferGetPageSize(rbuf));
17741779
elog(ERROR, "failed to add high key to the right sibling"
17751780
" while splitting block %u of index \"%s\"",
17761781
origpagenumber, RelationGetRelationName(rel));
@@ -1818,7 +1823,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
18181823
if (!_bt_pgaddtup(leftpage, newitemsz, newitem, afterleftoff,
18191824
false))
18201825
{
1821-
memset(rightpage, 0, BufferGetPageSize(rbuf));
18221826
elog(ERROR, "failed to add new item to the left sibling"
18231827
" while splitting block %u of index \"%s\"",
18241828
origpagenumber, RelationGetRelationName(rel));
@@ -1831,7 +1835,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
18311835
if (!_bt_pgaddtup(rightpage, newitemsz, newitem, afterrightoff,
18321836
afterrightoff == minusinfoff))
18331837
{
1834-
memset(rightpage, 0, BufferGetPageSize(rbuf));
18351838
elog(ERROR, "failed to add new item to the right sibling"
18361839
" while splitting block %u of index \"%s\"",
18371840
origpagenumber, RelationGetRelationName(rel));
@@ -1845,7 +1848,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
18451848
{
18461849
if (!_bt_pgaddtup(leftpage, itemsz, dataitem, afterleftoff, false))
18471850
{
1848-
memset(rightpage, 0, BufferGetPageSize(rbuf));
18491851
elog(ERROR, "failed to add old item to the left sibling"
18501852
" while splitting block %u of index \"%s\"",
18511853
origpagenumber, RelationGetRelationName(rel));
@@ -1857,7 +1859,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
18571859
if (!_bt_pgaddtup(rightpage, itemsz, dataitem, afterrightoff,
18581860
afterrightoff == minusinfoff))
18591861
{
1860-
memset(rightpage, 0, BufferGetPageSize(rbuf));
18611862
elog(ERROR, "failed to add old item to the right sibling"
18621863
" while splitting block %u of index \"%s\"",
18631864
origpagenumber, RelationGetRelationName(rel));
@@ -1878,7 +1879,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
18781879
if (!_bt_pgaddtup(rightpage, newitemsz, newitem, afterrightoff,
18791880
afterrightoff == minusinfoff))
18801881
{
1881-
memset(rightpage, 0, BufferGetPageSize(rbuf));
18821882
elog(ERROR, "failed to add new item to the right sibling"
18831883
" while splitting block %u of index \"%s\"",
18841884
origpagenumber, RelationGetRelationName(rel));
@@ -1898,7 +1898,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
18981898
sopaque = BTPageGetOpaque(spage);
18991899
if (sopaque->btpo_prev != origpagenumber)
19001900
{
1901-
memset(rightpage, 0, BufferGetPageSize(rbuf));
19021901
ereport(ERROR,
19031902
(errcode(ERRCODE_INDEX_CORRUPTED),
19041903
errmsg_internal("right sibling's left-link doesn't match: "
@@ -1941,9 +1940,19 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
19411940
* original. We need to do this before writing the WAL record, so that
19421941
* XLogInsert can WAL log an image of the page if necessary.
19431942
*/
1944-
PageRestoreTempPage(leftpage, origpage);
1943+
memcpy(origpage, leftpage, BLCKSZ);
19451944
/* leftpage, lopaque must not be used below here */
19461945

1946+
/*
1947+
* Move the contents of the right page from its temporary location to the
1948+
* destination buffer, before writing the WAL record. Unlike the left
1949+
* page, the right page and its opaque area are still needed to complete
1950+
* the update of the page, so reinitialize them.
1951+
*/
1952+
rightpage = BufferGetPage(rbuf);
1953+
memcpy(rightpage, rightpage_buf.data, BLCKSZ);
1954+
ropaque = BTPageGetOpaque(rightpage);
1955+
19471956
MarkBufferDirty(buf);
19481957
MarkBufferDirty(rbuf);
19491958

0 commit comments

Comments
 (0)