WIP Fix race condition in get_reconstruct_data #1447

knizhnik · 2022-03-31T15:50:49Z

hlinnaka · 2022-03-31T18:07:07Z

I'd like to understand the issue better before we put a bandaid over it. We don't have a repro, so we don't really know if this even fixes it.

stepashka · 2022-04-07T12:20:47Z

@lubennikovaav is looking for repro steps

knizhnik · 2022-04-07T13:46:25Z

@lubennikovaav is looking for repro steps

Just want to notice that I tried to insert random sleeps before obtaining layers lock but still failedto reproduce the problem. May be we need to create branches...

stepashka · 2022-04-21T10:28:49Z

so far @lubennikovaav cannot reproduce this

refer #1433

hlinnaka · 2022-05-03T11:02:31Z

I'm confused on where we stand on this. The bug that this was supposed to bandaid over, is it the same that was fixed by #1601? Can we close this?

knizhnik · 2022-05-04T06:11:36Z

I'm confused on where we stand on this. The bug that this was supposed to bandaid over, is it the same that was fixed by #1601? Can we close this?

I still think that there is one more race condition.
Please refer original description of the problem by @lubennikovaav
May be I am missing something but there is no any lock in LayeredTimeline::get_reconstruct_data between loop iteration.
So in each loop iteration we obtain layers lock and proceed one layer. But what will happen if layer map is changed between iteration (some open layers are frozen, frozen - flushed,...) Is there any warranty that we correctly collect reply sequence in this case?
The comment to this function says:

    ///
    /// This function takes the current timeline's locked LayerMap as an argument,
    /// so callers can avoid potential race conditions.
    fn get_reconstruct_data(
        &self,
        key: Key,
        request_lsn: Lsn,

But it is not true! Lock is obtained and release at each loop iteration.
This is why my proposal is to hold lock during all traversal.
Since number of traversed layer is expected to be small (I think one in most cases), such change will not reduce concurrency and have negative impact on performance.

lubennikovaav · 2022-05-05T07:01:14Z

@knizhnik , why do you expect that number of traversed layers will be 1? Is it because most of the data is cold and materialization is active enough?

I suggest to play safe and commit this fix. We can always optimize it later and remove the lock if we decide that it is totally safe. @hlinnaka , any objections?

knizhnik · 2022-05-05T11:52:06Z

Is it because most of the data is cold and materialization is active enough?
Ideally page should be reconstructed before it is accessed: saved in image layer so no reconstruction is needed on get_page_at_lsn. And when reconstruction is needed, then chain of wal records to be applied, shoudl not be to long and so fits in one delta layer.

knizhnik · 2022-05-08T18:31:47Z

After thinking a lot about possible scenarios I also do not see here source of race condition.
I hope that #1601 explains all observed issues.
So I am closing this PR now

knizhnik requested review from hlinnaka and lubennikovaav March 31, 2022 15:50

knizhnik force-pushed the reconstruct_rc branch from b0c71a6 to 3fa7b00 Compare April 5, 2022 11:36

stepashka linked an issue Apr 7, 2022 that may be closed by this pull request

"could not find layer with more data for key" error in CI #1433

Closed

stepashka assigned knizhnik and lubennikovaav Apr 7, 2022

stepashka unassigned knizhnik Apr 7, 2022

stepashka changed the title ~~Fix race condition in get_reconstruct_data~~ WIP Fix race condition in get_reconstruct_data Apr 7, 2022

stepashka unassigned lubennikovaav Apr 21, 2022

Fix race condition in get_reconstruct_data

acfbb06

refer #1433

knizhnik force-pushed the reconstruct_rc branch from 3fa7b00 to acfbb06 Compare April 26, 2022 06:15

This was referenced Apr 26, 2022

Add test for cascade branching #1569

Merged

Bump vendor/postgres #1573

Merged

knizhnik closed this May 8, 2022

bayandin deleted the reconstruct_rc branch May 19, 2023 13:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP Fix race condition in get_reconstruct_data #1447

WIP Fix race condition in get_reconstruct_data #1447

knizhnik commented Mar 31, 2022

hlinnaka commented Mar 31, 2022

stepashka commented Apr 7, 2022

knizhnik commented Apr 7, 2022

stepashka commented Apr 21, 2022 •

edited

Loading

hlinnaka commented May 3, 2022

knizhnik commented May 4, 2022

lubennikovaav commented May 5, 2022

knizhnik commented May 5, 2022

knizhnik commented May 8, 2022

WIP Fix race condition in get_reconstruct_data #1447

WIP Fix race condition in get_reconstruct_data #1447

Conversation

knizhnik commented Mar 31, 2022

hlinnaka commented Mar 31, 2022

stepashka commented Apr 7, 2022

knizhnik commented Apr 7, 2022

stepashka commented Apr 21, 2022 • edited Loading

hlinnaka commented May 3, 2022

knizhnik commented May 4, 2022

lubennikovaav commented May 5, 2022

knizhnik commented May 5, 2022

knizhnik commented May 8, 2022

stepashka commented Apr 21, 2022 •

edited

Loading