Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP Fix race condition in get_reconstruct_data #1447

Closed
wants to merge 1 commit into from
Closed

Conversation

knizhnik
Copy link
Contributor

refer #1433

@hlinnaka
Copy link
Contributor

I'd like to understand the issue better before we put a bandaid over it. We don't have a repro, so we don't really know if this even fixes it.

@stepashka
Copy link
Member

@lubennikovaav is looking for repro steps

@stepashka stepashka changed the title Fix race condition in get_reconstruct_data WIP Fix race condition in get_reconstruct_data Apr 7, 2022
@knizhnik
Copy link
Contributor Author

knizhnik commented Apr 7, 2022

@lubennikovaav is looking for repro steps

Just want to notice that I tried to insert random sleeps before obtaining layers lock but still failedto reproduce the problem. May be we need to create branches...

@stepashka
Copy link
Member

stepashka commented Apr 21, 2022

so far @lubennikovaav cannot reproduce this

@hlinnaka
Copy link
Contributor

hlinnaka commented May 3, 2022

I'm confused on where we stand on this. The bug that this was supposed to bandaid over, is it the same that was fixed by #1601? Can we close this?

@knizhnik
Copy link
Contributor Author

knizhnik commented May 4, 2022

I'm confused on where we stand on this. The bug that this was supposed to bandaid over, is it the same that was fixed by #1601? Can we close this?

I still think that there is one more race condition.
Please refer original description of the problem by @lubennikovaav
May be I am missing something but there is no any lock in LayeredTimeline::get_reconstruct_data between loop iteration.
So in each loop iteration we obtain layers lock and proceed one layer. But what will happen if layer map is changed between iteration (some open layers are frozen, frozen - flushed,...) Is there any warranty that we correctly collect reply sequence in this case?
The comment to this function says:

    ///
    /// This function takes the current timeline's locked LayerMap as an argument,
    /// so callers can avoid potential race conditions.
    fn get_reconstruct_data(
        &self,
        key: Key,
        request_lsn: Lsn,

But it is not true! Lock is obtained and release at each loop iteration.
This is why my proposal is to hold lock during all traversal.
Since number of traversed layer is expected to be small (I think one in most cases), such change will not reduce concurrency and have negative impact on performance.

@lubennikovaav
Copy link
Contributor

@knizhnik , why do you expect that number of traversed layers will be 1? Is it because most of the data is cold and materialization is active enough?

I suggest to play safe and commit this fix. We can always optimize it later and remove the lock if we decide that it is totally safe. @hlinnaka , any objections?

@knizhnik
Copy link
Contributor Author

knizhnik commented May 5, 2022

Is it because most of the data is cold and materialization is active enough?
Ideally page should be reconstructed before it is accessed: saved in image layer so no reconstruction is needed on get_page_at_lsn. And when reconstruction is needed, then chain of wal records to be applied, shoudl not be to long and so fits in one delta layer.

@knizhnik
Copy link
Contributor Author

knizhnik commented May 8, 2022

After thinking a lot about possible scenarios I also do not see here source of race condition.
I hope that #1601 explains all observed issues.
So I am closing this PR now

@knizhnik knizhnik closed this May 8, 2022
@bayandin bayandin deleted the reconstruct_rc branch May 19, 2023 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"could not find layer with more data for key" error in CI
4 participants