-
Notifications
You must be signed in to change notification settings - Fork 496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add test for cascade branching #1569
Conversation
I didn't plan to create test - my intention was to measure overhead of maintaining multiples branches.
|
Should I investigate it or you already found the problem? |
Yes, #1552 fixes the problem.
So it is not so easy to say what is the overhead of versioning based on this results: dispersion is very large. But the main difference is space overhead. I have to limit this test to only 50 iteration because space was exhausted at my comp: without branches: with branches: So while WAL size is almost the same while size of repository is larger about 3 times. |
For some reasons the following test is always failed at my system after few iterations with waitseq error. |
I could not reproduce the problem with my test with #1447
Should we commit this patch or you can propose something better? |
Can you reproduce this, now that #1601 has been committed? |
@knizhnik @hlinnaka updated the test to make it more scalable. Test doesn't pass locally for me [1], which is good because there are probably some bugs in our branching code or there is a bug in the test itself 😅 . The test does look good to me though. Will do some investigation. [1]: the error:
Probably something related to GC + branching 🤔 |
I failed to reproduce this error locally. Neither with release, neither with debug builds. In this particular test branching point is not explicitly specified. It meas that last_record_lsn of the ancestor branch is used as starting point:
But there is no critical section between obtaining last_record_lsn and creation of the new branch. So there is no warranty, that some new data will not be inserted in ancestor timeline and GC is performed, moving gc_cutoff_lsn before specified branching position. It is determined only by insertion speed and GC parameters. In this particular example the following GC parameters are used:
It means that there should be aat least 5 seconds between issuing command to create new branch and moment when this branch is actually created. What can delay creation of the branch for 5 seconds? Too high system load? As far as I can not reproduce the problem, I failed to give answer for this question. But I just want to once again notice that "invalid branch start lsn" message doesn't actually means some bug in pageserver code. It can be absolutely legal behavior. |
That's weird. I can reproduce it quite consistently (once every 2-3 runs) with debug build. CI also returns the same error: https://circleci.com/api/v1.1/project/github/neondatabase/neon/79558/output/106/0?file=true&allocation-id=62c5fd9c9f05123110b220e3-0-build%2F69C32525.
What causes the confusion is that the command to create a new branch doesn't specify any starting LSN, but the message tells us that we use invalid one. Page server uses end of WAL if no ancestor LSN is specified. In, if start_lsn == Lsn(0) {
// Find end of WAL on the old timeline
let end_of_wal = ancestor_timeline.get_last_record_lsn();
info!("branching at end of WAL: {}", end_of_wal);
start_lsn = end_of_wal;
} else {
// Wait for the WAL to arrive and be processed on the parent branch up
// to the requested branch point. The repository code itself doesn't
// require it, but if we start to receive WAL on the new timeline,
// decoding the new WAL might need to look up previous pages, relation
// sizes etc. and that would get confused if the previous page versions
// are not in the repository yet.
ancestor_timeline.wait_lsn(start_lsn)?;
}
That's not true. 5 seconds is just an interval to trigger GC in the ancestor branch. It's possible to make a "create a child branch" request right before the time for GC in the ancestor. I guess this is reason why the error happens. During the time between the request determines the end-of-WAL LSN and starts branching (acquire GC lock), some updates happen and then GC starts, which deletes the end-of-WAL LSN because of the small GC horizon. |
Please notice that this test uses very specific GC settings - with default or any other reasonable values of this parameters such error is not possible.
I mean another parameter - |
I see. Thanks for the clarification. I add some logs and it seems that at the time when the error happened, the last GC iteration took 16s, so it's possible that the elapsed time between command to create new branch and actual branch creation to be > 5s as a single GC lock is required for each timeline branching. Note that such big GC delay is caused by either
Look like we can optimize this, but it will need a separate PR/issue.
I agree. I still think it's better to be safe in this case to avoid possible confusion. Also noted that when testing GC, we need to set PITR interval to be reasonably small to have a meaningful GC interaction, so fixing this issue with small values should be a net improvement. |
|
There are 85 warnings in pageserver.log produced during this test execution:
It seems to have not relation to branching at all - just due to the specific GC parameter used in this test, GC is performed more frequently than in other tests. But this warning is also present in other tests. I am not so familiar with shutdown logic (please notice that this warning produced during normal work, not at shutdown). |
Will take a look. Thanks for pointing that out |
The code for the above warning: CURRENT_THREAD.with(|ct| {
if let Some(ct) = ct.borrow().as_ref() {
ct.shutdown_requested.load(Ordering::Relaxed)
} else {
if !cfg!(test) {
warn!("is_shutdown_requested() called in an unexpected thread");
}
false
}
}) Look like it's because gc is not in a pageserver thread. Seems to be related to the recent changes in #1933. I don't think it's related to changes in this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made suggestions for comments. Other than that, looks good to me.
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
37f7477
to
cb2d408
Compare
No description provided.