Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test for cascade branching #1569

Merged
merged 12 commits into from
Jul 12, 2022
Merged

Add test for cascade branching #1569

merged 12 commits into from
Jul 12, 2022

Conversation

knizhnik
Copy link
Contributor

No description provided.

@hlinnaka
Copy link
Contributor

See also #1552, @dhammika created a similar test there.

@knizhnik
Copy link
Contributor Author

See also #1552, @dhammika created a similar test there.

I didn't plan to create test - my intention was to measure overhead of maintaining multiples branches.
But I failed to do it even with disabled GC.
In my case error is

Error: could not find layer with more data for key 000000000000000000000000000000000000 at LSN 0/25A6A58, request LSN 0/25A6A58, ancestor 0/169AD60

@knizhnik
Copy link
Contributor Author

Should I investigate it or you already found the problem?

@hlinnaka
Copy link
Contributor

Should I investigate it or you already found the problem?

I suspect it's the bug that PR #1552 fixes. Does it happen with the fix from #1552?

@knizhnik
Copy link
Contributor Author

Yes, #1552 fixes the problem.
I got the following results on pgbench scale 100 (need to adjust statement timeout to make this test pass):
pgbench-N: pgbench -c 10 -N -T 10 -M prepared
pgbench-S: pgbench -c 10 -S -T 10 -M prepared
pgbench-branching-*: the same but each iteration in separate branch

pgbench-N pgbench-S pgbench-branching-N pgbench-branching-S
3659.679766 5799.035997 4096.113175 7033.988099
3411.610071 3745.583584 4879.264109 6866.284569
1243.158013 3328.822076 3434.285304 6586.347698
3667.244253 5989.222085 3729.312646 5817.914103
3174.557201 6203.412202 3683.717148 5613.968130
3324.629982 4436.722000 3170.684827 5630.288131
2016.809421 2158.615581 3048.013874 4766.886534
2348.957977 4238.340368 3717.156693 5328.268193
2555.701022 4113.181723 3494.957395 5324.218766
2095.486686 5210.213019 3153.178326 4525.540618
3740.493680 4737.210328 3129.817911 4583.439861
2819.108306 2883.230853 3023.859452 4630.586041
1629.900369 1145.820054 3327.016164 5399.938743
1547.075802 6422.298587 3350.022598 4730.168638
3507.577219 4722.432761 3429.416777 4526.292162
3354.957665 4458.728275 3302.257720 4590.323886
3535.285042 4094.957232 3532.010605 4632.028494
1871.365790 1561.230760 3651.497624 4279.151357
1661.103432 1216.078389 3602.751842 4335.726286
3347.524561 5167.328428 3086.730702 3869.362080
2754.882554 4935.236263 1951.371437 4547.524205
3350.340999 4405.187078 3072.209591 3274.802554
2598.451079 3016.624598 3493.345613 4271.918819
2422.449215 1866.724733 2880.223375 3419.336434
979.424246 1226.674731 2787.478583 3030.033206
1302.292477 1074.340414 2578.448592 2795.747509
1087.887194 3053.220832 2574.163016 2860.318668
3060.143983 3544.295032 2677.333768 2766.276634
2374.969178 3646.774255 2525.692086 3435.803541
4207.121783 5973.657599 2681.591777 2047.542351
3592.524597 3033.288100 2507.907888 3909.958078
2132.516472 1818.548125 2555.217954 2204.472378
1241.752051 3062.043510 2845.891674 2511.422455
4149.157832 5401.221000 3016.396372 2760.599435
3404.515522 4709.625501 2379.936458 2277.665283
2136.711076 4194.408300 3212.757447 3092.994152
3123.187209 4226.875111 2956.065867 1674.153808
1757.762534 1210.354943 3316.649131 2581.986020
1012.934126 1712.694497 2858.836826 3283.484110
1542.401048 5725.734568 2633.491551 2221.758105
2998.581370 4183.616975 3007.630412 3148.307623
3255.278734 4064.835901 1701.935917 3327.987278
3136.983172 2675.636450 2677.809464 2846.964252
2586.130511 2090.899264 2520.920101 2491.644437
1190.204360 1152.868711 2737.503026 2030.882911
1088.184614 935.609860 2327.155920 3494.381880
805.798292 2903.179769 2675.321267 1856.222328
2748.616037 3658.617502 2832.369971 3372.239396
3640.957640 5196.668707 2305.577982 2557.281869
4628.517141 6277.577937 2801.495865 2347.324249

So it is not so easy to say what is the overhead of versioning based on this results: dispersion is very large.
But as you can see maximal value of update TPS is almost not changed after 50 iterations (4k), while for branching it is decreased to 2.8k. For read-only the difference is even larger: 6.3k vs 3.3k.

But the main difference is space overhead. I have to limit this test to only 50 iteration because space was exhausted at my comp:

without branches:
26G test_output/test_branching/repo/pgdatadirs
26G test_output/test_branching/repo/safekeepers
8,8G test_output/test_branching/repo/tenants

with branches:
30G test_output/test_branching/repo/pgdatadirs/
30G test_output/test_branching/repo/safekeepers/
29G test_output/test_branching/repo/tenants/

So while WAL size is almost the same while size of repository is larger about 3 times.

@knizhnik
Copy link
Contributor Author

For some reasons the following test is always failed at my system after few iterations with waitseq error.
I failed to reproduce it using python script, although it seems to me that I am doing exactly the same.
Are there some page server configuration when test is spawned by pytest
multibranch.zip
?

@knizhnik
Copy link
Contributor Author

I could not reproduce the problem with my test with #1447
(I just rebased it). So we have

  1. Logical race condition problem discovered by @alubennikova
  2. Trivial fix for it
  3. Two tests reproducing the problem (@dhammika and my)

Should we commit this patch or you can propose something better?

@hlinnaka
Copy link
Contributor

hlinnaka commented May 5, 2022

Can you reproduce this, now that #1601 has been committed?

@aome510
Copy link
Contributor

aome510 commented Jul 6, 2022

@knizhnik @hlinnaka updated the test to make it more scalable. Test doesn't pass locally for me [1], which is good because there are probably some bugs in our branching code or there is a bug in the test itself 😅 . The test does look good to me though. Will do some investigation.

[1]: the error:

E           Exception:             Run ['/Users/aome510/Projects/Work/NeonDB/neon/target/debug/neon_local', 'timeline', 'branch', '--branch-name', 'b5', '--tenant-id', '53f48a3101484e9da4003e584de38924', '--ancestor-branch-name', 'b4'] failed:
E                         stdout:
E                         stderr: command failed: Error: Error: invalid branch start lsn

Probably something related to GC + branching 🤔

@knizhnik
Copy link
Contributor Author

knizhnik commented Jul 7, 2022

@knizhnik @hlinnaka updated the test to make it more scalable. Test doesn't pass locally for me [1], which is good because there are probably some bugs in our branching code or there is a bug in the test itself sweat_smile . The test does look good to me though. Will do some investigation.

[1]: the error:

E           Exception:             Run ['/Users/aome510/Projects/Work/NeonDB/neon/target/debug/neon_local', 'timeline', 'branch', '--branch-name', 'b5', '--tenant-id', '53f48a3101484e9da4003e584de38924', '--ancestor-branch-name', 'b4'] failed:
E                         stdout:
E                         stderr: command failed: Error: Error: invalid branch start lsn

Probably something related to GC + branching thinking

I failed to reproduce this error locally. Neither with release, neither with debug builds.
Actually "invalid branch start lsn" message is produced when specified branch start LSN is beyond GC horizon.
If user choose branching point manually - it is always possible that he set it too far in the past. So this error is "normal".

In this particular test branching point is not explicitly specified. It meas that last_record_lsn of the ancestor branch is used as starting point:

            if start_lsn == Lsn(0) {
                // Find end of WAL on the old timeline
                let end_of_wal = ancestor_timeline.get_last_record_lsn();
                info!("branching at end of WAL: {}", end_of_wal);
                start_lsn = end_of_wal;

But there is no critical section between obtaining last_record_lsn and creation of the new branch. So there is no warranty, that some new data will not be inserted in ancestor timeline and GC is performed, moving gc_cutoff_lsn before specified branching position. It is determined only by insertion speed and GC parameters.

In this particular example the following GC parameters are used:

'gc_period': '5 s',
'gc_horizon': f'{1024 ** 2}',
'pitr_interval': '5 s'

It means that there should be aat least 5 seconds between issuing command to create new branch and moment when this branch is actually created. What can delay creation of the branch for 5 seconds? Too high system load? As far as I can not reproduce the problem, I failed to give answer for this question.

But I just want to once again notice that "invalid branch start lsn" message doesn't actually means some bug in pageserver code. It can be absolutely legal behavior.

@aome510
Copy link
Contributor

aome510 commented Jul 7, 2022

I failed to reproduce this error locally. Neither with release, neither with debug builds.

That's weird. I can reproduce it quite consistently (once every 2-3 runs) with debug build. CI also returns the same error: https://circleci.com/api/v1.1/project/github/neondatabase/neon/79558/output/106/0?file=true&allocation-id=62c5fd9c9f05123110b220e3-0-build%2F69C32525.

Actually "invalid branch start lsn" message is produced when specified branch start LSN is beyond GC horizon.
If user choose branching point manually - it is always possible that he set it too far in the past. So this error is "normal".

But I just want to once again notice that "invalid branch start lsn" message doesn't actually means some bug in pageserver code. It can be absolutely legal behavior.

What causes the confusion is that the command to create a new branch doesn't specify any starting LSN, but the message tells us that we use invalid one. Page server uses end of WAL if no ancestor LSN is specified. In, timeline::create_timeline:

if start_lsn == Lsn(0) {
    // Find end of WAL on the old timeline
    let end_of_wal = ancestor_timeline.get_last_record_lsn();
    info!("branching at end of WAL: {}", end_of_wal);
    start_lsn = end_of_wal;
} else {
    // Wait for the WAL to arrive and be processed on the parent branch up
    // to the requested branch point. The repository code itself doesn't
    // require it, but if we start to receive WAL on the new timeline,
    // decoding the new WAL might need to look up previous pages, relation
    // sizes etc. and that would get confused if the previous page versions
    // are not in the repository yet.
    ancestor_timeline.wait_lsn(start_lsn)?;
}

It means that there should be aat least 5 seconds between issuing command to create new branch and moment when this branch is actually created. What can delay creation of the branch for 5 seconds? Too high system load? As far as I can not reproduce the problem, I failed to give answer for this question.

That's not true. 5 seconds is just an interval to trigger GC in the ancestor branch. It's possible to make a "create a child branch" request right before the time for GC in the ancestor. I guess this is reason why the error happens. During the time between the request determines the end-of-WAL LSN and starts branching (acquire GC lock), some updates happen and then GC starts, which deletes the end-of-WAL LSN because of the small GC horizon.

@knizhnik
Copy link
Contributor Author

knizhnik commented Jul 7, 2022

What causes the confusion is that the command to create a new branch doesn't specify any starting LSN, but the message tells us that we use invalid one.

Please notice that this test uses very specific GC settings - with default or any other reasonable values of this parameters such error is not possible.

That's not true. 5 seconds is just an interval to trigger GC in the ancestor branch.

I mean another parameter - pitr_interval which also has values 5 second.
Currently GC cutoff is determined by maximum of gc_horizon and pitr_interval.
It means that with parameters used in this test GC should keep last 1Mb of WAL or last 5 seconds of work.
That is what means saying that at least 5 seconds should pass between command to create new branch and actual branch creation.

@aome510
Copy link
Contributor

aome510 commented Jul 7, 2022

I mean another parameter - pitr_interval which also has values 5 second.
Currently GC cutoff is determined by maximum of gc_horizon and pitr_interval.
It means that with parameters used in this test GC should keep last 1Mb of WAL or last 5 seconds of work.
That is what means saying that at least 5 seconds should pass between command to create new branch and actual branch creation.

I see. Thanks for the clarification.

I add some logs and it seems that at the time when the error happened, the last GC iteration took 16s, so it's possible that the elapsed time between command to create new branch and actual branch creation to be > 5s as a single GC lock is required for each timeline branching.

Note that such big GC delay is caused by either

  1. test is run in debug mode
  2. GC iteration for each timeline is performed sequentially

Look like we can optimize this, but it will need a separate PR/issue.

Please notice that this test uses very specific GC settings - with default or any other reasonable values of this parameters such error is not possible.

I agree. I still think it's better to be safe in this case to avoid possible confusion. Also noted that when testing GC, we need to set PITR interval to be reasonably small to have a meaningful GC interaction, so fixing this issue with small values should be a net improvement.

@aome510
Copy link
Contributor

aome510 commented Jul 7, 2022

I agree. I still think it's better to be safe in this case to avoid possible confusion. Also noted that when testing GC, we need to set PITR interval to be reasonably small to have a meaningful GC interaction, so fixing this issue with small values should be a net improvement.

@knizhnik created a patch to fix this issue: #2053

@knizhnik
Copy link
Contributor Author

knizhnik commented Jul 8, 2022

There are 85 warnings in pageserver.log produced during this test execution:

2022-07-08T12:43:31.817237Z  WARN gc iteration{tenant=0596214083344da3980b0c1771f8eab7 timeline=None}: is_shutdown_requested() called in an unexpected thread

It seems to have not relation to branching at all - just due to the specific GC parameter used in this test, GC is performed more frequently than in other tests. But this warning is also present in other tests.

I am not so familiar with shutdown logic (please notice that this warning produced during normal work, not at shutdown).
But it seems to be something wrong here.

@aome510
Copy link
Contributor

aome510 commented Jul 8, 2022

There are 85 warnings in pageserver.log produced during this test execution:

2022-07-08T12:43:31.817237Z  WARN gc iteration{tenant=0596214083344da3980b0c1771f8eab7 timeline=None}: is_shutdown_requested() called in an unexpected thread

It seems to have not relation to branching at all - just due to the specific GC parameter used in this test, GC is performed more frequently than in other tests. But this warning is also present in other tests.

I am not so familiar with shutdown logic (please notice that this warning produced during normal work, not at shutdown). But it seems to be something wrong here.

Will take a look. Thanks for pointing that out

@aome510
Copy link
Contributor

aome510 commented Jul 8, 2022

There are 85 warnings in pageserver.log produced during this test execution:

2022-07-08T12:43:31.817237Z  WARN gc iteration{tenant=0596214083344da3980b0c1771f8eab7 timeline=None}: is_shutdown_requested() called in an unexpected thread

The code for the above warning:

    CURRENT_THREAD.with(|ct| {
        if let Some(ct) = ct.borrow().as_ref() {
            ct.shutdown_requested.load(Ordering::Relaxed)
        } else {
            if !cfg!(test) {
                warn!("is_shutdown_requested() called in an unexpected thread");
            }
            false
        }
    })

Look like it's because gc is not in a pageserver thread. Seems to be related to the recent changes in #1933. I don't think it's related to changes in this PR.

@aome510
Copy link
Contributor

aome510 commented Jul 8, 2022

@knizhnik @hlinnaka with #2053 merged in main. I think this PR is ready for review now. Can you review it?

@aome510 aome510 requested a review from hlinnaka July 8, 2022 21:15
@knizhnik
Copy link
Contributor Author

knizhnik commented Jul 9, 2022

@knizhnik @hlinnaka with #2053 merged in main. I think this PR is ready for review now. Can you review it?

As far as I have created this PR, I can not add review. But I have nothing against merging your version of the test.

Copy link
Contributor

@hlinnaka hlinnaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made suggestions for comments. Other than that, looks good to me.

aome510 and others added 4 commits July 12, 2022 10:31
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
@aome510 aome510 force-pushed the branching_test branch 2 times, most recently from 37f7477 to cb2d408 Compare July 12, 2022 16:07
@aome510 aome510 merged commit 5cf94a5 into main Jul 12, 2022
@aome510 aome510 deleted the branching_test branch July 12, 2022 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants