Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need for partial merge #459

Closed
martinsumner opened this issue Nov 26, 2024 · 2 comments · Fixed by #460
Closed

Need for partial merge #459

martinsumner opened this issue Nov 26, 2024 · 2 comments · Fixed by #460

Comments

@martinsumner
Copy link
Owner

martinsumner commented Nov 26, 2024

When a level is over capacity in the ledger, a file is packed almost at random (in some cases grooming can be an influence) - and that file is fully merged into the level below.

Between level 4 and Level 5 say, there are 384 files at Level 4 and 2304 files at Level 5. So an average 6 files at level 5 for everyone at level 4 - so on average one would expect to merge into 6 files below.

But what about extreme scenarios. Imagine a situation with three object buckets <<"B0">>, <<"B1">> and <<"B2">>. At the start of the database all objects loaded had keys in <<"B1">>, but after a point, all updates are a 50/50 mix to buckets <<"B0">> and <<"B2">>.

There could be now a scenario where there are 2000 files at Level5 which contain only <<"B0">> objects, and amongst the files at Level 4 ones that contain only <<"B0">> objects, ones that contain only <<"B1">> objects - and one file in between (as each level is ordered) which contains some <<"B0">> objects and some <<"B2">> objects.

If the file chosen for compaction is the mixed file - it will need to merge into > 2000 files below - although most of those merge events will be taking continuously from the same (lower) side in succession. This will take a long time - and if updates are ongoing, this will lead to a significant number of slow_offer backend pauses.

These events have been rare so far. Running a perf_SUITE test with 32M objects (much bigger than normal) such an event was triggered though ... and a load which should have taken 10 minutes took 6 hours. Such severe fluctuations are very much against the principles of what leveled should achieve. So although this may be a very rare event, it should still be considered a HIGH priority bug.

@martinsumner
Copy link
Owner Author

Currently merge is managed through the do_merge/8 loop in leveled_pclerk:

do_merge([], [], SinkLevel, _SinkB, _RP, NewSQN, _MaxSQN, _Opts, Additions) ->
    leveled_log:log(pc011, [NewSQN, SinkLevel, length(Additions)]),
    lists:reverse(Additions);
do_merge(KL1, KL2, SinkLevel, SinkB, RP, NewSQN, MaxSQN, OptsSST, Additions) ->
    FileName =
        leveled_penciller:sst_filename(
            NewSQN, SinkLevel, length(Additions)
        ),
    leveled_log:log(pc012, [NewSQN, FileName, SinkB]),
    TS1 = os:timestamp(),
    case leveled_sst:sst_newmerge(RP, FileName,
                                    KL1, KL2, SinkB, SinkLevel, MaxSQN,
                                    OptsSST) of
        empty ->
            leveled_log:log(pc013, [FileName]),
            do_merge(
                [], [],
                SinkLevel, SinkB,
                RP, NewSQN, MaxSQN,
                OptsSST, 
                Additions
            );                        
        {ok, Pid, Reply, Bloom} ->
            {{KL1Rem, KL2Rem}, SmallestKey, HighestKey} = Reply,
                Entry =
                    leveled_pmanifest:new_entry(
                        SmallestKey, HighestKey, Pid, FileName, Bloom),
                leveled_log:log_timer(pc015, [], TS1),
                do_merge(
                    KL1Rem, KL2Rem,
                    SinkLevel, SinkB,
                    RP, NewSQN, MaxSQN,
                    OptsSST,
                    [Entry|Additions]
                )
    end.

A MaxAdditions can be added so that if there is a loop around and either KL1 and KL2 is non-empty, and length(Additions) == MaxAdditions ->. the standard loop can be terminated.

In this case, there is still a need to write the remainders for both the higher and lower level.

For the higher level this is trivial, as there was only one file passed into the merge at this level - so the remainder cna be written using leveled_sst:sst_newmerge/8 with an empty KL2. For the lower level, this is harder, as we need to terminate when we reach a sst_pointer which is a next file - so that all subsequent files are not re-written, but are left unmodified.

The update of the penciller manifest is then complicated by the truncation of the merge.

@martinsumner
Copy link
Owner Author

This change needs to be back-ported to develop-3.1 to be added into Riak KV 3.2.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant