Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yield CPU for concurrent flush and concurrent mergeDelta (#5410) #5424

Merged

Conversation

ti-chi-bot
Copy link
Member

This is an automated cherry-pick of #5410

What problem does this PR solve?

Issue Number: close #5409

What is changed and how it works?

Add sleep for flushCache and mergeDeltaBySegment:

  • When flushCache is retrying, wait backoff will be 5ms ~ 100ms (considering that existing flushCache usually takes short time to finish).

  • When mergeDeltaBySegment is retrying, wait backoff will be 50ms ~ 1s (considering that split-prepare could take several seconds to finish).

Check List

Tests

  • Unit test
  • Integration test
  • Manual test

To test with the fix, I introduced a splitEachSegment debug function locally to manually trigger a split:

void DeltaMergeStore::splitEachSegment(const Context & db_context)
{
    auto dm_context = newDMContext(db_context, db_context.getSettingsRef(), /*tracing_id*/ "split_each_segment");
    std::vector<SegmentPtr> current_segments;
    {
        std::shared_lock lock(read_write_mutex);
        for (const auto & segment_it : segments)
        {
            current_segments.push_back(segment_it.second);
        }
    }
    for (const auto & segment : current_segments)
    {
        segmentSplit(*dm_context, segment, true);
    }
}

The test case is to trigger the split for a 1GB segment, and then perform a mergeDelta at the same time.

Before the fix (using release v6.1):

when there are both split (takes 10s) and mergeDelta (takes 20s in total, blocked by split for 10s), there are 211K retries in 10s when the mergeDelta is blocked:

❯ cat ~/.tiup/data/my_test_r61_no_fix/tiflash-0/log/tiflash.log | grep "Give up merge delta, segment" | head
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:42.057 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]

❯ cat ~/.tiup/data/my_test_r61_no_fix/tiflash-0/log/tiflash.log | grep "Give up merge delta, segment" | tail
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]
[2022/07/19 21:55:52.171 +08:00] [DEBUG] [DeltaMergeStore.cpp:2162] ["DeltaMergeStore:db_1.t_69 Give up merge delta, segment [1]"] [thread_id=30]

❯ cat ~/.tiup/data/my_test_r61_no_fix/tiflash-0/log/tiflash.log | grep "Give up merge delta, segment" | wc -l
  211848

The CPU usage is around 200% during the split+mergeDelta:

5.7   tiflash
128.0 tiflash  <--- split begin, mergeDelta begin
196.4 tiflash
185.1 tiflash
152.7 tiflash
184.9 tiflash
202.0 tiflash
191.3 tiflash
197.7 tiflash
192.1 tiflash
202.6 tiflash  <--- split is finished
115.7 tiflash
102.6 tiflash
103.0 tiflash
101.3 tiflash
101.0 tiflash
103.9 tiflash
100.8 tiflash
102.0 tiflash
102.5 tiflash
68.2  tiflash
4.7   tiflash
0.0   tiflash

After the fix:

there are only 14 retry attempts with exp backoff:

❯ cat ~/.tiup/data/my_test_fix/tiflash-0/log/tiflash.log | grep "Give up merge delta, segment"
[2022/07/19 20:58:44.240 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:44.295 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:44.399 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:44.604 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:45.007 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:45.812 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:46.815 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:47.821 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:48.825 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:49.826 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:50.831 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:51.832 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:52.838 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]
[2022/07/19 20:58:53.844 +08:00] [DEBUG] [DeltaMergeStore.cpp:2218] ["DeltaMergeStore:db_1.t_72 Give up merge delta, segment [1]"] [thread_id=25]

The CPU usage keeps around 100% (first 11s for split, next 10s for mergeDelta):

4.3  tiflash
21.4 tiflash  <--- split begin, mergeDelta begin
100.2 tiflash
103.1 tiflash
95.2  tiflash
89.1  tiflash
99.4  tiflash
100.8 tiflash
101.7 tiflash
103.2 tiflash
102.4 tiflash
97.8  tiflash  <--- split is finished
71.7  tiflash
102.8 tiflash
99.1  tiflash
102.6 tiflash
102.5 tiflash
95.5  tiflash
100.7 tiflash
105.1 tiflash
101.5 tiflash
103.4 tiflash
49.9  tiflash
4.2   tiflash

Note: As there is maximum 1s backoff, the CPU usage dropped for a short while when split was finished and the mergeDelta was not yet started.

  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Reduce unnecessary CPU usage in some edge cases

@ti-chi-bot
Copy link
Member Author

ti-chi-bot commented Jul 20, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • JaySon-Huang
  • breezewish

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added do-not-merge/cherry-pick-not-approved release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 20, 2022
@ti-chi-bot ti-chi-bot added the type/cherry-pick-for-release-6.1 This PR is cherry-picked to release-6.1 from a source PR. label Jul 20, 2022
@breezewish
Copy link
Member

The merge conflict is caused by #5296 not merged.

@VelocityLight VelocityLight added cherry-pick-approved Cherry pick PR approved by release team. and removed do-not-merge/cherry-pick-not-approved labels Jul 20, 2022
@ti-chi-bot ti-chi-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 20, 2022
@JaySon-Huang
Copy link
Contributor

@breezewish can you resolve the conflicts and make this pr merged?

@breezewish breezewish closed this Aug 17, 2022
@breezewish breezewish reopened this Aug 17, 2022
@breezewish breezewish force-pushed the cherry-pick-5410-to-release-6.1 branch from fba64c5 to af90192 Compare August 17, 2022 17:14
@ti-chi-bot ti-chi-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 17, 2022
@breezewish breezewish force-pushed the cherry-pick-5410-to-release-6.1 branch from af90192 to f1025db Compare August 17, 2022 17:28
@breezewish
Copy link
Member

/run-all-tests

@sre-bot
Copy link
Collaborator

sre-bot commented Aug 17, 2022

Coverage for changed files

Filename                                                                                                        Regions    Missed Regions     Cover   Functions  Missed Functions  Executed       Lines      Missed Lines     Cover    Branches   Missed Branches     Cover
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
/home/jenkins/agent/workspace/tiflash-build-common/tiflash/dbms/src/Storages/DeltaMerge/DeltaMergeStore.cpp        1457               511    64.93%          67                 5    92.54%        2049               464    77.35%         842               387    54.04%
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
TOTAL                                                                                                              1457               511    64.93%          67                 5    92.54%        2049               464    77.35%         842               387    54.04%

Coverage summary

Functions  MissedFunctions  Executed  Lines   MissedLines  Cover
18253      9782             46.41%    204761  98030        52.12%

full coverage report (for internal network access only)

Copy link
Contributor

@JaySon-Huang JaySon-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Aug 18, 2022
@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Aug 18, 2022
@breezewish
Copy link
Member

/merge

@ti-chi-bot
Copy link
Member Author

@breezewish: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Member Author

This pull request has been accepted and is ready to merge.

Commit hash: f1025db

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Aug 18, 2022
@ti-chi-bot
Copy link
Member Author

@ti-chi-bot: Your PR was out of date, I have automatically updated it for you.

At the same time I will also trigger all tests for you:

/run-all-tests

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@JaySon-Huang
Copy link
Contributor

/run-unit-test

@breezewish
Copy link
Member

/run-all-tests

@sre-bot
Copy link
Collaborator

sre-bot commented Aug 18, 2022

Coverage for changed files

Filename                                                                                                        Regions    Missed Regions     Cover   Functions  Missed Functions  Executed       Lines      Missed Lines     Cover    Branches   Missed Branches     Cover
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
/home/jenkins/agent/workspace/tiflash-build-common/tiflash/dbms/src/Storages/DeltaMerge/DeltaMergeStore.cpp        1457               511    64.93%          67                 5    92.54%        2049               464    77.35%         842               389    53.80%
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
TOTAL                                                                                                              1457               511    64.93%          67                 5    92.54%        2049               464    77.35%         842               389    53.80%

Coverage summary

Functions  MissedFunctions  Executed  Lines   MissedLines  Cover
18256      9781             46.42%    204829  97999        52.16%

full coverage report (for internal network access only)

@ti-chi-bot ti-chi-bot merged commit 4f6cf29 into pingcap:release-6.1 Aug 18, 2022
@jayl1e jayl1e added this to the v6.1.1 milestone Aug 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick-approved Cherry pick PR approved by release team. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2. type/cherry-pick-for-release-6.1 This PR is cherry-picked to release-6.1 from a source PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants