Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddl: Fix the physically drop storage instance may block removing regions #9442

Merged

Conversation

JaySon-Huang
Copy link
Contributor

@JaySon-Huang JaySon-Huang commented Sep 19, 2024

What problem does this PR solve?

Issue Number: close #9437

Problem Summary:

When a table is dropped in tidb, and exceeds the gc_safepoint, tiflash will generate an InterpreterDropQuery to physically remove the data from the tiflash instance. The "DropQuery" will call DeltaMergeStore::dropAllSegments to remove the Segment at DeltaTree storage one by one. Because there are many segments for the table, running DeltaMergeStore::dropAllSegments is slow.
https://github.com/pingcap/tiflash/blob/v7.1.3/dbms/src/TiDB/Schema/SchemaSyncService.cpp#L216-L227
https://github.com/pingcap/tiflash/blob/v7.1.3/dbms/src/Storages/DeltaMerge/DeltaMergeStore.cpp#L342-L349

[2024/09/16 21:29:29.604 +00:00] [INFO] [SchemaSyncService.cpp:170] ["Performing GC using safe point 452597098965106688"] [source="keyspace=4294967295"] [thread_id=4314]
[2024/09/16 21:29:29.610 +00:00] [INFO] [SchemaSyncService.cpp:216] ["Physically dropping table test_db(694).test_tbl(563080)"] [source="keyspace=4294967295"] [thread_id=4314]
[2024/09/16 21:29:30.446 +00:00] [INFO] [DeltaMergeStore.cpp:413] ["Clear DeltaMerge segments data"] [source="keyspace_id=4294967295 table_id=563080"] [thread_id=4314]
[2024/09/16 21:29:30.465 +00:00] [INFO] [Segment.cpp:2004] ["Finish segment drop its next segment, segment=<segment_id=292469 epoch=2 range=[?,?) next_segment_id=292472 delta_rows=0 delta_bytes=0 delta_deletes=0 stable_file=dmf_1174749 stable_rows=98304 stable_bytes=576748236 dmf_rows=98304 dmf_bytes=576748236 dmf_packs=12>"] [source="keyspace_id=4294967295 table_id=563080 segment_id=292469 epoch=2"] [thread_id=4314]
[2024/09/16 21:29:30.553 +00:00] [INFO] [Segment.cpp:2004] ["Finish segment drop its next segment, segment=<segment_id=292223 epoch=3 range=[?,?) next_segment_id=292469 delta_rows=0 delta_bytes=0 delta_deletes=0 stable_file=dmf_1205953 stable_rows=86440 stable_bytes=507435226 dmf_rows=86440 dmf_bytes=507435226 dmf_packs=11>"] [source="keyspace_id=4294967295 table_id=563080 segment_id=292223 epoch=3"] [thread_id=4314]
...

However, DeltaMergeStore::dropAllSegments does not check that all Regions have been removed from TiFlash before the physical drop action is performed. So at the same time, there are some Region removed actions are performed.
All the raftstore threads that try to remove Region will run into removeObsoleteDataInStorage, calling DeltaMergeStore::flushCache and try to acquire a read latch on table_id=563080. At last, all the raftsotre threads are blocked by the thread that executing DeltaMergeStore::dropAllSegments.

image

But more raft-message comes into the tiflash instance, the memory usage grows and cause OOM kills. After restarts, the tiflash instance runs into the same blocking again. And at last, all the segments (around 30,000 in total) are removed from tiflash. And tiflash begins to catch-up the raft message.

[2024/09/16 23:25:37.681 +00:00] [INFO] [DeltaMergeStore.cpp:413] ["Clear DeltaMerge segments data"] [source="keyspace_id=4294967295 table_id=563080"] [thread_id=4143]
...
[2024/09/16 23:46:11.570 +00:00] [INFO] [Segment.cpp:2004] ["Finish segment drop its next segment, segment=<segment_id=281216 epoch=2 range=[?,?) next_segment_id=281738 delta_rows=0 delta_bytes=0 delta_deletes=1 stable_file=dmf_1205957 stable_rows=59514 stable_bytes=348855913 dmf_rows=59514 dmf_bytes=348855913 dmf_packs=8>"] [source="keyspace_id=4294967295 table_id=563080 segment_id=281216 epoch=2"] [thread_id=4143]
[2024/09/16 23:46:11.714 +00:00] [INFO] [DeltaMergeStore.cpp:419] ["Clear DeltaMerge segments data done"] [source="keyspace_id=4294967295 table_id=563080"] [thread_id=4143]

image

What is changed and how it works?

ddl: Fix the physical drop storage instance may block removing regions
Make sure physical drop storage instance only happen after all related regions are removed

Pick PRs:

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Fix the issue that TiFlash may OOM after running `DROP TABLE` on a table with large amount of rows

@ti-chi-bot ti-chi-bot bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/cherry-pick-not-approved size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 19, 2024
@JaySon-Huang
Copy link
Contributor Author

/run-all-tests

@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 20, 2024
@JaySon-Huang JaySon-Huang changed the title [WIP] ddl: Fix the physically drop storage instance may block removing regions ddl: Fix the physically drop storage instance may block removing regions Sep 20, 2024
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 20, 2024
@JaySon-Huang
Copy link
Contributor Author

/run-all-tests

Copy link
Member

@CalvinNeo CalvinNeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Sep 20, 2024
Copy link
Contributor

ti-chi-bot bot commented Sep 20, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CalvinNeo, Lloyd-Pottiger

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [CalvinNeo,Lloyd-Pottiger]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Sep 20, 2024
Copy link
Contributor

ti-chi-bot bot commented Sep 20, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-09-20 08:06:30.130596991 +0000 UTC m=+1207659.871020925: ☑️ agreed by CalvinNeo.
  • 2024-09-20 08:33:56.790137941 +0000 UTC m=+1209306.530561881: ☑️ agreed by Lloyd-Pottiger.

@ti-chi-bot ti-chi-bot bot added cherry-pick-approved Cherry pick PR approved by release team. and removed do-not-merge/cherry-pick-not-approved labels Sep 20, 2024
@ti-chi-bot ti-chi-bot bot merged commit 62809fe into pingcap:release-7.1 Sep 20, 2024
5 checks passed
@JaySon-Huang JaySon-Huang deleted the fix_remove_region_block_71 branch September 20, 2024 09:46
@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved cherry-pick-approved Cherry pick PR approved by release team. lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Removing regions may be blocked for a long time when dropping a table with large amount of data
3 participants