Enhance metrics for KVStore write #8472

CalvinNeo · 2023-12-06T14:36:33Z

What problem does this PR solve?

Issue Number: close #8471

Problem Summary:

Add tiflash_raft_throughput_bytes, type_write for writes from raftstore, type_write_committed for writes to delta merge
(Add a timeseries panel in Raft)
Add tiflash_raft_raft_frequent_events_count.type_write_commit for every write event to to delta merge
Add tiflash_raft_write_flow_bytes, type_ingest_uncommitted for uncommitted writes bytes after ingestsst, type_write_committed for committed write bytes, type_net_write for net write bytes(region's size after - region's size before)
(Add two histogram panels in Raft)

The KVStore memory tracker test

tiup bench ycsb load tidb -p tidb.instances="127.0.0.1:4000" -p recordcount=10000
tiup bench ycsb run tidb -p tidb.instances="10.2.12.81:5711" -p operationcount=10000

What is changed and how it works?

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

None

Signed-off-by: CalvinNeo <calvinneo1995@gmail.com>

…tics into add-metrics-for-kvstore-write

JinheLin · 2023-12-07T06:47:41Z

I think we need to add a metric about how many region data have not been committed? Since these data are all resident in memory.

JaySon-Huang · 2023-12-07T06:58:25Z

Can you commit the changes to tiflash_summary.json ? Or provide a preview URL of the changed grafana panels

CalvinNeo · 2023-12-07T08:08:19Z

I think we need to add a metric about how many region data have not been committed? Since these data are all resident in memory.

There is already metric about this which is called tiflash_raft_region_flush_bytes.type_unflushed and is updated every unaccepted CompactLog to tell how many data are in the Region without flush.

However, there could be some other metrics that could improve this metric, which is more proper and useful?

A metric like type_unflushed but updated more frequently, maybe on every write/commit(tx), so that when a big tx is coming, we can read from this metric that there are many uncommitted key-values stuck in this Region. However, this metric seems weird to me, because we can't tell the difference between a single frequently reported region and many infrequently reported regions. They will all paint half of the histogram red above a certain horizontal line, however, this is misleading. We need views of memories take from all region at a certain timestamp to build a histogram, or we could have a sum-up line chart.
A metric that provides a sum-up line char, which means how many memories(roughly, maybe only count RegionData) do all regions in KVStore take. I think this is important, as Avoid exit for ProxyFFI when alloc failure due to memory exceeds limit #8464 has also pointed out that we need a mechanism to monitor the memory taken of all storage components. If we do so, I think we need a design before.

As a conclusion, I think we should have a metric about how many memories KVStore takes, but I think it could be handled in PR for #8464. In this PR, we are trying to provide a more convenient way to diagnose the problem of big transaction, especially "invisible" uncommitted big transactions that happens between CompactLog and CompactLog. I introduced "Net Write In" for this, so the diagnose could be:

Check memory metrics, and find OOM
[Strong evidence] Check "Raft Events QPS" and see there is a low rate of write_commit
Check "Raft Log Gap" and see the panel is spread with data points, which means there has been many logs after the last CompactLog(heavy write)
[Strong evidence] Check "Raft Entry Batch Size Heatmap" and see large size of every entry batch
[Strong evidence] Check "Raft handled bytes" and see large gap between "write_committed" and "write". Note that write - write_committed is not the size of uncommitted data we want, since an Rollback put will be created when transaction rollbacks. It will be commit with no effect.

Signed-off-by: CalvinNeo <calvinneo1995@gmail.com>

…tics into add-metrics-for-kvstore-write

JinheLin · 2023-12-11T09:35:28Z

tiflash_raft_region_flush_bytes.type_unflushed seems cannot tell the memory usage of region data. Because the flushed uncommited data can still use a lot of memory.

CalvinNeo · 2023-12-11T13:37:50Z

tiflash_raft_region_flush_bytes.type_unflushed seems cannot tell the memory usage of region data. Because the flushed uncommited data can still use a lot of memory.

I think the name is ambiguous and misleading here. What I actually mean is how much memory does this region take when a "unflushing" event takes place, whereas the type_flushed is for how much memory this region takes when a "flushing" event takes place.

However, I think I can't come up with a name that is short enough and can convey what I mean... Maybe it could be type_unflushed_compact_log, but there are too many things related to compact logs which is a concept from TiKV rather than TiFlash and can also make one confusing IMO

JinheLin · 2023-12-12T02:43:25Z

I know what you means.

The problems of tiflash_raft_region_flush_bytes are:

Its reporting frequency is triggered by the compact log command, not particularly timely.
It it about memory usage of each region, not all.

I think we need a MemoryTracker object to track the memory usage of all regions' data.

CalvinNeo · 2023-12-12T07:13:02Z

I know what you means.

The problems of tiflash_raft_region_flush_bytes are:

Its reporting frequency is triggered by the compact log command, not particularly timely.

It it about memory usage of each region, not all.

I think we need a MemoryTracker object to track the memory usage of all regions' data.

The problem is that it could not give a "global view" of memory consumption of all regions, so we can only record the metrics based on some events like CompactLog.

In TiKV, there is a module named MemoryTrace in https://github.com/tikv/tikv/blob/master/components/tikv_alloc/src/trace.rs. I think may be we can have something like this to observe all memory consuming objects.

I suggest we have a design on this.

JinheLin · 2023-12-13T06:44:47Z

TiFlash has class MemoryTracker for this purpose.

CalvinNeo · 2023-12-13T09:37:02Z

MemoryTracker

Good idea, I will look into that

Signed-off-by: CalvinNeo <calvinneo1995@gmail.com>

JaySon-Huang · 2023-12-20T10:43:32Z

Also push the grafana changes to this PR?

CalvinNeo · 2023-12-20T10:49:24Z

Also push the grafana changes to this PR?

This pr is held until the grafana is pushed.

If other things are OK, could you firstly lgtm this PR? Lots of work taken to build such a metric with about 4+ new panels.

Signed-off-by: CalvinNeo <calvinneo1995@gmail.com>

…tics into add-metrics-for-kvstore-write

CalvinNeo · 2023-12-20T14:11:26Z

/run-all-tests

Signed-off-by: CalvinNeo <calvinneo1995@gmail.com>

…tics into add-metrics-for-kvstore-write

Signed-off-by: CalvinNeo <calvinneo1995@gmail.com>

CalvinNeo · 2023-12-20T16:06:04Z

CalvinNeo · 2023-12-20T16:06:20Z

/unhold

CalvinNeo · 2023-12-20T16:06:26Z

/run-all-tests

JaySon-Huang

LGTM

ti-chi-bot · 2023-12-21T03:35:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JaySon-Huang, JinheLin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [JaySon-Huang,JinheLin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2023-12-21T03:35:03Z

[LGTM Timeline notifier]

Timeline:

2023-12-19 08:56:13.203748166 +0000 UTC m=+951264.240975077: ☑️ agreed by JinheLin.
2023-12-21 03:35:02.185310355 +0000 UTC m=+1104793.222537304: ☑️ agreed by JaySon-Huang.

ti-chi-bot · 2023-12-21T03:41:11Z

@CalvinNeo: Your PR was out of date, I have automatically updated it for you.

At the same time I will also trigger all tests for you:

/run-all-tests

trigger some heavy tests which will not run always when PR updated.

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

close pingcap#8471

initial

40345fc

Signed-off-by: CalvinNeo <calvinneo1995@gmail.com>

ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 6, 2023

CalvinNeo requested review from JaySon-Huang, JinheLin and Lloyd-Pottiger December 7, 2023 02:16

Merge branch 'master' into add-metrics-for-kvstore-write

4767d87

This comment was marked as outdated.

Sign in to view

CalvinNeo changed the title ~~Enhance metrics for KVStore write~~ WIP Enhance metrics for KVStore write Dec 7, 2023

ti-chi-bot bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 7, 2023

CalvinNeo requested a review from flowbehappy December 7, 2023 04:01

CalvinNeo and others added 3 commits December 7, 2023 12:05

enhance

63a09e0

Signed-off-by: CalvinNeo <calvinneo1995@gmail.com>

Merge branch 'add-metrics-for-kvstore-write' of github.com:CalvinNeo/…

ec5feb3

…tics into add-metrics-for-kvstore-write

Merge branch 'master' into add-metrics-for-kvstore-write

447b8b6

CalvinNeo changed the title ~~WIP Enhance metrics for KVStore write~~ Enhance metrics for KVStore write Dec 7, 2023

ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 7, 2023

CalvinNeo added 2 commits December 7, 2023 17:10

introduce tiflash_raft_throughput_bytes.type_write

a131e07

Signed-off-by: CalvinNeo <calvinneo1995@gmail.com>

Merge branch 'add-metrics-for-kvstore-write' of github.com:CalvinNeo/…

d712290

…tics into add-metrics-for-kvstore-write

ti-chi-bot bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 7, 2023

mem tracker

bc93b9e

Signed-off-by: CalvinNeo <calvinneo1995@gmail.com>

CalvinNeo and others added 3 commits December 20, 2023 20:09

f

e45db5b

Signed-off-by: CalvinNeo <calvinneo1995@gmail.com>

Merge branch 'add-metrics-for-kvstore-write' of github.com:CalvinNeo/…

c201bcd

…tics into add-metrics-for-kvstore-write

Merge branch 'master' into add-metrics-for-kvstore-write

5f2d42c

CalvinNeo added 2 commits December 20, 2023 23:05

grafana

7d7f30c

Signed-off-by: CalvinNeo <calvinneo1995@gmail.com>

Merge branch 'add-metrics-for-kvstore-write' of github.com:CalvinNeo/…

a136d58

…tics into add-metrics-for-kvstore-write

ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 20, 2023

CalvinNeo added 3 commits December 20, 2023 23:18

fix

00f4ecf

Signed-off-by: CalvinNeo <calvinneo1995@gmail.com>

add more panels

1f31be8

Signed-off-by: CalvinNeo <calvinneo1995@gmail.com>

fix

de186d6

Signed-off-by: CalvinNeo <calvinneo1995@gmail.com>

CalvinNeo requested a review from JaySon-Huang December 20, 2023 16:06

ti-chi-bot bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 20, 2023

JaySon-Huang approved these changes Dec 21, 2023

View reviewed changes

ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Dec 21, 2023

Merge branch 'master' into add-metrics-for-kvstore-write

afe243f

ti-chi-bot bot merged commit cc76e87 into pingcap:master Dec 21, 2023
6 checks passed

CalvinNeo added a commit to CalvinNeo/tiflash that referenced this pull request Dec 21, 2023

Enhance metrics for KVStore write (pingcap#8472)

ddfcc12

close pingcap#8471

yibin87 pushed a commit to yibin87/tiflash that referenced this pull request Dec 22, 2023

Enhance metrics for KVStore write (pingcap#8472)

5fda92e

close pingcap#8471

JaySon-Huang pushed a commit to JaySon-Huang/tiflash that referenced this pull request Dec 27, 2023

Enhance metrics for KVStore write (pingcap#8472)

4f8a963

close pingcap#8471

CalvinNeo mentioned this pull request Jan 30, 2024

Encryption: fix read encryption key from page storage & add unit tests #8738

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance metrics for KVStore write #8472

Enhance metrics for KVStore write #8472

CalvinNeo commented Dec 6, 2023 •

edited

Loading

This comment was marked as outdated.

JinheLin commented Dec 7, 2023

JaySon-Huang commented Dec 7, 2023

CalvinNeo commented Dec 7, 2023 •

edited

Loading

JinheLin commented Dec 11, 2023

CalvinNeo commented Dec 11, 2023 •

edited

Loading

JinheLin commented Dec 12, 2023

CalvinNeo commented Dec 12, 2023 •

edited

Loading

JinheLin commented Dec 13, 2023

CalvinNeo commented Dec 13, 2023

JaySon-Huang commented Dec 20, 2023 •

edited

Loading

CalvinNeo commented Dec 20, 2023

CalvinNeo commented Dec 20, 2023

CalvinNeo commented Dec 20, 2023

CalvinNeo commented Dec 20, 2023

CalvinNeo commented Dec 20, 2023

JaySon-Huang left a comment

ti-chi-bot bot commented Dec 21, 2023

ti-chi-bot bot commented Dec 21, 2023

ti-chi-bot bot commented Dec 21, 2023

Enhance metrics for KVStore write #8472

Enhance metrics for KVStore write #8472

Conversation

CalvinNeo commented Dec 6, 2023 • edited Loading

What problem does this PR solve?

What is changed and how it works?

Check List

Release note

This comment was marked as outdated.

JinheLin commented Dec 7, 2023

JaySon-Huang commented Dec 7, 2023

CalvinNeo commented Dec 7, 2023 • edited Loading

JinheLin commented Dec 11, 2023

CalvinNeo commented Dec 11, 2023 • edited Loading

JinheLin commented Dec 12, 2023

CalvinNeo commented Dec 12, 2023 • edited Loading

JinheLin commented Dec 13, 2023

CalvinNeo commented Dec 13, 2023

JaySon-Huang commented Dec 20, 2023 • edited Loading

CalvinNeo commented Dec 20, 2023

CalvinNeo commented Dec 20, 2023

CalvinNeo commented Dec 20, 2023

CalvinNeo commented Dec 20, 2023

CalvinNeo commented Dec 20, 2023

JaySon-Huang left a comment

Choose a reason for hiding this comment

ti-chi-bot bot commented Dec 21, 2023

ti-chi-bot bot commented Dec 21, 2023

[LGTM Timeline notifier]

ti-chi-bot bot commented Dec 21, 2023

CalvinNeo commented Dec 6, 2023 •

edited

Loading

CalvinNeo commented Dec 7, 2023 •

edited

Loading

CalvinNeo commented Dec 11, 2023 •

edited

Loading

CalvinNeo commented Dec 12, 2023 •

edited

Loading

JaySon-Huang commented Dec 20, 2023 •

edited

Loading