-
Notifications
You must be signed in to change notification settings - Fork 288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metrics(ticdc): add some log and metrics to owner and processorManage… #4402
metrics(ticdc): add some log and metrics to owner and processorManage… #4402
Conversation
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
/run-all-tests |
10e35c4
to
ced7a7d
Compare
/run-all-tests |
cdc/owner/owner.go
Outdated
if costTime > changefeedLogsWarnDuration { | ||
log.Warn("changefeed tick took too long", zap.String("changefeed", changefeedID), zap.Duration("duration", costTime)) | ||
} | ||
changefeedReactorTickDuration.WithLabelValues(changefeedID).Observe(costTime.Seconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid using WithLabelValues
to create metric observer in hot path, which will cost too much CPU.
A better solution is to pre allocate prometheus labels and reuse the observer, such as what we use in processor
tiflow/cdc/processor/processor.go
Lines 240 to 248 in 99389d6
metricResolvedTsGauge: resolvedTsGauge.WithLabelValues(changefeedID, advertiseAddr), | |
metricResolvedTsLagGauge: resolvedTsLagGauge.WithLabelValues(changefeedID, advertiseAddr), | |
metricMinResolvedTableIDGuage: resolvedTsMinTableIDGauge.WithLabelValues(changefeedID, advertiseAddr), | |
metricCheckpointTsGauge: checkpointTsGauge.WithLabelValues(changefeedID, advertiseAddr), | |
metricCheckpointTsLagGauge: checkpointTsLagGauge.WithLabelValues(changefeedID, advertiseAddr), | |
metricMinCheckpointTableIDGuage: checkpointTsMinTableIDGauge.WithLabelValues(changefeedID, advertiseAddr), | |
metricSyncTableNumGauge: syncTableNumGauge.WithLabelValues(changefeedID, advertiseAddr), | |
metricProcessorErrorCounter: processorErrorCounter.WithLabelValues(changefeedID, advertiseAddr), | |
metricSchemaStorageGcTsGauge: processorSchemaStorageGcTsGauge.WithLabelValues(changefeedID, advertiseAddr), |
ced7a7d
to
ad7df9d
Compare
d279324
to
c55d844
Compare
/run-all-tests |
/run-kafka-integration-tests |
cdc/owner/changefeed.go
Outdated
@@ -128,6 +132,17 @@ func (c *changefeed) Tick(ctx cdcContext.Context, state *orchestrator.Changefeed | |||
Message: err.Error(), | |||
}) | |||
c.releaseResources(ctx) | |||
} else { | |||
// c.initialize(ctx) | |||
costTime := time.Since(startTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be it is better to rewrite the whole if block
as:
err := c.tick()
costTime = time.Since(startTime)
// warn log here
if err != nil {
...
}
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[2022-01-25T02:45:17.902Z] === Failed
[2022-01-25T02:45:17.902Z] === FAIL: pkg/etcd TestGetOwnerRevision (11.99s)
[2022-01-25T02:45:17.902Z] etcd_test.go:418:
[2022-01-25T02:45:17.902Z] Error Trace: etcd_test.go:418
[2022-01-25T02:45:17.902Z] asm_amd64.s:1371
[2022-01-25T02:45:17.902Z] Error: "etcdserver: request timed out" does not contain "context canceled"
[2022-01-25T02:45:17.902Z] Test: TestGetOwnerRevision
[2022-01-25T02:45:17.902Z] etcd_test.go:418:
[2022-01-25T02:45:17.902Z] Error Trace: etcd_test.go:418
[2022-01-25T02:45:17.902Z] asm_amd64.s:1371
[2022-01-25T02:45:17.902Z] Error: "rpc error: code = Unknown desc = context deadline exceeded" does not contain "context canceled"
[2022-01-25T02:45:17.902Z] Test: TestGetOwnerRevision
[2022-01-25T02:45:17.902Z] ==================
[2022-01-25T02:45:17.902Z] WARNING: DATA RACE
[2022-01-25T02:45:17.902Z] Write at 0x00c000f03ea1 by goroutine 163:
[2022-01-25T02:45:17.902Z] testing.(*common).FailNow()
[2022-01-25T02:45:17.902Z] /usr/local/go/src/testing/testing.go:740 +0x4f
[2022-01-25T02:45:17.902Z] testing.(*T).FailNow()
[2022-01-25T02:45:17.902Z] <autogenerated>:1 +0x44
[2022-01-25T02:45:17.902Z] github.com/stretchr/testify/require.Contains()
[2022-01-25T02:45:17.902Z] /go/pkg/mod/github.com/stretchr/testify@v1.7.0/require/require.go:50 +0x124
[2022-01-25T02:45:17.902Z] github.com/pingcap/tiflow/pkg/etcd.TestGetOwnerRevision.func1()
@@ -65,13 +69,34 @@ var ( | |||
Name: "status", | |||
Help: "The status of changefeeds", | |||
}, []string{"changefeed"}) | |||
changefeedTickDuration = prometheus.NewHistogramVec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a screenshot after the metric applied make things more clear, the dashboard template will be added later ?
34108fc
to
8975c79
Compare
/run-all-tests |
/run-all-tests |
Codecov Report
Flags with carried forward coverage won't be shown. Click here to find out more. @@ Coverage Diff @@
## master #4402 +/- ##
================================================
- Coverage 55.6402% 55.4329% -0.2073%
================================================
Files 494 498 +4
Lines 61283 61826 +543
================================================
+ Hits 34098 34272 +174
- Misses 23750 24109 +359
- Partials 3435 3445 +10 |
/merge |
This pull request has been accepted and is ready to merge. Commit hash: 8ef0148
|
@CharlesCheung96: Your PR was out of date, I have automatically updated it for you. At the same time I will also trigger all tests for you: /run-all-tests If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
/run-kafka-integration-tests |
What problem does this PR solve?
Issue Number: close #3884
What is changed and how it works?
add some log and metrics to owner and processorManager to detect stuck scenario
Check List
Code changes
Side effects
Related changes
Release note