Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

txn: Slow txn log #41864

Merged
merged 12 commits into from
Mar 6, 2023
Merged

txn: Slow txn log #41864

merged 12 commits into from
Mar 6, 2023

Conversation

ekexium
Copy link
Member

@ekexium ekexium commented Mar 2, 2023

What problem does this PR solve?

Issue Number: close #41863

Problem Summary:

What is changed and how it works?

Add a session variable tidb_slow_txn_log_threshold. Log transaction events observed by the transaction manager of any transaction that takes longer than the threshold. Setting it to 0 means disabling the log.

The log looks like

[2023/03/02 16:12:19.644 +08:00] [INFO] [txnmanager.go:194] ["slow transaction"] [duration=1.5955ms] [conn=1] [txnStartTS=439810717028974592] [events="[\"{event=\\\"enter txn\\\",gap=0s}\",\"{event=\\\"update pessimistic set v = 2 where v = 1\\\",gap=62.458µs}\",\"{event=\\\"stmt commit\\\",gap=1.157125ms}\",\"{event=\\\"txn end\\\",gap=375.167µs}\"]"]

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

Add a session variable `tidb_slow_txn_log_threshold` to log slow transactions.

ekexium added 4 commits March 1, 2023 16:58
Signed-off-by: ekexium <eke@fastmail.com>
Signed-off-by: ekexium <eke@fastmail.com>
Signed-off-by: ekexium <eke@fastmail.com>
Signed-off-by: ekexium <eke@fastmail.com>
@ekexium ekexium requested a review from a team as a code owner March 2, 2023 07:31
@ti-chi-bot
Copy link
Member

ti-chi-bot commented Mar 2, 2023

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • cfzjywxk
  • zyguan

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 2, 2023
ekexium added 2 commits March 2, 2023 15:39
Signed-off-by: ekexium <eke@fastmail.com>
Signed-off-by: ekexium <eke@fastmail.com>
@ti-chi-bot ti-chi-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 2, 2023
@ekexium ekexium requested review from zyguan and cfzjywxk March 2, 2023 08:13
@ti-chi-bot ti-chi-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 2, 2023
@cfzjywxk
Copy link
Contributor

cfzjywxk commented Mar 2, 2023

Is this change intended to resolve issues like #41471, if so we may need find out a way to show the long-running transactions like show processlist does.

@ti-chi-bot ti-chi-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 2, 2023
@ekexium
Copy link
Member Author

ekexium commented Mar 2, 2023

Is this change intended to resolve issues like #41471, if so we may need find out a way to show the long-running transactions like show processlist does.

No. I'd rather treat them as different problems. This PR is a counterpart of slow queries. I intend to use it as a tool to diagnose tail latencies. Though finished long-running transaction could be logged by it as well.

Signed-off-by: ekexium <eke@fastmail.com>
@cfzjywxk
Copy link
Contributor

cfzjywxk commented Mar 2, 2023

No. I'd rather treat them as different problems. This PR is a counterpart of slow queries. I intend to use it as a tool to diagnose tail latencies. Though finished long-running transaction could logged by it as well.

How to diagnose tail latencies using the long-txn log, I don't quite get it. Could you give an example of it?

@ekexium
Copy link
Member Author

ekexium commented Mar 2, 2023

How to diagnose tail latencies using the long-txn log, I don't quite get it. Could you give an example of it?

For example we see high tail latencies for transactions and high max idle durations in metrics. But we cannot guarantee that the tail latency is caused by idle duration.
With slow txn log like

...,\"{event=\\\"stmt end\\\",gap=18.89µs}\",\"{event=\\\"insert bs_accum_lmt_detail (haveAccumCnt,haveAccumLmt,lastLmtDt,lmtNo,lmtObj,lmtPeriodType,lmtSubSeqNo,rgstBrc,rgstDt,rgstOperr,rgstTm,transSrlNo) values (?,?,?,?,?,?,?,?,?,?,?,?)\\\",gap=218.345562ms}\",...

We can tell that there is an idle duration of 218ms before the INSERT stmt comes. And it might help further diagnosis in client side.

@@ -2175,6 +2175,7 @@ func (s *session) ExecuteStmt(ctx context.Context, stmtNode ast.StmtNode) (sqlex
cmd32 := atomic.LoadUint32(&s.GetSessionVars().CommandValue)
s.SetProcessInfo(stmtNode.Text(), time.Now(), byte(cmd32), 0)
s.txn.onStmtStart(digest.String())
defer sessiontxn.GetTxnManager(s).OnStmtEnd()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to merge the two OnStmtEnd into one #41122 but it looks difficult..

@@ -38,6 +38,8 @@ const (
DefaultLogFormat = "text"
// DefaultSlowThreshold is the default slow log threshold in millisecond.
DefaultSlowThreshold = 300
// DefaultSlowTxnThreshold is the default slow txn log threshold in ms.
DefaultSlowTxnThreshold = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to choose a default-enabled value?

Copy link
Member Author

@ekexium ekexium Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default to 0. It's totally dependent on the application logic to determine what is "slow". And defaulting to 0 will prevent noisy logs from inner transactions.
And that's why this should be a SESSION-scoped variable.

session/txnmanager.go Show resolved Hide resolved
Copy link
Contributor

@zyguan zyguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that query_dur > stmt_dur, that is {event="select ...",gap=100ms} may be caused by slow parsing, we cannot confirm that it's caused by client idle time without the slow query log of that query.

session/txnmanager.go Outdated Show resolved Hide resolved
session/txnmanager.go Outdated Show resolved Hide resolved
// used for slow transaction logs
events []event
lastInstant time.Time
enterTxnInstant time.Time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe lastInstant is unnecessary, we can record the start instant and the duration (since the start instant) of each event, then the gaps can be inferred from these info.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I've considered the alternative. I think they are basically equivalent and either is fine. If you prefer this style I can change it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it’s fine. A field of type time.Time here is not expensive at all.

Signed-off-by: ekexium <eke@fastmail.com>
@ekexium
Copy link
Member Author

ekexium commented Mar 3, 2023

Note that query_dur > stmt_dur, that is {event="select ...",gap=100ms} may be caused by slow parsing, we cannot confirm that it's caused by client idle time without the slow query log of that query.

That's true. And the stat doesn't count the time spent in TiDB after OnTxnEnd. Precisely it's counting from the "txn manager's view". But that helps, too. It's better than nothing. What do you think?

In real cases we should use it together with slow query logs and stmt info. For things like {event="select ...",gap=100ms} we can almost assure it's an idle duration since long parsing/compilation is kinda rare.

@cfzjywxk
Copy link
Contributor

cfzjywxk commented Mar 3, 2023

BTW I think we've added some "event tracking" things in the consistency check enhancement, should we also include the inconsistency events and merge them into a single event tracking framework?

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Mar 3, 2023
@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Mar 4, 2023
@cfzjywxk
Copy link
Contributor

cfzjywxk commented Mar 4, 2023

@ekexium
Need to resolve the conflicts.

@ti-chi-bot ti-chi-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 6, 2023
Signed-off-by: ekexium <eke@fastmail.com>
@ekexium
Copy link
Member Author

ekexium commented Mar 6, 2023

/retest

@ekexium
Copy link
Member Author

ekexium commented Mar 6, 2023

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: af9d4df

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Mar 6, 2023
@ti-chi-bot ti-chi-bot merged commit 9e27518 into pingcap:master Mar 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Log slow transactions
4 participants