RFC: Traffic Capture and Replay #643

djshow832 · 2024-08-27T09:58:35Z

What problem does this PR solve?

Issue Number: ref #642

Problem Summary:
There are some cases when users want to capture the traffic on the production cluster and replay the traffic on a testing cluster:

A new TiDB version may have compatibility breakers, such as the statements failing, running slower, or resulting in different query results.
When the cluster runs unexpectedly, users want to capture the traffic so that they can investigate it later by replaying the traffic.
Test the maximum throughput of a scaled-up or scaled-down cluster using the real production workload instead of standard bench tools.

What is changed and how it works:
This proposes a design of capturing traffic on the production cluster and replaying the traffic on a testing cluster to verify the SQL compatibility and performance of the new cluster.

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Notable changes

Has configuration change
Has HTTP API interfaces change
Has tiproxyctl change
Other user behavior changes

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

ti-chi-bot · 2024-08-27T09:58:39Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from djshow832, ensuring that each of them provides their approval before proceeding. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov-commenter · 2024-08-27T10:02:57Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (main@34459ec). Learn more about missing BASE report.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #643   +/-   ##
=======================================
  Coverage        ?   69.10%           
=======================================
  Files           ?       86           
  Lines           ?     8052           
  Branches        ?        0           
=======================================
  Hits            ?     5564           
  Misses          ?     2090           
  Partials        ?      398

Flag	Coverage Δ
unit	`69.10% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ti-chi-bot · 2024-08-29T13:55:57Z

@djshow832: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-unit-test	`4e23b06`	link	true	`/test unit-test`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

bb7133 · 2024-08-29T18:10:21Z

docs/design/2024-08-27-traffic-replay.md

+
+To replay the statements in the same order, the statements should be ordered by the start time. So the field `Time` indicates the start time instead of the end time. To order by the start time, one statement must stay in memory until all the previous statements are finished. If a statement runs too long, it's logged without query result information, so the replay skips checking the result and duration.
+
+It's trivial to lose traffic during failure, so TiProxy doesn't need to flush the traffic synchronously. To reduce the effect on the production cluster, TiProxy flushes the traffic in batch asynchronously with double buffering. Similarly, TiProxy loads traffic with double buffering during replaying. If the IO is too slow and the buffer accumulates too much, the capture should stop immediately and the status should be reported in UI and Grafana.


Could we estimate the performance impact of captures/replays to TiProxy?

Yes, for capture, latency regression < 1% and throughput < 5%. For replay, the QPS should be higher than sending requests from the clients.
But I think it should be specified in the spec, not the design.

rfc

6028f53

ti-chi-bot bot requested review from bb7133 and xhebox August 27, 2024 09:58

ti-chi-bot bot added the size/L label Aug 27, 2024

djshow832 changed the title ~~RFC: traffic capture and replay~~ RFC: Traffic Capture and Replay Aug 27, 2024

djshow832 mentioned this pull request Aug 28, 2024

Support traffic replay #642

Open

42 tasks

update report

4e23b06

bb7133 reviewed Aug 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Traffic Capture and Replay #643

RFC: Traffic Capture and Replay #643

djshow832 commented Aug 27, 2024

ti-chi-bot bot commented Aug 27, 2024

codecov-commenter commented Aug 27, 2024

ti-chi-bot bot commented Aug 29, 2024

bb7133 Aug 29, 2024

djshow832 Aug 30, 2024 •

edited

Loading


		To replay the statements in the same order, the statements should be ordered by the start time. So the field `Time` indicates the start time instead of the end time. To order by the start time, one statement must stay in memory until all the previous statements are finished. If a statement runs too long, it's logged without query result information, so the replay skips checking the result and duration.

		It's trivial to lose traffic during failure, so TiProxy doesn't need to flush the traffic synchronously. To reduce the effect on the production cluster, TiProxy flushes the traffic in batch asynchronously with double buffering. Similarly, TiProxy loads traffic with double buffering during replaying. If the IO is too slow and the buffer accumulates too much, the capture should stop immediately and the status should be reported in UI and Grafana.

RFC: Traffic Capture and Replay #643

Are you sure you want to change the base?

RFC: Traffic Capture and Replay #643

Conversation

djshow832 commented Aug 27, 2024

What problem does this PR solve?

Check List

Release note

ti-chi-bot bot commented Aug 27, 2024

codecov-commenter commented Aug 27, 2024

Codecov Report

ti-chi-bot bot commented Aug 29, 2024

bb7133 Aug 29, 2024

Choose a reason for hiding this comment

djshow832 Aug 30, 2024 • edited Loading

Choose a reason for hiding this comment

djshow832 Aug 30, 2024 •

edited

Loading