-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Traffic Capture and Replay #643
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #643 +/- ##
=======================================
Coverage ? 69.10%
=======================================
Files ? 86
Lines ? 8052
Branches ? 0
=======================================
Hits ? 5564
Misses ? 2090
Partials ? 398
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@djshow832: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
||
To replay the statements in the same order, the statements should be ordered by the start time. So the field `Time` indicates the start time instead of the end time. To order by the start time, one statement must stay in memory until all the previous statements are finished. If a statement runs too long, it's logged without query result information, so the replay skips checking the result and duration. | ||
|
||
It's trivial to lose traffic during failure, so TiProxy doesn't need to flush the traffic synchronously. To reduce the effect on the production cluster, TiProxy flushes the traffic in batch asynchronously with double buffering. Similarly, TiProxy loads traffic with double buffering during replaying. If the IO is too slow and the buffer accumulates too much, the capture should stop immediately and the status should be reported in UI and Grafana. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we estimate the performance impact of captures/replays to TiProxy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for capture, latency regression < 1% and throughput < 5%. For replay, the QPS should be higher than sending requests from the clients.
But I think it should be specified in the spec, not the design.
What problem does this PR solve?
Issue Number: ref #642
Problem Summary:
There are some cases when users want to capture the traffic on the production cluster and replay the traffic on a testing cluster:
What is changed and how it works:
This proposes a design of capturing traffic on the production cluster and replaying the traffic on a testing cluster to verify the SQL compatibility and performance of the new cluster.
Check List
Tests
Notable changes
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.