Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Traffic Capture and Replay #643

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

djshow832
Copy link
Collaborator

What problem does this PR solve?

Issue Number: ref #642

Problem Summary:
There are some cases when users want to capture the traffic on the production cluster and replay the traffic on a testing cluster:

  • A new TiDB version may have compatibility breakers, such as the statements failing, running slower, or resulting in different query results.
  • When the cluster runs unexpectedly, users want to capture the traffic so that they can investigate it later by replaying the traffic.
  • Test the maximum throughput of a scaled-up or scaled-down cluster using the real production workload instead of standard bench tools.

What is changed and how it works:
This proposes a design of capturing traffic on the production cluster and replaying the traffic on a testing cluster to verify the SQL compatibility and performance of the new cluster.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Notable changes

  • Has configuration change
  • Has HTTP API interfaces change
  • Has tiproxyctl change
  • Other user behavior changes

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot ti-chi-bot bot requested review from bb7133 and xhebox August 27, 2024 09:58
Copy link

ti-chi-bot bot commented Aug 27, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from djshow832, ensuring that each of them provides their approval before proceeding. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/L label Aug 27, 2024
@djshow832 djshow832 changed the title RFC: traffic capture and replay RFC: Traffic Capture and Replay Aug 27, 2024
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (main@34459ec). Learn more about missing BASE report.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #643   +/-   ##
=======================================
  Coverage        ?   69.10%           
=======================================
  Files           ?       86           
  Lines           ?     8052           
  Branches        ?        0           
=======================================
  Hits            ?     5564           
  Misses          ?     2090           
  Partials        ?      398           
Flag Coverage Δ
unit 69.10% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@djshow832 djshow832 mentioned this pull request Aug 28, 2024
42 tasks
Copy link

ti-chi-bot bot commented Aug 29, 2024

@djshow832: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test 4e23b06 link true /test unit-test

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.


To replay the statements in the same order, the statements should be ordered by the start time. So the field `Time` indicates the start time instead of the end time. To order by the start time, one statement must stay in memory until all the previous statements are finished. If a statement runs too long, it's logged without query result information, so the replay skips checking the result and duration.

It's trivial to lose traffic during failure, so TiProxy doesn't need to flush the traffic synchronously. To reduce the effect on the production cluster, TiProxy flushes the traffic in batch asynchronously with double buffering. Similarly, TiProxy loads traffic with double buffering during replaying. If the IO is too slow and the buffer accumulates too much, the capture should stop immediately and the status should be reported in UI and Grafana.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we estimate the performance impact of captures/replays to TiProxy?

Copy link
Collaborator Author

@djshow832 djshow832 Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for capture, latency regression < 1% and throughput < 5%. For replay, the QPS should be higher than sending requests from the clients.
But I think it should be specified in the spec, not the design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants