-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Multi-tenant and Sharded TSO #5895
Comments
ref #5895 Signed-off-by: Ryan Leung <rleungx@gmail.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
ref #5895 Some code refinements for `serviceModeKeeper`. Signed-off-by: JmPotato <ghzpotato@gmail.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
ref #5895 Signed-off-by: Ryan Leung <rleungx@gmail.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
Meeting notes of TSO wg sync-up (attendees: @rleungx , @lhy1024 , @hnes , @binshi-bing ):
|
ref #5895 Signed-off-by: Ryan Leung <rleungx@gmail.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
After internal sync within the PD TSO Working Group, here is the roadmap to land on TSO microservice:
|
ref #5895 Fix the problem described above by serializing the tso stream creation. Signed-off-by: Bin Shi <binshi.bing@gmail.com> Co-authored-by: lhy1024 <admin@liudos.us>
ref #5895 Add general tso forward/dispatcher for independent pd(tso)/tso services and cross cluster forwarding. Signed-off-by: Bin Shi <binshi.bing@gmail.com>
ref #5895 Support basic functions of multi-keyspace-group management Signed-off-by: Bin Shi <binshi.bing@gmail.com>
ref #5895 Signed-off-by: Ryan Leung <rleungx@gmail.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
ref #5895 - Refine the TSO allocator manager parameters. - Always run `tsoAllocatorLoop` to advance the Global TSO. Signed-off-by: JmPotato <ghzpotato@gmail.com>
ref tikv#5895 Add benchmarks for keyspace assignment patrol. Signed-off-by: JmPotato <ghzpotato@gmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ref tikv#5895 mcs, tso: change keyspace group primary path. The path for non-default keyspace group primary election changes from "/ms/{cluster_id}/tso/{group}/primary" to "/ms/{cluster_id}/tso/keyspace_groups/election/{group}/primary". Default keyspace group keeps /ms/{cluster_id}/tso/00000/primary. Signed-off-by: Bin Shi <binshi.bing@gmail.com>
ref tikv#5895 Add TestUpgradingAPIandTSOClusters to test the scenario that after we restart the API cluster then restart the TSO cluster, the TSO service can still serve TSO requests normally. Signed-off-by: Bin Shi <binshi.bing@gmail.com>
ref tikv#5895 Signed-off-by: Ryan Leung <rleungx@gmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ref tikv#5895 Signed-off-by: Ryan Leung <rleungx@gmail.com>
ref tikv#5895 Improve tso proxy reliability. 1. Add protection mechanisms to TSO Proxy. a. Throttle the concurrency of TSO Proxy streamings. Default 5000. b. If TSO Proxy didn't receive the TSO request from the client for 1 hour, close the stream. 2. Optimize forceLoad lock with RW lock. 3. Enable stress test. 4. Add deadline for API leader forwarding request to TSO service. 5. Make tso response channel more safely. 6. Move tso proxy stress test away from the test suite as it has impact on other test cases. 7. Fix grpc client connection pool (server side) resource leak problem. 8. Make MaxConcurrentTSOProxyStreamings (5000 as default) and TSOProxyClientRecvTimeout (1 hour as default) configurable. 9. Add metrics tsoProxyHandleDuration, tsoProxyBatchSize and tsoProxyForwardTimeoutCounter. Signed-off-by: Bin Shi <binshi.bing@gmail.com>
ref tikv#5895 Add failure test cases. Signed-off-by: Bin Shi <binshi.bing@gmail.com>
…eyspace movement state change in the persistent store (tikv#6596) ref tikv#5895 fix potential inconsistency caused by non-atomic applying the state change in the persistent in the following cases: 1. Keyspace group split/merge 2. Keyspace movement across keyspace groups. Signed-off-by: Bin Shi <binshi.bing@gmail.com>
ref tikv#5895 Signed-off-by: Ryan Leung <rleungx@gmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ref tikv#5895 Add more debugging info to time fallback log. [2023/06/27 10:50:54.196 -07:00] [PANIC] [tso_dispatcher.go:764] ["[tso] timestamp fallback"] [dc-location=global] [keyspace=4294967295] [last-ts="(1687888254152, 1)"] [cur-ts="(1687888254052, 2)"] [last-tso-server=127.0.0.1:3380] [cur-tso-server=127.0.0.1:3380] [last-keyspace-group-in-request=0] [cur-keyspace-group-in-request=0] [last-keyspace-group-in-response=0] [cur-keyspace-group-in-response=0] [last-response-received-at=2023/06/27 10:50:54.195 -07:00] [cur-response-received-at=2023/06/27 10:50:54.196 -07:00] Signed-off-by: Bin Shi <binshi.bing@gmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ref tikv#5895 Signed-off-by: Ryan Leung <rleungx@gmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ref tikv#5895 Signed-off-by: lhy1024 <admin@liudos.us> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ref tikv#5895 Signed-off-by: Ryan Leung <rleungx@gmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ref tikv#5895 Add more debugging info to time fallback log. [2023/06/27 10:50:54.196 -07:00] [PANIC] [tso_dispatcher.go:764] ["[tso] timestamp fallback"] [dc-location=global] [keyspace=4294967295] [last-ts="(1687888254152, 1)"] [cur-ts="(1687888254052, 2)"] [last-tso-server=127.0.0.1:3380] [cur-tso-server=127.0.0.1:3380] [last-keyspace-group-in-request=0] [cur-keyspace-group-in-request=0] [last-keyspace-group-in-response=0] [cur-keyspace-group-in-response=0] [last-response-received-at=2023/06/27 10:50:54.195 -07:00] [cur-response-received-at=2023/06/27 10:50:54.196 -07:00] Signed-off-by: Bin Shi <binshi.bing@gmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ref tikv#5895 Signed-off-by: Ryan Leung <rleungx@gmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Feature Request
Describe your feature request related problem
The current TSO solution is known for its poor scalability and single point of failure. As we are moving to 'Cloud' providing DBaaS and Serverless Computing, supporting multi-tenant is one of primary goals which brings more requirements to the TSO service, including:
Describe the feature you'd like
By sharding TSO service across tenants, we aim to achieve the following goals:
Big Picture
For more details, please refer to the RFC (TODO: add link to the RFC).
Milestone 1 - (goal) deliver single-group tso microservice (Completed)
All work are tracked in #5836
Milestone 2 - (goal) code complete for multi-group tso microservice (4/21/2023) (Completed)
Milestone 3 - (goal) deploy multi-group tso microservice in dev env (5/5/2023) (Note: May 1st - 3rd, holiday)
Milestone 4 - (goal) deploy multi-group tso microservice in staging env (5/29/2023)
GetMinTS
interface tidbcloud/kvproto#16 @rleungxGetMinTS
interface pingcap/kvproto#1116 @rleungx(Milestone 4 landing plan) timeline breakdown
Serverless service team confirmed that there are existing E2E test cases to cover BR & GCSafePoint scenarios.
Risk: unexpected compatibility issue
- [x] TSO server stuck issue caused PD out of service during EKS upgrading TSO Server Close() gets stuck sometimes #6530 Fix tso server close stuck issue #6529 Add test case to simulate EKS upgrading (restart the entire API cluster then TSO cluster) #6534 @binshi-bing
- [x] mcs, tso: change keyspace group primary path. #6526 @binshi-bing
- [x] The api leader got stuck at tso requests forwarding #6549 @binshi-bing
- [x] "Not enough replicas" caused keyspace group split to fail randomly #6550 @rleungx @lhy1024
Milestone 5 - (goal) deploy multi-group tso microservice in prod env (ETA: N/A) (the exact time will be decided by Serverless service team)
Describe alternatives you've considered
Please see the "The Alternative Architectures Considered" in the RFC (TODO: add link to the RFC).
Teachability, Documentation, Adoption, Migration Strategy
The text was updated successfully, but these errors were encountered: