Skip to content

Latest commit

 

History

History
214 lines (138 loc) · 14.9 KB

best-practices-on-public-cloud.md

File metadata and controls

214 lines (138 loc) · 14.9 KB
title summary
TiDB Best Practices on Public Cloud
Learn about the best practices for deploying TiDB on public cloud.

TiDB Best Practices on Public Cloud

Public cloud infrastructure has become an increasingly popular choice for deploying and managing TiDB. However, deploying TiDB on public cloud requires careful consideration of several critical factors, including performance tuning, cost optimization, reliability, and scalability.

This document covers various essential best practices for deploying TiDB on public cloud, such as reducing compaction I/O flow in KV RocksDB, using a dedicated disk for Raft Engine, optimizing costs for cross-AZ traffic, mitigating Google Cloud live migration events, and fine-tuning the PD server in large clusters. By following these best practices, you can maximize the performance, cost efficiency, reliability, and scalability of your TiDB deployment on public cloud.

Reduce compaction I/O flow in KV RocksDB

As the storage engine of TiKV, RocksDB is used to store user data. Because the provisioned IO throughput on cloud EBS is usually limited due to cost considerations, RocksDB might exhibit high write amplification, and the disk throughput might become the bottleneck for the workload. As a result, the total number of pending compaction bytes grows over time and triggers flow control, which indicates that TiKV lacks sufficient disk bandwidth to keep up with the foreground write flow.

To alleviate the bottleneck caused by limited disk throughput, you can improve performance by enabling Titan. If your average row size is smaller than 512 bytes, Titan is not applicable. In this case, you can improve performance by increasing all the compression levels.

Enable Titan

Titan is a high-performance RocksDB plugin for key-value separation, which can reduce write amplification in RocksDB when large values are used.

If your average row size is larger than 512 bytes, you can enable Titan to reduce the compaction I/O flow as follows, with min-blob-size set to "512B" or "1KB" and blob-file-compression set to "zstd":

[rocksdb.titan]
enabled = true
[rocksdb.defaultcf.titan]
min-blob-size = "1KB"
blob-file-compression = "zstd"

Note:

When Titan is enabled, there might be a slight performance degradation for range scans on the primary key. For more information, see Impact of min-blob-size on performance.

Increase all the compression levels

If your average row size is smaller than 512 bytes, you can increase all the compression levels of the default column family to "zstd" as follows:

[rocksdb.defaultcf]
compression-per-level = ["zstd", "zstd", "zstd", "zstd", "zstd", "zstd", "zstd"]

Use a dedicated disk for Raft Engine

The Raft Engine in TiKV plays a critical role similar to that of a write-ahead log (WAL) in traditional databases. To achieve optimal performance and stability, it is crucial to allocate a dedicated disk for the Raft Engine when you deploy TiDB on public cloud. The following iostat shows the I/O characteristics on a TiKV node with a write-heavy workload.

Device            r/s     rkB/s       w/s     wkB/s      f/s  aqu-sz  %util
sdb           1649.00 209030.67   1293.33 304644.00    13.33    5.09  48.37
sdd           1033.00   4132.00   1141.33  31685.33   571.00    0.94 100.00

The device sdb is used for KV RocksDB, while sdd is used to restore Raft Engine logs. Note that sdd has a significantly higher f/s value, which represents the number of flush requests completed per second for the device. In Raft Engine, when a write in a batch is marked synchronous, the batch leader will call fdatasync() after writing, guaranteeing that buffered data is flushed to the storage. By using a dedicated disk for Raft Engine, TiKV reduces the average queue length of requests, thereby ensuring optimal and stable write latency.

Different cloud providers offer various disk types with different performance characteristics, such as IOPS and MBPS. Therefore, it is important to choose an appropriate cloud provider, disk type, and disk size based on your workload.

Choose appropriate disks for Raft Engine on public clouds

This section outlines best practices for choosing appropriate disks for Raft Engine on different public clouds. Depending on performance requirements, two types of recommended disks are available.

Middle-range disk

The following are recommended middle-range disks for different public clouds:

  • On AWS, gp3 is recommended. The gp3 volume offers a free allocation of 3000 IOPS and 125 MB/s throughput, regardless of the volume size, which is usually sufficient for the Raft Engine.

  • On Google Cloud, pd-ssd is recommended. The IOPS and MBPS vary depending on the allocated disk size. To meet performance requirements, it is recommended to allocate 200 GB for Raft Engine. Although Raft Engine does not require such a large space, it ensures optimal performance.

  • On Azure, Premium SSD v2 is recommended. Similar to AWS gp3, Premium SSD v2 provides a free allocation of 3000 IOPS and 125 MB/s throughput, regardless of the volume size, which is usually sufficient for Raft Engine.

High-end disk

If you expect an even lower latency for Raft Engine, consider using high-end disks. The following are recommended high-end disks for different public clouds:

  • On AWS, io2 is recommended. Disk size and IOPS can be provisioned according to your specific requirements.

  • On Google Cloud, pd-extreme is recommended. Disk size, IOPS, and MBPS can be provisioned, but it is only available on instances with more than 64 CPU cores.

  • On Azure, ultra disk is recommended. Disk size, IOPS, and MBPS can be provisioned according to your specific requirements.

Example 1: Run a social network workload on AWS

AWS offers 3000 IOPS and 125 MBPS/s for a 20 GB gp3 volume.

By using a dedicated 20 GB gp3 Raft Engine disk on AWS for a write-intensive social network application workload, the following improvements are observed but the estimated cost only increases by 0.4%:

  • a 17.5% increase in QPS (queries per second)
  • an 18.7% decrease in average latency for insert statements
  • a 45.6% decrease in p99 latency for insert statements.
Metric Shared Raft Engine disk Dedicated Raft Engine disk Difference (%)
QPS (K/s) 8.0 9.4 17.5
AVG Insert Latency (ms) 11.3 9.2 -18.7
P99 Insert Latency (ms) 29.4 16.0 -45.6

Example 2: Run TPC-C/Sysbench workload on Azure

By using a dedicated 32 GB ultra disk for Raft Engine on Azure, the following improvements are observed:

  • Sysbench oltp_read_write workload: a 17.8% increase in QPS and a 15.6% decrease in average latency.
  • TPC-C workload: a 27.6% increase in QPS and a 23.1% decrease in average latency.
Metric Workload Shared Raft Engine disk Dedicated Raft Engine disk Difference (%)
QPS (K/s) Sysbench oltp_read_write 60.7 71.5 17.8
QPS (K/s) TPC-C 23.9 30.5 27.6
AVG Latency (ms) Sysbench oltp_read_write 4.5 3.8 -15.6
AVG Latency (ms) TPC-C 3.9 3.0 -23.1

Example 3: Attach a dedicated pd-ssd disk on Google Cloud for Raft Engine on TiKV manifest

The following TiKV configuration example shows how to attach an additional 512 GB pd-ssd disk to a cluster on Google Cloud deployed by TiDB Operator, with raft-engine.dir configured to store Raft Engine logs to this specific disk.

tikv:
    config: |
      [raft-engine]
        dir = "/var/lib/raft-pv-ssd/raft-engine"
        enable = true
        enable-log-recycle = true
    requests:
      storage: 4Ti
    storageClassName: pd-ssd
    storageVolumes:
    - mountPath: /var/lib/raft-pv-ssd
      name: raft-pv-ssd
      storageSize: 512Gi

Optimize cost for cross-AZ network traffic

Deploying TiDB across multiple availability zones (AZs) can lead to increased costs due to cross-AZ data transfer fees. To optimize costs, it is important to reduce cross-AZ network traffic.

To reduce cross-AZ read traffic, you can enable the Follower Read feature, which allows TiDB to prioritize selecting replicas in the same availability zone. To enable this feature, set the tidb_replica_read variable to closest-replicas or closest-adaptive.

To reduce cross-AZ write traffic in TiKV instances, you can enable the gRPC compression feature, which compresses data before transmitting it over the network. The following configuration example shows how to enable gzip gRPC compression for TiKV.

server_configs:
  tikv:
    server.grpc-compression-type: gzip

To reduce network traffic caused by the data shuffle of TiFlash MPP tasks, it is recommended to deploy multiple TiFlash instances in the same availability zones (AZs). Starting from v6.6.0, compression exchange is enabled by default, which reduces the network traffic caused by MPP data shuffle.

Mitigate live migration maintenance events on Google Cloud

The Live Migration feature of Google Cloud enables VMs to be seamlessly migrated between hosts without causing downtime. However, these migration events, although infrequent, can significantly impact the performance of VMs, including those running in a TiDB cluster. During such events, affected VMs might experience reduced performance, leading to longer query processing times in the TiDB cluster.

To detect live migration events initiated by Google Cloud and mitigate the performance impact of these events, TiDB provides a watching script based on Google's metadata example. You can deploy this script on TiDB, TiKV, and PD nodes to detect maintenance events. When a maintenance event is detected, appropriate actions can be taken automatically as follows to minimize disruption and optimize the cluster behavior:

  • TiDB: Takes the TiDB node offline by cordoning it and deleting the TiDB pod. This assumes that the node pool of the TiDB instance is set to auto-scale and dedicated to TiDB. Other pods running on the node might experience interruptions, and the cordoned node is expected to be reclaimed by the auto-scaler.
  • TiKV: Evicts leaders on the affected TiKV store during maintenance.
  • PD: Resigns a leader if the current PD instance is the PD leader.

It is important to note that this watching script is specifically designed for TiDB clusters deployed using TiDB Operator, which offers enhanced management functionalities for TiDB in Kubernetes environments.

By utilizing the watching script and taking necessary actions during maintenance events, TiDB clusters can better handle live migration events on Google Cloud and ensure smoother operations with minimal impact on query processing and response times.

Tune PD for a large-scale TiDB cluster with high QPS

In a TiDB cluster, a single active Placement Driver (PD) server is used to handle crucial tasks such as serving the TSO (Timestamp Oracle) and processing requests. However, relying on a single active PD server can limit the scalability of TiDB clusters.

Symptoms of PD limitation

The following diagrams show the symptoms of a large-scale TiDB cluster consisting of three PD servers, each equipped with 56 CPUs. From these diagrams, it is observed that when the query per second (QPS) exceeds 1 million and the TSO (Timestamp Oracle) requests per second exceed 162,000, the CPU utilization reaches approximately 4,600%. This high CPU utilization indicates that the PD leader is experiencing a significant load and is running out of available CPU resources.

pd-server-cpu pd-server-metrics

Tune PD performance

To address the high CPU utilization issue in the PD server, you can make the following tuning adjustments:

Adjust PD configuration

tso-update-physical-interval: This parameter controls the interval at which the PD server updates the physical TSO batch. By reducing the interval, the PD server can allocate TSO batches more frequently, thereby reducing the waiting time for the next allocation.

tso-update-physical-interval = "10ms" # default: 50ms

Adjust a TiDB global variable

In addition to the PD configuration, enabling the TSO client batch wait feature can further optimize the TSO client's behavior. To enable this feature, you can set the global variable tidb_tso_client_batch_max_wait_time to a non-zero value.

set global tidb_tso_client_batch_max_wait_time = 2; # default: 0

Adjust TiKV configuration

To reduce the number of Regions and alleviate the heartbeat overhead on the system, you can refer to Adjust Region size to moderately increase the size of the Region in TiKV configuration.

[coprocessor]
  region-split-size = "288MiB"

After tuning

After the tuning, the following effects can be observed:

  • The TSO requests per second are decreased to 64,800.
  • The CPU utilization is significantly reduced from approximately 4,600% to 1,400%.
  • The P999 value of PD server TSO handle time is decreased from 2ms to 0.5ms.

These improvements indicate that the tuning adjustments have successfully reduced the CPU utilization of the PD server while maintaining stable TSO handling performance.

pd-server-cpu pd-server-metrics