Acquire read lock in LockKeys to avoid data race #585

sticnarf · 2022-09-14T07:58:06Z

This PR tries to reduce the risk of data race in memDB like pingcap/tidb#26832, by acquiring the read lock when it needs to read the memDB in the execution path of LockKeys. The same RWMutex is used in memDB.set, so we won't have concurrent write and read.

Throughout our observations about similar problems, most of them are connected with locking keys while updating the membuf. So, although this PR does not thoroughly solve the problem of memDB data race, I think it will largely reduce its possibility in the real world.

The granularity of locking is relatively coarse to minimize the performance impact as much as possible. extractKeyExistsErr is not in the critical path, so the lock is acquired in each call.

cfzjywxk · 2022-09-15T01:53:59Z

txnkv/transaction/pessimistic.go

@@ -86,6 +86,7 @@ func (actionPessimisticRollback) tiKVTxnRegionsNumHistogram() prometheus.Observe
 func (action actionPessimisticLock) handleSingleBatch(c *twoPhaseCommitter, bo *retry.Backoffer, batch batchMutations) error {
 	m := batch.mutations
 	mutations := make([]*kvrpcpb.Mutation, m.Len())
+	c.txn.GetMemBuffer().RLock()


If the contention happens here would it be possible that actionPessimisticLock tasks are processed one by one according to the feature of the RWLock? I'm not quite sure if it may introduce extra risks.

Do you mean that all later RLock requests will wait for a queuing write lock request? Could you point out which write lock may cause it to run one by one?

For example:

Update // Try to acquire the write lock. Some possible parallel executors(like parallel projection in the early issue) selectLockExec // Try to acquire read locks.

The update execution may try to Set new key-value pairs into the memory buffer which needs to acquire the write lock, while the parallel actionPessimisticLock operations triggered by LockKeys are being processed. If an actionPessimisticLock gets the read lock successfully first and then the write lock acquisition is pending, other actionPessimisticLock requests may have to wait.

Got it. Yet I think it will have little impact if all these operations are fast. And the goroutines that are blocked after the write lock will then acquire the read lock at the same time, so they can run concurrently as before.

I will try to design a worst case to see how much impact it will bring.

cfzjywxk · 2022-09-15T02:01:39Z

We may need a test case to verify that the panic could be avoided, some benchmark tests are needed too.

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

sticnarf · 2022-09-29T10:16:24Z

Table schema:

CREATE TABLE `t` (
  `id` bigint(20) NOT NULL,
  `v` bigint(20) DEFAULT NULL,
  PRIMARY KEY (`id`) /*T![clustered_index] CLUSTERED */
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin
PARTITION BY HASH (`id`) PARTITIONS 10

Workload:

BEGIN;
UPDATE t SET v = v + 1 WHERE id BETWEEN {x} AND {x} + 1000
ROLLBACK;

Execution plan:

id                         	task     	estRows	operator info                                                             	actRows	execution info                                                                                                                                                                                                                                                                                       	memory 	disk
Update_5                   	root     	0      	N/A                                                                       	0      	time:440.4ms, loops:5, lock_keys: {time:3.39ms, region:10, keys:1001, slowest_rpc: {total: 0.002s, region_id: 100075, store: 127.0.0.1:30160, tikv_wall_time: 1.83ms, scan_detail: {get_snapshot_time: 2.99µs, rocksdb: {block: {cache_hit_count: 501}}}, }, lock_rpc:12.518293ms, rpc_count:10}    	38.4 KB	N/A
└─Projection_18            	root     	10000  	test.t.id, test.t.v                                                       	5005   	time:421.8ms, loops:55, Concurrency:5                                                                                                                                                                                                                                                                	26.7 KB	N/A
  └─SelectLock_19          	root     	10000  	for update 0                                                              	5005   	time:436ms, loops:55                                                                                                                                                                                                                                                                                 	N/A    	N/A
    └─PartitionUnion_20    	root     	10000  	                                                                          	5005   	time:152.7ms, loops:55                                                                                                                                                                                                                                                                               	N/A    	N/A
      ├─TableReader_22     	root     	1000   	data:TableRangeScan_21                                                    	505    	time:55.8ms, loops:10, cop_task: {num: 5, max: 29ms, min: 403.8µs, avg: 11ms, p95: 29ms, max_proc_keys: 101, p95_proc_keys: 101, rpc_num: 9, rpc_time: 80.7ms, copr_cache_hit_ratio: 0.00, distsql_concurrency: 15}                                                                                 	2.66 KB	N/A
      │ └─TableRangeScan_21	cop[tikv]	1000   	table:t, partition:p0, range:[78180,79180], keep order:false, stats:pseudo	505    	tikv_task:{proc max:0s, min:0s, avg: 0s, p80:0s, p95:0s, iters:15, tasks:5}, scan_detail: {total_process_keys: 505, total_process_keys_size: 20200, total_keys: 510, get_snapshot_time: 19µs, rocksdb: {delete_skipped_count: 505, key_skipped_count: 2210, block: {cache_hit_count: 25}}}          	N/A    	N/A
      ├─TableReader_24     	root     	1000   	data:TableRangeScan_23                                                    	500    	time:31.5ms, loops:10, cop_task: {num: 5, max: 16.8ms, min: 690.4µs, avg: 6.12ms, p95: 16.8ms, max_proc_keys: 100, p95_proc_keys: 100, rpc_num: 9, rpc_time: 44.3ms, copr_cache_hit_ratio: 0.00, distsql_concurrency: 15}                                                                           	2.63 KB	N/A
      │ └─TableRangeScan_23	cop[tikv]	1000   	table:t, partition:p1, range:[78180,79180], keep order:false, stats:pseudo	500    	tikv_task:{proc max:0s, min:0s, avg: 0s, p80:0s, p95:0s, iters:15, tasks:5}, scan_detail: {total_process_keys: 500, total_process_keys_size: 20000, total_keys: 505, get_snapshot_time: 15.9µs, rocksdb: {delete_skipped_count: 500, key_skipped_count: 2282, block: {cache_hit_count: 20}}}        	N/A    	N/A
      ├─TableReader_26     	root     	1000   	data:TableRangeScan_25                                                    	500    	time:44.3ms, loops:10, cop_task: {num: 5, max: 28.9ms, min: 332.4µs, avg: 8.68ms, p95: 28.9ms, max_proc_keys: 100, p95_proc_keys: 100, rpc_num: 9, rpc_time: 77.4ms, copr_cache_hit_ratio: 0.00, distsql_concurrency: 15}                                                                           	2.64 KB	N/A
      │ └─TableRangeScan_25	cop[tikv]	1000   	table:t, partition:p2, range:[78180,79180], keep order:false, stats:pseudo	500    	tikv_task:{proc max:4ms, min:0s, avg: 800µs, p80:4ms, p95:4ms, iters:15, tasks:5}, scan_detail: {total_process_keys: 500, total_process_keys_size: 20000, total_keys: 505, get_snapshot_time: 16.2µs, rocksdb: {delete_skipped_count: 500, key_skipped_count: 2191, block: {cache_hit_count: 20}}} 	N/A    	N/A
      ├─TableReader_28     	root     	1000   	data:TableRangeScan_27                                                    	500    	time:44.3ms, loops:10, cop_task: {num: 5, max: 22.1ms, min: 241.2µs, avg: 8.69ms, p95: 22.1ms, max_proc_keys: 100, p95_proc_keys: 100, rpc_num: 9, rpc_time: 70.4ms, copr_cache_hit_ratio: 0.00, distsql_concurrency: 15}                                                                           	2.64 KB	N/A
      │ └─TableRangeScan_27	cop[tikv]	1000   	table:t, partition:p3, range:[78180,79180], keep order:false, stats:pseudo	500    	tikv_task:{proc max:0s, min:0s, avg: 0s, p80:0s, p95:0s, iters:15, tasks:5}, scan_detail: {total_process_keys: 500, total_process_keys_size: 20000, total_keys: 505, get_snapshot_time: 15.6µs, rocksdb: {delete_skipped_count: 500, key_skipped_count: 2191, block: {cache_hit_count: 20}}}        	N/A    	N/A
      ├─TableReader_30     	root     	1000   	data:TableRangeScan_29                                                    	500    	time:54.8ms, loops:10, cop_task: {num: 5, max: 30.5ms, min: 463.6µs, avg: 10.8ms, p95: 30.5ms, max_proc_keys: 100, p95_proc_keys: 100, rpc_num: 9, rpc_time: 77.4ms, copr_cache_hit_ratio: 0.00, distsql_concurrency: 15}                                                                           	2.64 KB	N/A
      │ └─TableRangeScan_29	cop[tikv]	1000   	table:t, partition:p4, range:[78180,79180], keep order:false, stats:pseudo	500    	tikv_task:{proc max:0s, min:0s, avg: 0s, p80:0s, p95:0s, iters:15, tasks:5}, scan_detail: {total_process_keys: 500, total_process_keys_size: 20000, total_keys: 1005, get_snapshot_time: 13.7µs, rocksdb: {delete_skipped_count: 500, key_skipped_count: 2694, block: {cache_hit_count: 20}}}       	N/A    	N/A
      ├─TableReader_32     	root     	1000   	data:TableRangeScan_31                                                    	500    	time:26.2ms, loops:10, cop_task: {num: 5, max: 11.3ms, min: 423.1µs, avg: 4.52ms, p95: 11.3ms, max_proc_keys: 100, p95_proc_keys: 100, rpc_num: 9, rpc_time: 35.3ms, copr_cache_hit_ratio: 0.00, distsql_concurrency: 15}                                                                           	2.64 KB	N/A
      │ └─TableRangeScan_31	cop[tikv]	1000   	table:t, partition:p5, range:[78180,79180], keep order:false, stats:pseudo	500    	tikv_task:{proc max:0s, min:0s, avg: 0s, p80:0s, p95:0s, iters:15, tasks:5}, scan_detail: {total_process_keys: 500, total_process_keys_size: 20000, total_keys: 1005, get_snapshot_time: 17.3µs, rocksdb: {delete_skipped_count: 500, key_skipped_count: 2789, block: {cache_hit_count: 20}}}       	N/A    	N/A
      ├─TableReader_34     	root     	1000   	data:TableRangeScan_33                                                    	500    	time:26.4ms, loops:10, cop_task: {num: 5, max: 12.8ms, min: 376.7µs, avg: 5.16ms, p95: 12.8ms, max_proc_keys: 100, p95_proc_keys: 100, rpc_num: 9, rpc_time: 38.8ms, copr_cache_hit_ratio: 0.00, distsql_concurrency: 15}                                                                           	2.64 KB	N/A
      │ └─TableRangeScan_33	cop[tikv]	1000   	table:t, partition:p6, range:[78180,79180], keep order:false, stats:pseudo	500    	tikv_task:{proc max:0s, min:0s, avg: 0s, p80:0s, p95:0s, iters:15, tasks:5}, scan_detail: {total_process_keys: 500, total_process_keys_size: 20000, total_keys: 1005, get_snapshot_time: 18.3µs, rocksdb: {delete_skipped_count: 500, key_skipped_count: 2694, block: {cache_hit_count: 20}}}       	N/A    	N/A
      ├─TableReader_36     	root     	1000   	data:TableRangeScan_35                                                    	500    	time:29.8ms, loops:10, cop_task: {num: 5, max: 17.5ms, min: 420µs, avg: 5.32ms, p95: 17.5ms, max_proc_keys: 100, p95_proc_keys: 100, rpc_num: 9, rpc_time: 45.8ms, copr_cache_hit_ratio: 0.00, distsql_concurrency: 15}                                                                             	2.64 KB	N/A
      │ └─TableRangeScan_35	cop[tikv]	1000   	table:t, partition:p7, range:[78180,79180], keep order:false, stats:pseudo	500    	tikv_task:{proc max:0s, min:0s, avg: 0s, p80:0s, p95:0s, iters:15, tasks:5}, scan_detail: {total_process_keys: 500, total_process_keys_size: 20000, total_keys: 1005, get_snapshot_time: 17.9µs, rocksdb: {delete_skipped_count: 500, key_skipped_count: 2787, block: {cache_hit_count: 20}}}       	N/A    	N/A
      ├─TableReader_38     	root     	1000   	data:TableRangeScan_37                                                    	500    	time:16.6ms, loops:10, cop_task: {num: 5, max: 7.3ms, min: 308.6µs, avg: 2.69ms, p95: 7.3ms, max_proc_keys: 100, p95_proc_keys: 100, rpc_num: 9, rpc_time: 22.1ms, copr_cache_hit_ratio: 0.00, distsql_concurrency: 15}                                                                             	2.64 KB	N/A
      │ └─TableRangeScan_37	cop[tikv]	1000   	table:t, partition:p8, range:[78180,79180], keep order:false, stats:pseudo	500    	tikv_task:{proc max:0s, min:0s, avg: 0s, p80:0s, p95:0s, iters:15, tasks:5}, scan_detail: {total_process_keys: 500, total_process_keys_size: 20000, total_keys: 1005, get_snapshot_time: 17.5µs, rocksdb: {delete_skipped_count: 500, key_skipped_count: 2787, block: {cache_hit_count: 20}}}       	N/A    	N/A
      └─TableReader_40     	root     	1000   	data:TableRangeScan_39                                                    	500    	time:64.4ms, loops:10, cop_task: {num: 5, max: 34.7ms, min: 193.5µs, avg: 12.6ms, p95: 34.7ms, max_proc_keys: 100, p95_proc_keys: 100, rpc_num: 9, rpc_time: 91.4ms, copr_cache_hit_ratio: 0.00, distsql_concurrency: 15}                                                                           	2.64 KB	N/A
        └─TableRangeScan_39	cop[tikv]	1000   	table:t, partition:p9, range:[78180,79180], keep order:false, stats:pseudo	500    	tikv_task:{proc max:4ms, min:0s, avg: 800µs, p80:4ms, p95:4ms, iters:15, tasks:5}, scan_detail: {total_process_keys: 500, total_process_keys_size: 20000, total_keys: 1005, get_snapshot_time: 15.7µs, rocksdb: {delete_skipped_count: 500, key_skipped_count: 2687, block: {cache_hit_count: 20}}}	N/A    	N/A

I revert pingcap/tidb#30290 to allow concurrency for the projection executor in this plan.

I limit the CPU of TiDB to only 4 cores to let TiDB be a bottleneck.

Threads	4	100
Master w/o 30290	Panic quickly	285 TPS
This PR w/o 30290	287 TPS	271 TPS
Master w/ 30290	278 TPS	278 TPS

In benchbot TPC-C 400 threads test, I find no regression but the TiDB CPU is not 100% used. This PR: 69682 QPS, Baseline: 68184. (It should not improve performance, it should be just small environment error).

I'll also benchmark against other cases when TiDB is the bottleneck.

sticnarf · 2022-10-10T05:35:40Z

I also test sysbench read_write in the benchbot, which almost runs out of TiDB CPU.

(Sorry, the links below are not publicly accessible)
http://perf.pingcap.net/d/MB_crJk4z/oltp-bench-details?orgId=1&var-oltp_summary_log_id=4020030 (baseline) vs http://perf.pingcap.net/d/MB_crJk4z/oltp-bench-details?orgId=1&var-oltp_summary_log_id=4050030 (with this PR)

There seems no big difference

Signed-off-by: Yilin Chen <sticnarf@gmail.com> Signed-off-by: Yilin Chen <sticnarf@gmail.com>

* Clear intersecting regions in the cache when inserting a region (#566) Signed-off-by: Yilin Chen <sticnarf@gmail.com> * Acquire read lock in LockKeys to avoid data race (#585) Signed-off-by: Yilin Chen <sticnarf@gmail.com> Signed-off-by: Yilin Chen <sticnarf@gmail.com> Signed-off-by: Yilin Chen <sticnarf@gmail.com>

Signed-off-by: Yilin Chen <sticnarf@gmail.com> Signed-off-by: Yilin Chen <sticnarf@gmail.com> Signed-off-by: guo-shaoge <shaoge1994@163.com>

sticnarf requested a review from cfzjywxk September 14, 2022 07:58

cfzjywxk reviewed Sep 15, 2022

View reviewed changes

Acquire read lock in LockKeys to avoid data race

9a1aa06

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

sticnarf force-pushed the lock-keys-rlock branch from dd2de13 to 9a1aa06 Compare September 23, 2022 03:55

disksing requested a review from cfzjywxk September 28, 2022 07:42

Merge branch 'master' into lock-keys-rlock

38d5ae7

Merge branch 'master' into lock-keys-rlock

83aac44

cfzjywxk approved these changes Oct 10, 2022

View reviewed changes

cfzjywxk merged commit 8dfd76b into tikv:master Oct 10, 2022

sticnarf added a commit to sticnarf/client-go that referenced this pull request Oct 12, 2022

Acquire read lock in LockKeys to avoid data race (tikv#585)

13e39f6

Signed-off-by: Yilin Chen <sticnarf@gmail.com> Signed-off-by: Yilin Chen <sticnarf@gmail.com>

This was referenced Oct 12, 2022

*: update client-go (tidb-6.1) pingcap/tidb#38414

Merged

Fix the missing unlock in extractKeyExistsErr #603

Merged

Fix the missing unlock in extractKeyExistsErr (tidb-6.1) #604

Merged

sticnarf mentioned this pull request Oct 31, 2022

DATA RACE in the internal/unionstore.(*MemDB) #608

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Acquire read lock in LockKeys to avoid data race #585

Acquire read lock in LockKeys to avoid data race #585

sticnarf commented Sep 14, 2022

cfzjywxk Sep 15, 2022

sticnarf Sep 28, 2022

cfzjywxk Sep 28, 2022

sticnarf Sep 29, 2022

cfzjywxk commented Sep 15, 2022

sticnarf commented Sep 29, 2022

sticnarf commented Oct 10, 2022

Acquire read lock in LockKeys to avoid data race #585

Acquire read lock in LockKeys to avoid data race #585

Conversation

sticnarf commented Sep 14, 2022

cfzjywxk Sep 15, 2022

Choose a reason for hiding this comment

sticnarf Sep 28, 2022

Choose a reason for hiding this comment

cfzjywxk Sep 28, 2022

Choose a reason for hiding this comment

sticnarf Sep 29, 2022

Choose a reason for hiding this comment

cfzjywxk commented Sep 15, 2022

sticnarf commented Sep 29, 2022

sticnarf commented Oct 10, 2022