-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
selftests/bpf: Add benchmark for local_storage get #474
Conversation
Master branch: 68084a1 |
Master branch: 70a1b25 |
b8e9570
to
67813f5
Compare
Master branch: 834650b |
67813f5
to
d927b05
Compare
Master branch: 841d8bc |
d927b05
to
556a496
Compare
841d8bc
to
7aa424e
Compare
Master branch: 7aa424e |
556a496
to
bab9b0a
Compare
Master branch: 96af42c |
bab9b0a
to
ff7bd75
Compare
Master branch: b23316a |
ff7bd75
to
cd488e5
Compare
Master branch: fa37686 |
cd488e5
to
0553dbc
Compare
Master branch: 2dc323b |
0553dbc
to
186db86
Compare
Master branch: 2dc323b |
186db86
to
5b18bd3
Compare
Master branch: 9794976 |
5b18bd3
to
eb119d2
Compare
Master branch: c272e25 |
eb119d2
to
2d91ea7
Compare
Master branch: c272e25 |
2d91ea7
to
8c4f220
Compare
Master branch: a56672f |
8c4f220
to
3b7f822
Compare
Master branch: ff20959 |
3b7f822
to
0fa8e23
Compare
Master branch: 4050764 |
0fa8e23
to
8537bc9
Compare
Add a benchmarks to demonstrate the performance cliff for local_storage get as the number of local_storage maps increases beyond current local_storage implementation's cache size. "sequential get" and "interleaved get" benchmarks are added, both of which do many bpf_task_storage_get calls on sets of task local_storage maps of various counts, while considering a single specific map to be 'important' and counting task_storage_gets to the important map separately in addition to normal 'hits' count of all gets. Goal here is to mimic scenario where a particular program using one map - the important one - is running on a system where many other local_storage maps exist and are accessed often. While "sequential get" benchmark does bpf_task_storage_get for map 0, 1, ..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4 bpf_task_storage_gets for the important map for every 10 map gets. This is meant to highlight performance differences when important map is accessed far more frequently than non-important maps. A "hashmap control" benchmark is also included for easy comparison of standard bpf hashmap lookup vs local_storage get. The benchmark is identical to "sequential get", but creates and uses BPF_MAP_TYPE_HASH instead of local storage. Addition of this benchmark is inspired by conversation with Alexei in a previous patchset's thread [0], which highlighted the need for such a benchmark to motivate and validate improvements to local_storage implementation. My approach in that series focused on improving performance for explicitly-marked 'important' maps and was rejected with feedback to make more generally-applicable improvements while avoiding explicitly marking maps as important. Thus the benchmark reports both general and important-map-focused metrics, so effect of future work on both is clear. Regarding the benchmark results. On a powerful system (Skylake, 20 cores, 256gb ram): Local Storage ============= Hashmap Control w/ 500 maps hashmap (control) sequential get: hits throughput: 69.649 ± 1.207 M ops/s, hits latency: 14.358 ns/op, important_hits throughput: 0.139 ± 0.002 M ops/s num_maps: 1 local_storage cache sequential get: hits throughput: 3.849 ± 0.035 M ops/s, hits latency: 259.803 ns/op, important_hits throughput: 3.849 ± 0.035 M ops/s local_storage cache interleaved get: hits throughput: 6.881 ± 0.110 M ops/s, hits latency: 145.324 ns/op, important_hits throughput: 6.881 ± 0.110 M ops/s num_maps: 10 local_storage cache sequential get: hits throughput: 20.339 ± 0.442 M ops/s, hits latency: 49.167 ns/op, important_hits throughput: 2.034 ± 0.044 M ops/s local_storage cache interleaved get: hits throughput: 22.408 ± 0.606 M ops/s, hits latency: 44.627 ns/op, important_hits throughput: 8.003 ± 0.217 M ops/s num_maps: 16 local_storage cache sequential get: hits throughput: 24.428 ± 1.120 M ops/s, hits latency: 40.937 ns/op, important_hits throughput: 1.527 ± 0.070 M ops/s local_storage cache interleaved get: hits throughput: 26.853 ± 0.825 M ops/s, hits latency: 37.240 ns/op, important_hits throughput: 8.544 ± 0.262 M ops/s num_maps: 17 local_storage cache sequential get: hits throughput: 24.158 ± 0.222 M ops/s, hits latency: 41.394 ns/op, important_hits throughput: 1.421 ± 0.013 M ops/s local_storage cache interleaved get: hits throughput: 26.223 ± 0.201 M ops/s, hits latency: 38.134 ns/op, important_hits throughput: 7.981 ± 0.061 M ops/s num_maps: 24 local_storage cache sequential get: hits throughput: 16.820 ± 0.294 M ops/s, hits latency: 59.451 ns/op, important_hits throughput: 0.701 ± 0.012 M ops/s local_storage cache interleaved get: hits throughput: 19.185 ± 0.212 M ops/s, hits latency: 52.125 ns/op, important_hits throughput: 5.396 ± 0.060 M ops/s num_maps: 32 local_storage cache sequential get: hits throughput: 11.998 ± 0.310 M ops/s, hits latency: 83.347 ns/op, important_hits throughput: 0.375 ± 0.010 M ops/s local_storage cache interleaved get: hits throughput: 14.233 ± 0.265 M ops/s, hits latency: 70.259 ns/op, important_hits throughput: 3.972 ± 0.074 M ops/s num_maps: 100 local_storage cache sequential get: hits throughput: 5.780 ± 0.250 M ops/s, hits latency: 173.003 ns/op, important_hits throughput: 0.058 ± 0.002 M ops/s local_storage cache interleaved get: hits throughput: 7.175 ± 0.312 M ops/s, hits latency: 139.381 ns/op, important_hits throughput: 1.874 ± 0.081 M ops/s num_maps: 1000 local_storage cache sequential get: hits throughput: 0.456 ± 0.011 M ops/s, hits latency: 2192.982 ns/op, important_hits throughput: 0.000 ± 0.000 M ops/s local_storage cache interleaved get: hits throughput: 0.539 ± 0.005 M ops/s, hits latency: 1855.508 ns/op, important_hits throughput: 0.135 ± 0.001 M ops/s Looking at the "sequential get" results, it's clear that as the number of task local_storage maps grows beyond the current cache size (16), there's a significant reduction in hits throughput. Note that current local_storage implementation assigns a cache_idx to maps as they are created. Since "sequential get" is creating maps 0..n in order and then doing bpf_task_storage_get calls in the same order, the benchmark is effectively ensuring that a map will not be in cache when the program tries to access it. For "interleaved get" results, important-map hits throughput is greatly increased as the important map is more likely to be in cache by virtue of being accessed far more frequently. Throughput still reduces as # maps increases, though. As evidenced by the unintuitive-looking results for smaller num_maps benchmark runs, overhead which is amortized across larger num_maps runs dominates when there are fewer maps. To get a sense of the overhead, I commented out bpf_task_storage_get/bpf_map_lookup_elem in local_storage_bench.h and ran the benchmark on the same host as the 'real' run. Results: Local Storage ============= Hashmap Control w/ 500 maps hashmap (control) sequential get: hits throughput: 128.699 ± 1.267 M ops/s, hits latency: 7.770 ns/op, important_hits throughput: 0.257 ± 0.003 M ops/s num_maps: 1 local_storage cache sequential get: hits throughput: 4.135 ± 0.069 M ops/s, hits latency: 241.831 ns/op, important_hits throughput: 4.135 ± 0.069 M ops/s local_storage cache interleaved get: hits throughput: 7.693 ± 0.039 M ops/s, hits latency: 129.982 ns/op, important_hits throughput: 7.693 ± 0.039 M ops/s num_maps: 10 local_storage cache sequential get: hits throughput: 33.044 ± 0.232 M ops/s, hits latency: 30.262 ns/op, important_hits throughput: 3.304 ± 0.023 M ops/s local_storage cache interleaved get: hits throughput: 36.525 ± 1.545 M ops/s, hits latency: 27.378 ns/op, important_hits throughput: 13.045 ± 0.552 M ops/s num_maps: 16 local_storage cache sequential get: hits throughput: 45.502 ± 1.429 M ops/s, hits latency: 21.977 ns/op, important_hits throughput: 2.844 ± 0.089 M ops/s local_storage cache interleaved get: hits throughput: 47.741 ± 1.115 M ops/s, hits latency: 20.946 ns/op, important_hits throughput: 15.190 ± 0.355 M ops/s num_maps: 17 local_storage cache sequential get: hits throughput: 47.177 ± 0.617 M ops/s, hits latency: 21.197 ns/op, important_hits throughput: 2.775 ± 0.036 M ops/s local_storage cache interleaved get: hits throughput: 50.005 ± 0.463 M ops/s, hits latency: 19.998 ns/op, important_hits throughput: 15.219 ± 0.141 M ops/s num_maps: 24 local_storage cache sequential get: hits throughput: 58.076 ± 0.507 M ops/s, hits latency: 17.219 ns/op, important_hits throughput: 2.420 ± 0.021 M ops/s local_storage cache interleaved get: hits throughput: 57.731 ± 0.500 M ops/s, hits latency: 17.322 ns/op, important_hits throughput: 16.237 ± 0.141 M ops/s num_maps: 32 local_storage cache sequential get: hits throughput: 68.266 ± 0.234 M ops/s, hits latency: 14.649 ns/op, important_hits throughput: 2.133 ± 0.007 M ops/s local_storage cache interleaved get: hits throughput: 62.138 ± 2.695 M ops/s, hits latency: 16.093 ns/op, important_hits throughput: 17.341 ± 0.752 M ops/s num_maps: 100 local_storage cache sequential get: hits throughput: 103.735 ± 2.874 M ops/s, hits latency: 9.640 ns/op, important_hits throughput: 1.037 ± 0.029 M ops/s local_storage cache interleaved get: hits throughput: 85.950 ± 1.619 M ops/s, hits latency: 11.635 ns/op, important_hits throughput: 22.450 ± 0.423 M ops/s num_maps: 1000 local_storage cache sequential get: hits throughput: 133.551 ± 1.915 M ops/s, hits latency: 7.488 ns/op, important_hits throughput: 0.134 ± 0.002 M ops/s local_storage cache interleaved get: hits throughput: 97.579 ± 1.415 M ops/s, hits latency: 10.248 ns/op, important_hits throughput: 24.505 ± 0.355 M ops/s Adjusting for overhead, latency numbers for "hashmap control" and "sequential get" are: hashmap_control: ~6.6ns sequential_get_1: ~17.9ns sequential_get_10: ~18.9ns sequential_get_16: ~19.0ns sequential_get_17: ~20.2ns sequential_get_24: ~42.2ns sequential_get_32: ~68.7ns sequential_get_100: ~163.3ns sequential_get_1000: ~2200ns Clearly demonstrating a cliff. When running the benchmarks it may be necessary to bump 'open files' ulimit for a successful run. [0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@fb.com Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Master branch: f9a3eca |
8537bc9
to
55f9bca
Compare
At least one diff in series https://patchwork.kernel.org/project/netdevbpf/list/?series=643820 expired. Closing PR. |
The purpose of btrfs_bbio_propagate_error() shall be propagating an error of split bio to its original btrfs_bio, and tell the error to the upper layer. However, it's not working well on some cases. * Case 1. Immediate (or quick) end_bio with an error When btrfs sends btrfs_bio to mirrored devices, btrfs calls btrfs_bio_end_io() when all the mirroring bios are completed. If that btrfs_bio was split, it is from btrfs_clone_bioset and its end_io function is btrfs_orig_write_end_io. For this case, btrfs_bbio_propagate_error() accesses the orig_bbio's bio context to increase the error count. That works well in most cases. However, if the end_io is called enough fast, orig_bbio's (remaining part after split) bio context may not be properly set at that time. Since the bio context is set when the orig_bbio (the last btrfs_bio) is sent to devices, that might be too late for earlier split btrfs_bio's completion. That will result in NULL pointer dereference. That bug is easily reproducible by running btrfs/146 on zoned devices [1] and it shows the following trace. [1] You need raid-stripe-tree feature as it create "-d raid0 -m raid1" FS. BUG: kernel NULL pointer dereference, address: 0000000000000020 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] PREEMPT SMP PTI CPU: 1 UID: 0 PID: 13 Comm: kworker/u32:1 Not tainted 6.11.0-rc7-BTRFS-ZNS+ #474 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Workqueue: writeback wb_workfn (flush-btrfs-5) RIP: 0010:btrfs_bio_end_io+0xae/0xc0 [btrfs] BTRFS error (device dm-0): bdev /dev/mapper/error-test errs: wr 2, rd 0, flush 0, corrupt 0, gen 0 RSP: 0018:ffffc9000006f248 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888005a7f080 RCX: ffffc9000006f1dc RDX: 0000000000000000 RSI: 000000000000000a RDI: ffff888005a7f080 RBP: ffff888011dfc540 R08: 0000000000000000 R09: 0000000000000001 R10: ffffffff82e508e0 R11: 0000000000000005 R12: ffff88800ddfbe58 R13: ffff888005a7f080 R14: ffff888005a7f158 R15: ffff888005a7f158 FS: 0000000000000000(0000) GS:ffff88803ea80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000020 CR3: 0000000002e22006 CR4: 0000000000370ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __die_body.cold+0x19/0x26 ? page_fault_oops+0x13e/0x2b0 ? _printk+0x58/0x73 ? do_user_addr_fault+0x5f/0x750 ? exc_page_fault+0x76/0x240 ? asm_exc_page_fault+0x22/0x30 ? btrfs_bio_end_io+0xae/0xc0 [btrfs] ? btrfs_log_dev_io_error+0x7f/0x90 [btrfs] btrfs_orig_write_end_io+0x51/0x90 [btrfs] dm_submit_bio+0x5c2/0xa50 [dm_mod] ? find_held_lock+0x2b/0x80 ? blk_try_enter_queue+0x90/0x1e0 __submit_bio+0xe0/0x130 ? ktime_get+0x10a/0x160 ? lockdep_hardirqs_on+0x74/0x100 submit_bio_noacct_nocheck+0x199/0x410 btrfs_submit_bio+0x7d/0x150 [btrfs] btrfs_submit_chunk+0x1a1/0x6d0 [btrfs] ? lockdep_hardirqs_on+0x74/0x100 ? __folio_start_writeback+0x10/0x2c0 btrfs_submit_bbio+0x1c/0x40 [btrfs] submit_one_bio+0x44/0x60 [btrfs] submit_extent_folio+0x13f/0x330 [btrfs] ? btrfs_set_range_writeback+0xa3/0xd0 [btrfs] extent_writepage_io+0x18b/0x360 [btrfs] extent_write_locked_range+0x17c/0x340 [btrfs] ? __pfx_end_bbio_data_write+0x10/0x10 [btrfs] run_delalloc_cow+0x71/0xd0 [btrfs] btrfs_run_delalloc_range+0x176/0x500 [btrfs] ? find_lock_delalloc_range+0x119/0x260 [btrfs] writepage_delalloc+0x2ab/0x480 [btrfs] extent_write_cache_pages+0x236/0x7d0 [btrfs] btrfs_writepages+0x72/0x130 [btrfs] do_writepages+0xd4/0x240 ? find_held_lock+0x2b/0x80 ? wbc_attach_and_unlock_inode+0x12c/0x290 ? wbc_attach_and_unlock_inode+0x12c/0x290 __writeback_single_inode+0x5c/0x4c0 ? do_raw_spin_unlock+0x49/0xb0 writeback_sb_inodes+0x22c/0x560 __writeback_inodes_wb+0x4c/0xe0 wb_writeback+0x1d6/0x3f0 wb_workfn+0x334/0x520 process_one_work+0x1ee/0x570 ? lock_is_held_type+0xc6/0x130 worker_thread+0x1d1/0x3b0 ? __pfx_worker_thread+0x10/0x10 kthread+0xee/0x120 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Modules linked in: dm_mod btrfs blake2b_generic xor raid6_pq rapl CR2: 0000000000000020 * Case 2. Earlier completion of orig_bbio for mirrored btrfs_bios btrfs_bbio_propagate_error() assumes the end_io function for orig_bbio is called last among split bios. In that case, btrfs_orig_write_end_io() sets the bio->bi_status to BLK_STS_IOERR by seeing the bioc->error [2]. Otherwise, the increased orig_bio's bioc->error is not checked by anyone and return BLK_STS_OK to the upper layer. [2] Actually, this is not true. Because we only increases orig_bioc->errors by max_errors, the condition "atomic_read(&bioc->error) > bioc->max_errors" is still not met if only one split btrfs_bio fails. * Case 3. Later completion of orig_bbio for un-mirrored btrfs_bios In contrast to the above case, btrfs_bbio_propagate_error() is not working well if un-mirrored orig_bbio is completed last. It sets orig_bbio->bio.bi_status to the btrfs_bio's error. But, that is easily over-written by orig_bbio's completion status. If the status is BLK_STS_OK, the upper layer would not know the failure. * Solution Considering the above cases, we can only save the error status in the orig_bbio (remaining part after split) itself as it is always available. Also, the saved error status should be propagated when all the split btrfs_bios are finished (i.e, bbio->pending_ios == 0). This commit introduces "status" to btrfs_bbio and saves the first error of split bios to original btrfs_bio's "status" variable. When all the split bios are finished, the saved status is loaded into original btrfs_bio's status. With this commit, btrfs/146 on zoned devices does not hit the NULL pointer dereference anymore. Fixes: 852eee6 ("btrfs: allow btrfs_submit_bio to split bios") CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
Pull request for series with
subject: selftests/bpf: Add benchmark for local_storage get
version: 2
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=642562