Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

selftests/bpf: Add benchmark for local_storage get #474

Closed
wants to merge 2 commits into from

Conversation

kernel-patches-bot
Copy link

Pull request for series with
subject: selftests/bpf: Add benchmark for local_storage get
version: 2
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=642562

@kernel-patches-bot
Copy link
Author

Master branch: 68084a1
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=642562
version: 2

@kernel-patches-bot
Copy link
Author

Master branch: 70a1b25
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=642562
version: 2

@kernel-patches-bot
Copy link
Author

Master branch: 834650b
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=642562
version: 2

@kernel-patches-bot
Copy link
Author

Master branch: 841d8bc
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=642562
version: 2

@kernel-patches-bot
Copy link
Author

Master branch: 7aa424e
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=642562
version: 2

@kernel-patches-bot
Copy link
Author

Master branch: 96af42c
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=642562
version: 2

@kernel-patches-bot
Copy link
Author

Master branch: b23316a
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=642562
version: 2

@kernel-patches-bot
Copy link
Author

Master branch: fa37686
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=642562
version: 2

@kernel-patches-bot
Copy link
Author

Master branch: 2dc323b
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=642562
version: 2

@kernel-patches-bot
Copy link
Author

Master branch: 2dc323b
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=642562
version: 2

@kernel-patches-bot
Copy link
Author

Master branch: 9794976
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=642562
version: 2

@kernel-patches-bot
Copy link
Author

Master branch: c272e25
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=642562
version: 2

@kernel-patches-bot
Copy link
Author

Master branch: c272e25
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=643820
version: 3

@kernel-patches-bot kernel-patches-bot added V3 and removed V2 labels May 21, 2022
@kernel-patches-bot
Copy link
Author

Master branch: a56672f
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=643820
version: 3

@kernel-patches-bot
Copy link
Author

Master branch: ff20959
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=643820
version: 3

@kernel-patches-bot
Copy link
Author

Master branch: 4050764
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=643820
version: 3

Kernel Patches Daemon and others added 2 commits May 23, 2022 13:10
Add a benchmarks to demonstrate the performance cliff for local_storage
get as the number of local_storage maps increases beyond current
local_storage implementation's cache size.

"sequential get" and "interleaved get" benchmarks are added, both of
which do many bpf_task_storage_get calls on sets of task local_storage
maps of various counts, while considering a single specific map to be
'important' and counting task_storage_gets to the important map
separately in addition to normal 'hits' count of all gets. Goal here is
to mimic scenario where a particular program using one map - the
important one - is running on a system where many other local_storage
maps exist and are accessed often.

While "sequential get" benchmark does bpf_task_storage_get for map 0, 1,
..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4
bpf_task_storage_gets for the important map for every 10 map gets. This
is meant to highlight performance differences when important map is
accessed far more frequently than non-important maps.

A "hashmap control" benchmark is also included for easy comparison of
standard bpf hashmap lookup vs local_storage get. The benchmark is
identical to "sequential get", but creates and uses BPF_MAP_TYPE_HASH
instead of local storage.

Addition of this benchmark is inspired by conversation with Alexei in a
previous patchset's thread [0], which highlighted the need for such a
benchmark to motivate and validate improvements to local_storage
implementation. My approach in that series focused on improving
performance for explicitly-marked 'important' maps and was rejected
with feedback to make more generally-applicable improvements while
avoiding explicitly marking maps as important. Thus the benchmark
reports both general and important-map-focused metrics, so effect of
future work on both is clear.

Regarding the benchmark results. On a powerful system (Skylake, 20
cores, 256gb ram):

Local Storage
=============
        Hashmap Control w/ 500 maps
hashmap (control) sequential    get:  hits throughput: 69.649 ± 1.207 M ops/s, hits latency: 14.358 ns/op, important_hits throughput: 0.139 ± 0.002 M ops/s

        num_maps: 1
local_storage cache sequential  get:  hits throughput: 3.849 ± 0.035 M ops/s, hits latency: 259.803 ns/op, important_hits throughput: 3.849 ± 0.035 M ops/s
local_storage cache interleaved get:  hits throughput: 6.881 ± 0.110 M ops/s, hits latency: 145.324 ns/op, important_hits throughput: 6.881 ± 0.110 M ops/s

        num_maps: 10
local_storage cache sequential  get:  hits throughput: 20.339 ± 0.442 M ops/s, hits latency: 49.167 ns/op, important_hits throughput: 2.034 ± 0.044 M ops/s
local_storage cache interleaved get:  hits throughput: 22.408 ± 0.606 M ops/s, hits latency: 44.627 ns/op, important_hits throughput: 8.003 ± 0.217 M ops/s

        num_maps: 16
local_storage cache sequential  get:  hits throughput: 24.428 ± 1.120 M ops/s, hits latency: 40.937 ns/op, important_hits throughput: 1.527 ± 0.070 M ops/s
local_storage cache interleaved get:  hits throughput: 26.853 ± 0.825 M ops/s, hits latency: 37.240 ns/op, important_hits throughput: 8.544 ± 0.262 M ops/s

        num_maps: 17
local_storage cache sequential  get:  hits throughput: 24.158 ± 0.222 M ops/s, hits latency: 41.394 ns/op, important_hits throughput: 1.421 ± 0.013 M ops/s
local_storage cache interleaved get:  hits throughput: 26.223 ± 0.201 M ops/s, hits latency: 38.134 ns/op, important_hits throughput: 7.981 ± 0.061 M ops/s

        num_maps: 24
local_storage cache sequential  get:  hits throughput: 16.820 ± 0.294 M ops/s, hits latency: 59.451 ns/op, important_hits throughput: 0.701 ± 0.012 M ops/s
local_storage cache interleaved get:  hits throughput: 19.185 ± 0.212 M ops/s, hits latency: 52.125 ns/op, important_hits throughput: 5.396 ± 0.060 M ops/s

        num_maps: 32
local_storage cache sequential  get:  hits throughput: 11.998 ± 0.310 M ops/s, hits latency: 83.347 ns/op, important_hits throughput: 0.375 ± 0.010 M ops/s
local_storage cache interleaved get:  hits throughput: 14.233 ± 0.265 M ops/s, hits latency: 70.259 ns/op, important_hits throughput: 3.972 ± 0.074 M ops/s

        num_maps: 100
local_storage cache sequential  get:  hits throughput: 5.780 ± 0.250 M ops/s, hits latency: 173.003 ns/op, important_hits throughput: 0.058 ± 0.002 M ops/s
local_storage cache interleaved get:  hits throughput: 7.175 ± 0.312 M ops/s, hits latency: 139.381 ns/op, important_hits throughput: 1.874 ± 0.081 M ops/s

        num_maps: 1000
local_storage cache sequential  get:  hits throughput: 0.456 ± 0.011 M ops/s, hits latency: 2192.982 ns/op, important_hits throughput: 0.000 ± 0.000 M ops/s
local_storage cache interleaved get:  hits throughput: 0.539 ± 0.005 M ops/s, hits latency: 1855.508 ns/op, important_hits throughput: 0.135 ± 0.001 M ops/s

Looking at the "sequential get" results, it's clear that as the
number of task local_storage maps grows beyond the current cache size
(16), there's a significant reduction in hits throughput. Note that
current local_storage implementation assigns a cache_idx to maps as they
are created. Since "sequential get" is creating maps 0..n in order and
then doing bpf_task_storage_get calls in the same order, the benchmark
is effectively ensuring that a map will not be in cache when the program
tries to access it.

For "interleaved get" results, important-map hits throughput is greatly
increased as the important map is more likely to be in cache by virtue
of being accessed far more frequently. Throughput still reduces as #
maps increases, though.

As evidenced by the unintuitive-looking results for smaller num_maps
benchmark runs, overhead which is amortized across larger num_maps runs
dominates when there are fewer maps. To get a sense of the overhead, I
commented out bpf_task_storage_get/bpf_map_lookup_elem in
local_storage_bench.h and ran the benchmark on the same host as the
'real' run. Results:

Local Storage
=============
        Hashmap Control w/ 500 maps
hashmap (control) sequential    get:  hits throughput: 128.699 ± 1.267 M ops/s, hits latency: 7.770 ns/op, important_hits throughput: 0.257 ± 0.003 M ops/s

        num_maps: 1
local_storage cache sequential  get:  hits throughput: 4.135 ± 0.069 M ops/s, hits latency: 241.831 ns/op, important_hits throughput: 4.135 ± 0.069 M ops/s
local_storage cache interleaved get:  hits throughput: 7.693 ± 0.039 M ops/s, hits latency: 129.982 ns/op, important_hits throughput: 7.693 ± 0.039 M ops/s

        num_maps: 10
local_storage cache sequential  get:  hits throughput: 33.044 ± 0.232 M ops/s, hits latency: 30.262 ns/op, important_hits throughput: 3.304 ± 0.023 M ops/s
local_storage cache interleaved get:  hits throughput: 36.525 ± 1.545 M ops/s, hits latency: 27.378 ns/op, important_hits throughput: 13.045 ± 0.552 M ops/s

        num_maps: 16
local_storage cache sequential  get:  hits throughput: 45.502 ± 1.429 M ops/s, hits latency: 21.977 ns/op, important_hits throughput: 2.844 ± 0.089 M ops/s
local_storage cache interleaved get:  hits throughput: 47.741 ± 1.115 M ops/s, hits latency: 20.946 ns/op, important_hits throughput: 15.190 ± 0.355 M ops/s

        num_maps: 17
local_storage cache sequential  get:  hits throughput: 47.177 ± 0.617 M ops/s, hits latency: 21.197 ns/op, important_hits throughput: 2.775 ± 0.036 M ops/s
local_storage cache interleaved get:  hits throughput: 50.005 ± 0.463 M ops/s, hits latency: 19.998 ns/op, important_hits throughput: 15.219 ± 0.141 M ops/s

        num_maps: 24
local_storage cache sequential  get:  hits throughput: 58.076 ± 0.507 M ops/s, hits latency: 17.219 ns/op, important_hits throughput: 2.420 ± 0.021 M ops/s
local_storage cache interleaved get:  hits throughput: 57.731 ± 0.500 M ops/s, hits latency: 17.322 ns/op, important_hits throughput: 16.237 ± 0.141 M ops/s

        num_maps: 32
local_storage cache sequential  get:  hits throughput: 68.266 ± 0.234 M ops/s, hits latency: 14.649 ns/op, important_hits throughput: 2.133 ± 0.007 M ops/s
local_storage cache interleaved get:  hits throughput: 62.138 ± 2.695 M ops/s, hits latency: 16.093 ns/op, important_hits throughput: 17.341 ± 0.752 M ops/s

        num_maps: 100
local_storage cache sequential  get:  hits throughput: 103.735 ± 2.874 M ops/s, hits latency: 9.640 ns/op, important_hits throughput: 1.037 ± 0.029 M ops/s
local_storage cache interleaved get:  hits throughput: 85.950 ± 1.619 M ops/s, hits latency: 11.635 ns/op, important_hits throughput: 22.450 ± 0.423 M ops/s

        num_maps: 1000
local_storage cache sequential  get:  hits throughput: 133.551 ± 1.915 M ops/s, hits latency: 7.488 ns/op, important_hits throughput: 0.134 ± 0.002 M ops/s
local_storage cache interleaved get:  hits throughput: 97.579 ± 1.415 M ops/s, hits latency: 10.248 ns/op, important_hits throughput: 24.505 ± 0.355 M ops/s

Adjusting for overhead, latency numbers for "hashmap control" and "sequential get" are:

hashmap_control:     ~6.6ns
sequential_get_1:    ~17.9ns
sequential_get_10:   ~18.9ns
sequential_get_16:   ~19.0ns
sequential_get_17:   ~20.2ns
sequential_get_24:   ~42.2ns
sequential_get_32:   ~68.7ns
sequential_get_100:  ~163.3ns
sequential_get_1000: ~2200ns

Clearly demonstrating a cliff.

When running the benchmarks it may be necessary to bump 'open files'
ulimit for a successful run.

  [0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@fb.com

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
@kernel-patches-bot
Copy link
Author

Master branch: f9a3eca
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=643820
version: 3

@kernel-patches-bot
Copy link
Author

At least one diff in series https://patchwork.kernel.org/project/netdevbpf/list/?series=643820 expired. Closing PR.

@kernel-patches-bot kernel-patches-bot deleted the series/639579=>bpf-next branch May 26, 2022 04:13
kernel-patches-daemon-bpf-rc bot pushed a commit that referenced this pull request Nov 5, 2024
The purpose of btrfs_bbio_propagate_error() shall be propagating an error
of split bio to its original btrfs_bio, and tell the error to the upper
layer. However, it's not working well on some cases.

* Case 1. Immediate (or quick) end_bio with an error

When btrfs sends btrfs_bio to mirrored devices, btrfs calls
btrfs_bio_end_io() when all the mirroring bios are completed. If that
btrfs_bio was split, it is from btrfs_clone_bioset and its end_io function
is btrfs_orig_write_end_io. For this case, btrfs_bbio_propagate_error()
accesses the orig_bbio's bio context to increase the error count.

That works well in most cases. However, if the end_io is called enough
fast, orig_bbio's (remaining part after split) bio context may not be
properly set at that time. Since the bio context is set when the orig_bbio
(the last btrfs_bio) is sent to devices, that might be too late for earlier
split btrfs_bio's completion.  That will result in NULL pointer
dereference.

That bug is easily reproducible by running btrfs/146 on zoned devices [1]
and it shows the following trace.

[1] You need raid-stripe-tree feature as it create "-d raid0 -m raid1" FS.

  BUG: kernel NULL pointer dereference, address: 0000000000000020
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: Oops: 0000 [#1] PREEMPT SMP PTI
  CPU: 1 UID: 0 PID: 13 Comm: kworker/u32:1 Not tainted 6.11.0-rc7-BTRFS-ZNS+ #474
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  Workqueue: writeback wb_workfn (flush-btrfs-5)
  RIP: 0010:btrfs_bio_end_io+0xae/0xc0 [btrfs]
  BTRFS error (device dm-0): bdev /dev/mapper/error-test errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
  RSP: 0018:ffffc9000006f248 EFLAGS: 00010246
  RAX: 0000000000000000 RBX: ffff888005a7f080 RCX: ffffc9000006f1dc
  RDX: 0000000000000000 RSI: 000000000000000a RDI: ffff888005a7f080
  RBP: ffff888011dfc540 R08: 0000000000000000 R09: 0000000000000001
  R10: ffffffff82e508e0 R11: 0000000000000005 R12: ffff88800ddfbe58
  R13: ffff888005a7f080 R14: ffff888005a7f158 R15: ffff888005a7f158
  FS:  0000000000000000(0000) GS:ffff88803ea80000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000020 CR3: 0000000002e22006 CR4: 0000000000370ef0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <TASK>
   ? __die_body.cold+0x19/0x26
   ? page_fault_oops+0x13e/0x2b0
   ? _printk+0x58/0x73
   ? do_user_addr_fault+0x5f/0x750
   ? exc_page_fault+0x76/0x240
   ? asm_exc_page_fault+0x22/0x30
   ? btrfs_bio_end_io+0xae/0xc0 [btrfs]
   ? btrfs_log_dev_io_error+0x7f/0x90 [btrfs]
   btrfs_orig_write_end_io+0x51/0x90 [btrfs]
   dm_submit_bio+0x5c2/0xa50 [dm_mod]
   ? find_held_lock+0x2b/0x80
   ? blk_try_enter_queue+0x90/0x1e0
   __submit_bio+0xe0/0x130
   ? ktime_get+0x10a/0x160
   ? lockdep_hardirqs_on+0x74/0x100
   submit_bio_noacct_nocheck+0x199/0x410
   btrfs_submit_bio+0x7d/0x150 [btrfs]
   btrfs_submit_chunk+0x1a1/0x6d0 [btrfs]
   ? lockdep_hardirqs_on+0x74/0x100
   ? __folio_start_writeback+0x10/0x2c0
   btrfs_submit_bbio+0x1c/0x40 [btrfs]
   submit_one_bio+0x44/0x60 [btrfs]
   submit_extent_folio+0x13f/0x330 [btrfs]
   ? btrfs_set_range_writeback+0xa3/0xd0 [btrfs]
   extent_writepage_io+0x18b/0x360 [btrfs]
   extent_write_locked_range+0x17c/0x340 [btrfs]
   ? __pfx_end_bbio_data_write+0x10/0x10 [btrfs]
   run_delalloc_cow+0x71/0xd0 [btrfs]
   btrfs_run_delalloc_range+0x176/0x500 [btrfs]
   ? find_lock_delalloc_range+0x119/0x260 [btrfs]
   writepage_delalloc+0x2ab/0x480 [btrfs]
   extent_write_cache_pages+0x236/0x7d0 [btrfs]
   btrfs_writepages+0x72/0x130 [btrfs]
   do_writepages+0xd4/0x240
   ? find_held_lock+0x2b/0x80
   ? wbc_attach_and_unlock_inode+0x12c/0x290
   ? wbc_attach_and_unlock_inode+0x12c/0x290
   __writeback_single_inode+0x5c/0x4c0
   ? do_raw_spin_unlock+0x49/0xb0
   writeback_sb_inodes+0x22c/0x560
   __writeback_inodes_wb+0x4c/0xe0
   wb_writeback+0x1d6/0x3f0
   wb_workfn+0x334/0x520
   process_one_work+0x1ee/0x570
   ? lock_is_held_type+0xc6/0x130
   worker_thread+0x1d1/0x3b0
   ? __pfx_worker_thread+0x10/0x10
   kthread+0xee/0x120
   ? __pfx_kthread+0x10/0x10
   ret_from_fork+0x30/0x50
   ? __pfx_kthread+0x10/0x10
   ret_from_fork_asm+0x1a/0x30
   </TASK>
  Modules linked in: dm_mod btrfs blake2b_generic xor raid6_pq rapl
  CR2: 0000000000000020

* Case 2. Earlier completion of orig_bbio for mirrored btrfs_bios

btrfs_bbio_propagate_error() assumes the end_io function for orig_bbio is
called last among split bios. In that case, btrfs_orig_write_end_io() sets
the bio->bi_status to BLK_STS_IOERR by seeing the bioc->error [2].
Otherwise, the increased orig_bio's bioc->error is not checked by anyone
and return BLK_STS_OK to the upper layer.

[2] Actually, this is not true. Because we only increases orig_bioc->errors
by max_errors, the condition "atomic_read(&bioc->error) > bioc->max_errors"
is still not met if only one split btrfs_bio fails.

* Case 3. Later completion of orig_bbio for un-mirrored btrfs_bios

In contrast to the above case, btrfs_bbio_propagate_error() is not working
well if un-mirrored orig_bbio is completed last. It sets
orig_bbio->bio.bi_status to the btrfs_bio's error. But, that is easily
over-written by orig_bbio's completion status. If the status is BLK_STS_OK,
the upper layer would not know the failure.

* Solution

Considering the above cases, we can only save the error status in the
orig_bbio (remaining part after split) itself as it is always
available. Also, the saved error status should be propagated when all the
split btrfs_bios are finished (i.e, bbio->pending_ios == 0).

This commit introduces "status" to btrfs_bbio and saves the first error of
split bios to original btrfs_bio's "status" variable. When all the split
bios are finished, the saved status is loaded into original btrfs_bio's
status.

With this commit, btrfs/146 on zoned devices does not hit the NULL pointer
dereference anymore.

Fixes: 852eee6 ("btrfs: allow btrfs_submit_bio to split bios")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants