Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport CStr improvement to remove need for a nightly feature #6

Merged
merged 1 commit into from
Sep 18, 2020

Conversation

alex
Copy link
Member

@alex alex commented Sep 12, 2020

This is abc979140d6a070867c9e78e251c2c689782274f from the other repo.

@ojeda
Copy link
Member

ojeda commented Sep 12, 2020

Thanks for the port!

@alex
Copy link
Member Author

alex commented Sep 12, 2020

Rebased for conflicts.

@ojeda ojeda merged commit 7f60ab1 into rust Sep 18, 2020
@ojeda ojeda deleted the cstr-improvement branch September 18, 2020 17:56
ojeda pushed a commit that referenced this pull request Nov 28, 2020
I got the following lockdep splat while testing:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.8.0-rc7-00172-g021118712e59 #932 Not tainted
  ------------------------------------------------------
  btrfs/229626 is trying to acquire lock:
  ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450

  but task is already holding lock:
  ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
	 __mutex_lock+0x9f/0x930
	 btrfs_scrub_dev+0x11c/0x630
	 btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
	 btrfs_ioctl+0x2799/0x30a0
	 ksys_ioctl+0x83/0xc0
	 __x64_sys_ioctl+0x16/0x20
	 do_syscall_64+0x50/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
	 __mutex_lock+0x9f/0x930
	 btrfs_run_dev_stats+0x49/0x480
	 commit_cowonly_roots+0xb5/0x2a0
	 btrfs_commit_transaction+0x516/0xa60
	 sync_filesystem+0x6b/0x90
	 generic_shutdown_super+0x22/0x100
	 kill_anon_super+0xe/0x30
	 btrfs_kill_super+0x12/0x20
	 deactivate_locked_super+0x29/0x60
	 cleanup_mnt+0xb8/0x140
	 task_work_run+0x6d/0xb0
	 __prepare_exit_to_usermode+0x1cc/0x1e0
	 do_syscall_64+0x5c/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
	 __mutex_lock+0x9f/0x930
	 btrfs_commit_transaction+0x4bb/0xa60
	 sync_filesystem+0x6b/0x90
	 generic_shutdown_super+0x22/0x100
	 kill_anon_super+0xe/0x30
	 btrfs_kill_super+0x12/0x20
	 deactivate_locked_super+0x29/0x60
	 cleanup_mnt+0xb8/0x140
	 task_work_run+0x6d/0xb0
	 __prepare_exit_to_usermode+0x1cc/0x1e0
	 do_syscall_64+0x5c/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
	 __mutex_lock+0x9f/0x930
	 btrfs_record_root_in_trans+0x43/0x70
	 start_transaction+0xd1/0x5d0
	 btrfs_dirty_inode+0x42/0xd0
	 touch_atime+0xa1/0xd0
	 btrfs_file_mmap+0x3f/0x60
	 mmap_region+0x3a4/0x640
	 do_mmap+0x376/0x580
	 vm_mmap_pgoff+0xd5/0x120
	 ksys_mmap_pgoff+0x193/0x230
	 do_syscall_64+0x50/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #3 (&mm->mmap_lock#2){++++}-{3:3}:
	 __might_fault+0x68/0x90
	 _copy_to_user+0x1e/0x80
	 perf_read+0x141/0x2c0
	 vfs_read+0xad/0x1b0
	 ksys_read+0x5f/0xe0
	 do_syscall_64+0x50/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #2 (&cpuctx_mutex){+.+.}-{3:3}:
	 __mutex_lock+0x9f/0x930
	 perf_event_init_cpu+0x88/0x150
	 perf_event_init+0x1db/0x20b
	 start_kernel+0x3ae/0x53c
	 secondary_startup_64+0xa4/0xb0

  -> #1 (pmus_lock){+.+.}-{3:3}:
	 __mutex_lock+0x9f/0x930
	 perf_event_init_cpu+0x4f/0x150
	 cpuhp_invoke_callback+0xb1/0x900
	 _cpu_up.constprop.26+0x9f/0x130
	 cpu_up+0x7b/0xc0
	 bringup_nonboot_cpus+0x4f/0x60
	 smp_init+0x26/0x71
	 kernel_init_freeable+0x110/0x258
	 kernel_init+0xa/0x103
	 ret_from_fork+0x1f/0x30

  -> #0 (cpu_hotplug_lock){++++}-{0:0}:
	 __lock_acquire+0x1272/0x2310
	 lock_acquire+0x9e/0x360
	 cpus_read_lock+0x39/0xb0
	 alloc_workqueue+0x378/0x450
	 __btrfs_alloc_workqueue+0x15d/0x200
	 btrfs_alloc_workqueue+0x51/0x160
	 scrub_workers_get+0x5a/0x170
	 btrfs_scrub_dev+0x18c/0x630
	 btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
	 btrfs_ioctl+0x2799/0x30a0
	 ksys_ioctl+0x83/0xc0
	 __x64_sys_ioctl+0x16/0x20
	 do_syscall_64+0x50/0x90
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  other info that might help us debug this:

  Chain exists of:
    cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(&fs_info->scrub_lock);
				 lock(&fs_devs->device_list_mutex);
				 lock(&fs_info->scrub_lock);
    lock(cpu_hotplug_lock);

   *** DEADLOCK ***

  2 locks held by btrfs/229626:
   #0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
   #1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630

  stack backtrace:
  CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
  Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
  Call Trace:
   dump_stack+0x78/0xa0
   check_noncircular+0x165/0x180
   __lock_acquire+0x1272/0x2310
   lock_acquire+0x9e/0x360
   ? alloc_workqueue+0x378/0x450
   cpus_read_lock+0x39/0xb0
   ? alloc_workqueue+0x378/0x450
   alloc_workqueue+0x378/0x450
   ? rcu_read_lock_sched_held+0x52/0x80
   __btrfs_alloc_workqueue+0x15d/0x200
   btrfs_alloc_workqueue+0x51/0x160
   scrub_workers_get+0x5a/0x170
   btrfs_scrub_dev+0x18c/0x630
   ? start_transaction+0xd1/0x5d0
   btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
   btrfs_ioctl+0x2799/0x30a0
   ? do_sigaction+0x102/0x250
   ? lockdep_hardirqs_on_prepare+0xca/0x160
   ? _raw_spin_unlock_irq+0x24/0x30
   ? trace_hardirqs_on+0x1c/0xe0
   ? _raw_spin_unlock_irq+0x24/0x30
   ? do_sigaction+0x102/0x250
   ? ksys_ioctl+0x83/0xc0
   ksys_ioctl+0x83/0xc0
   __x64_sys_ioctl+0x16/0x20
   do_syscall_64+0x50/0x90
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This happens because we're allocating the scrub workqueues under the
scrub and device list mutex, which brings in a whole host of other
dependencies.

Because the work queue allocation is done with GFP_KERNEL, it can
trigger reclaim, which can lead to a transaction commit, which in turns
needs the device_list_mutex, it can lead to a deadlock. A different
problem for which this fix is a solution.

Fix this by moving the actual allocation outside of the
scrub lock, and then only take the lock once we're ready to actually
assign them to the fs_info.  We'll now have to cleanup the workqueues in
a few more places, so I've added a helper to do the refcount dance to
safely free the workqueues.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
ojeda pushed a commit that referenced this pull request Nov 28, 2020
…s metrics" test

Linux 5.9 introduced perf test case "Parse and process metrics" and
on s390 this test case always dumps core:

  [root@t35lp67 perf]# ./perf test -vvvv -F 67
  67: Parse and process metrics                             :
  --- start ---
  metric expr inst_retired.any / cpu_clk_unhalted.thread for IPC
  parsing metric: inst_retired.any / cpu_clk_unhalted.thread
  Segmentation fault (core dumped)
  [root@t35lp67 perf]#

I debugged this core dump and gdb shows this call chain:

  (gdb) where
   #0  0x000003ffabc3192a in __strnlen_c_1 () from /lib64/libc.so.6
   #1  0x000003ffabc293de in strcasestr () from /lib64/libc.so.6
   #2  0x0000000001102ba2 in match_metric(list=0x1e6ea20 "inst_retired.any",
            n=<optimized out>)
       at util/metricgroup.c:368
   #3  find_metric (map=<optimized out>, map=<optimized out>,
           metric=0x1e6ea20 "inst_retired.any")
      at util/metricgroup.c:765
   #4  __resolve_metric (ids=0x0, map=<optimized out>, metric_list=0x0,
           metric_no_group=<optimized out>, m=<optimized out>)
      at util/metricgroup.c:844
   #5  resolve_metric (ids=0x0, map=0x0, metric_list=0x0,
          metric_no_group=<optimized out>)
      at util/metricgroup.c:881
   #6  metricgroup__add_metric (metric=<optimized out>,
        metric_no_group=metric_no_group@entry=false, events=<optimized out>,
        events@entry=0x3ffd84fb878, metric_list=0x0,
        metric_list@entry=0x3ffd84fb868, map=0x0)
      at util/metricgroup.c:943
   #7  0x00000000011034ae in metricgroup__add_metric_list (map=0x13f9828 <map>,
        metric_list=0x3ffd84fb868, events=0x3ffd84fb878,
        metric_no_group=<optimized out>, list=<optimized out>)
      at util/metricgroup.c:988
   #8  parse_groups (perf_evlist=perf_evlist@entry=0x1e70260,
          str=str@entry=0x12f34b2 "IPC", metric_no_group=<optimized out>,
          metric_no_merge=<optimized out>,
          fake_pmu=fake_pmu@entry=0x1462f18 <perf_pmu.fake>,
          metric_events=0x3ffd84fba58, map=0x1)
      at util/metricgroup.c:1040
   #9  0x0000000001103eb2 in metricgroup__parse_groups_test(
  	evlist=evlist@entry=0x1e70260, map=map@entry=0x13f9828 <map>,
  	str=str@entry=0x12f34b2 "IPC",
  	metric_no_group=metric_no_group@entry=false,
  	metric_no_merge=metric_no_merge@entry=false,
  	metric_events=0x3ffd84fba58)
      at util/metricgroup.c:1082
   #10 0x00000000010c84d8 in __compute_metric (ratio2=0x0, name2=0x0,
          ratio1=<synthetic pointer>, name1=0x12f34b2 "IPC",
  	vals=0x3ffd84fbad8, name=0x12f34b2 "IPC")
      at tests/parse-metric.c:159
   #11 compute_metric (ratio=<synthetic pointer>, vals=0x3ffd84fbad8,
  	name=0x12f34b2 "IPC")
      at tests/parse-metric.c:189
   #12 test_ipc () at tests/parse-metric.c:208
.....
..... omitted many more lines

This test case was added with
commit 218ca91 ("perf tests: Add parse metric test for frontend metric").

When I compile with make DEBUG=y it works fine and I do not get a core dump.

It turned out that the above listed function call chain worked on a struct
pmu_event array which requires a trailing element with zeroes which was
missing. The marco map_for_each_event() loops over that array tests for members
metric_expr/metric_name/metric_group being non-NULL. Adding this element fixes
the issue.

Output after:

  [root@t35lp46 perf]# ./perf test 67
  67: Parse and process metrics                             : Ok
  [root@t35lp46 perf]#

Committer notes:

As Ian remarks, this is not s390 specific:

<quote Ian>
  This also shows up with address sanitizer on all architectures
  (perhaps change the patch title) and perhaps add a "Fixes: <commit>"
  tag.

  =================================================================
  ==4718==ERROR: AddressSanitizer: global-buffer-overflow on address
  0x55c93b4d59e8 at pc 0x55c93a1541e2 bp 0x7ffd24327c60 sp
  0x7ffd24327c58
  READ of size 8 at 0x55c93b4d59e8 thread T0
      #0 0x55c93a1541e1 in find_metric tools/perf/util/metricgroup.c:764:2
      #1 0x55c93a153e6c in __resolve_metric tools/perf/util/metricgroup.c:844:9
      #2 0x55c93a152f18 in resolve_metric tools/perf/util/metricgroup.c:881:9
      #3 0x55c93a1528db in metricgroup__add_metric
  tools/perf/util/metricgroup.c:943:9
      #4 0x55c93a151996 in metricgroup__add_metric_list
  tools/perf/util/metricgroup.c:988:9
      #5 0x55c93a1511b9 in parse_groups tools/perf/util/metricgroup.c:1040:8
      #6 0x55c93a1513e1 in metricgroup__parse_groups_test
  tools/perf/util/metricgroup.c:1082:9
      #7 0x55c93a0108ae in __compute_metric tools/perf/tests/parse-metric.c:159:8
      #8 0x55c93a010744 in compute_metric tools/perf/tests/parse-metric.c:189:9
      #9 0x55c93a00f5ee in test_ipc tools/perf/tests/parse-metric.c:208:2
      #10 0x55c93a00f1e8 in test__parse_metric
  tools/perf/tests/parse-metric.c:345:2
      #11 0x55c939fd7202 in run_test tools/perf/tests/builtin-test.c:410:9
      #12 0x55c939fd6736 in test_and_print tools/perf/tests/builtin-test.c:440:9
      #13 0x55c939fd58c3 in __cmd_test tools/perf/tests/builtin-test.c:661:4
      #14 0x55c939fd4e02 in cmd_test tools/perf/tests/builtin-test.c:807:9
      #15 0x55c939e4763d in run_builtin tools/perf/perf.c:313:11
      #16 0x55c939e46475 in handle_internal_command tools/perf/perf.c:365:8
      #17 0x55c939e4737e in run_argv tools/perf/perf.c:409:2
      #18 0x55c939e45f7e in main tools/perf/perf.c:539:3

  0x55c93b4d59e8 is located 0 bytes to the right of global variable
  'pme_test' defined in 'tools/perf/tests/parse-metric.c:17:25'
  (0x55c93b4d54a0) of size 1352
  SUMMARY: AddressSanitizer: global-buffer-overflow
  tools/perf/util/metricgroup.c:764:2 in find_metric
  Shadow bytes around the buggy address:
    0x0ab9a7692ae0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    0x0ab9a7692af0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    0x0ab9a7692b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    0x0ab9a7692b10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    0x0ab9a7692b20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  =>0x0ab9a7692b30: 00 00 00 00 00 00 00 00 00 00 00 00 00[f9]f9 f9
    0x0ab9a7692b40: f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9
    0x0ab9a7692b50: f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9
    0x0ab9a7692b60: f9 f9 f9 f9 f9 f9 f9 f9 00 00 00 00 00 00 00 00
    0x0ab9a7692b70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    0x0ab9a7692b80: f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9 f9
  Shadow byte legend (one shadow byte represents 8 application bytes):
    Addressable:           00
    Partially addressable: 01 02 03 04 05 06 07
    Heap left redzone:	   fa
    Freed heap region:	   fd
    Stack left redzone:	   f1
    Stack mid redzone:	   f2
    Stack right redzone:     f3
    Stack after return:	   f5
    Stack use after scope:   f8
    Global redzone:          f9
    Global init order:	   f6
    Poisoned by user:        f7
    Container overflow:	   fc
    Array cookie:            ac
    Intra object redzone:    bb
    ASan internal:           fe
    Left alloca redzone:     ca
    Right alloca redzone:    cb
    Shadow gap:              cc
</quote>

I'm also adding the missing "Fixes" tag and setting just .name to NULL,
as doing it that way is more compact (the compiler will zero out
everything else) and the table iterators look for .name being NULL as
the sentinel marking the end of the table.

Fixes: 0a507af ("perf tests: Add parse metric test for ipc metric")
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Acked-by: Ian Rogers <irogers@google.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Link: http://lore.kernel.org/lkml/20200825071211.16959-1-tmricht@linux.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
ojeda pushed a commit that referenced this pull request Nov 28, 2020
If we hit the UINT_MAX limit of bio->bi_iter.bi_size and so we are anyway
not merging this page in this bio, then it make sense to make same_page
also as false before returning.

Without this patch, we hit below WARNING in iomap.
This mostly happens with very large memory system and / or after tweaking
vm dirty threshold params to delay writeback of dirty data.

WARNING: CPU: 18 PID: 5130 at fs/iomap/buffered-io.c:74 iomap_page_release+0x120/0x150
 CPU: 18 PID: 5130 Comm: fio Kdump: loaded Tainted: G        W         5.8.0-rc3 #6
 Call Trace:
  __remove_mapping+0x154/0x320 (unreliable)
  iomap_releasepage+0x80/0x180
  try_to_release_page+0x94/0xe0
  invalidate_inode_page+0xc8/0x110
  invalidate_mapping_pages+0x1dc/0x540
  generic_fadvise+0x3c8/0x450
  xfs_file_fadvise+0x2c/0xe0 [xfs]
  vfs_fadvise+0x3c/0x60
  ksys_fadvise64_64+0x68/0xe0
  sys_fadvise64+0x28/0x40
  system_call_exception+0xf8/0x1c0
  system_call_common+0xf0/0x278

Fixes: cc90bc6 ("block: fix "check bi_size overflow before merge"")
Reported-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ojeda pushed a commit that referenced this pull request Nov 28, 2020
The aliases were never released causing the following leaks:

  Indirect leak of 1224 byte(s) in 9 object(s) allocated from:
    #0 0x7feefb830628 in malloc (/lib/x86_64-linux-gnu/libasan.so.5+0x107628)
    #1 0x56332c8f1b62 in __perf_pmu__new_alias util/pmu.c:322
    #2 0x56332c8f401f in pmu_add_cpu_aliases_map util/pmu.c:778
    #3 0x56332c792ce9 in __test__pmu_event_aliases tests/pmu-events.c:295
    #4 0x56332c792ce9 in test_aliases tests/pmu-events.c:367
    #5 0x56332c76a09b in run_test tests/builtin-test.c:410
    #6 0x56332c76a09b in test_and_print tests/builtin-test.c:440
    #7 0x56332c76ce69 in __cmd_test tests/builtin-test.c:695
    #8 0x56332c76ce69 in cmd_test tests/builtin-test.c:807
    #9 0x56332c7d2214 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312
    #10 0x56332c6701a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364
    #11 0x56332c6701a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408
    #12 0x56332c6701a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538
    #13 0x7feefb359cc9 in __libc_start_main ../csu/libc-start.c:308

Fixes: 956a783 ("perf test: Test pmu-events aliases")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: John Garry <john.garry@huawei.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20200915031819.386559-11-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
ojeda pushed a commit that referenced this pull request Nov 28, 2020
The evsel->unit borrows a pointer of pmu event or alias instead of
owns a string.  But tool event (duration_time) passes a result of
strdup() caused a leak.

It was found by ASAN during metric test:

  Direct leak of 210 byte(s) in 70 object(s) allocated from:
    #0 0x7fe366fca0b5 in strdup (/lib/x86_64-linux-gnu/libasan.so.5+0x920b5)
    #1 0x559fbbcc6ea3 in add_event_tool util/parse-events.c:414
    #2 0x559fbbcc6ea3 in parse_events_add_tool util/parse-events.c:1414
    #3 0x559fbbd8474d in parse_events_parse util/parse-events.y:439
    #4 0x559fbbcc95da in parse_events__scanner util/parse-events.c:2096
    #5 0x559fbbcc95da in __parse_events util/parse-events.c:2141
    #6 0x559fbbc28555 in check_parse_id tests/pmu-events.c:406
    #7 0x559fbbc28555 in check_parse_id tests/pmu-events.c:393
    #8 0x559fbbc28555 in check_parse_cpu tests/pmu-events.c:415
    #9 0x559fbbc28555 in test_parsing tests/pmu-events.c:498
    #10 0x559fbbc0109b in run_test tests/builtin-test.c:410
    #11 0x559fbbc0109b in test_and_print tests/builtin-test.c:440
    #12 0x559fbbc03e69 in __cmd_test tests/builtin-test.c:695
    #13 0x559fbbc03e69 in cmd_test tests/builtin-test.c:807
    #14 0x559fbbc691f4 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312
    #15 0x559fbbb071a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364
    #16 0x559fbbb071a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408
    #17 0x559fbbb071a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538
    #18 0x7fe366b68cc9 in __libc_start_main ../csu/libc-start.c:308

Fixes: f0fbb11 ("perf stat: Implement duration_time as a proper event")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20200915031819.386559-6-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
ojeda pushed a commit that referenced this pull request Nov 28, 2020
The test_generic_metric() missed to release entries in the pctx.  Asan
reported following leak (and more):

  Direct leak of 128 byte(s) in 1 object(s) allocated from:
    #0 0x7f4c9396980e in calloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10780e)
    #1 0x55f7e748cc14 in hashmap_grow (/home/namhyung/project/linux/tools/perf/perf+0x90cc14)
    #2 0x55f7e748d497 in hashmap__insert (/home/namhyung/project/linux/tools/perf/perf+0x90d497)
    #3 0x55f7e7341667 in hashmap__set /home/namhyung/project/linux/tools/perf/util/hashmap.h:111
    #4 0x55f7e7341667 in expr__add_ref util/expr.c:120
    #5 0x55f7e7292436 in prepare_metric util/stat-shadow.c:783
    #6 0x55f7e729556d in test_generic_metric util/stat-shadow.c:858
    #7 0x55f7e712390b in compute_single tests/parse-metric.c:128
    #8 0x55f7e712390b in __compute_metric tests/parse-metric.c:180
    #9 0x55f7e712446d in compute_metric tests/parse-metric.c:196
    #10 0x55f7e712446d in test_dcache_l2 tests/parse-metric.c:295
    #11 0x55f7e712446d in test__parse_metric tests/parse-metric.c:355
    #12 0x55f7e70be09b in run_test tests/builtin-test.c:410
    #13 0x55f7e70be09b in test_and_print tests/builtin-test.c:440
    #14 0x55f7e70c101a in __cmd_test tests/builtin-test.c:661
    #15 0x55f7e70c101a in cmd_test tests/builtin-test.c:807
    #16 0x55f7e7126214 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312
    #17 0x55f7e6fc41a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364
    #18 0x55f7e6fc41a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408
    #19 0x55f7e6fc41a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538
    #20 0x7f4c93492cc9 in __libc_start_main ../csu/libc-start.c:308

Fixes: 6d432c4 ("perf tools: Add test_generic_metric function")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20200915031819.386559-8-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
ojeda pushed a commit that referenced this pull request Nov 28, 2020
The metricgroup__add_metric() can find multiple match for a metric group
and it's possible to fail.  Also it can fail in the middle like in
resolve_metric() even for single metric.

In those cases, the intermediate list and ids will be leaked like:

  Direct leak of 3 byte(s) in 1 object(s) allocated from:
    #0 0x7f4c938f40b5 in strdup (/lib/x86_64-linux-gnu/libasan.so.5+0x920b5)
    #1 0x55f7e71c1bef in __add_metric util/metricgroup.c:683
    #2 0x55f7e71c31d0 in add_metric util/metricgroup.c:906
    #3 0x55f7e71c3844 in metricgroup__add_metric util/metricgroup.c:940
    #4 0x55f7e71c488d in metricgroup__add_metric_list util/metricgroup.c:993
    #5 0x55f7e71c488d in parse_groups util/metricgroup.c:1045
    #6 0x55f7e71c60a4 in metricgroup__parse_groups_test util/metricgroup.c:1087
    #7 0x55f7e71235ae in __compute_metric tests/parse-metric.c:164
    #8 0x55f7e7124650 in compute_metric tests/parse-metric.c:196
    #9 0x55f7e7124650 in test_recursion_fail tests/parse-metric.c:318
    #10 0x55f7e7124650 in test__parse_metric tests/parse-metric.c:356
    #11 0x55f7e70be09b in run_test tests/builtin-test.c:410
    #12 0x55f7e70be09b in test_and_print tests/builtin-test.c:440
    #13 0x55f7e70c101a in __cmd_test tests/builtin-test.c:661
    #14 0x55f7e70c101a in cmd_test tests/builtin-test.c:807
    #15 0x55f7e7126214 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312
    #16 0x55f7e6fc41a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364
    #17 0x55f7e6fc41a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408
    #18 0x55f7e6fc41a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538
    #19 0x7f4c93492cc9 in __libc_start_main ../csu/libc-start.c:308

Fixes: 83de0b7 ("perf metric: Collect referenced metrics in struct metric_ref_node")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20200915031819.386559-9-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
ojeda pushed a commit that referenced this pull request Nov 28, 2020
The following leaks were detected by ASAN:

  Indirect leak of 360 byte(s) in 9 object(s) allocated from:
    #0 0x7fecc305180e in calloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10780e)
    #1 0x560578f6dce5 in perf_pmu__new_format util/pmu.c:1333
    #2 0x560578f752fc in perf_pmu_parse util/pmu.y:59
    #3 0x560578f6a8b7 in perf_pmu__format_parse util/pmu.c:73
    #4 0x560578e07045 in test__pmu tests/pmu.c:155
    #5 0x560578de109b in run_test tests/builtin-test.c:410
    #6 0x560578de109b in test_and_print tests/builtin-test.c:440
    #7 0x560578de401a in __cmd_test tests/builtin-test.c:661
    #8 0x560578de401a in cmd_test tests/builtin-test.c:807
    #9 0x560578e49354 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312
    #10 0x560578ce71a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364
    #11 0x560578ce71a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408
    #12 0x560578ce71a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538
    #13 0x7fecc2b7acc9 in __libc_start_main ../csu/libc-start.c:308

Fixes: cff7f95 ("perf tests: Move pmu tests into separate object")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20200915031819.386559-12-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
ojeda pushed a commit that referenced this pull request Nov 28, 2020
Current neigh update event handler implementation takes reference to
neighbour structure, assigns it to nhe->n, tries to schedule workqueue task
and releases the reference if task was already enqueued. This results
potentially overwriting existing nhe->n pointer with another neighbour
instance, which causes double release of the instance (once in neigh update
handler that failed to enqueue to workqueue and another one in neigh update
workqueue task that processes updated nhe->n pointer instead of original
one):

[ 3376.512806] ------------[ cut here ]------------
[ 3376.513534] refcount_t: underflow; use-after-free.
[ 3376.521213] Modules linked in: act_skbedit act_mirred act_tunnel_key vxlan ip6_udp_tunnel udp_tunnel nfnetlink act_gact cls_flower sch_ingress openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 mlx5_ib mlx5_core mlxfw pci_hyperv_intf ptp pps_core nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd
 grace fscache ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp rpcrdma rdma_ucm ib_umad ib_ipoib ib_iser rdma_cm ib_cm iw_cm rfkill ib_uverbs ib_core sunrpc kvm_intel kvm iTCO_wdt iTCO_vendor_support virtio_net irqbypass net_failover crc32_pclmul lpc_ich i2c_i801 failover pcspkr i2c_smbus mfd_core ghash_clmulni_intel sch_fq_codel drm i2c
_core ip_tables crc32c_intel serio_raw [last unloaded: mlxfw]
[ 3376.529468] CPU: 8 PID: 22756 Comm: kworker/u20:5 Not tainted 5.9.0-rc5+ #6
[ 3376.530399] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[ 3376.531975] Workqueue: mlx5e mlx5e_rep_neigh_update [mlx5_core]
[ 3376.532820] RIP: 0010:refcount_warn_saturate+0xd8/0xe0
[ 3376.533589] Code: ff 48 c7 c7 e0 b8 27 82 c6 05 0b b6 09 01 01 e8 94 93 c1 ff 0f 0b c3 48 c7 c7 88 b8 27 82 c6 05 f7 b5 09 01 01 e8 7e 93 c1 ff <0f> 0b c3 0f 1f 44 00 00 8b 07 3d 00 00 00 c0 74 12 83 f8 01 74 13
[ 3376.536017] RSP: 0018:ffffc90002a97e30 EFLAGS: 00010286
[ 3376.536793] RAX: 0000000000000000 RBX: ffff8882de30d648 RCX: 0000000000000000
[ 3376.537718] RDX: ffff8882f5c28f20 RSI: ffff8882f5c18e40 RDI: ffff8882f5c18e40
[ 3376.538654] RBP: ffff8882cdf56c00 R08: 000000000000c580 R09: 0000000000001a4d
[ 3376.539582] R10: 0000000000000731 R11: ffffc90002a97ccd R12: 0000000000000000
[ 3376.540519] R13: ffff8882de30d600 R14: ffff8882de30d640 R15: ffff88821e000900
[ 3376.541444] FS:  0000000000000000(0000) GS:ffff8882f5c00000(0000) knlGS:0000000000000000
[ 3376.542732] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3376.543545] CR2: 0000556e5504b248 CR3: 00000002c6f10005 CR4: 0000000000770ee0
[ 3376.544483] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3376.545419] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3376.546344] PKRU: 55555554
[ 3376.546911] Call Trace:
[ 3376.547479]  mlx5e_rep_neigh_update.cold+0x33/0xe2 [mlx5_core]
[ 3376.548299]  process_one_work+0x1d8/0x390
[ 3376.548977]  worker_thread+0x4d/0x3e0
[ 3376.549631]  ? rescuer_thread+0x3e0/0x3e0
[ 3376.550295]  kthread+0x118/0x130
[ 3376.550914]  ? kthread_create_worker_on_cpu+0x70/0x70
[ 3376.551675]  ret_from_fork+0x1f/0x30
[ 3376.552312] ---[ end trace d84e8f46d2a77eec ]---

Fix the bug by moving work_struct to dedicated dynamically-allocated
structure. This enabled every event handler to work on its own private
neighbour pointer and removes the need for handling the case when task is
already enqueued.

Fixes: 232c001 ("net/mlx5e: Add support to neighbour update flow")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
ojeda pushed a commit that referenced this pull request Mar 15, 2021
I met below warning when cating a small size(about 80bytes) txt file
on 9pfs(msize=2097152 is passed to 9p mount option), the reason is we
miss iov_iter_advance() if the read count is 0 for zerocopy case, so
we didn't truncate the pipe, then iov_iter_pipe() thinks the pipe is
full. Fix it by removing the exception for 0 to ensure to call
iov_iter_advance() even on empty read for zerocopy case.

[    8.279568] WARNING: CPU: 0 PID: 39 at lib/iov_iter.c:1203 iov_iter_pipe+0x31/0x40
[    8.280028] Modules linked in:
[    8.280561] CPU: 0 PID: 39 Comm: cat Not tainted 5.11.0+ #6
[    8.281260] RIP: 0010:iov_iter_pipe+0x31/0x40
[    8.281974] Code: 2b 42 54 39 42 5c 76 22 c7 07 20 00 00 00 48 89 57 18 8b 42 50 48 c7 47 08 b
[    8.283169] RSP: 0018:ffff888000cbbd80 EFLAGS: 00000246
[    8.283512] RAX: 0000000000000010 RBX: ffff888000117d00 RCX: 0000000000000000
[    8.283876] RDX: ffff88800031d600 RSI: 0000000000000000 RDI: ffff888000cbbd90
[    8.284244] RBP: ffff888000cbbe38 R08: 0000000000000000 R09: ffff8880008d2058
[    8.284605] R10: 0000000000000002 R11: ffff888000375510 R12: 0000000000000050
[    8.284964] R13: ffff888000cbbe80 R14: 0000000000000050 R15: ffff88800031d600
[    8.285439] FS:  00007f24fd8af600(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000
[    8.285844] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    8.286150] CR2: 00007f24fd7d7b90 CR3: 0000000000c97000 CR4: 00000000000406b0
[    8.286710] Call Trace:
[    8.288279]  generic_file_splice_read+0x31/0x1a0
[    8.289273]  ? do_splice_to+0x2f/0x90
[    8.289511]  splice_direct_to_actor+0xcc/0x220
[    8.289788]  ? pipe_to_sendpage+0xa0/0xa0
[    8.290052]  do_splice_direct+0x8b/0xd0
[    8.290314]  do_sendfile+0x1ad/0x470
[    8.290576]  do_syscall_64+0x2d/0x40
[    8.290818]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[    8.291409] RIP: 0033:0x7f24fd7dca0a
[    8.292511] Code: c3 0f 1f 80 00 00 00 00 4c 89 d2 4c 89 c6 e9 bd fd ff ff 0f 1f 44 00 00 31 8
[    8.293360] RSP: 002b:00007ffc20932818 EFLAGS: 00000206 ORIG_RAX: 0000000000000028
[    8.293800] RAX: ffffffffffffffda RBX: 0000000001000000 RCX: 00007f24fd7dca0a
[    8.294153] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000001
[    8.294504] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
[    8.294867] R10: 0000000001000000 R11: 0000000000000206 R12: 0000000000000003
[    8.295217] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000000
[    8.295782] ---[ end trace 63317af81b3ca24b ]---

Signed-off-by: Jisheng Zhang <Jisheng.Zhang@synaptics.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ojeda pushed a commit that referenced this pull request Mar 15, 2021
The evlist has the maps with its own refcounts so we don't need to set
the pointers to NULL.  Otherwise following error was reported by Asan.

  # perf test -v 4
   4: Read samples using the mmap interface      :
  --- start ---
  test child forked, pid 139782
  mmap size 528384B

  =================================================================
  ==139782==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 40 byte(s) in 1 object(s) allocated from:
    #0 0x7f1f76daee8f in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145
    #1 0x564ba21a0fea in cpu_map__trim_new /home/namhyung/project/linux/tools/lib/perf/cpumap.c:79
    #2 0x564ba21a1a0f in perf_cpu_map__read /home/namhyung/project/linux/tools/lib/perf/cpumap.c:149
    #3 0x564ba21a21cf in cpu_map__read_all_cpu_map /home/namhyung/project/linux/tools/lib/perf/cpumap.c:166
    #4 0x564ba21a21cf in perf_cpu_map__new /home/namhyung/project/linux/tools/lib/perf/cpumap.c:181
    #5 0x564ba1e48298 in test__basic_mmap tests/mmap-basic.c:55
    #6 0x564ba1e278fb in run_test tests/builtin-test.c:428
    #7 0x564ba1e278fb in test_and_print tests/builtin-test.c:458
    #8 0x564ba1e29a53 in __cmd_test tests/builtin-test.c:679
    #9 0x564ba1e29a53 in cmd_test tests/builtin-test.c:825
    #10 0x564ba1e95cb4 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:313
    #11 0x564ba1d1fa88 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:365
    #12 0x564ba1d1fa88 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:409
    #13 0x564ba1d1fa88 in main /home/namhyung/project/linux/tools/perf/perf.c:539
    #14 0x7f1f768e4d09 in __libc_start_main ../csu/libc-start.c:308

    ...
  test child finished with 1
  ---- end ----
  Read samples using the mmap interface: FAILED!
  failed to open shell test directory: /home/namhyung/libexec/perf-core/tests/shell

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Link: https://lore.kernel.org/r/20210301140409.184570-2-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
ojeda pushed a commit that referenced this pull request Mar 15, 2021
The evlist has the maps with its own refcounts so we don't need to set
the pointers to NULL.  Otherwise following error was reported by Asan.

Also change the goto label since it doesn't need to have two.

  # perf test -v 24
  24: Number of exit events of a simple workload :
  --- start ---
  test child forked, pid 145915
  mmap size 528384B

  =================================================================
  ==145915==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 32 byte(s) in 1 object(s) allocated from:
    #0 0x7fc44e50d1f8 in __interceptor_realloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:164
    #1 0x561cf50f4d2e in perf_thread_map__realloc /home/namhyung/project/linux/tools/lib/perf/threadmap.c:23
    #2 0x561cf4eeb949 in thread_map__new_by_tid util/thread_map.c:63
    #3 0x561cf4db7fd2 in test__task_exit tests/task-exit.c:74
    #4 0x561cf4d798fb in run_test tests/builtin-test.c:428
    #5 0x561cf4d798fb in test_and_print tests/builtin-test.c:458
    #6 0x561cf4d7ba53 in __cmd_test tests/builtin-test.c:679
    #7 0x561cf4d7ba53 in cmd_test tests/builtin-test.c:825
    #8 0x561cf4de7d04 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:313
    #9 0x561cf4c71a88 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:365
    #10 0x561cf4c71a88 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:409
    #11 0x561cf4c71a88 in main /home/namhyung/project/linux/tools/perf/perf.c:539
    #12 0x7fc44e042d09 in __libc_start_main ../csu/libc-start.c:308

    ...
  test child finished with 1
  ---- end ----
  Number of exit events of a simple workload: FAILED!

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: https://lore.kernel.org/r/20210301140409.184570-4-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
ojeda pushed a commit that referenced this pull request Mar 15, 2021
The evlist has the maps with its own refcounts so we don't need to set
the pointers to NULL.  Otherwise following error was reported by Asan.

Also change the goto label since it doesn't need to have two.

  # perf test -v 25
  25: Software clock events period values        :
  --- start ---
  test child forked, pid 149154
  mmap size 528384B
  mmap size 528384B

  =================================================================
  ==149154==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 32 byte(s) in 1 object(s) allocated from:
    #0 0x7fef5cd071f8 in __interceptor_realloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:164
    #1 0x56260d5e8b8e in perf_thread_map__realloc /home/namhyung/project/linux/tools/lib/perf/threadmap.c:23
    #2 0x56260d3df7a9 in thread_map__new_by_tid util/thread_map.c:63
    #3 0x56260d2ac6b2 in __test__sw_clock_freq tests/sw-clock.c:65
    #4 0x56260d26d8fb in run_test tests/builtin-test.c:428
    #5 0x56260d26d8fb in test_and_print tests/builtin-test.c:458
    #6 0x56260d26fa53 in __cmd_test tests/builtin-test.c:679
    #7 0x56260d26fa53 in cmd_test tests/builtin-test.c:825
    #8 0x56260d2dbb64 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:313
    #9 0x56260d165a88 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:365
    #10 0x56260d165a88 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:409
    #11 0x56260d165a88 in main /home/namhyung/project/linux/tools/perf/perf.c:539
    #12 0x7fef5c83cd09 in __libc_start_main ../csu/libc-start.c:308

    ...
  test child finished with 1
  ---- end ----
  Software clock events period values      : FAILED!

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: https://lore.kernel.org/r/20210301140409.184570-5-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
ojeda pushed a commit that referenced this pull request Mar 15, 2021
The evlist and the cpu/thread maps should be released together.
Otherwise following error was reported by Asan.

Note that this test still has memory leaks in DSOs so it still fails
even after this change.  I'll take a look at that too.

  # perf test -v 26
  26: Object code reading                        :
  --- start ---
  test child forked, pid 154184
  Looking at the vmlinux_path (8 entries long)
  symsrc__init: build id mismatch for vmlinux.
  symsrc__init: cannot get elf header.
  Using /proc/kcore for kernel data
  Using /proc/kallsyms for symbols
  Parsing event 'cycles'
  mmap size 528384B
  ...
  =================================================================
  ==154184==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 439 byte(s) in 1 object(s) allocated from:
    #0 0x7fcb66e77037 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x55ad9b7e821e in dso__new_id util/dso.c:1256
    #2 0x55ad9b8cfd4a in __machine__addnew_vdso util/vdso.c:132
    #3 0x55ad9b8cfd4a in machine__findnew_vdso util/vdso.c:347
    #4 0x55ad9b845b7e in map__new util/map.c:176
    #5 0x55ad9b8415a2 in machine__process_mmap2_event util/machine.c:1787
    #6 0x55ad9b8fab16 in perf_tool__process_synth_event util/synthetic-events.c:64
    #7 0x55ad9b8fab16 in perf_event__synthesize_mmap_events util/synthetic-events.c:499
    #8 0x55ad9b8fbfdf in __event__synthesize_thread util/synthetic-events.c:741
    #9 0x55ad9b8ff3e3 in perf_event__synthesize_thread_map util/synthetic-events.c:833
    #10 0x55ad9b738585 in do_test_code_reading tests/code-reading.c:608
    #11 0x55ad9b73b25d in test__code_reading tests/code-reading.c:722
    #12 0x55ad9b6f28fb in run_test tests/builtin-test.c:428
    #13 0x55ad9b6f28fb in test_and_print tests/builtin-test.c:458
    #14 0x55ad9b6f4a53 in __cmd_test tests/builtin-test.c:679
    #15 0x55ad9b6f4a53 in cmd_test tests/builtin-test.c:825
    #16 0x55ad9b760cc4 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:313
    #17 0x55ad9b5eaa88 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:365
    #18 0x55ad9b5eaa88 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:409
    #19 0x55ad9b5eaa88 in main /home/namhyung/project/linux/tools/perf/perf.c:539
    #20 0x7fcb669acd09 in __libc_start_main ../csu/libc-start.c:308

    ...
  SUMMARY: AddressSanitizer: 471 byte(s) leaked in 2 allocation(s).
  test child finished with 1
  ---- end ----
  Object code reading: FAILED!

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: https://lore.kernel.org/r/20210301140409.184570-6-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
ojeda pushed a commit that referenced this pull request Mar 15, 2021
The evlist and the cpu/thread maps should be released together.
Otherwise following error was reported by Asan.

  $ perf test -v 28
  28: Use a dummy software event to keep tracking:
  --- start ---
  test child forked, pid 156810
  mmap size 528384B

  =================================================================
  ==156810==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 40 byte(s) in 1 object(s) allocated from:
    #0 0x7f637d2bce8f in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145
    #1 0x55cc6295cffa in cpu_map__trim_new /home/namhyung/project/linux/tools/lib/perf/cpumap.c:79
    #2 0x55cc6295da1f in perf_cpu_map__read /home/namhyung/project/linux/tools/lib/perf/cpumap.c:149
    #3 0x55cc6295e1df in cpu_map__read_all_cpu_map /home/namhyung/project/linux/tools/lib/perf/cpumap.c:166
    #4 0x55cc6295e1df in perf_cpu_map__new /home/namhyung/project/linux/tools/lib/perf/cpumap.c:181
    #5 0x55cc626287cf in test__keep_tracking tests/keep-tracking.c:84
    #6 0x55cc625e38fb in run_test tests/builtin-test.c:428
    #7 0x55cc625e38fb in test_and_print tests/builtin-test.c:458
    #8 0x55cc625e5a53 in __cmd_test tests/builtin-test.c:679
    #9 0x55cc625e5a53 in cmd_test tests/builtin-test.c:825
    #10 0x55cc62651cc4 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:313
    #11 0x55cc624dba88 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:365
    #12 0x55cc624dba88 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:409
    #13 0x55cc624dba88 in main /home/namhyung/project/linux/tools/perf/perf.c:539
    #14 0x7f637cdf2d09 in __libc_start_main ../csu/libc-start.c:308

  SUMMARY: AddressSanitizer: 72 byte(s) leaked in 2 allocation(s).
  test child finished with 1
  ---- end ----
  Use a dummy software event to keep tracking: FAILED!

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: https://lore.kernel.org/r/20210301140409.184570-7-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
ojeda pushed a commit that referenced this pull request Mar 15, 2021
The evlist and cpu/thread maps should be released together.
Otherwise the following error was reported by Asan.

  $ perf test -v 35
  35: Track with sched_switch                    :
  --- start ---
  test child forked, pid 159287
  Using CPUID GenuineIntel-6-8E-C
  mmap size 528384B
  1295 events recorded

  =================================================================
  ==159287==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 40 byte(s) in 1 object(s) allocated from:
    #0 0x7fa28d9a2e8f in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145
    #1 0x5652f5a5affa in cpu_map__trim_new /home/namhyung/project/linux/tools/lib/perf/cpumap.c:79
    #2 0x5652f5a5ba1f in perf_cpu_map__read /home/namhyung/project/linux/tools/lib/perf/cpumap.c:149
    #3 0x5652f5a5c1df in cpu_map__read_all_cpu_map /home/namhyung/project/linux/tools/lib/perf/cpumap.c:166
    #4 0x5652f5a5c1df in perf_cpu_map__new /home/namhyung/project/linux/tools/lib/perf/cpumap.c:181
    #5 0x5652f5723bbf in test__switch_tracking tests/switch-tracking.c:350
    #6 0x5652f56e18fb in run_test tests/builtin-test.c:428
    #7 0x5652f56e18fb in test_and_print tests/builtin-test.c:458
    #8 0x5652f56e3a53 in __cmd_test tests/builtin-test.c:679
    #9 0x5652f56e3a53 in cmd_test tests/builtin-test.c:825
    #10 0x5652f574fcc4 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:313
    #11 0x5652f55d9a88 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:365
    #12 0x5652f55d9a88 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:409
    #13 0x5652f55d9a88 in main /home/namhyung/project/linux/tools/perf/perf.c:539
    #14 0x7fa28d4d8d09 in __libc_start_main ../csu/libc-start.c:308

  SUMMARY: AddressSanitizer: 72 byte(s) leaked in 2 allocation(s).
  test child finished with 1
  ---- end ----
  Track with sched_switch: FAILED!

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: https://lore.kernel.org/r/20210301140409.184570-8-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
ojeda pushed a commit that referenced this pull request Mar 15, 2021
It missed to call perf_thread_map__put() after using the map.

  $ perf test -v 43
  43: Synthesize thread map                      :
  --- start ---
  test child forked, pid 162640

  =================================================================
  ==162640==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 32 byte(s) in 1 object(s) allocated from:
    #0 0x7fd48cdaa1f8 in __interceptor_realloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:164
    #1 0x563e6d5f8d0e in perf_thread_map__realloc /home/namhyung/project/linux/tools/lib/perf/threadmap.c:23
    #2 0x563e6d3ef69a in thread_map__new_by_pid util/thread_map.c:46
    #3 0x563e6d2cec90 in test__thread_map_synthesize tests/thread-map.c:97
    #4 0x563e6d27d8fb in run_test tests/builtin-test.c:428
    #5 0x563e6d27d8fb in test_and_print tests/builtin-test.c:458
    #6 0x563e6d27fa53 in __cmd_test tests/builtin-test.c:679
    #7 0x563e6d27fa53 in cmd_test tests/builtin-test.c:825
    #8 0x563e6d2ebce4 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:313
    #9 0x563e6d175a88 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:365
    #10 0x563e6d175a88 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:409
    #11 0x563e6d175a88 in main /home/namhyung/project/linux/tools/perf/perf.c:539
    #12 0x7fd48c8dfd09 in __libc_start_main ../csu/libc-start.c:308

  SUMMARY: AddressSanitizer: 8224 byte(s) leaked in 2 allocation(s).
  test child finished with 1
  ---- end ----
  Synthesize thread map: FAILED!

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: https://lore.kernel.org/r/20210301140409.184570-9-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
ojeda pushed a commit that referenced this pull request Mar 15, 2021
It should be released after printing the map.

  $ perf test -v 52
  52: Print cpu map                              :
  --- start ---
  test child forked, pid 172233

  =================================================================
  ==172233==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 156 byte(s) in 1 object(s) allocated from:
    #0 0x7fc472518e8f in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145
    #1 0x55e63b378f7a in cpu_map__trim_new /home/namhyung/project/linux/tools/lib/perf/cpumap.c:79
    #2 0x55e63b37a05c in perf_cpu_map__new /home/namhyung/project/linux/tools/lib/perf/cpumap.c:237
    #3 0x55e63b056d16 in cpu_map_print tests/cpumap.c:102
    #4 0x55e63b056d16 in test__cpu_map_print tests/cpumap.c:120
    #5 0x55e63afff8fb in run_test tests/builtin-test.c:428
    #6 0x55e63afff8fb in test_and_print tests/builtin-test.c:458
    #7 0x55e63b001a53 in __cmd_test tests/builtin-test.c:679
    #8 0x55e63b001a53 in cmd_test tests/builtin-test.c:825
    #9 0x55e63b06dc44 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:313
    #10 0x55e63aef7a88 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:365
    #11 0x55e63aef7a88 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:409
    #12 0x55e63aef7a88 in main /home/namhyung/project/linux/tools/perf/perf.c:539
    #13 0x7fc47204ed09 in __libc_start_main ../csu/libc-start.c:308
  ...

  SUMMARY: AddressSanitizer: 448 byte(s) leaked in 7 allocation(s).
  test child finished with 1
  ---- end ----
  Print cpu map: FAILED!

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: https://lore.kernel.org/r/20210301140409.184570-11-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
ojeda pushed a commit that referenced this pull request Mar 15, 2021
It should release the maps at the end.

  $ perf test -v 71
  71: Convert perf time to TSC                   :
  --- start ---
  test child forked, pid 178744
  mmap size 528384B
  1st event perf time 59207256505278 tsc 13187166645142
  rdtsc          time 59207256542151 tsc 13187166723020
  2nd event perf time 59207256543749 tsc 13187166726393

  =================================================================
  ==178744==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 40 byte(s) in 1 object(s) allocated from:
    #0 0x7faf601f9e8f in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145
    #1 0x55b620cfc00a in cpu_map__trim_new /home/namhyung/project/linux/tools/lib/perf/cpumap.c:79
    #2 0x55b620cfca2f in perf_cpu_map__read /home/namhyung/project/linux/tools/lib/perf/cpumap.c:149
    #3 0x55b620cfd1ef in cpu_map__read_all_cpu_map /home/namhyung/project/linux/tools/lib/perf/cpumap.c:166
    #4 0x55b620cfd1ef in perf_cpu_map__new /home/namhyung/project/linux/tools/lib/perf/cpumap.c:181
    #5 0x55b6209ef1b2 in test__perf_time_to_tsc tests/perf-time-to-tsc.c:73
    #6 0x55b6209828fb in run_test tests/builtin-test.c:428
    #7 0x55b6209828fb in test_and_print tests/builtin-test.c:458
    #8 0x55b620984a53 in __cmd_test tests/builtin-test.c:679
    #9 0x55b620984a53 in cmd_test tests/builtin-test.c:825
    #10 0x55b6209f0cd4 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:313
    #11 0x55b62087aa88 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:365
    #12 0x55b62087aa88 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:409
    #13 0x55b62087aa88 in main /home/namhyung/project/linux/tools/perf/perf.c:539
    #14 0x7faf5fd2fd09 in __libc_start_main ../csu/libc-start.c:308

  SUMMARY: AddressSanitizer: 72 byte(s) leaked in 2 allocation(s).
  test child finished with 1
  ---- end ----
  Convert perf time to TSC: FAILED!

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: https://lore.kernel.org/r/20210301140409.184570-12-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
ojeda pushed a commit that referenced this pull request Mar 15, 2021
I got a segfault when using -r option with event groups.  The option
makes it run the workload multiple times and it will reuse the evlist
and evsel for each run.

While most of resources are allocated and freed properly, the id hash
in the evlist was not and it resulted in the bug.  You can see it with
the address sanitizer like below:

  $ perf stat -r 100 -e '{cycles,instructions}' true
  =================================================================
  ==693052==ERROR: AddressSanitizer: heap-use-after-free on
      address 0x6080000003d0 at pc 0x558c57732835 bp 0x7fff1526adb0 sp 0x7fff1526ada8
  WRITE of size 8 at 0x6080000003d0 thread T0
    #0 0x558c57732834 in hlist_add_head /home/namhyung/project/linux/tools/include/linux/list.h:644
    #1 0x558c57732834 in perf_evlist__id_hash /home/namhyung/project/linux/tools/lib/perf/evlist.c:237
    #2 0x558c57732834 in perf_evlist__id_add /home/namhyung/project/linux/tools/lib/perf/evlist.c:244
    #3 0x558c57732834 in perf_evlist__id_add_fd /home/namhyung/project/linux/tools/lib/perf/evlist.c:285
    #4 0x558c5747733e in store_evsel_ids util/evsel.c:2765
    #5 0x558c5747733e in evsel__store_ids util/evsel.c:2782
    #6 0x558c5730b717 in __run_perf_stat /home/namhyung/project/linux/tools/perf/builtin-stat.c:895
    #7 0x558c5730b717 in run_perf_stat /home/namhyung/project/linux/tools/perf/builtin-stat.c:1014
    #8 0x558c5730b717 in cmd_stat /home/namhyung/project/linux/tools/perf/builtin-stat.c:2446
    #9 0x558c57427c24 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:313
    #10 0x558c572b1a48 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:365
    #11 0x558c572b1a48 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:409
    #12 0x558c572b1a48 in main /home/namhyung/project/linux/tools/perf/perf.c:539
    #13 0x7fcadb9f7d09 in __libc_start_main ../csu/libc-start.c:308
    #14 0x558c572b60f9 in _start (/home/namhyung/project/linux/tools/perf/perf+0x45d0f9)

Actually the nodes in the hash table are struct perf_stream_id and
they were freed in the previous run.  Fix it by resetting the hash.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: https://lore.kernel.org/r/20210225035148.778569-2-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
JoseTeuttli pushed a commit to JoseTeuttli/linux that referenced this pull request Jun 14, 2021
ASan reported a memory leak caused by info_linear not being deallocated.

The info_linear was allocated during in perf_event__synthesize_one_bpf_prog().

This patch adds the corresponding free() when bpf_prog_info_node
is freed in perf_env__purge_bpf().

  $ sudo ./perf record -- sleep 5
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.025 MB perf.data (8 samples) ]

  =================================================================
  ==297735==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 7688 byte(s) in 19 object(s) allocated from:
      #0 0x4f420f in malloc (/home/user/linux/tools/perf/perf+0x4f420f)
      #1 0xc06a74 in bpf_program__get_prog_info_linear /home/user/linux/tools/lib/bpf/libbpf.c:11113:16
      #2 0xb426fe in perf_event__synthesize_one_bpf_prog /home/user/linux/tools/perf/util/bpf-event.c:191:16
      Rust-for-Linux#3 0xb42008 in perf_event__synthesize_bpf_events /home/user/linux/tools/perf/util/bpf-event.c:410:9
      Rust-for-Linux#4 0x594596 in record__synthesize /home/user/linux/tools/perf/builtin-record.c:1490:8
      Rust-for-Linux#5 0x58c9ac in __cmd_record /home/user/linux/tools/perf/builtin-record.c:1798:8
      Rust-for-Linux#6 0x58990b in cmd_record /home/user/linux/tools/perf/builtin-record.c:2901:8
      Rust-for-Linux#7 0x7b2a20 in run_builtin /home/user/linux/tools/perf/perf.c:313:11
      Rust-for-Linux#8 0x7b12ff in handle_internal_command /home/user/linux/tools/perf/perf.c:365:8
      Rust-for-Linux#9 0x7b2583 in run_argv /home/user/linux/tools/perf/perf.c:409:2
      Rust-for-Linux#10 0x7b0d79 in main /home/user/linux/tools/perf/perf.c:539:3
      Rust-for-Linux#11 0x7fa357ef6b74 in __libc_start_main /usr/src/debug/glibc-2.33-8.fc34.x86_64/csu/../csu/libc-start.c:332:16

Signed-off-by: Riccardo Mancini <rickyman7@gmail.com>
Acked-by: Ian Rogers <irogers@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Link: http://lore.kernel.org/lkml/20210602224024.300485-1-rickyman7@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
AltF02 pushed a commit to AltF02/linux that referenced this pull request Oct 5, 2021
It's later supposed to be either a correct address or NULL. Without the
initialization, it may contain an undefined value which results in the
following segmentation fault:

  # perf top --sort comm -g --ignore-callees=do_idle

terminates with:

  #0  0x00007ffff56b7685 in __strlen_avx2 () from /lib64/libc.so.6
  Rust-for-Linux#1  0x00007ffff55e3802 in strdup () from /lib64/libc.so.6
  Rust-for-Linux#2  0x00005555558cb139 in hist_entry__init (callchain_size=<optimized out>, sample_self=true, template=0x7fffde7fb110, he=0x7fffd801c250) at util/hist.c:489
  Rust-for-Linux#3  hist_entry__new (template=template@entry=0x7fffde7fb110, sample_self=sample_self@entry=true) at util/hist.c:564
  Rust-for-Linux#4  0x00005555558cb4ba in hists__findnew_entry (hists=hists@entry=0x5555561d9e38, entry=entry@entry=0x7fffde7fb110, al=al@entry=0x7fffde7fb420,
      sample_self=sample_self@entry=true) at util/hist.c:657
  Rust-for-Linux#5  0x00005555558cba1b in __hists__add_entry (hists=hists@entry=0x5555561d9e38, al=0x7fffde7fb420, sym_parent=<optimized out>, bi=bi@entry=0x0, mi=mi@entry=0x0,
      sample=sample@entry=0x7fffde7fb4b0, sample_self=true, ops=0x0, block_info=0x0) at util/hist.c:288
  Rust-for-Linux#6  0x00005555558cbb70 in hists__add_entry (sample_self=true, sample=0x7fffde7fb4b0, mi=0x0, bi=0x0, sym_parent=<optimized out>, al=<optimized out>, hists=0x5555561d9e38)
      at util/hist.c:1056
  Rust-for-Linux#7  iter_add_single_cumulative_entry (iter=0x7fffde7fb460, al=<optimized out>) at util/hist.c:1056
  Rust-for-Linux#8  0x00005555558cc8a4 in hist_entry_iter__add (iter=iter@entry=0x7fffde7fb460, al=al@entry=0x7fffde7fb420, max_stack_depth=<optimized out>, arg=arg@entry=0x7fffffff7db0)
      at util/hist.c:1231
  Rust-for-Linux#9  0x00005555557cdc9a in perf_event__process_sample (machine=<optimized out>, sample=0x7fffde7fb4b0, evsel=<optimized out>, event=<optimized out>, tool=0x7fffffff7db0)
      at builtin-top.c:842
  Rust-for-Linux#10 deliver_event (qe=<optimized out>, qevent=<optimized out>) at builtin-top.c:1202
  Rust-for-Linux#11 0x00005555558a9318 in do_flush (show_progress=false, oe=0x7fffffff80e0) at util/ordered-events.c:244
  Rust-for-Linux#12 __ordered_events__flush (oe=oe@entry=0x7fffffff80e0, how=how@entry=OE_FLUSH__TOP, timestamp=timestamp@entry=0) at util/ordered-events.c:323
  Rust-for-Linux#13 0x00005555558a9789 in __ordered_events__flush (timestamp=<optimized out>, how=<optimized out>, oe=<optimized out>) at util/ordered-events.c:339
  Rust-for-Linux#14 ordered_events__flush (how=OE_FLUSH__TOP, oe=0x7fffffff80e0) at util/ordered-events.c:341
  Rust-for-Linux#15 ordered_events__flush (oe=oe@entry=0x7fffffff80e0, how=how@entry=OE_FLUSH__TOP) at util/ordered-events.c:339
  Rust-for-Linux#16 0x00005555557cd631 in process_thread (arg=0x7fffffff7db0) at builtin-top.c:1114
  Rust-for-Linux#17 0x00007ffff7bb817a in start_thread () from /lib64/libpthread.so.0
  Rust-for-Linux#18 0x00007ffff5656dc3 in clone () from /lib64/libc.so.6

If you look at the frame Rust-for-Linux#2, the code is:

488	 if (he->srcline) {
489          he->srcline = strdup(he->srcline);
490          if (he->srcline == NULL)
491              goto err_rawdata;
492	 }

If he->srcline is not NULL (it is not NULL if it is uninitialized rubbish),
it gets strdupped and strdupping a rubbish random string causes the problem.

Also, if you look at the commit 1fb7d06, it adds the srcline property
into the struct, but not initializing it everywhere needed.

Committer notes:

Now I see, when using --ignore-callees=do_idle we end up here at line
2189 in add_callchain_ip():

2181         if (al.sym != NULL) {
2182                 if (perf_hpp_list.parent && !*parent &&
2183                     symbol__match_regex(al.sym, &parent_regex))
2184                         *parent = al.sym;
2185                 else if (have_ignore_callees && root_al &&
2186                   symbol__match_regex(al.sym, &ignore_callees_regex)) {
2187                         /* Treat this symbol as the root,
2188                            forgetting its callees. */
2189                         *root_al = al;
2190                         callchain_cursor_reset(cursor);
2191                 }
2192         }

And the al that doesn't have the ->srcline field initialized will be
copied to the root_al, so then, back to:

1211 int hist_entry_iter__add(struct hist_entry_iter *iter, struct addr_location *al,
1212                          int max_stack_depth, void *arg)
1213 {
1214         int err, err2;
1215         struct map *alm = NULL;
1216
1217         if (al)
1218                 alm = map__get(al->map);
1219
1220         err = sample__resolve_callchain(iter->sample, &callchain_cursor, &iter->parent,
1221                                         iter->evsel, al, max_stack_depth);
1222         if (err) {
1223                 map__put(alm);
1224                 return err;
1225         }
1226
1227         err = iter->ops->prepare_entry(iter, al);
1228         if (err)
1229                 goto out;
1230
1231         err = iter->ops->add_single_entry(iter, al);
1232         if (err)
1233                 goto out;
1234

That al at line 1221 is what hist_entry_iter__add() (called from
sample__resolve_callchain()) saw as 'root_al', and then:

        iter->ops->add_single_entry(iter, al);

will go on with al->srcline with a bogus value, I'll add the above
sequence to the cset and apply, thanks!

Signed-off-by: Michael Petlan <mpetlan@redhat.com>
CC: Milian Wolff <milian.wolff@kdab.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Fixes: 1fb7d06 ("perf report Use srcline from callchain for hist entries")
Link: https //lore.kernel.org/r/20210719145332.29747-1-mpetlan@redhat.com
Reported-by: Juri Lelli <jlelli@redhat.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
AltF02 pushed a commit to AltF02/linux that referenced this pull request Oct 5, 2021
FD uses xyarray__entry that may return NULL if an index is out of
bounds. If NULL is returned then a segv happens as FD unconditionally
dereferences the pointer. This was happening in a case of with perf
iostat as shown below. The fix is to make FD an "int*" rather than an
int and handle the NULL case as either invalid input or a closed fd.

  $ sudo gdb --args perf stat --iostat  list
  ...
  Breakpoint 1, perf_evsel__alloc_fd (evsel=0x5555560951a0, ncpus=1, nthreads=1) at evsel.c:50
  50      {
  (gdb) bt
   #0  perf_evsel__alloc_fd (evsel=0x5555560951a0, ncpus=1, nthreads=1) at evsel.c:50
   Rust-for-Linux#1  0x000055555585c188 in evsel__open_cpu (evsel=0x5555560951a0, cpus=0x555556093410,
      threads=0x555556086fb0, start_cpu=0, end_cpu=1) at util/evsel.c:1792
   Rust-for-Linux#2  0x000055555585cfb2 in evsel__open (evsel=0x5555560951a0, cpus=0x0, threads=0x555556086fb0)
      at util/evsel.c:2045
   Rust-for-Linux#3  0x000055555585d0db in evsel__open_per_thread (evsel=0x5555560951a0, threads=0x555556086fb0)
      at util/evsel.c:2065
   Rust-for-Linux#4  0x00005555558ece64 in create_perf_stat_counter (evsel=0x5555560951a0,
      config=0x555555c34700 <stat_config>, target=0x555555c2f1c0 <target>, cpu=0) at util/stat.c:590
   Rust-for-Linux#5  0x000055555578e927 in __run_perf_stat (argc=1, argv=0x7fffffffe4a0, run_idx=0)
      at builtin-stat.c:833
   Rust-for-Linux#6  0x000055555578f3c6 in run_perf_stat (argc=1, argv=0x7fffffffe4a0, run_idx=0)
      at builtin-stat.c:1048
   Rust-for-Linux#7  0x0000555555792ee5 in cmd_stat (argc=1, argv=0x7fffffffe4a0) at builtin-stat.c:2534
   Rust-for-Linux#8  0x0000555555835ed3 in run_builtin (p=0x555555c3f540 <commands+288>, argc=3,
      argv=0x7fffffffe4a0) at perf.c:313
   Rust-for-Linux#9  0x0000555555836154 in handle_internal_command (argc=3, argv=0x7fffffffe4a0) at perf.c:365
   Rust-for-Linux#10 0x000055555583629f in run_argv (argcp=0x7fffffffe2ec, argv=0x7fffffffe2e0) at perf.c:409
   Rust-for-Linux#11 0x0000555555836692 in main (argc=3, argv=0x7fffffffe4a0) at perf.c:539
  ...
  (gdb) c
  Continuing.
  Error:
  The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (uncore_iio_0/event=0x83,umask=0x04,ch_mask=0xF,fc_mask=0x07/).
  /bin/dmesg | grep -i perf may provide additional information.

  Program received signal SIGSEGV, Segmentation fault.
  0x00005555559b03ea in perf_evsel__close_fd_cpu (evsel=0x5555560951a0, cpu=1) at evsel.c:166
  166                     if (FD(evsel, cpu, thread) >= 0)

v3. fixes a bug in perf_evsel__run_ioctl where the sense of a branch was
    backward.

Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20210918054440.2350466-1-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
mdaverde pushed a commit to mdaverde/linux that referenced this pull request Nov 7, 2021
After commit 9298e63 ("bpf/tests: Add exhaustive tests of ALU
operand magnitudes"), when modprobe test_bpf.ko with JIT on mips64,
there exists segment fault due to the following reason:

  [...]
  ALU64_MOV_X: all register value magnitudes jited:1
  Break instruction in kernel code[Rust-for-Linux#1]
  [...]

It seems that the related JIT implementations of some test cases
in test_bpf() have problems. At this moment, I do not care about
the segment fault while I just want to verify the test cases of
tail calls.

Based on the above background and motivation, add the following
module parameter test_suite to the test_bpf.ko:

  test_suite=<string>: only the specified test suite will be run, the
  string can be "test_bpf", "test_tail_calls" or "test_skb_segment".

If test_suite is not specified, but test_id, test_name or test_range
is specified, set 'test_bpf' as the default test suite. This is useful
to only test the corresponding test suite when specifying the valid
test_suite string.

Any invalid test suite will result in -EINVAL being returned and no
tests being run. If the test_suite is not specified or specified as
empty string, it does not change the current logic, all of the test
cases will be run.

Here are some test results:

 # dmesg -c
 # modprobe test_bpf
 # dmesg | grep Summary
 test_bpf: Summary: 1009 PASSED, 0 FAILED, [0/997 JIT'ed]
 test_bpf: test_tail_calls: Summary: 8 PASSED, 0 FAILED, [0/8 JIT'ed]
 test_bpf: test_skb_segment: Summary: 2 PASSED, 0 FAILED

 # rmmod test_bpf
 # dmesg -c
 # modprobe test_bpf test_suite=test_bpf
 # dmesg | tail -1
 test_bpf: Summary: 1009 PASSED, 0 FAILED, [0/997 JIT'ed]

 # rmmod test_bpf
 # dmesg -c
 # modprobe test_bpf test_suite=test_tail_calls
 # dmesg
 test_bpf: #0 Tail call leaf jited:0 21 PASS
 [...]
 test_bpf: Rust-for-Linux#7 Tail call error path, index out of range jited:0 32 PASS
 test_bpf: test_tail_calls: Summary: 8 PASSED, 0 FAILED, [0/8 JIT'ed]

 # rmmod test_bpf
 # dmesg -c
 # modprobe test_bpf test_suite=test_skb_segment
 # dmesg
 test_bpf: #0 gso_with_rx_frags PASS
 test_bpf: Rust-for-Linux#1 gso_linear_no_head_frag PASS
 test_bpf: test_skb_segment: Summary: 2 PASSED, 0 FAILED

 # rmmod test_bpf
 # dmesg -c
 # modprobe test_bpf test_id=1
 # dmesg
 test_bpf: test_bpf: set 'test_bpf' as the default test_suite.
 test_bpf: Rust-for-Linux#1 TXA jited:0 54 51 50 PASS
 test_bpf: Summary: 1 PASSED, 0 FAILED, [0/1 JIT'ed]

 # rmmod test_bpf
 # dmesg -c
 # modprobe test_bpf test_suite=test_bpf test_name=TXA
 # dmesg
 test_bpf: Rust-for-Linux#1 TXA jited:0 54 50 51 PASS
 test_bpf: Summary: 1 PASSED, 0 FAILED, [0/1 JIT'ed]

 # rmmod test_bpf
 # dmesg -c
 # modprobe test_bpf test_suite=test_tail_calls test_range=6,7
 # dmesg
 test_bpf: Rust-for-Linux#6 Tail call error path, NULL target jited:0 41 PASS
 test_bpf: Rust-for-Linux#7 Tail call error path, index out of range jited:0 32 PASS
 test_bpf: test_tail_calls: Summary: 2 PASSED, 0 FAILED, [0/2 JIT'ed]

 # rmmod test_bpf
 # dmesg -c
 # modprobe test_bpf test_suite=test_skb_segment test_id=1
 # dmesg
 test_bpf: Rust-for-Linux#1 gso_linear_no_head_frag PASS
 test_bpf: test_skb_segment: Summary: 1 PASSED, 0 FAILED

By the way, the above segment fault has been fixed in the latest bpf-next
tree which contains the mips64 JIT rework.

Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Acked-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Link: https://lore.kernel.org/bpf/1635384321-28128-1-git-send-email-yangtiezhu@loongson.cn
fbq pushed a commit that referenced this pull request Dec 28, 2023
Andrii Nakryiko says:

====================
BPF token support in libbpf's BPF object

Add fuller support for BPF token in high-level BPF object APIs. This is the
most frequently used way to work with BPF using libbpf, so supporting BPF
token there is critical.

Patch #1 is improving kernel-side BPF_TOKEN_CREATE behavior by rejecting to
create "empty" BPF token with no delegation. This seems like saner behavior
which also makes libbpf's caching better overall. If we ever want to create
BPF token with no delegate_xxx options set on BPF FS, we can use a new flag to
enable that.

Patches #2-#5 refactor libbpf internals, mostly feature detection code, to
prepare it from BPF token FD.

Patch #6 adds options to pass BPF token into BPF object open options. It also
adds implicit BPF token creation logic to BPF object load step, even without
any explicit involvement of the user. If the environment is setup properly,
BPF token will be created transparently and used implicitly. This allows for
all existing application to gain BPF token support by just linking with
latest version of libbpf library. No source code modifications are required.
All that under assumption that privileged container management agent properly
set up default BPF FS instance at /sys/bpf/fs to allow BPF token creation.

Patches #7-#8 adds more selftests, validating BPF object APIs work as expected
under unprivileged user namespaced conditions in the presence of BPF token.

Patch #9 extends libbpf with LIBBPF_BPF_TOKEN_PATH envvar knowledge, which can
be used to override custom BPF FS location used for implicit BPF token
creation logic without needing to adjust application code. This allows admins
or container managers to mount BPF token-enabled BPF FS at non-standard
location without the need to coordinate with applications.
LIBBPF_BPF_TOKEN_PATH can also be used to disable BPF token implicit creation
by setting it to an empty value. Patch #10 tests this new envvar functionality.

v2->v3:
  - move some stray feature cache refactorings into patch #4 (Alexei);
  - add LIBBPF_BPF_TOKEN_PATH envvar support (Alexei);
v1->v2:
  - remove minor code redundancies (Eduard, John);
  - add acks and rebase.
====================

Link: https://lore.kernel.org/r/20231213190842.3844987-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
fbq pushed a commit that referenced this pull request Dec 28, 2023
Ido Schimmel says:

====================
Add MDB bulk deletion support

This patchset adds MDB bulk deletion support, allowing user space to
request the deletion of matching entries instead of dumping the entire
MDB and issuing a separate deletion request for each matching entry.
Support is added in both the bridge and VXLAN drivers in a similar
fashion to the existing FDB bulk deletion support.

The parameters according to which bulk deletion can be performed are
similar to the FDB ones, namely: Destination port, VLAN ID, state (e.g.,
"permanent"), routing protocol, source / destination VNI, destination IP
and UDP port. Flushing based on flags (e.g., "offload", "fast_leave",
"added_by_star_ex", "blocked") is not currently supported, but can be
added in the future, if a use case arises.

Patch #1 adds a new uAPI attribute to allow specifying the state mask
according to which bulk deletion will be performed, if any.

Patch #2 adds a new policy according to which bulk deletion requests
(with 'NLM_F_BULK' flag set) will be parsed.

Patches #3-#4 add a new NDO for MDB bulk deletion and invoke it from the
rtnetlink code when a bulk deletion request is made.

Patches #5-#6 implement the MDB bulk deletion NDO in the bridge and
VXLAN drivers, respectively.

Patch #7 allows user space to issue MDB bulk deletion requests by no
longer rejecting the 'NLM_F_BULK' flag when it is set in 'RTM_DELMDB'
requests.

Patches #8-#9 add selftests for both drivers, for both good and bad
flows.

iproute2 changes can be found here [1].

https://github.com/idosch/iproute2/tree/submit/mdb_flush_v1
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
fbq pushed a commit that referenced this pull request Dec 28, 2023
Wen Gu says:

====================
net/smc: implement SMCv2.1 virtual ISM device support

The fourth edition of SMCv2 adds the SMC version 2.1 feature updates for
SMC-Dv2 with virtual ISM. Virtual ISM are created and supported mainly by
OS or hypervisor software, comparable to IBM ISM which is based on platform
firmware or hardware.

With the introduction of virtual ISM, SMCv2.1 makes some updates:

- Introduce feature bitmask to indicate supplemental features.
- Reserve a range of CHIDs for virtual ISM.
- Support extended GIDs (128 bits) in CLC handshake.

So this patch set aims to implement these updates in Linux kernel. And it
acts as the first part of SMC-D virtual ISM extension & loopback-ism [1].

[1] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/

v8->v7:
- Patch #7: v7 mistakenly changed the type of gid_ext in
  smc_clc_msg_accept_confirm to u64 instead of __be64 as previous versions
  when fixing the rebase conflicts. So fix this mistake.

v7->v6:
Link: https://lore.kernel.org/netdev/20231219084536.8158-1-guwen@linux.alibaba.com/
- Collect the Reviewed-by tag in v6;
- Patch #3: redefine the struct smc_clc_msg_accept_confirm;
- Patch #7: Because that the Patch #3 already adds '__packed' to
  smc_clc_msg_accept_confirm, so Patch #7 doesn't need to do the same thing.
  But this is a minor change, so I kept the 'Reviewed-by' tag.

Other changes in previous versions but not yet acked:
- Patch #1: Some minor changes in subject and fix the format issue
  (length exceeds 80 columns) compared to v3.
- Patch #5: removes useless ini->feature_mask assignment in __smc_connect()
  and smc_listen_v2_check() compared to v4.
- Patch #8: new added, compared to v3.

v6->v5:
Link: https://lore.kernel.org/netdev/1702371151-125258-1-git-send-email-guwen@linux.alibaba.com/
- Add 'Reviewed-by' label given in the previous versions:
  * Patch #4, #6, #9, #10 have nothing changed since v3;
- Patch #2:
  * fix the format issue (Alignment should match open parenthesis) compared to v5;
  * remove useless clc->hdr.length assignment in smcr_clc_prep_confirm_accept()
    compared to v5;
- Patch #3: new added compared to v5.
- Patch #7: some minor changes like aclc_v2->aclc or clc_v2->clc compared to v5
  due to the introduction of Patch #3. Since there were no major changes, I kept
  the 'Reviewed-by' label.

Other changes in previous versions but not yet acked:
- Patch #1: Some minor changes in subject and fix the format issue
  (length exceeds 80 columns) compared to v3.
- Patch #5: removes useless ini->feature_mask assignment in __smc_connect()
  and smc_listen_v2_check() compared to v4.
- Patch #8: new added, compared to v3.

v5->v4:
Link: https://lore.kernel.org/netdev/1702021259-41504-1-git-send-email-guwen@linux.alibaba.com/
- Patch #6: improve the comment of SMCD_CLC_MAX_V2_GID_ENTRIES;
- Patch #4: remove useless ini->feature_mask assignment;

v4->v3:
https://lore.kernel.org/netdev/1701920994-73705-1-git-send-email-guwen@linux.alibaba.com/
- Patch #6: use SMCD_CLC_MAX_V2_GID_ENTRIES to indicate the max gid
  entries in CLC proposal and using SMC_MAX_V2_ISM_DEVS to indicate the
  max devices to propose;
- Patch #6: use i and i+1 in smc_find_ism_v2_device_serv();
- Patch #2: replace the large if-else block in smc_clc_send_confirm_accept()
  with 2 subfunctions;
- Fix missing byte order conversion of GID and token in CLC handshake,
  which is in a separate patch sending to net:
  https://lore.kernel.org/netdev/1701882157-87956-1-git-send-email-guwen@linux.alibaba.com/
- Patch #7: add extended GID in SMC-D lgr netlink attribute;

v3->v2:
https://lore.kernel.org/netdev/1701343695-122657-1-git-send-email-guwen@linux.alibaba.com/
- Rename smc_clc_fill_fce as smc_clc_fill_fce_v2x;
- Remove ISM_IDENT_MASK from drivers/s390/net/ism.h;
- Add explicitly assigning 'false' to ism_v2_capable in ism_dev_init();
- Remove smc_ism_set_v2_capable() helper for now, and introduce it in
  later loopback-ism implementation;

v2->v1:
- Fix sparse complaint;
- Rebase to the latest net-next;
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
ojeda pushed a commit that referenced this pull request Jan 20, 2024
…te_call_indirect

kprobe_emulate_call_indirect currently uses int3_emulate_call to emulate
indirect calls. However, int3_emulate_call always assumes the size of
the call to be 5 bytes when calculating the return address. This is
incorrect for register-based indirect calls in x86, which can be either
2 or 3 bytes depending on whether REX prefix is used. At kprobe runtime,
the incorrect return address causes control flow to land onto the wrong
place after return -- possibly not a valid instruction boundary. This
can lead to a panic like the following:

[    7.308204][    C1] BUG: unable to handle page fault for address: 000000000002b4d8
[    7.308883][    C1] #PF: supervisor read access in kernel mode
[    7.309168][    C1] #PF: error_code(0x0000) - not-present page
[    7.309461][    C1] PGD 0 P4D 0
[    7.309652][    C1] Oops: 0000 [#1] SMP
[    7.309929][    C1] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.7.0-rc5-trace-for-next #6
[    7.310397][    C1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-20220807_005459-localhost 04/01/2014
[    7.311068][    C1] RIP: 0010:__common_interrupt+0x52/0xc0
[    7.311349][    C1] Code: 01 00 4d 85 f6 74 39 49 81 fe 00 f0 ff ff 77 30 4c 89 f7 4d 8b 5e 68 41 ba 91 76 d8 42 45 03 53 fc 74 02 0f 0b cc ff d3 65 48 <8b> 05 30 c7 ff 7e 65 4c 89 3d 28 c7 ff 7e 5b 41 5c 41 5e 41 5f c3
[    7.312512][    C1] RSP: 0018:ffffc900000e0fd0 EFLAGS: 00010046
[    7.312899][    C1] RAX: 0000000000000001 RBX: 0000000000000023 RCX: 0000000000000001
[    7.313334][    C1] RDX: 00000000000003cd RSI: 0000000000000001 RDI: ffff888100d302a4
[    7.313702][    C1] RBP: 0000000000000001 R08: 0ef439818636191f R09: b1621ff338a3b482
[    7.314146][    C1] R10: ffffffff81e5127b R11: ffffffff81059810 R12: 0000000000000023
[    7.314509][    C1] R13: 0000000000000000 R14: ffff888100d30200 R15: 0000000000000000
[    7.314951][    C1] FS:  0000000000000000(0000) GS:ffff88813bc80000(0000) knlGS:0000000000000000
[    7.315396][    C1] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    7.315691][    C1] CR2: 000000000002b4d8 CR3: 0000000003028003 CR4: 0000000000370ef0
[    7.316153][    C1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    7.316508][    C1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    7.316948][    C1] Call Trace:
[    7.317123][    C1]  <IRQ>
[    7.317279][    C1]  ? __die_body+0x64/0xb0
[    7.317482][    C1]  ? page_fault_oops+0x248/0x370
[    7.317712][    C1]  ? __wake_up+0x96/0xb0
[    7.317964][    C1]  ? exc_page_fault+0x62/0x130
[    7.318211][    C1]  ? asm_exc_page_fault+0x22/0x30
[    7.318444][    C1]  ? __cfi_native_send_call_func_single_ipi+0x10/0x10
[    7.318860][    C1]  ? default_idle+0xb/0x10
[    7.319063][    C1]  ? __common_interrupt+0x52/0xc0
[    7.319330][    C1]  common_interrupt+0x78/0x90
[    7.319546][    C1]  </IRQ>
[    7.319679][    C1]  <TASK>
[    7.319854][    C1]  asm_common_interrupt+0x22/0x40
[    7.320082][    C1] RIP: 0010:default_idle+0xb/0x10
[    7.320309][    C1] Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 b8 0c 67 40 a5 66 90 0f 00 2d 09 b9 3b 00 fb f4 <fa> c3 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 b8 0c 67 40 a5 e9
[    7.321449][    C1] RSP: 0018:ffffc9000009bee8 EFLAGS: 00000256
[    7.321808][    C1] RAX: ffff88813bca8b68 RBX: 0000000000000001 RCX: 000000000001ef0c
[    7.322227][    C1] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000000000001ef0c
[    7.322656][    C1] RBP: ffffc9000009bef8 R08: 8000000000000000 R09: 00000000000008c2
[    7.323083][    C1] R10: 0000000000000000 R11: ffffffff81058e70 R12: 0000000000000000
[    7.323530][    C1] R13: ffff8881002b30c0 R14: 0000000000000000 R15: 0000000000000000
[    7.323948][    C1]  ? __cfi_lapic_next_deadline+0x10/0x10
[    7.324239][    C1]  default_idle_call+0x31/0x50
[    7.324464][    C1]  do_idle+0xd3/0x240
[    7.324690][    C1]  cpu_startup_entry+0x25/0x30
[    7.324983][    C1]  start_secondary+0xb4/0xc0
[    7.325217][    C1]  secondary_startup_64_no_verify+0x179/0x17b
[    7.325498][    C1]  </TASK>
[    7.325641][    C1] Modules linked in:
[    7.325906][    C1] CR2: 000000000002b4d8
[    7.326104][    C1] ---[ end trace 0000000000000000 ]---
[    7.326354][    C1] RIP: 0010:__common_interrupt+0x52/0xc0
[    7.326614][    C1] Code: 01 00 4d 85 f6 74 39 49 81 fe 00 f0 ff ff 77 30 4c 89 f7 4d 8b 5e 68 41 ba 91 76 d8 42 45 03 53 fc 74 02 0f 0b cc ff d3 65 48 <8b> 05 30 c7 ff 7e 65 4c 89 3d 28 c7 ff 7e 5b 41 5c 41 5e 41 5f c3
[    7.327570][    C1] RSP: 0018:ffffc900000e0fd0 EFLAGS: 00010046
[    7.327910][    C1] RAX: 0000000000000001 RBX: 0000000000000023 RCX: 0000000000000001
[    7.328273][    C1] RDX: 00000000000003cd RSI: 0000000000000001 RDI: ffff888100d302a4
[    7.328632][    C1] RBP: 0000000000000001 R08: 0ef439818636191f R09: b1621ff338a3b482
[    7.329223][    C1] R10: ffffffff81e5127b R11: ffffffff81059810 R12: 0000000000000023
[    7.329780][    C1] R13: 0000000000000000 R14: ffff888100d30200 R15: 0000000000000000
[    7.330193][    C1] FS:  0000000000000000(0000) GS:ffff88813bc80000(0000) knlGS:0000000000000000
[    7.330632][    C1] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    7.331050][    C1] CR2: 000000000002b4d8 CR3: 0000000003028003 CR4: 0000000000370ef0
[    7.331454][    C1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    7.331854][    C1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    7.332236][    C1] Kernel panic - not syncing: Fatal exception in interrupt
[    7.332730][    C1] Kernel Offset: disabled
[    7.333044][    C1] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

The relevant assembly code is (from objdump, faulting address
highlighted):

ffffffff8102ed9d:       41 ff d3                  call   *%r11
ffffffff8102eda0:       65 48 <8b> 05 30 c7 ff    mov    %gs:0x7effc730(%rip),%rax

The emulation incorrectly sets the return address to be ffffffff8102ed9d
+ 0x5 = ffffffff8102eda2, which is the 8b byte in the middle of the next
mov. This in turn causes incorrect subsequent instruction decoding and
eventually triggers the page fault above.

Instead of invoking int3_emulate_call, perform push and jmp emulation
directly in kprobe_emulate_call_indirect. At this point we can obtain
the instruction size from p->ainsn.size so that we can calculate the
correct return address.

Link: https://lore.kernel.org/all/20240102233345.385475-1-jinghao7@illinois.edu/

Fixes: 6256e66 ("x86/kprobes: Use int3 instead of debug trap for single-step")
Cc: stable@vger.kernel.org
Signed-off-by: Jinghao Jia <jinghao7@illinois.edu>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
ojeda pushed a commit that referenced this pull request Jan 22, 2024
Petr Machata says:

====================
mlxsw: Miscellaneous fixes

This patchset is a bric-a-brac of fixes for bugs impacting mlxsw.

- Patches #1 and #2 fix issues in ACL handling error paths.
- Patch #3 fixes stack corruption in ACL code that a recent FW update
  has uncovered.

- Patch #4 fixes an issue in handling of IPIP next hops.

- Patch #5 fixes a typo in a the qos_pfc selftest
- Patch #6 fixes the same selftest to work with 8-lane ports.
====================

Link: https://lore.kernel.org/r/cover.1705502064.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
metaspace pushed a commit to metaspace/linux that referenced this pull request Apr 16, 2024
The driver creates /sys/kernel/debug/dri/0/mob_ttm even when the
corresponding ttm_resource_manager is not allocated.
This leads to a crash when trying to read from this file.

Add a check to create mob_ttm, system_mob_ttm, and gmr_ttm debug file
only when the corresponding ttm_resource_manager is allocated.

crash> bt
PID: 3133409  TASK: ffff8fe4834a5000  CPU: 3    COMMAND: "grep"
 #0 [ffffb954506b3b20] machine_kexec at ffffffffb2a6bec3
 #1 [ffffb954506b3b78] __crash_kexec at ffffffffb2bb598a
 Rust-for-Linux#2 [ffffb954506b3c38] crash_kexec at ffffffffb2bb68c1
 Rust-for-Linux#3 [ffffb954506b3c50] oops_end at ffffffffb2a2a9b1
 Rust-for-Linux#4 [ffffb954506b3c70] no_context at ffffffffb2a7e913
 Rust-for-Linux#5 [ffffb954506b3cc8] __bad_area_nosemaphore at ffffffffb2a7ec8c
 Rust-for-Linux#6 [ffffb954506b3d10] do_page_fault at ffffffffb2a7f887
 Rust-for-Linux#7 [ffffb954506b3d40] page_fault at ffffffffb360116e
    [exception RIP: ttm_resource_manager_debug+0x11]
    RIP: ffffffffc04afd11  RSP: ffffb954506b3df0  RFLAGS: 00010246
    RAX: ffff8fe41a6d1200  RBX: 0000000000000000  RCX: 0000000000000940
    RDX: 0000000000000000  RSI: ffffffffc04b4338  RDI: 0000000000000000
    RBP: ffffb954506b3e08   R8: ffff8fee3ffad000   R9: 0000000000000000
    R10: ffff8fe41a76a000  R11: 0000000000000001  R12: 00000000ffffffff
    R13: 0000000000000001  R14: ffff8fe5bb6f3900  R15: ffff8fe41a6d1200
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 Rust-for-Linux#8 [ffffb954506b3e00] ttm_resource_manager_show at ffffffffc04afde7 [ttm]
 Rust-for-Linux#9 [ffffb954506b3e30] seq_read at ffffffffb2d8f9f3
    RIP: 00007f4c4eda8985  RSP: 00007ffdbba9e9f8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 000000000037e000  RCX: 00007f4c4eda8985
    RDX: 000000000037e000  RSI: 00007f4c41573000  RDI: 0000000000000003
    RBP: 000000000037e000   R8: 0000000000000000   R9: 000000000037fe30
    R10: 0000000000000000  R11: 0000000000000246  R12: 00007f4c41573000
    R13: 0000000000000003  R14: 00007f4c41572010  R15: 0000000000000003
    ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b

Signed-off-by: Jocelyn Falempe <jfalempe@redhat.com>
Fixes: af4a25b ("drm/vmwgfx: Add debugfs entries for various ttm resource managers")
Cc: <stable@vger.kernel.org>
Reviewed-by: Zack Rusin <zack.rusin@broadcom.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240312093551.196609-1-jfalempe@redhat.com
metaspace pushed a commit to metaspace/linux that referenced this pull request Apr 16, 2024
…git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

Patch #1 unlike early commit path stage which triggers a call to abort,
         an explicit release of the batch is required on abort, otherwise
         mutex is released and commit_list remains in place.

Patch Rust-for-Linux#2 release mutex after nft_gc_seq_end() in commit path, otherwise
         async GC worker could collect expired objects.

Patch Rust-for-Linux#3 flush pending destroy work in module removal path, otherwise UaF
         is possible.

Patch Rust-for-Linux#4 and Rust-for-Linux#6 restrict the table dormant flag with basechain updates
	 to fix state inconsistency in the hook registration.

Patch Rust-for-Linux#5 adds missing RCU read side lock to flowtable type to avoid races
	 with module removal.

* tag 'nf-24-04-04' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nf_tables: discard table flag update with pending basechain deletion
  netfilter: nf_tables: Fix potential data-race in __nft_flowtable_type_get()
  netfilter: nf_tables: reject new basechain after table flag update
  netfilter: nf_tables: flush pending destroy work before exit_net release
  netfilter: nf_tables: release mutex after nft_gc_seq_end from abort path
  netfilter: nf_tables: release batch on table validation from abort path
====================

Link: https://lore.kernel.org/r/20240404104334.1627-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ojeda pushed a commit that referenced this pull request Apr 29, 2024
…git/netfilter/nf

netfilter pull request 24-04-11

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

Patches #1 and #2 add missing rcu read side lock when iterating over
expression and object type list which could race with module removal.

Patch #3 prevents promisc packet from visiting the bridge/input hook
	 to amend a recent fix to address conntrack confirmation race
	 in br_netfilter and nf_conntrack_bridge.

Patch #4 adds and uses iterate decorator type to fetch the current
	 pipapo set backend datastructure view when netlink dumps the
	 set elements.

Patch #5 fixes removal of duplicate elements in the pipapo set backend.

Patch #6 flowtable validates pppoe header before accessing it.

Patch #7 fixes flowtable datapath for pppoe packets, otherwise lookup
         fails and pppoe packets follow classic path.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
ojeda pushed a commit that referenced this pull request Apr 29, 2024
vhost_worker will call tun call backs to receive packets. If too many
illegal packets arrives, tun_do_read will keep dumping packet contents.
When console is enabled, it will costs much more cpu time to dump
packet and soft lockup will be detected.

net_ratelimit mechanism can be used to limit the dumping rate.

PID: 33036    TASK: ffff949da6f20000  CPU: 23   COMMAND: "vhost-32980"
 #0 [fffffe00003fce50] crash_nmi_callback at ffffffff89249253
 #1 [fffffe00003fce58] nmi_handle at ffffffff89225fa3
 #2 [fffffe00003fceb0] default_do_nmi at ffffffff8922642e
 #3 [fffffe00003fced0] do_nmi at ffffffff8922660d
 #4 [fffffe00003fcef0] end_repeat_nmi at ffffffff89c01663
    [exception RIP: io_serial_in+20]
    RIP: ffffffff89792594  RSP: ffffa655314979e8  RFLAGS: 00000002
    RAX: ffffffff89792500  RBX: ffffffff8af428a0  RCX: 0000000000000000
    RDX: 00000000000003fd  RSI: 0000000000000005  RDI: ffffffff8af428a0
    RBP: 0000000000002710   R8: 0000000000000004   R9: 000000000000000f
    R10: 0000000000000000  R11: ffffffff8acbf64f  R12: 0000000000000020
    R13: ffffffff8acbf698  R14: 0000000000000058  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffffa655314979e8] io_serial_in at ffffffff89792594
 #6 [ffffa655314979e8] wait_for_xmitr at ffffffff89793470
 #7 [ffffa65531497a08] serial8250_console_putchar at ffffffff897934f6
 #8 [ffffa65531497a20] uart_console_write at ffffffff8978b605
 #9 [ffffa65531497a48] serial8250_console_write at ffffffff89796558
 #10 [ffffa65531497ac8] console_unlock at ffffffff89316124
 #11 [ffffa65531497b10] vprintk_emit at ffffffff89317c07
 #12 [ffffa65531497b68] printk at ffffffff89318306
 #13 [ffffa65531497bc8] print_hex_dump at ffffffff89650765
 #14 [ffffa65531497ca8] tun_do_read at ffffffffc0b06c27 [tun]
 #15 [ffffa65531497d38] tun_recvmsg at ffffffffc0b06e34 [tun]
 #16 [ffffa65531497d68] handle_rx at ffffffffc0c5d682 [vhost_net]
 #17 [ffffa65531497ed0] vhost_worker at ffffffffc0c644dc [vhost]
 #18 [ffffa65531497f10] kthread at ffffffff892d2e72
 #19 [ffffa65531497f50] ret_from_fork at ffffffff89c0022f

Fixes: ef3db4a ("tun: avoid BUG, dump packet on GSO errors")
Signed-off-by: Lei Chen <lei.chen@smartx.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20240415020247.2207781-1-lei.chen@smartx.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ojeda pushed a commit that referenced this pull request May 27, 2024
…l/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "This contains the usual miscellaneous features, cleanups, and fixes
  for vfs and individual fses.

  Features:

   - Free up FMODE_* bits. I've freed up bits 6, 7, 8, and 24. That
     means we now have six free FMODE_* bits in total (but bit #6
     already got used for FMODE_WRITE_RESTRICTED)

   - Add FOP_HUGE_PAGES flag (follow-up to FMODE_* cleanup)

   - Add fd_raw cleanup class so we can make use of automatic cleanup
     provided by CLASS(fd_raw, f)(fd) for O_PATH fds as well

   - Optimize seq_puts()

   - Simplify __seq_puts()

   - Add new anon_inode_getfile_fmode() api to allow specifying f_mode
     instead of open-coding it in multiple places

   - Annotate struct file_handle with __counted_by() and use
     struct_size()

   - Warn in get_file() whether f_count resurrection from zero is
     attempted (epoll/drm discussion)

   - Folio-sophize aio

   - Export the subvolume id in statx() for both btrfs and bcachefs

   - Relax linkat(AT_EMPTY_PATH) requirements

   - Add F_DUPFD_QUERY fcntl() allowing to compare two file descriptors
     for dup*() equality replacing kcmp()

  Cleanups:

   - Compile out swapfile inode checks when swap isn't enabled

   - Use (1 << n) notation for FMODE_* bitshifts for clarity

   - Remove redundant variable assignment in fs/direct-io

   - Cleanup uses of strncpy in orangefs

   - Speed up and cleanup writeback

   - Move fsparam_string_empty() helper into header since it's currently
     open-coded in multiple places

   - Add kernel-doc comments to proc_create_net_data_write()

   - Don't needlessly read dentry->d_flags twice

  Fixes:

   - Fix out-of-range warning in nilfs2

   - Fix ecryptfs overflow due to wrong encryption packet size
     calculation

   - Fix overly long line in xfs file_operations (follow-up to FMODE_*
     cleanup)

   - Don't raise FOP_BUFFER_{R,W}ASYNC for directories in xfs (follow-up
     to FMODE_* cleanup)

   - Don't call xfs_file_open from xfs_dir_open (follow-up to FMODE_*
     cleanup)

   - Fix stable offset api to prevent endless loops

   - Fix afs file server rotations

   - Prevent xattr node from overflowing the eraseblock in jffs2

   - Move fdinfo PTRACE_MODE_READ procfs check into the .permission()
     operation instead of .open() operation since this caused userspace
     regressions"

* tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits)
  afs: Fix fileserver rotation getting stuck
  selftests: add F_DUPDFD_QUERY selftests
  fcntl: add F_DUPFD_QUERY fcntl()
  file: add fd_raw cleanup class
  fs: WARN when f_count resurrection is attempted
  seq_file: Simplify __seq_puts()
  seq_file: Optimize seq_puts()
  proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operation
  fs: Create anon_inode_getfile_fmode()
  xfs: don't call xfs_file_open from xfs_dir_open
  xfs: drop fop_flags for directories
  xfs: fix overly long line in the file_operations
  shmem: Fix shmem_rename2()
  libfs: Add simple_offset_rename() API
  libfs: Fix simple_offset_rename_exchange()
  jffs2: prevent xattr node from overflowing the eraseblock
  vfs, swap: compile out IS_SWAPFILE() on swapless configs
  vfs: relax linkat() AT_EMPTY_PATH - aka flink() - requirements
  fs/direct-io: remove redundant assignment to variable retval
  fs/dcache: Re-use value stored to dentry->d_flags instead of re-reading
  ...
ojeda pushed a commit that referenced this pull request May 27, 2024
…rnel/git/netfilter/nf-next

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

Patch #1 skips transaction if object type provides no .update interface.

Patch #2 skips NETDEV_CHANGENAME which is unused.

Patch #3 enables conntrack to handle Multicast Router Advertisements and
	 Multicast Router Solicitations from the Multicast Router Discovery
	 protocol (RFC4286) as untracked opposed to invalid packets.
	 From Linus Luessing.

Patch #4 updates DCCP conntracker to mark invalid as invalid, instead of
	 dropping them, from Jason Xing.

Patch #5 uses NF_DROP instead of -NF_DROP since NF_DROP is 0,
	 also from Jason.

Patch #6 removes reference in netfilter's sysctl documentation on pickup
	 entries which were already removed by Florian Westphal.

Patch #7 removes check for IPS_OFFLOAD flag to disable early drop which
	 allows to evict entries from the conntrack table,
	 also from Florian.

Patches #8 to #16 updates nf_tables pipapo set backend to allocate
	 the datastructure copy on-demand from preparation phase,
	 to better deal with OOM situations where .commit step is too late
	 to fail. Series from Florian Westphal.

Patch #17 adds a selftest with packetdrill to cover conntrack TCP state
	 transitions, also from Florian.

Patch #18 use GFP_KERNEL to clone elements from control plane to avoid
	 quick atomic reserves exhaustion with large sets, reporter refers
	 to million entries magnitude.

* tag 'nf-next-24-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: nf_tables: allow clone callbacks to sleep
  selftests: netfilter: add packetdrill based conntrack tests
  netfilter: nft_set_pipapo: remove dirty flag
  netfilter: nft_set_pipapo: move cloning of match info to insert/removal path
  netfilter: nft_set_pipapo: prepare pipapo_get helper for on-demand clone
  netfilter: nft_set_pipapo: merge deactivate helper into caller
  netfilter: nft_set_pipapo: prepare walk function for on-demand clone
  netfilter: nft_set_pipapo: prepare destroy function for on-demand clone
  netfilter: nft_set_pipapo: make pipapo_clone helper return NULL
  netfilter: nft_set_pipapo: move prove_locking helper around
  netfilter: conntrack: remove flowtable early-drop test
  netfilter: conntrack: documentation: remove reference to non-existent sysctl
  netfilter: use NF_DROP instead of -NF_DROP
  netfilter: conntrack: dccp: try not to drop skb in conntrack
  netfilter: conntrack: fix ct-state for ICMPv6 Multicast Router Discovery
  netfilter: nf_tables: remove NETDEV_CHANGENAME from netdev chain event handler
  netfilter: nf_tables: skip transaction if update object is not implemented
====================

Link: https://lore.kernel.org/r/20240512161436.168973-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ojeda pushed a commit that referenced this pull request May 27, 2024
Using completion_done to determine whether the caller has gone
away only works after a complete call.  Furthermore it's still
possible that the caller has not yet called wait_for_completion,
resulting in another potential UAF.

Fix this by making the caller use cancel_work_sync and then freeing
the memory safely.

Fixes: 7d42e09 ("crypto: qat - resolve race condition during AER recovery")
Cc: <stable@vger.kernel.org> #6.8+
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Reviewed-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
ojeda pushed a commit that referenced this pull request Jun 11, 2024
…git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

Patch #1 syzbot reports that nf_reinject() could be called without
         rcu_read_lock() when flushing pending packets at nfnetlink
         queue removal, from Eric Dumazet.

Patch #2 flushes ipset list:set when canceling garbage collection to
         reference to other lists to fix a race, from Jozsef Kadlecsik.

Patch #3 restores q-in-q matching with nft_payload by reverting
         f6ae9f1 ("netfilter: nft_payload: add C-VLAN support").

Patch #4 fixes vlan mangling in skbuff when vlan offload is present
         in skbuff, without this patch nft_payload corrupts packets
         in this case.

Patch #5 fixes possible nul-deref in tproxy no IP address is found in
         netdevice, reported by syzbot and patch from Florian Westphal.

Patch #6 removes a superfluous restriction which prevents loose fib
         lookups from input and forward hooks, from Eric Garver.

My assessment is that patches #1, #2 and #5 address possible kernel
crash, anything else in this batch fixes broken features.

netfilter pull request 24-05-29

* tag 'nf-24-05-29' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nft_fib: allow from forward/input without iif selector
  netfilter: tproxy: bail out if IP has been disabled on the device
  netfilter: nft_payload: skbuff vlan metadata mangle support
  netfilter: nft_payload: restore vlan q-in-q match support
  netfilter: ipset: Add list flush to cancel_gc
  netfilter: nfnetlink_queue: acquire rcu_read_lock() in instance_destroy_rcu()
====================

Link: https://lore.kernel.org/r/20240528225519.1155786-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
ojeda pushed a commit that referenced this pull request Jun 11, 2024
With commit c4cb231 ("iommu/amd: Add support for enable/disable IOPF")
we are hitting below issue. This happens because in IOPF enablement path
it holds spin lock with irq disable and then tries to take mutex lock.

dmesg:
-----
[    0.938739] =============================
[    0.938740] [ BUG: Invalid wait context ]
[    0.938742] 6.10.0-rc1+ #1 Not tainted
[    0.938745] -----------------------------
[    0.938746] swapper/0/1 is trying to lock:
[    0.938748] ffffffff8c9f01d8 (&port_lock_key){....}-{3:3}, at: serial8250_console_write+0x78/0x4a0
[    0.938767] other info that might help us debug this:
[    0.938768] context-{5:5}
[    0.938769] 7 locks held by swapper/0/1:
[    0.938772]  #0: ffff888101a91310 (&group->mutex){+.+.}-{4:4}, at: bus_iommu_probe+0x70/0x160
[    0.938790]  #1: ffff888101d1f1b8 (&domain->lock){....}-{3:3}, at: amd_iommu_attach_device+0xa5/0x700
[    0.938799]  #2: ffff888101cc3d18 (&dev_data->lock){....}-{3:3}, at: amd_iommu_attach_device+0xc5/0x700
[    0.938806]  #3: ffff888100052830 (&iommu->lock){....}-{2:2}, at: amd_iommu_iopf_add_device+0x3f/0xa0
[    0.938813]  #4: ffffffff8945a340 (console_lock){+.+.}-{0:0}, at: _printk+0x48/0x50
[    0.938822]  #5: ffffffff8945a390 (console_srcu){....}-{0:0}, at: console_flush_all+0x58/0x4e0
[    0.938867]  #6: ffffffff82459f80 (console_owner){....}-{0:0}, at: console_flush_all+0x1f0/0x4e0
[    0.938872] stack backtrace:
[    0.938874] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 6.10.0-rc1+ #1
[    0.938877] Hardware name: HP HP EliteBook 745 G3/807E, BIOS N73 Ver. 01.39 04/16/2019

Fix above issue by re-arranging code in attach device path:
  - move device PASID/IOPF enablement outside lock in AMD IOMMU driver.
    This is safe as core layer holds group->mutex lock before calling
    iommu_ops->attach_dev.

Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Reported-by: Chris Bainbridge <chris.bainbridge@gmail.com>
Fixes: c4cb231 ("iommu/amd: Add support for enable/disable IOPF")
Tested-by: Borislav Petkov <bp@alien8.de>
Tested-by: Chris Bainbridge <chris.bainbridge@gmail.com>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Signed-off-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20240530084801.10758-1-vasant.hegde@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
ojeda pushed a commit that referenced this pull request Jun 11, 2024
…PLES event"

This reverts commit 7d1405c.

This causes segfaults in some cases, as reported by Milian:

  ```
  sudo /usr/bin/perf record -z --call-graph dwarf -e cycles -e
  raw_syscalls:sys_enter ls
  ...
  [ perf record: Woken up 3 times to write data ]
  malloc(): invalid next size (unsorted)
  Aborted
  ```

  Backtrace with GDB + debuginfod:

  ```
  malloc(): invalid next size (unsorted)

  Thread 1 "perf" received signal SIGABRT, Aborted.
  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6,
  no_tid=no_tid@entry=0) at pthread_kill.c:44
  Downloading source file /usr/src/debug/glibc/glibc/nptl/pthread_kill.c
  44            return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO
  (ret) : 0;
  (gdb) bt
  #0  __pthread_kill_implementation (threadid=<optimized out>,
  signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
  #1  0x00007ffff6ea8eb3 in __pthread_kill_internal (threadid=<optimized out>,
  signo=6) at pthread_kill.c:78
  #2  0x00007ffff6e50a30 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/
  raise.c:26
  #3  0x00007ffff6e384c3 in __GI_abort () at abort.c:79
  #4  0x00007ffff6e39354 in __libc_message_impl (fmt=fmt@entry=0x7ffff6fc22ea
  "%s\n") at ../sysdeps/posix/libc_fatal.c:132
  #5  0x00007ffff6eb3085 in malloc_printerr (str=str@entry=0x7ffff6fc5850
  "malloc(): invalid next size (unsorted)") at malloc.c:5772
  #6  0x00007ffff6eb657c in _int_malloc (av=av@entry=0x7ffff6ff6ac0
  <main_arena>, bytes=bytes@entry=368) at malloc.c:4081
  #7  0x00007ffff6eb877e in __libc_calloc (n=<optimized out>,
  elem_size=<optimized out>) at malloc.c:3754
  #8  0x000055555569bdb6 in perf_session.do_write_header ()
  #9  0x00005555555a373a in __cmd_record.constprop.0 ()
  #10 0x00005555555a6846 in cmd_record ()
  #11 0x000055555564db7f in run_builtin ()
  #12 0x000055555558ed77 in main ()
  ```

  Valgrind memcheck:
  ```
  ==45136== Invalid write of size 8
  ==45136==    at 0x2B38A5: perf_event__synthesize_id_sample (in /usr/bin/perf)
  ==45136==    by 0x157069: __cmd_record.constprop.0 (in /usr/bin/perf)
  ==45136==    by 0x15A845: cmd_record (in /usr/bin/perf)
  ==45136==    by 0x201B7E: run_builtin (in /usr/bin/perf)
  ==45136==    by 0x142D76: main (in /usr/bin/perf)
  ==45136==  Address 0x6a866a8 is 0 bytes after a block of size 40 alloc'd
  ==45136==    at 0x4849BF3: calloc (vg_replace_malloc.c:1675)
  ==45136==    by 0x3574AB: zalloc (in /usr/bin/perf)
  ==45136==    by 0x1570E0: __cmd_record.constprop.0 (in /usr/bin/perf)
  ==45136==    by 0x15A845: cmd_record (in /usr/bin/perf)
  ==45136==    by 0x201B7E: run_builtin (in /usr/bin/perf)
  ==45136==    by 0x142D76: main (in /usr/bin/perf)
  ==45136==
  ==45136== Syscall param write(buf) points to unaddressable byte(s)
  ==45136==    at 0x575953D: __libc_write (write.c:26)
  ==45136==    by 0x575953D: write (write.c:24)
  ==45136==    by 0x35761F: ion (in /usr/bin/perf)
  ==45136==    by 0x357778: writen (in /usr/bin/perf)
  ==45136==    by 0x1548F7: record__write (in /usr/bin/perf)
  ==45136==    by 0x15708A: __cmd_record.constprop.0 (in /usr/bin/perf)
  ==45136==    by 0x15A845: cmd_record (in /usr/bin/perf)
  ==45136==    by 0x201B7E: run_builtin (in /usr/bin/perf)
  ==45136==    by 0x142D76: main (in /usr/bin/perf)
  ==45136==  Address 0x6a866a8 is 0 bytes after a block of size 40 alloc'd
  ==45136==    at 0x4849BF3: calloc (vg_replace_malloc.c:1675)
  ==45136==    by 0x3574AB: zalloc (in /usr/bin/perf)
  ==45136==    by 0x1570E0: __cmd_record.constprop.0 (in /usr/bin/perf)
  ==45136==    by 0x15A845: cmd_record (in /usr/bin/perf)
  ==45136==    by 0x201B7E: run_builtin (in /usr/bin/perf)
  ==45136==    by 0x142D76: main (in /usr/bin/perf)
  ==45136==
 -----

Closes: https://lore.kernel.org/linux-perf-users/23879991.0LEYPuXRzz@milian-workstation/
Reported-by: Milian Wolff <milian.wolff@kdab.com>
Tested-by: Milian Wolff <milian.wolff@kdab.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: stable@kernel.org # 6.8+
Link: https://lore.kernel.org/lkml/Zl9ksOlHJHnKM70p@x1
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
ojeda pushed a commit that referenced this pull request Jun 11, 2024
We have been seeing crashes on duplicate keys in
btrfs_set_item_key_safe():

  BTRFS critical (device vdb): slot 4 key (450 108 8192) new key (450 108 8192)
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/ctree.c:2620!
  invalid opcode: 0000 [#1] PREEMPT SMP PTI
  CPU: 0 PID: 3139 Comm: xfs_io Kdump: loaded Not tainted 6.9.0 #6
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
  RIP: 0010:btrfs_set_item_key_safe+0x11f/0x290 [btrfs]

With the following stack trace:

  #0  btrfs_set_item_key_safe (fs/btrfs/ctree.c:2620:4)
  #1  btrfs_drop_extents (fs/btrfs/file.c:411:4)
  #2  log_one_extent (fs/btrfs/tree-log.c:4732:9)
  #3  btrfs_log_changed_extents (fs/btrfs/tree-log.c:4955:9)
  #4  btrfs_log_inode (fs/btrfs/tree-log.c:6626:9)
  #5  btrfs_log_inode_parent (fs/btrfs/tree-log.c:7070:8)
  #6  btrfs_log_dentry_safe (fs/btrfs/tree-log.c:7171:8)
  #7  btrfs_sync_file (fs/btrfs/file.c:1933:8)
  #8  vfs_fsync_range (fs/sync.c:188:9)
  #9  vfs_fsync (fs/sync.c:202:9)
  #10 do_fsync (fs/sync.c:212:9)
  #11 __do_sys_fdatasync (fs/sync.c:225:9)
  #12 __se_sys_fdatasync (fs/sync.c:223:1)
  #13 __x64_sys_fdatasync (fs/sync.c:223:1)
  #14 do_syscall_x64 (arch/x86/entry/common.c:52:14)
  #15 do_syscall_64 (arch/x86/entry/common.c:83:7)
  #16 entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121)

So we're logging a changed extent from fsync, which is splitting an
extent in the log tree. But this split part already exists in the tree,
triggering the BUG().

This is the state of the log tree at the time of the crash, dumped with
drgn (https://github.com/osandov/drgn/blob/main/contrib/btrfs_tree.py)
to get more details than btrfs_print_leaf() gives us:

  >>> print_extent_buffer(prog.crashed_thread().stack_trace()[0]["eb"])
  leaf 33439744 level 0 items 72 generation 9 owner 18446744073709551610
  leaf 33439744 flags 0x100000000000000
  fs uuid e5bd3946-400c-4223-8923-190ef1f18677
  chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
          item 0 key (450 INODE_ITEM 0) itemoff 16123 itemsize 160
                  generation 7 transid 9 size 8192 nbytes 8473563889606862198
                  block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
                  sequence 204 flags 0x10(PREALLOC)
                  atime 1716417703.220000000 (2024-05-22 15:41:43)
                  ctime 1716417704.983333333 (2024-05-22 15:41:44)
                  mtime 1716417704.983333333 (2024-05-22 15:41:44)
                  otime 17592186044416.000000000 (559444-03-08 01:40:16)
          item 1 key (450 INODE_REF 256) itemoff 16110 itemsize 13
                  index 195 namelen 3 name: 193
          item 2 key (450 XATTR_ITEM 1640047104) itemoff 16073 itemsize 37
                  location key (0 UNKNOWN.0 0) type XATTR
                  transid 7 data_len 1 name_len 6
                  name: user.a
                  data a
          item 3 key (450 EXTENT_DATA 0) itemoff 16020 itemsize 53
                  generation 9 type 1 (regular)
                  extent data disk byte 303144960 nr 12288
                  extent data offset 0 nr 4096 ram 12288
                  extent compression 0 (none)
          item 4 key (450 EXTENT_DATA 4096) itemoff 15967 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 4096 nr 8192
          item 5 key (450 EXTENT_DATA 8192) itemoff 15914 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 8192 nr 4096
  ...

So the real problem happened earlier: notice that items 4 (4k-12k) and 5
(8k-12k) overlap. Both are prealloc extents. Item 4 straddles i_size and
item 5 starts at i_size.

Here is the state of the filesystem tree at the time of the crash:

  >>> root = prog.crashed_thread().stack_trace()[2]["inode"].root
  >>> ret, nodes, slots = btrfs_search_slot(root, BtrfsKey(450, 0, 0))
  >>> print_extent_buffer(nodes[0])
  leaf 30425088 level 0 items 184 generation 9 owner 5
  leaf 30425088 flags 0x100000000000000
  fs uuid e5bd3946-400c-4223-8923-190ef1f18677
  chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
  	...
          item 179 key (450 INODE_ITEM 0) itemoff 4907 itemsize 160
                  generation 7 transid 7 size 4096 nbytes 12288
                  block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
                  sequence 6 flags 0x10(PREALLOC)
                  atime 1716417703.220000000 (2024-05-22 15:41:43)
                  ctime 1716417703.220000000 (2024-05-22 15:41:43)
                  mtime 1716417703.220000000 (2024-05-22 15:41:43)
                  otime 1716417703.220000000 (2024-05-22 15:41:43)
          item 180 key (450 INODE_REF 256) itemoff 4894 itemsize 13
                  index 195 namelen 3 name: 193
          item 181 key (450 XATTR_ITEM 1640047104) itemoff 4857 itemsize 37
                  location key (0 UNKNOWN.0 0) type XATTR
                  transid 7 data_len 1 name_len 6
                  name: user.a
                  data a
          item 182 key (450 EXTENT_DATA 0) itemoff 4804 itemsize 53
                  generation 9 type 1 (regular)
                  extent data disk byte 303144960 nr 12288
                  extent data offset 0 nr 8192 ram 12288
                  extent compression 0 (none)
          item 183 key (450 EXTENT_DATA 8192) itemoff 4751 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 8192 nr 4096

Item 5 in the log tree corresponds to item 183 in the filesystem tree,
but nothing matches item 4. Furthermore, item 183 is the last item in
the leaf.

btrfs_log_prealloc_extents() is responsible for logging prealloc extents
beyond i_size. It first truncates any previously logged prealloc extents
that start beyond i_size. Then, it walks the filesystem tree and copies
the prealloc extent items to the log tree.

If it hits the end of a leaf, then it calls btrfs_next_leaf(), which
unlocks the tree and does another search. However, while the filesystem
tree is unlocked, an ordered extent completion may modify the tree. In
particular, it may insert an extent item that overlaps with an extent
item that was already copied to the log tree.

This may manifest in several ways depending on the exact scenario,
including an EEXIST error that is silently translated to a full sync,
overlapping items in the log tree, or this crash. This particular crash
is triggered by the following sequence of events:

- Initially, the file has i_size=4k, a regular extent from 0-4k, and a
  prealloc extent beyond i_size from 4k-12k. The prealloc extent item is
  the last item in its B-tree leaf.
- The file is fsync'd, which copies its inode item and both extent items
  to the log tree.
- An xattr is set on the file, which sets the
  BTRFS_INODE_COPY_EVERYTHING flag.
- The range 4k-8k in the file is written using direct I/O. i_size is
  extended to 8k, but the ordered extent is still in flight.
- The file is fsync'd. Since BTRFS_INODE_COPY_EVERYTHING is set, this
  calls copy_inode_items_to_log(), which calls
  btrfs_log_prealloc_extents().
- btrfs_log_prealloc_extents() finds the 4k-12k prealloc extent in the
  filesystem tree. Since it starts before i_size, it skips it. Since it
  is the last item in its B-tree leaf, it calls btrfs_next_leaf().
- btrfs_next_leaf() unlocks the path.
- The ordered extent completion runs, which converts the 4k-8k part of
  the prealloc extent to written and inserts the remaining prealloc part
  from 8k-12k.
- btrfs_next_leaf() does a search and finds the new prealloc extent
  8k-12k.
- btrfs_log_prealloc_extents() copies the 8k-12k prealloc extent into
  the log tree. Note that it overlaps with the 4k-12k prealloc extent
  that was copied to the log tree by the first fsync.
- fsync calls btrfs_log_changed_extents(), which tries to log the 4k-8k
  extent that was written.
- This tries to drop the range 4k-8k in the log tree, which requires
  adjusting the start of the 4k-12k prealloc extent in the log tree to
  8k.
- btrfs_set_item_key_safe() sees that there is already an extent
  starting at 8k in the log tree and calls BUG().

Fix this by detecting when we're about to insert an overlapping file
extent item in the log tree and truncating the part that would overlap.

CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
ojeda pushed a commit that referenced this pull request Jul 8, 2024
…play

During inode logging (and log replay too), we are holding a transaction
handle and we often need to call btrfs_iget(), which will read an inode
from its subvolume btree if it's not loaded in memory and that results in
allocating an inode with GFP_KERNEL semantics at the btrfs_alloc_inode()
callback - and this may recurse into the filesystem in case we are under
memory pressure and attempt to commit the current transaction, resulting
in a deadlock since the logging (or log replay) task is holding a
transaction handle open.

Syzbot reported this with the following stack traces:

  WARNING: possible circular locking dependency detected
  6.10.0-rc2-syzkaller-00361-g061d1af7b030 #0 Not tainted
  ------------------------------------------------------
  syz-executor.1/9919 is trying to acquire lock:
  ffffffff8dd3aac0 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:334 [inline]
  ffffffff8dd3aac0 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:3891 [inline]
  ffffffff8dd3aac0 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:3981 [inline]
  ffffffff8dd3aac0 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4020

  but task is already holding lock:
  ffff88804b569358 (&ei->log_mutex){+.+.}-{3:3}, at: btrfs_log_inode+0x39c/0x4660 fs/btrfs/tree-log.c:6481

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #3 (&ei->log_mutex){+.+.}-{3:3}:
         __mutex_lock_common kernel/locking/mutex.c:608 [inline]
         __mutex_lock+0x175/0x9c0 kernel/locking/mutex.c:752
         btrfs_log_inode+0x39c/0x4660 fs/btrfs/tree-log.c:6481
         btrfs_log_inode_parent+0x8cb/0x2a90 fs/btrfs/tree-log.c:7079
         btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7180
         btrfs_sync_file+0x9c1/0xe10 fs/btrfs/file.c:1959
         vfs_fsync_range+0x141/0x230 fs/sync.c:188
         generic_write_sync include/linux/fs.h:2794 [inline]
         btrfs_do_write_iter+0x584/0x10c0 fs/btrfs/file.c:1705
         new_sync_write fs/read_write.c:497 [inline]
         vfs_write+0x6b6/0x1140 fs/read_write.c:590
         ksys_write+0x12f/0x260 fs/read_write.c:643
         do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
         __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386
         do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
         entry_SYSENTER_compat_after_hwframe+0x84/0x8e

  -> #2 (btrfs_trans_num_extwriters){++++}-{0:0}:
         join_transaction+0x164/0xf40 fs/btrfs/transaction.c:315
         start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700
         btrfs_commit_super+0xa1/0x110 fs/btrfs/disk-io.c:4170
         close_ctree+0xcb0/0xf90 fs/btrfs/disk-io.c:4324
         generic_shutdown_super+0x159/0x3d0 fs/super.c:642
         kill_anon_super+0x3a/0x60 fs/super.c:1226
         btrfs_kill_super+0x3b/0x50 fs/btrfs/super.c:2096
         deactivate_locked_super+0xbe/0x1a0 fs/super.c:473
         deactivate_super+0xde/0x100 fs/super.c:506
         cleanup_mnt+0x222/0x450 fs/namespace.c:1267
         task_work_run+0x14e/0x250 kernel/task_work.c:180
         resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
         exit_to_user_mode_loop kernel/entry/common.c:114 [inline]
         exit_to_user_mode_prepare include/linux/entry-common.h:328 [inline]
         __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline]
         syscall_exit_to_user_mode+0x278/0x2a0 kernel/entry/common.c:218
         __do_fast_syscall_32+0x80/0x120 arch/x86/entry/common.c:389
         do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
         entry_SYSENTER_compat_after_hwframe+0x84/0x8e

  -> #1 (btrfs_trans_num_writers){++++}-{0:0}:
         __lock_release kernel/locking/lockdep.c:5468 [inline]
         lock_release+0x33e/0x6c0 kernel/locking/lockdep.c:5774
         percpu_up_read include/linux/percpu-rwsem.h:99 [inline]
         __sb_end_write include/linux/fs.h:1650 [inline]
         sb_end_intwrite include/linux/fs.h:1767 [inline]
         __btrfs_end_transaction+0x5ca/0x920 fs/btrfs/transaction.c:1071
         btrfs_commit_inode_delayed_inode+0x228/0x330 fs/btrfs/delayed-inode.c:1301
         btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291
         evict+0x2ed/0x6c0 fs/inode.c:667
         iput_final fs/inode.c:1741 [inline]
         iput.part.0+0x5a8/0x7f0 fs/inode.c:1767
         iput+0x5c/0x80 fs/inode.c:1757
         dentry_unlink_inode+0x295/0x480 fs/dcache.c:400
         __dentry_kill+0x1d0/0x600 fs/dcache.c:603
         dput.part.0+0x4b1/0x9b0 fs/dcache.c:845
         dput+0x1f/0x30 fs/dcache.c:835
         ovl_stack_put+0x60/0x90 fs/overlayfs/util.c:132
         ovl_destroy_inode+0xc6/0x190 fs/overlayfs/super.c:182
         destroy_inode+0xc4/0x1b0 fs/inode.c:311
         iput_final fs/inode.c:1741 [inline]
         iput.part.0+0x5a8/0x7f0 fs/inode.c:1767
         iput+0x5c/0x80 fs/inode.c:1757
         dentry_unlink_inode+0x295/0x480 fs/dcache.c:400
         __dentry_kill+0x1d0/0x600 fs/dcache.c:603
         shrink_kill fs/dcache.c:1048 [inline]
         shrink_dentry_list+0x140/0x5d0 fs/dcache.c:1075
         prune_dcache_sb+0xeb/0x150 fs/dcache.c:1156
         super_cache_scan+0x32a/0x550 fs/super.c:221
         do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435
         shrink_slab_memcg mm/shrinker.c:548 [inline]
         shrink_slab+0xa87/0x1310 mm/shrinker.c:626
         shrink_one+0x493/0x7c0 mm/vmscan.c:4790
         shrink_many mm/vmscan.c:4851 [inline]
         lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951
         shrink_node mm/vmscan.c:5910 [inline]
         kswapd_shrink_node mm/vmscan.c:6720 [inline]
         balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911
         kswapd+0x5ea/0xbf0 mm/vmscan.c:7180
         kthread+0x2c1/0x3a0 kernel/kthread.c:389
         ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
         ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

  -> #0 (fs_reclaim){+.+.}-{0:0}:
         check_prev_add kernel/locking/lockdep.c:3134 [inline]
         check_prevs_add kernel/locking/lockdep.c:3253 [inline]
         validate_chain kernel/locking/lockdep.c:3869 [inline]
         __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137
         lock_acquire kernel/locking/lockdep.c:5754 [inline]
         lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719
         __fs_reclaim_acquire mm/page_alloc.c:3801 [inline]
         fs_reclaim_acquire+0x102/0x160 mm/page_alloc.c:3815
         might_alloc include/linux/sched/mm.h:334 [inline]
         slab_pre_alloc_hook mm/slub.c:3891 [inline]
         slab_alloc_node mm/slub.c:3981 [inline]
         kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4020
         btrfs_alloc_inode+0x118/0xb20 fs/btrfs/inode.c:8411
         alloc_inode+0x5d/0x230 fs/inode.c:261
         iget5_locked fs/inode.c:1235 [inline]
         iget5_locked+0x1c9/0x2c0 fs/inode.c:1228
         btrfs_iget_locked fs/btrfs/inode.c:5590 [inline]
         btrfs_iget_path fs/btrfs/inode.c:5607 [inline]
         btrfs_iget+0xfb/0x230 fs/btrfs/inode.c:5636
         add_conflicting_inode fs/btrfs/tree-log.c:5657 [inline]
         copy_inode_items_to_log+0x1039/0x1e30 fs/btrfs/tree-log.c:5928
         btrfs_log_inode+0xa48/0x4660 fs/btrfs/tree-log.c:6592
         log_new_delayed_dentries fs/btrfs/tree-log.c:6363 [inline]
         btrfs_log_inode+0x27dd/0x4660 fs/btrfs/tree-log.c:6718
         btrfs_log_all_parents fs/btrfs/tree-log.c:6833 [inline]
         btrfs_log_inode_parent+0x22ba/0x2a90 fs/btrfs/tree-log.c:7141
         btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7180
         btrfs_sync_file+0x9c1/0xe10 fs/btrfs/file.c:1959
         vfs_fsync_range+0x141/0x230 fs/sync.c:188
         generic_write_sync include/linux/fs.h:2794 [inline]
         btrfs_do_write_iter+0x584/0x10c0 fs/btrfs/file.c:1705
         do_iter_readv_writev+0x504/0x780 fs/read_write.c:741
         vfs_writev+0x36f/0xde0 fs/read_write.c:971
         do_pwritev+0x1b2/0x260 fs/read_write.c:1072
         __do_compat_sys_pwritev2 fs/read_write.c:1218 [inline]
         __se_compat_sys_pwritev2 fs/read_write.c:1210 [inline]
         __ia32_compat_sys_pwritev2+0x121/0x1b0 fs/read_write.c:1210
         do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
         __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386
         do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
         entry_SYSENTER_compat_after_hwframe+0x84/0x8e

  other info that might help us debug this:

  Chain exists of:
    fs_reclaim --> btrfs_trans_num_extwriters --> &ei->log_mutex

   Possible unsafe locking scenario:

         CPU0                    CPU1
         ----                    ----
    lock(&ei->log_mutex);
                                 lock(btrfs_trans_num_extwriters);
                                 lock(&ei->log_mutex);
    lock(fs_reclaim);

   *** DEADLOCK ***

  7 locks held by syz-executor.1/9919:
   #0: ffff88802be20420 (sb_writers#23){.+.+}-{0:0}, at: do_pwritev+0x1b2/0x260 fs/read_write.c:1072
   #1: ffff888065c0f8f0 (&sb->s_type->i_mutex_key#33){++++}-{3:3}, at: inode_lock include/linux/fs.h:791 [inline]
   #1: ffff888065c0f8f0 (&sb->s_type->i_mutex_key#33){++++}-{3:3}, at: btrfs_inode_lock+0xc8/0x110 fs/btrfs/inode.c:385
   #2: ffff888065c0f778 (&ei->i_mmap_lock){++++}-{3:3}, at: btrfs_inode_lock+0xee/0x110 fs/btrfs/inode.c:388
   #3: ffff88802be20610 (sb_internal#4){.+.+}-{0:0}, at: btrfs_sync_file+0x95b/0xe10 fs/btrfs/file.c:1952
   #4: ffff8880546323f0 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x430/0xf40 fs/btrfs/transaction.c:290
   #5: ffff888054632418 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x430/0xf40 fs/btrfs/transaction.c:290
   #6: ffff88804b569358 (&ei->log_mutex){+.+.}-{3:3}, at: btrfs_log_inode+0x39c/0x4660 fs/btrfs/tree-log.c:6481

  stack backtrace:
  CPU: 2 PID: 9919 Comm: syz-executor.1 Not tainted 6.10.0-rc2-syzkaller-00361-g061d1af7b030 #0
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
  Call Trace:
   <TASK>
   __dump_stack lib/dump_stack.c:88 [inline]
   dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:114
   check_noncircular+0x31a/0x400 kernel/locking/lockdep.c:2187
   check_prev_add kernel/locking/lockdep.c:3134 [inline]
   check_prevs_add kernel/locking/lockdep.c:3253 [inline]
   validate_chain kernel/locking/lockdep.c:3869 [inline]
   __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137
   lock_acquire kernel/locking/lockdep.c:5754 [inline]
   lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719
   __fs_reclaim_acquire mm/page_alloc.c:3801 [inline]
   fs_reclaim_acquire+0x102/0x160 mm/page_alloc.c:3815
   might_alloc include/linux/sched/mm.h:334 [inline]
   slab_pre_alloc_hook mm/slub.c:3891 [inline]
   slab_alloc_node mm/slub.c:3981 [inline]
   kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4020
   btrfs_alloc_inode+0x118/0xb20 fs/btrfs/inode.c:8411
   alloc_inode+0x5d/0x230 fs/inode.c:261
   iget5_locked fs/inode.c:1235 [inline]
   iget5_locked+0x1c9/0x2c0 fs/inode.c:1228
   btrfs_iget_locked fs/btrfs/inode.c:5590 [inline]
   btrfs_iget_path fs/btrfs/inode.c:5607 [inline]
   btrfs_iget+0xfb/0x230 fs/btrfs/inode.c:5636
   add_conflicting_inode fs/btrfs/tree-log.c:5657 [inline]
   copy_inode_items_to_log+0x1039/0x1e30 fs/btrfs/tree-log.c:5928
   btrfs_log_inode+0xa48/0x4660 fs/btrfs/tree-log.c:6592
   log_new_delayed_dentries fs/btrfs/tree-log.c:6363 [inline]
   btrfs_log_inode+0x27dd/0x4660 fs/btrfs/tree-log.c:6718
   btrfs_log_all_parents fs/btrfs/tree-log.c:6833 [inline]
   btrfs_log_inode_parent+0x22ba/0x2a90 fs/btrfs/tree-log.c:7141
   btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7180
   btrfs_sync_file+0x9c1/0xe10 fs/btrfs/file.c:1959
   vfs_fsync_range+0x141/0x230 fs/sync.c:188
   generic_write_sync include/linux/fs.h:2794 [inline]
   btrfs_do_write_iter+0x584/0x10c0 fs/btrfs/file.c:1705
   do_iter_readv_writev+0x504/0x780 fs/read_write.c:741
   vfs_writev+0x36f/0xde0 fs/read_write.c:971
   do_pwritev+0x1b2/0x260 fs/read_write.c:1072
   __do_compat_sys_pwritev2 fs/read_write.c:1218 [inline]
   __se_compat_sys_pwritev2 fs/read_write.c:1210 [inline]
   __ia32_compat_sys_pwritev2+0x121/0x1b0 fs/read_write.c:1210
   do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
   __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386
   do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411
   entry_SYSENTER_compat_after_hwframe+0x84/0x8e
  RIP: 0023:0xf7334579
  Code: b8 01 10 06 03 (...)
  RSP: 002b:00000000f5f265ac EFLAGS: 00000292 ORIG_RAX: 000000000000017b
  RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00000000200002c0
  RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
  RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000292 R12: 0000000000000000
  R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

Fix this by ensuring we are under a NOFS scope whenever we call
btrfs_iget() during inode logging and log replay.

Reported-by: syzbot+8576cfa84070dce4d59b@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/000000000000274a3a061abbd928@google.com/
Fixes: 712e36c ("btrfs: use GFP_KERNEL in btrfs_alloc_inode")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
ojeda pushed a commit that referenced this pull request Jul 8, 2024
The code in ocfs2_dio_end_io_write() estimates number of necessary
transaction credits using ocfs2_calc_extend_credits().  This however does
not take into account that the IO could be arbitrarily large and can
contain arbitrary number of extents.

Extent tree manipulations do often extend the current transaction but not
in all of the cases.  For example if we have only single block extents in
the tree, ocfs2_mark_extent_written() will end up calling
ocfs2_replace_extent_rec() all the time and we will never extend the
current transaction and eventually exhaust all the transaction credits if
the IO contains many single block extents.  Once that happens a
WARN_ON(jbd2_handle_buffer_credits(handle) <= 0) is triggered in
jbd2_journal_dirty_metadata() and subsequently OCFS2 aborts in response to
this error.  This was actually triggered by one of our customers on a
heavily fragmented OCFS2 filesystem.

To fix the issue make sure the transaction always has enough credits for
one extent insert before each call of ocfs2_mark_extent_written().

Heming Zhao said:

------
PANIC: "Kernel panic - not syncing: OCFS2: (device dm-1): panic forced after error"

PID: xxx  TASK: xxxx  CPU: 5  COMMAND: "SubmitThread-CA"
  #0 machine_kexec at ffffffff8c069932
  #1 __crash_kexec at ffffffff8c1338fa
  #2 panic at ffffffff8c1d69b9
  #3 ocfs2_handle_error at ffffffffc0c86c0c [ocfs2]
  #4 __ocfs2_abort at ffffffffc0c88387 [ocfs2]
  #5 ocfs2_journal_dirty at ffffffffc0c51e98 [ocfs2]
  #6 ocfs2_split_extent at ffffffffc0c27ea3 [ocfs2]
  #7 ocfs2_change_extent_flag at ffffffffc0c28053 [ocfs2]
  #8 ocfs2_mark_extent_written at ffffffffc0c28347 [ocfs2]
  #9 ocfs2_dio_end_io_write at ffffffffc0c2bef9 [ocfs2]
#10 ocfs2_dio_end_io at ffffffffc0c2c0f5 [ocfs2]
#11 dio_complete at ffffffff8c2b9fa7
#12 do_blockdev_direct_IO at ffffffff8c2bc09f
#13 ocfs2_direct_IO at ffffffffc0c2b653 [ocfs2]
#14 generic_file_direct_write at ffffffff8c1dcf14
#15 __generic_file_write_iter at ffffffff8c1dd07b
#16 ocfs2_file_write_iter at ffffffffc0c49f1f [ocfs2]
#17 aio_write at ffffffff8c2cc72e
#18 kmem_cache_alloc at ffffffff8c248dde
#19 do_io_submit at ffffffff8c2ccada
#20 do_syscall_64 at ffffffff8c004984
#21 entry_SYSCALL_64_after_hwframe at ffffffff8c8000ba

Link: https://lkml.kernel.org/r/20240617095543.6971-1-jack@suse.cz
Link: https://lkml.kernel.org/r/20240614145243.8837-1-jack@suse.cz
Fixes: c15471f ("ocfs2: fix sparse file & data ordering issue in direct io")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ojeda pushed a commit that referenced this pull request Aug 5, 2024
The pen ID, 0x80842, was not the correct ID for wacom driver to
treat. The ID was corrected to 0x8842.
Also, 0x4200 was not the expected ID used on any Wacom device.
Therefore, 0x4200 was removed.

Signed-off-by: Tatsunosuke Tobita <tatsunosuke.tobita@wacom.com>
Signed-off-by: Tatsunosuke Tobita <tatsunosuke.wacom@gmail.com>
Fixes: bfdc750 ("HID: wacom: add three styli to wacom_intuos_get_tool_type")
Cc: stable@kernel.org #6.2
Reviewed-by: Ping Cheng <ping.cheng@wacom.com>
Link: https://patch.msgid.link/20240709055729.17158-1-tatsunosuke.wacom@gmail.com
Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
Darksonn pushed a commit to Darksonn/linux that referenced this pull request Aug 14, 2024
When l2tp tunnels use a socket provided by userspace, we can hit
lockdep splats like the below when data is transmitted through another
(unrelated) userspace socket which then gets routed over l2tp.

This issue was previously discussed here:
https://lore.kernel.org/netdev/87sfialu2n.fsf@cloudflare.com/

The solution is to have lockdep treat socket locks of l2tp tunnel
sockets separately than those of standard INET sockets. To do so, use
a different lockdep subclass where lock nesting is possible.

  ============================================
  WARNING: possible recursive locking detected
  6.10.0+ Rust-for-Linux#34 Not tainted
  --------------------------------------------
  iperf3/771 is trying to acquire lock:
  ffff8881027601d8 (slock-AF_INET/1){+.-.}-{2:2}, at: l2tp_xmit_skb+0x243/0x9d0

  but task is already holding lock:
  ffff888102650d98 (slock-AF_INET/1){+.-.}-{2:2}, at: tcp_v4_rcv+0x1848/0x1e10

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(slock-AF_INET/1);
    lock(slock-AF_INET/1);

   *** DEADLOCK ***

   May be due to missing lock nesting notation

  10 locks held by iperf3/771:
   #0: ffff888102650258 (sk_lock-AF_INET){+.+.}-{0:0}, at: tcp_sendmsg+0x1a/0x40
   Rust-for-Linux#1: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: __ip_queue_xmit+0x4b/0xbc0
   Rust-for-Linux#2: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x17a/0x1130
   Rust-for-Linux#3: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: process_backlog+0x28b/0x9f0
   Rust-for-Linux#4: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: ip_local_deliver_finish+0xf9/0x260
   Rust-for-Linux#5: ffff888102650d98 (slock-AF_INET/1){+.-.}-{2:2}, at: tcp_v4_rcv+0x1848/0x1e10
   Rust-for-Linux#6: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: __ip_queue_xmit+0x4b/0xbc0
   Rust-for-Linux#7: ffffffff822ac220 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x17a/0x1130
   Rust-for-Linux#8: ffffffff822ac1e0 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0xcc/0x1450
   Rust-for-Linux#9: ffff888101f33258 (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock#2){+...}-{2:2}, at: __dev_queue_xmit+0x513/0x1450

  stack backtrace:
  CPU: 2 UID: 0 PID: 771 Comm: iperf3 Not tainted 6.10.0+ Rust-for-Linux#34
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  Call Trace:
   <IRQ>
   dump_stack_lvl+0x69/0xa0
   dump_stack+0xc/0x20
   __lock_acquire+0x135d/0x2600
   ? srso_alias_return_thunk+0x5/0xfbef5
   lock_acquire+0xc4/0x2a0
   ? l2tp_xmit_skb+0x243/0x9d0
   ? __skb_checksum+0xa3/0x540
   _raw_spin_lock_nested+0x35/0x50
   ? l2tp_xmit_skb+0x243/0x9d0
   l2tp_xmit_skb+0x243/0x9d0
   l2tp_eth_dev_xmit+0x3c/0xc0
   dev_hard_start_xmit+0x11e/0x420
   sch_direct_xmit+0xc3/0x640
   __dev_queue_xmit+0x61c/0x1450
   ? ip_finish_output2+0xf4c/0x1130
   ip_finish_output2+0x6b6/0x1130
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __ip_finish_output+0x217/0x380
   ? srso_alias_return_thunk+0x5/0xfbef5
   __ip_finish_output+0x217/0x380
   ip_output+0x99/0x120
   __ip_queue_xmit+0xae4/0xbc0
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? tcp_options_write.constprop.0+0xcb/0x3e0
   ip_queue_xmit+0x34/0x40
   __tcp_transmit_skb+0x1625/0x1890
   __tcp_send_ack+0x1b8/0x340
   tcp_send_ack+0x23/0x30
   __tcp_ack_snd_check+0xa8/0x530
   ? srso_alias_return_thunk+0x5/0xfbef5
   tcp_rcv_established+0x412/0xd70
   tcp_v4_do_rcv+0x299/0x420
   tcp_v4_rcv+0x1991/0x1e10
   ip_protocol_deliver_rcu+0x50/0x220
   ip_local_deliver_finish+0x158/0x260
   ip_local_deliver+0xc8/0xe0
   ip_rcv+0xe5/0x1d0
   ? __pfx_ip_rcv+0x10/0x10
   __netif_receive_skb_one_core+0xce/0xe0
   ? process_backlog+0x28b/0x9f0
   __netif_receive_skb+0x34/0xd0
   ? process_backlog+0x28b/0x9f0
   process_backlog+0x2cb/0x9f0
   __napi_poll.constprop.0+0x61/0x280
   net_rx_action+0x332/0x670
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? find_held_lock+0x2b/0x80
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   handle_softirqs+0xda/0x480
   ? __dev_queue_xmit+0xa2c/0x1450
   do_softirq+0xa1/0xd0
   </IRQ>
   <TASK>
   __local_bh_enable_ip+0xc8/0xe0
   ? __dev_queue_xmit+0xa2c/0x1450
   __dev_queue_xmit+0xa48/0x1450
   ? ip_finish_output2+0xf4c/0x1130
   ip_finish_output2+0x6b6/0x1130
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __ip_finish_output+0x217/0x380
   ? srso_alias_return_thunk+0x5/0xfbef5
   __ip_finish_output+0x217/0x380
   ip_output+0x99/0x120
   __ip_queue_xmit+0xae4/0xbc0
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? tcp_options_write.constprop.0+0xcb/0x3e0
   ip_queue_xmit+0x34/0x40
   __tcp_transmit_skb+0x1625/0x1890
   tcp_write_xmit+0x766/0x2fb0
   ? __entry_text_end+0x102ba9/0x102bad
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __might_fault+0x74/0xc0
   ? srso_alias_return_thunk+0x5/0xfbef5
   __tcp_push_pending_frames+0x56/0x190
   tcp_push+0x117/0x310
   tcp_sendmsg_locked+0x14c1/0x1740
   tcp_sendmsg+0x28/0x40
   inet_sendmsg+0x5d/0x90
   sock_write_iter+0x242/0x2b0
   vfs_write+0x68d/0x800
   ? __pfx_sock_write_iter+0x10/0x10
   ksys_write+0xc8/0xf0
   __x64_sys_write+0x3d/0x50
   x64_sys_call+0xfaf/0x1f50
   do_syscall_64+0x6d/0x140
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7f4d143af992
  Code: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> e9 01 cc ff ff 41 54 b8 02 00 00 0
  RSP: 002b:00007ffd65032058 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
  RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f4d143af992
  RDX: 0000000000000025 RSI: 00007f4d143f3bcc RDI: 0000000000000005
  RBP: 00007f4d143f2b28 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4d143f3bcc
  R13: 0000000000000005 R14: 0000000000000000 R15: 00007ffd650323f0
   </TASK>

Fixes: 0b2c597 ("l2tp: close all race conditions in l2tp_tunnel_register()")
Suggested-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot+6acef9e0a4d1f46c83d4@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=6acef9e0a4d1f46c83d4
CC: gnault@redhat.com
CC: cong.wang@bytedance.com
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Link: https://patch.msgid.link/20240806160626.1248317-1-jchapman@katalix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ojeda pushed a commit that referenced this pull request Aug 18, 2024
We might run into a SIE validity if gisa has been disabled either via using
kernel parameter "kvm.use_gisa=0" or by setting the related sysfs
attribute to N (echo N >/sys/module/kvm/parameters/use_gisa).

The validity is caused by an invalid value in the SIE control block's
gisa designation. That happens because we pass the uninitialized gisa
origin to virt_to_phys() before writing it to the gisa designation.

To fix this we return 0 in kvm_s390_get_gisa_desc() if the origin is 0.
kvm_s390_get_gisa_desc() is used to determine which gisa designation to
set in the SIE control block. A value of 0 in the gisa designation disables
gisa usage.

The issue surfaces in the host kernel with the following kernel message as
soon a new kvm guest start is attemted.

kvm: unhandled validity intercept 0x1011
WARNING: CPU: 0 PID: 781237 at arch/s390/kvm/intercept.c:101 kvm_handle_sie_intercept+0x42e/0x4d0 [kvm]
Modules linked in: vhost_net tap tun xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT xt_tcpudp nft_compat x_tables nf_nat_tftp nf_conntrack_tftp vfio_pci_core irqbypass vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables sunrpc mlx5_ib ib_uverbs ib_core mlx5_core uvdevice s390_trng eadm_sch vfio_ccw zcrypt_cex4 mdev vfio_iommu_type1 vfio sch_fq_codel drm i2c_core loop drm_panel_orientation_quirks configfs nfnetlink lcs ctcm fsm dm_service_time ghash_s390 prng chacha_s390 libchacha aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common dm_mirror dm_region_hash dm_log zfcp scsi_transport_fc scsi_dh_rdac scsi_dh_emc scsi_dh_alua pkey zcrypt dm_multipath rng_core autofs4 [last unloaded: vfio_pci]
CPU: 0 PID: 781237 Comm: CPU 0/KVM Not tainted 6.10.0-08682-gcad9f11498ea #6
Hardware name: IBM 3931 A01 701 (LPAR)
Krnl PSW : 0704c00180000000 000003d93deb0122 (kvm_handle_sie_intercept+0x432/0x4d0 [kvm])
           R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
Krnl GPRS: 000003d900000027 000003d900000023 0000000000000028 000002cd00000000
           000002d063a00900 00000359c6daf708 00000000000bebb5 0000000000001eff
           000002cfd82e9000 000002cfd80bc000 0000000000001011 000003d93deda412
           000003ff8962df98 000003d93de77ce0 000003d93deb011e 00000359c6daf960
Krnl Code: 000003d93deb0112: c020fffe7259	larl	%r2,000003d93de7e5c4
           000003d93deb0118: c0e53fa8beac	brasl	%r14,000003d9bd3c7e70
          #000003d93deb011e: af000000		mc	0,0
          >000003d93deb0122: a728ffea		lhi	%r2,-22
           000003d93deb0126: a7f4fe24		brc	15,000003d93deafd6e
           000003d93deb012a: 9101f0b0		tm	176(%r15),1
           000003d93deb012e: a774fe48		brc	7,000003d93deafdbe
           000003d93deb0132: 40a0f0ae		sth	%r10,174(%r15)
Call Trace:
 [<000003d93deb0122>] kvm_handle_sie_intercept+0x432/0x4d0 [kvm]
([<000003d93deb011e>] kvm_handle_sie_intercept+0x42e/0x4d0 [kvm])
 [<000003d93deacc10>] vcpu_post_run+0x1d0/0x3b0 [kvm]
 [<000003d93deaceda>] __vcpu_run+0xea/0x2d0 [kvm]
 [<000003d93dead9da>] kvm_arch_vcpu_ioctl_run+0x16a/0x430 [kvm]
 [<000003d93de93ee0>] kvm_vcpu_ioctl+0x190/0x7c0 [kvm]
 [<000003d9bd728b4e>] vfs_ioctl+0x2e/0x70
 [<000003d9bd72a092>] __s390x_sys_ioctl+0xc2/0xd0
 [<000003d9be0e9222>] __do_syscall+0x1f2/0x2e0
 [<000003d9be0f9a90>] system_call+0x70/0x98
Last Breaking-Event-Address:
 [<000003d9bd3c7f58>] __warn_printk+0xe8/0xf0

Cc: stable@vger.kernel.org
Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Fixes: fe0ef00 ("KVM: s390: sort out physical vs virtual pointers usage")
Signed-off-by: Michael Mueller <mimu@linux.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20240801123109.2782155-1-mimu@linux.ibm.com
Message-ID: <20240801123109.2782155-1-mimu@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
ojeda pushed a commit that referenced this pull request Aug 18, 2024
Lockdep reported a warning in Linux version 6.6:

[  414.344659] ================================
[  414.345155] WARNING: inconsistent lock state
[  414.345658] 6.6.0-07439-gba2303cacfda #6 Not tainted
[  414.346221] --------------------------------
[  414.346712] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[  414.347545] kworker/u10:3/1152 [HC0[0]:SC0[0]:HE0:SE1] takes:
[  414.349245] ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0
[  414.351204] {IN-SOFTIRQ-W} state was registered at:
[  414.351751]   lock_acquire+0x18d/0x460
[  414.352218]   _raw_spin_lock_irqsave+0x39/0x60
[  414.352769]   __wake_up_common_lock+0x22/0x60
[  414.353289]   sbitmap_queue_wake_up+0x375/0x4f0
[  414.353829]   sbitmap_queue_clear+0xdd/0x270
[  414.354338]   blk_mq_put_tag+0xdf/0x170
[  414.354807]   __blk_mq_free_request+0x381/0x4d0
[  414.355335]   blk_mq_free_request+0x28b/0x3e0
[  414.355847]   __blk_mq_end_request+0x242/0xc30
[  414.356367]   scsi_end_request+0x2c1/0x830
[  414.345155] WARNING: inconsistent lock state
[  414.345658] 6.6.0-07439-gba2303cacfda #6 Not tainted
[  414.346221] --------------------------------
[  414.346712] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[  414.347545] kworker/u10:3/1152 [HC0[0]:SC0[0]:HE0:SE1] takes:
[  414.349245] ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0
[  414.351204] {IN-SOFTIRQ-W} state was registered at:
[  414.351751]   lock_acquire+0x18d/0x460
[  414.352218]   _raw_spin_lock_irqsave+0x39/0x60
[  414.352769]   __wake_up_common_lock+0x22/0x60
[  414.353289]   sbitmap_queue_wake_up+0x375/0x4f0
[  414.353829]   sbitmap_queue_clear+0xdd/0x270
[  414.354338]   blk_mq_put_tag+0xdf/0x170
[  414.354807]   __blk_mq_free_request+0x381/0x4d0
[  414.355335]   blk_mq_free_request+0x28b/0x3e0
[  414.355847]   __blk_mq_end_request+0x242/0xc30
[  414.356367]   scsi_end_request+0x2c1/0x830
[  414.356863]   scsi_io_completion+0x177/0x1610
[  414.357379]   scsi_complete+0x12f/0x260
[  414.357856]   blk_complete_reqs+0xba/0xf0
[  414.358338]   __do_softirq+0x1b0/0x7a2
[  414.358796]   irq_exit_rcu+0x14b/0x1a0
[  414.359262]   sysvec_call_function_single+0xaf/0xc0
[  414.359828]   asm_sysvec_call_function_single+0x1a/0x20
[  414.360426]   default_idle+0x1e/0x30
[  414.360873]   default_idle_call+0x9b/0x1f0
[  414.361390]   do_idle+0x2d2/0x3e0
[  414.361819]   cpu_startup_entry+0x55/0x60
[  414.362314]   start_secondary+0x235/0x2b0
[  414.362809]   secondary_startup_64_no_verify+0x18f/0x19b
[  414.363413] irq event stamp: 428794
[  414.363825] hardirqs last  enabled at (428793): [<ffffffff816bfd1c>] ktime_get+0x1dc/0x200
[  414.364694] hardirqs last disabled at (428794): [<ffffffff85470177>] _raw_spin_lock_irq+0x47/0x50
[  414.365629] softirqs last  enabled at (428444): [<ffffffff85474780>] __do_softirq+0x540/0x7a2
[  414.366522] softirqs last disabled at (428419): [<ffffffff813f65ab>] irq_exit_rcu+0x14b/0x1a0
[  414.367425]
               other info that might help us debug this:
[  414.368194]  Possible unsafe locking scenario:
[  414.368900]        CPU0
[  414.369225]        ----
[  414.369548]   lock(&sbq->ws[i].wait);
[  414.370000]   <Interrupt>
[  414.370342]     lock(&sbq->ws[i].wait);
[  414.370802]
                *** DEADLOCK ***
[  414.371569] 5 locks held by kworker/u10:3/1152:
[  414.372088]  #0: ffff88810130e938 ((wq_completion)writeback){+.+.}-{0:0}, at: process_scheduled_works+0x357/0x13f0
[  414.373180]  #1: ffff88810201fdb8 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_scheduled_works+0x3a3/0x13f0
[  414.374384]  #2: ffffffff86ffbdc0 (rcu_read_lock){....}-{1:2}, at: blk_mq_run_hw_queue+0x637/0xa00
[  414.375342]  #3: ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0
[  414.376377]  #4: ffff888106205a08 (&hctx->dispatch_wait_lock){+.-.}-{2:2}, at: blk_mq_dispatch_rq_list+0x1337/0x1ee0
[  414.378607]
               stack backtrace:
[  414.379177] CPU: 0 PID: 1152 Comm: kworker/u10:3 Not tainted 6.6.0-07439-gba2303cacfda #6
[  414.380032] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[  414.381177] Workqueue: writeback wb_workfn (flush-253:0)
[  414.381805] Call Trace:
[  414.382136]  <TASK>
[  414.382429]  dump_stack_lvl+0x91/0xf0
[  414.382884]  mark_lock_irq+0xb3b/0x1260
[  414.383367]  ? __pfx_mark_lock_irq+0x10/0x10
[  414.383889]  ? stack_trace_save+0x8e/0xc0
[  414.384373]  ? __pfx_stack_trace_save+0x10/0x10
[  414.384903]  ? graph_lock+0xcf/0x410
[  414.385350]  ? save_trace+0x3d/0xc70
[  414.385808]  mark_lock.part.20+0x56d/0xa90
[  414.386317]  mark_held_locks+0xb0/0x110
[  414.386791]  ? __pfx_do_raw_spin_lock+0x10/0x10
[  414.387320]  lockdep_hardirqs_on_prepare+0x297/0x3f0
[  414.387901]  ? _raw_spin_unlock_irq+0x28/0x50
[  414.388422]  trace_hardirqs_on+0x58/0x100
[  414.388917]  _raw_spin_unlock_irq+0x28/0x50
[  414.389422]  __blk_mq_tag_busy+0x1d6/0x2a0
[  414.389920]  __blk_mq_get_driver_tag+0x761/0x9f0
[  414.390899]  blk_mq_dispatch_rq_list+0x1780/0x1ee0
[  414.391473]  ? __pfx_blk_mq_dispatch_rq_list+0x10/0x10
[  414.392070]  ? sbitmap_get+0x2b8/0x450
[  414.392533]  ? __blk_mq_get_driver_tag+0x210/0x9f0
[  414.393095]  __blk_mq_sched_dispatch_requests+0xd99/0x1690
[  414.393730]  ? elv_attempt_insert_merge+0x1b1/0x420
[  414.394302]  ? __pfx___blk_mq_sched_dispatch_requests+0x10/0x10
[  414.394970]  ? lock_acquire+0x18d/0x460
[  414.395456]  ? blk_mq_run_hw_queue+0x637/0xa00
[  414.395986]  ? __pfx_lock_acquire+0x10/0x10
[  414.396499]  blk_mq_sched_dispatch_requests+0x109/0x190
[  414.397100]  blk_mq_run_hw_queue+0x66e/0xa00
[  414.397616]  blk_mq_flush_plug_list.part.17+0x614/0x2030
[  414.398244]  ? __pfx_blk_mq_flush_plug_list.part.17+0x10/0x10
[  414.398897]  ? writeback_sb_inodes+0x241/0xcc0
[  414.399429]  blk_mq_flush_plug_list+0x65/0x80
[  414.399957]  __blk_flush_plug+0x2f1/0x530
[  414.400458]  ? __pfx___blk_flush_plug+0x10/0x10
[  414.400999]  blk_finish_plug+0x59/0xa0
[  414.401467]  wb_writeback+0x7cc/0x920
[  414.401935]  ? __pfx_wb_writeback+0x10/0x10
[  414.402442]  ? mark_held_locks+0xb0/0x110
[  414.402931]  ? __pfx_do_raw_spin_lock+0x10/0x10
[  414.403462]  ? lockdep_hardirqs_on_prepare+0x297/0x3f0
[  414.404062]  wb_workfn+0x2b3/0xcf0
[  414.404500]  ? __pfx_wb_workfn+0x10/0x10
[  414.404989]  process_scheduled_works+0x432/0x13f0
[  414.405546]  ? __pfx_process_scheduled_works+0x10/0x10
[  414.406139]  ? do_raw_spin_lock+0x101/0x2a0
[  414.406641]  ? assign_work+0x19b/0x240
[  414.407106]  ? lock_is_held_type+0x9d/0x110
[  414.407604]  worker_thread+0x6f2/0x1160
[  414.408075]  ? __kthread_parkme+0x62/0x210
[  414.408572]  ? lockdep_hardirqs_on_prepare+0x297/0x3f0
[  414.409168]  ? __kthread_parkme+0x13c/0x210
[  414.409678]  ? __pfx_worker_thread+0x10/0x10
[  414.410191]  kthread+0x33c/0x440
[  414.410602]  ? __pfx_kthread+0x10/0x10
[  414.411068]  ret_from_fork+0x4d/0x80
[  414.411526]  ? __pfx_kthread+0x10/0x10
[  414.411993]  ret_from_fork_asm+0x1b/0x30
[  414.412489]  </TASK>

When interrupt is turned on while a lock holding by spin_lock_irq it
throws a warning because of potential deadlock.

blk_mq_prep_dispatch_rq
 blk_mq_get_driver_tag
  __blk_mq_get_driver_tag
   __blk_mq_alloc_driver_tag
    blk_mq_tag_busy -> tag is already busy
    // failed to get driver tag
 blk_mq_mark_tag_wait
  spin_lock_irq(&wq->lock) -> lock A (&sbq->ws[i].wait)
  __add_wait_queue(wq, wait) -> wait queue active
  blk_mq_get_driver_tag
  __blk_mq_tag_busy
-> 1) tag must be idle, which means there can't be inflight IO
   spin_lock_irq(&tags->lock) -> lock B (hctx->tags)
   spin_unlock_irq(&tags->lock) -> unlock B, turn on interrupt accidentally
-> 2) context must be preempt by IO interrupt to trigger deadlock.

As shown above, the deadlock is not possible in theory, but the warning
still need to be fixed.

Fix it by using spin_lock_irqsave to get lockB instead of spin_lock_irq.

Fixes: 4f1731d ("blk-mq: fix potential io hang by wrong 'wake_batch'")
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240815024736.2040971-1-lilingfeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ojeda pushed a commit that referenced this pull request Aug 18, 2024
Currently, migrate_pages_batch() can lock multiple locked folios with an
arbitrary order.  Although folio_trylock() is used to avoid deadlock as
commit 2ef7dbb ("migrate_pages: try migrate in batch asynchronously
firstly") mentioned, it seems try_split_folio() is still missing.

It was found by compaction stress test when I explicitly enable EROFS
compressed files to use large folios, which case I cannot reproduce with
the same workload if large folio support is off (current mainline). 
Typically, filesystem reads (with locked file-backed folios) could use
another bdev/meta inode to load some other I/Os (e.g.  inode extent
metadata or caching compressed data), so the locking order will be:

  file-backed folios  (A)
     bdev/meta folios (B)

The following calltrace shows the deadlock:
   Thread 1 takes (B) lock and tries to take folio (A) lock
   Thread 2 takes (A) lock and tries to take folio (B) lock

[Thread 1]
INFO: task stress:1824 blocked for more than 30 seconds.
      Tainted: G           OE      6.10.0-rc7+ #6
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:stress          state:D stack:0     pid:1824  tgid:1824  ppid:1822   flags:0x0000000c
Call trace:
 __switch_to+0xec/0x138
 __schedule+0x43c/0xcb0
 schedule+0x54/0x198
 io_schedule+0x44/0x70
 folio_wait_bit_common+0x184/0x3f8
			<-- folio mapping ffff00036d69cb18 index 996  (**)
 __folio_lock+0x24/0x38
 migrate_pages_batch+0x77c/0xea0	// try_split_folio (mm/migrate.c:1486:2)
					// migrate_pages_batch (mm/migrate.c:1734:16)
		<--- LIST_HEAD(unmap_folios) has
			..
			folio mapping 0xffff0000d184f1d8 index 1711;   (*)
			folio mapping 0xffff0000d184f1d8 index 1712;
			..
 migrate_pages+0xb28/0xe90
 compact_zone+0xa08/0x10f0
 compact_node+0x9c/0x180
 sysctl_compaction_handler+0x8c/0x118
 proc_sys_call_handler+0x1a8/0x280
 proc_sys_write+0x1c/0x30
 vfs_write+0x240/0x380
 ksys_write+0x78/0x118
 __arm64_sys_write+0x24/0x38
 invoke_syscall+0x78/0x108
 el0_svc_common.constprop.0+0x48/0xf0
 do_el0_svc+0x24/0x38
 el0_svc+0x3c/0x148
 el0t_64_sync_handler+0x100/0x130
 el0t_64_sync+0x190/0x198

[Thread 2]
INFO: task stress:1825 blocked for more than 30 seconds.
      Tainted: G           OE      6.10.0-rc7+ #6
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:stress          state:D stack:0     pid:1825  tgid:1825  ppid:1822   flags:0x0000000c
Call trace:
 __switch_to+0xec/0x138
 __schedule+0x43c/0xcb0
 schedule+0x54/0x198
 io_schedule+0x44/0x70
 folio_wait_bit_common+0x184/0x3f8
			<-- folio = 0xfffffdffc6b503c0 (mapping == 0xffff0000d184f1d8 index == 1711) (*)
 __folio_lock+0x24/0x38
 z_erofs_runqueue+0x384/0x9c0 [erofs]
 z_erofs_readahead+0x21c/0x350 [erofs]       <-- folio mapping 0xffff00036d69cb18 range from [992, 1024] (**)
 read_pages+0x74/0x328
 page_cache_ra_order+0x26c/0x348
 ondemand_readahead+0x1c0/0x3a0
 page_cache_sync_ra+0x9c/0xc0
 filemap_get_pages+0xc4/0x708
 filemap_read+0x104/0x3a8
 generic_file_read_iter+0x4c/0x150
 vfs_read+0x27c/0x330
 ksys_pread64+0x84/0xd0
 __arm64_sys_pread64+0x28/0x40
 invoke_syscall+0x78/0x108
 el0_svc_common.constprop.0+0x48/0xf0
 do_el0_svc+0x24/0x38
 el0_svc+0x3c/0x148
 el0t_64_sync_handler+0x100/0x130
 el0t_64_sync+0x190/0x198

Link: https://lkml.kernel.org/r/20240729021306.398286-1-hsiangkao@linux.alibaba.com
Fixes: 5dfab10 ("migrate_pages: batch _unmap and _move")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Darksonn pushed a commit to Darksonn/linux that referenced this pull request Sep 23, 2024
attempts to retrofit memory safety onto C are increasingly annoying

------------[ cut here ]------------
memcpy: detected field-spanning write (size 4) of single field "&k.replicas" at fs/bcachefs/replicas.c:454 (size 3)
WARNING: CPU: 5 PID: 6525 at fs/bcachefs/replicas.c:454 bch2_replicas_gc2+0x2cb/0x400 [bcachefs]
bch2_replicas_gc2+0x2cb/0x400:
bch2_replicas_gc2 at /home/ojab/src/bcachefs/fs/bcachefs/replicas.c:454 (discriminator 3)
Modules linked in: dm_mod tun nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter overlay msr sctp bcachefs lz4hc_compress lz4_compress libcrc32c xor raid6_pq lz4_decompress pps_ldisc pps_core wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel curve25519_x86_64 libcurve25519_generic libchacha sit tunnel4 ip_tunnel af_packet bridge stp llc ip6table_nat ip6table_filter ip6_tables xt_MASQUERADE xt_conntrack iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables x_tables tcp_bbr sch_fq_codel efivarfs nls_iso8859_1 nls_cp437 vfat fat cdc_mbim cdc_wdm cdc_ncm cdc_ether usbnet r8152 input_leds joydev mii amdgpu mousedev hid_generic usbhid hid ath10k_pci amd_atl edac_mce_amd ath10k_core kvm_amd ath kvm mac80211 bfq crc32_pclmul crc32c_intel polyval_clmulni polyval_generic sha512_ssse3 sha256_ssse3 sha1_ssse3 snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg i2c_algo_bit drm_exec snd_hda_codec r8169 drm_suballoc_helper
aesni_intel gf128mul crypto_simd amdxcp realtek mfd_core tpm_crb drm_buddy snd_hwdep mdio_devres libarc4 cryptd tpm_tis wmi_bmof cfg80211 evdev libphy snd_hda_core tpm_tis_core gpu_sched rapl xhci_pci xhci_hcd snd_pcm drm_display_helper snd_timer tpm sp5100_tco rfkill efi_pstore mpt3sas drm_ttm_helper ahci usbcore libaescfb ccp snd ttm 8250 libahci watchdog soundcore raid_class sha1_generic acpi_cpufreq k10temp 8250_base usb_common scsi_transport_sas i2c_piix4 hwmon video serial_mctrl_gpio serial_base ecdh_generic wmi rtc_cmos backlight ecc gpio_amdpt rng_core gpio_generic button
CPU: 5 UID: 0 PID: 6525 Comm: bcachefs Tainted: G        W          6.11.0-rc1-ojab-00058-g224bc118aec9 Rust-for-Linux#6 6d5debde398d2a84851f42ab300dae32c2992027
Tainted: [W]=WARN
RIP: 0010:bch2_replicas_gc2+0x2cb/0x400 [bcachefs]
Code: c7 c2 60 91 d1 c1 48 89 c6 48 c7 c7 98 91 d1 c1 4c 89 14 24 44 89 5c 24 08 48 89 44 24 20 c6 05 fa 68 04 00 01 e8 05 a3 40 e4 <0f> 0b 4c 8b 14 24 44 8b 5c 24 08 48 8b 44 24 20 e9 55 fe ff ff 8b
RSP: 0018:ffffb434c9263d60 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff9a8efa79cc00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffb434c9263de0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000005
R13: ffff9a8efa73c300 R14: ffff9a8d9e880000 R15: ffff9a8d9e8806f8
FS:  0000000000000000(0000) GS:ffff9a9410c80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000565423373090 CR3: 0000000164e30000 CR4: 00000000003506f0
Call Trace:
<TASK>
? __warn+0x97/0x150
? bch2_replicas_gc2+0x2cb/0x400 [bcachefs 9803eca5e131ef28f26250ede34072d5b50d98b3]
bch2_replicas_gc2+0x2cb/0x400:
bch2_replicas_gc2 at /home/ojab/src/bcachefs/fs/bcachefs/replicas.c:454 (discriminator 3)
? report_bug+0x196/0x1c0
? handle_bug+0x3c/0x70
? exc_invalid_op+0x17/0x80
? __wake_up_klogd.part.0+0x4c/0x80
? asm_exc_invalid_op+0x16/0x20
? bch2_replicas_gc2+0x2cb/0x400 [bcachefs 9803eca5e131ef28f26250ede34072d5b50d98b3]
bch2_replicas_gc2+0x2cb/0x400:
bch2_replicas_gc2 at /home/ojab/src/bcachefs/fs/bcachefs/replicas.c:454 (discriminator 3)
? bch2_dev_usage_read+0xa0/0xa0 [bcachefs 9803eca5e131ef28f26250ede34072d5b50d98b3]
bch2_dev_usage_read+0xa0/0xa0:
discard_in_flight_remove at /home/ojab/src/bcachefs/fs/bcachefs/alloc_background.c:1712

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
fbq pushed a commit that referenced this pull request Sep 30, 2024
Use a dedicated mutex to guard kvm_usage_count to fix a potential deadlock
on x86 due to a chain of locks and SRCU synchronizations.  Translating the
below lockdep splat, CPU1 #6 will wait on CPU0 #1, CPU0 #8 will wait on
CPU2 #3, and CPU2 #7 will wait on CPU1 #4 (if there's a writer, due to the
fairness of r/w semaphores).

    CPU0                     CPU1                     CPU2
1   lock(&kvm->slots_lock);
2                                                     lock(&vcpu->mutex);
3                                                     lock(&kvm->srcu);
4                            lock(cpu_hotplug_lock);
5                            lock(kvm_lock);
6                            lock(&kvm->slots_lock);
7                                                     lock(cpu_hotplug_lock);
8   sync(&kvm->srcu);

Note, there are likely more potential deadlocks in KVM x86, e.g. the same
pattern of taking cpu_hotplug_lock outside of kvm_lock likely exists with
__kvmclock_cpufreq_notifier():

  cpuhp_cpufreq_online()
  |
  -> cpufreq_online()
     |
     -> cpufreq_gov_performance_limits()
        |
        -> __cpufreq_driver_target()
           |
           -> __target_index()
              |
              -> cpufreq_freq_transition_begin()
                 |
                 -> cpufreq_notify_transition()
                    |
                    -> ... __kvmclock_cpufreq_notifier()

But, actually triggering such deadlocks is beyond rare due to the
combination of dependencies and timings involved.  E.g. the cpufreq
notifier is only used on older CPUs without a constant TSC, mucking with
the NX hugepage mitigation while VMs are running is very uncommon, and
doing so while also onlining/offlining a CPU (necessary to generate
contention on cpu_hotplug_lock) would be even more unusual.

The most robust solution to the general cpu_hotplug_lock issue is likely
to switch vm_list to be an RCU-protected list, e.g. so that x86's cpufreq
notifier doesn't to take kvm_lock.  For now, settle for fixing the most
blatant deadlock, as switching to an RCU-protected list is a much more
involved change, but add a comment in locking.rst to call out that care
needs to be taken when walking holding kvm_lock and walking vm_list.

  ======================================================
  WARNING: possible circular locking dependency detected
  6.10.0-smp--c257535a0c9d-pip #330 Tainted: G S         O
  ------------------------------------------------------
  tee/35048 is trying to acquire lock:
  ff6a80eced71e0a8 (&kvm->slots_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x179/0x1e0 [kvm]

  but task is already holding lock:
  ffffffffc07abb08 (kvm_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x14a/0x1e0 [kvm]

  which lock already depends on the new lock.

   the existing dependency chain (in reverse order) is:

  -> #3 (kvm_lock){+.+.}-{3:3}:
         __mutex_lock+0x6a/0xb40
         mutex_lock_nested+0x1f/0x30
         kvm_dev_ioctl+0x4fb/0xe50 [kvm]
         __se_sys_ioctl+0x7b/0xd0
         __x64_sys_ioctl+0x21/0x30
         x64_sys_call+0x15d0/0x2e60
         do_syscall_64+0x83/0x160
         entry_SYSCALL_64_after_hwframe+0x76/0x7e

  -> #2 (cpu_hotplug_lock){++++}-{0:0}:
         cpus_read_lock+0x2e/0xb0
         static_key_slow_inc+0x16/0x30
         kvm_lapic_set_base+0x6a/0x1c0 [kvm]
         kvm_set_apic_base+0x8f/0xe0 [kvm]
         kvm_set_msr_common+0x9ae/0xf80 [kvm]
         vmx_set_msr+0xa54/0xbe0 [kvm_intel]
         __kvm_set_msr+0xb6/0x1a0 [kvm]
         kvm_arch_vcpu_ioctl+0xeca/0x10c0 [kvm]
         kvm_vcpu_ioctl+0x485/0x5b0 [kvm]
         __se_sys_ioctl+0x7b/0xd0
         __x64_sys_ioctl+0x21/0x30
         x64_sys_call+0x15d0/0x2e60
         do_syscall_64+0x83/0x160
         entry_SYSCALL_64_after_hwframe+0x76/0x7e

  -> #1 (&kvm->srcu){.+.+}-{0:0}:
         __synchronize_srcu+0x44/0x1a0
         synchronize_srcu_expedited+0x21/0x30
         kvm_swap_active_memslots+0x110/0x1c0 [kvm]
         kvm_set_memslot+0x360/0x620 [kvm]
         __kvm_set_memory_region+0x27b/0x300 [kvm]
         kvm_vm_ioctl_set_memory_region+0x43/0x60 [kvm]
         kvm_vm_ioctl+0x295/0x650 [kvm]
         __se_sys_ioctl+0x7b/0xd0
         __x64_sys_ioctl+0x21/0x30
         x64_sys_call+0x15d0/0x2e60
         do_syscall_64+0x83/0x160
         entry_SYSCALL_64_after_hwframe+0x76/0x7e

  -> #0 (&kvm->slots_lock){+.+.}-{3:3}:
         __lock_acquire+0x15ef/0x2e30
         lock_acquire+0xe0/0x260
         __mutex_lock+0x6a/0xb40
         mutex_lock_nested+0x1f/0x30
         set_nx_huge_pages+0x179/0x1e0 [kvm]
         param_attr_store+0x93/0x100
         module_attr_store+0x22/0x40
         sysfs_kf_write+0x81/0xb0
         kernfs_fop_write_iter+0x133/0x1d0
         vfs_write+0x28d/0x380
         ksys_write+0x70/0xe0
         __x64_sys_write+0x1f/0x30
         x64_sys_call+0x281b/0x2e60
         do_syscall_64+0x83/0x160
         entry_SYSCALL_64_after_hwframe+0x76/0x7e

Cc: Chao Gao <chao.gao@intel.com>
Fixes: 0bf5049 ("KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock")
Cc: stable@vger.kernel.org
Reviewed-by: Kai Huang <kai.huang@intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240830043600.127750-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
shannmu pushed a commit to shannmu/Rust-For-Linux that referenced this pull request Oct 5, 2024
Ido Schimmel says:

====================
net: fib_rules: Add DSCP selector support

Currently, the kernel rejects IPv4 FIB rules that try to match on the
upper three DSCP bits:

 # ip -4 rule add tos 0x1c table 100
 # ip -4 rule add tos 0x3c table 100
 Error: Invalid tos.

The reason for that is that historically users of the FIB lookup API
only populated the lower three DSCP bits in the TOS field of the IPv4
flow key ('flowi4_tos'), which fits the TOS definition from the initial
IPv4 specification (RFC 791).

This is not very useful nowadays and instead some users want to be able
to match on the six bits DSCP field, which replaced the TOS and IP
precedence fields over 25 years ago (RFC 2474). In addition, the current
behavior differs between IPv4 and IPv6 which does allow users to match
on the entire DSCP field using the TOS selector.

Recent patchsets made sure that callers of the FIB lookup API now
populate the entire DSCP field in the IPv4 flow key. Therefore, it is
now possible to extend FIB rules to match on DSCP.

This is done by adding a new DSCP attribute which is implemented for
both IPv4 and IPv6 to provide user space programs a consistent behavior
between both address families.

The behavior of the old TOS selector is unchanged and IPv4 FIB rules
using it will only match on the lower three DSCP bits. The kernel will
reject rules that try to use both selectors.

Patch Rust-for-Linux#1 adds the new DSCP attribute but rejects its usage.

Patches Rust-for-Linux#2-Rust-for-Linux#3 implement IPv4 and IPv6 support.

Patch Rust-for-Linux#4 allows user space to use the new attribute.

Patches Rust-for-Linux#5-Rust-for-Linux#6 add selftests.
====================

Link: https://patch.msgid.link/20240911093748.3662015-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ojeda pushed a commit that referenced this pull request Oct 21, 2024
On the node of an NFS client, some files saved in the mountpoint of the
NFS server were copied to another location of the same NFS server.
Accidentally, the nfs42_complete_copies() got a NULL-pointer dereference
crash with the following syslog:

[232064.838881] NFSv4: state recovery failed for open file nfs/pvc-12b5200d-cd0f-46a3-b9f0-af8f4fe0ef64.qcow2, error = -116
[232064.839360] NFSv4: state recovery failed for open file nfs/pvc-12b5200d-cd0f-46a3-b9f0-af8f4fe0ef64.qcow2, error = -116
[232066.588183] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000058
[232066.588586] Mem abort info:
[232066.588701]   ESR = 0x0000000096000007
[232066.588862]   EC = 0x25: DABT (current EL), IL = 32 bits
[232066.589084]   SET = 0, FnV = 0
[232066.589216]   EA = 0, S1PTW = 0
[232066.589340]   FSC = 0x07: level 3 translation fault
[232066.589559] Data abort info:
[232066.589683]   ISV = 0, ISS = 0x00000007
[232066.589842]   CM = 0, WnR = 0
[232066.589967] user pgtable: 64k pages, 48-bit VAs, pgdp=00002000956ff400
[232066.590231] [0000000000000058] pgd=08001100ae100003, p4d=08001100ae100003, pud=08001100ae100003, pmd=08001100b3c00003, pte=0000000000000000
[232066.590757] Internal error: Oops: 96000007 [#1] SMP
[232066.590958] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm vhost_net vhost vhost_iotlb tap tun ipt_rpfilter xt_multiport ip_set_hash_ip ip_set_hash_net xfrm_interface xfrm6_tunnel tunnel4 tunnel6 esp4 ah4 wireguard libcurve25519_generic veth xt_addrtype xt_set nf_conntrack_netlink ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_bitmap_port ip_set_hash_ipport dummy ip_set ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs iptable_filter sch_ingress nfnetlink_cttimeout vport_gre ip_gre ip_tunnel gre vport_geneve geneve vport_vxlan vxlan ip6_udp_tunnel udp_tunnel openvswitch nf_conncount dm_round_robin dm_service_time dm_multipath xt_nat xt_MASQUERADE nft_chain_nat nf_nat xt_mark xt_conntrack xt_comment nft_compat nft_counter nf_tables nfnetlink ocfs2 ocfs2_nodemanager ocfs2_stackglue iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_ssif nbd overlay 8021q garp mrp bonding tls rfkill sunrpc ext4 mbcache jbd2
[232066.591052]  vfat fat cas_cache cas_disk ses enclosure scsi_transport_sas sg acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler ip_tables vfio_pci vfio_pci_core vfio_virqfd vfio_iommu_type1 vfio dm_mirror dm_region_hash dm_log dm_mod nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge stp llc fuse xfs libcrc32c ast drm_vram_helper qla2xxx drm_kms_helper syscopyarea crct10dif_ce sysfillrect ghash_ce sysimgblt sha2_ce fb_sys_fops cec sha256_arm64 sha1_ce drm_ttm_helper ttm nvme_fc igb sbsa_gwdt nvme_fabrics drm nvme_core i2c_algo_bit i40e scsi_transport_fc megaraid_sas aes_neon_bs
[232066.596953] CPU: 6 PID: 4124696 Comm: 10.253.166.125- Kdump: loaded Not tainted 5.15.131-9.cl9_ocfs2.aarch64 #1
[232066.597356] Hardware name: Great Wall .\x93\x8e...RF6260 V5/GWMSSE2GL1T, BIOS T656FBE_V3.0.18 2024-01-06
[232066.597721] pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[232066.598034] pc : nfs4_reclaim_open_state+0x220/0x800 [nfsv4]
[232066.598327] lr : nfs4_reclaim_open_state+0x12c/0x800 [nfsv4]
[232066.598595] sp : ffff8000f568fc70
[232066.598731] x29: ffff8000f568fc70 x28: 0000000000001000 x27: ffff21003db33000
[232066.599030] x26: ffff800005521ae0 x25: ffff0100f98fa3f0 x24: 0000000000000001
[232066.599319] x23: ffff800009920008 x22: ffff21003db33040 x21: ffff21003db33050
[232066.599628] x20: ffff410172fe9e40 x19: ffff410172fe9e00 x18: 0000000000000000
[232066.599914] x17: 0000000000000000 x16: 0000000000000004 x15: 0000000000000000
[232066.600195] x14: 0000000000000000 x13: ffff800008e685a8 x12: 00000000eac0c6e6
[232066.600498] x11: 0000000000000000 x10: 0000000000000008 x9 : ffff8000054e5828
[232066.600784] x8 : 00000000ffffffbf x7 : 0000000000000001 x6 : 000000000a9eb14a
[232066.601062] x5 : 0000000000000000 x4 : ffff70ff8a14a800 x3 : 0000000000000058
[232066.601348] x2 : 0000000000000001 x1 : 54dce46366daa6c6 x0 : 0000000000000000
[232066.601636] Call trace:
[232066.601749]  nfs4_reclaim_open_state+0x220/0x800 [nfsv4]
[232066.601998]  nfs4_do_reclaim+0x1b8/0x28c [nfsv4]
[232066.602218]  nfs4_state_manager+0x928/0x10f0 [nfsv4]
[232066.602455]  nfs4_run_state_manager+0x78/0x1b0 [nfsv4]
[232066.602690]  kthread+0x110/0x114
[232066.602830]  ret_from_fork+0x10/0x20
[232066.602985] Code: 1400000d f9403f20 f9402e61 91016003 (f9402c00)
[232066.603284] SMP: stopping secondary CPUs
[232066.606936] Starting crashdump kernel...
[232066.607146] Bye!

Analysing the vmcore, we know that nfs4_copy_state listed by destination
nfs_server->ss_copies was added by the field copies in handle_async_copy(),
and we found a waiting copy process with the stack as:
PID: 3511963  TASK: ffff710028b47e00  CPU: 0   COMMAND: "cp"
 #0 [ffff8001116ef740] __switch_to at ffff8000081b92f4
 #1 [ffff8001116ef760] __schedule at ffff800008dd0650
 #2 [ffff8001116ef7c0] schedule at ffff800008dd0a00
 #3 [ffff8001116ef7e0] schedule_timeout at ffff800008dd6aa0
 #4 [ffff8001116ef860] __wait_for_common at ffff800008dd166c
 #5 [ffff8001116ef8e0] wait_for_completion_interruptible at ffff800008dd1898
 #6 [ffff8001116ef8f0] handle_async_copy at ffff8000055142f4 [nfsv4]
 #7 [ffff8001116ef970] _nfs42_proc_copy at ffff8000055147c8 [nfsv4]
 #8 [ffff8001116efa80] nfs42_proc_copy at ffff800005514cf0 [nfsv4]
 #9 [ffff8001116efc50] __nfs4_copy_file_range.constprop.0 at ffff8000054ed694 [nfsv4]

The NULL-pointer dereference was due to nfs42_complete_copies() listed
the nfs_server->ss_copies by the field ss_copies of nfs4_copy_state.
So the nfs4_copy_state address ffff0100f98fa3f0 was offset by 0x10 and
the data accessed through this pointer was also incorrect. Generally,
the ordered list nfs4_state_owner->so_states indicate open(O_RDWR) or
open(O_WRITE) states are reclaimed firstly by nfs4_reclaim_open_state().
When destination state reclaim is failed with NFS_STATE_RECOVERY_FAILED
and copies are not deleted in nfs_server->ss_copies, the source state
may be passed to the nfs42_complete_copies() process earlier, resulting
in this crash scene finally. To solve this issue, we add a list_head
nfs_server->ss_src_copies for a server-to-server copy specially.

Fixes: 0e65a32 ("NFS: handle source server reboot")
Signed-off-by: Yanjun Zhang <zhangyanjun@cestc.cn>
Reviewed-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
ojeda pushed a commit that referenced this pull request Oct 21, 2024
…ation

When testing the XDP_REDIRECT function on the LS1028A platform, we
found a very reproducible issue that the Tx frames can no longer be
sent out even if XDP_REDIRECT is turned off. Specifically, if there
is a lot of traffic on Rx direction, when XDP_REDIRECT is turned on,
the console may display some warnings like "timeout for tx ring #6
clear", and all redirected frames will be dropped, the detailed log
is as follows.

root@ls1028ardb:~# ./xdp-bench redirect eno0 eno2
Redirecting from eno0 (ifindex 3; driver fsl_enetc) to eno2 (ifindex 4; driver fsl_enetc)
[203.849809] fsl_enetc 0000:00:00.2 eno2: timeout for tx ring #5 clear
[204.006051] fsl_enetc 0000:00:00.2 eno2: timeout for tx ring #6 clear
[204.161944] fsl_enetc 0000:00:00.2 eno2: timeout for tx ring #7 clear
eno0->eno2     1420505 rx/s       1420590 err,drop/s      0 xmit/s
  xmit eno0->eno2    0 xmit/s     1420590 drop/s     0 drv_err/s     15.71 bulk-avg
eno0->eno2     1420484 rx/s       1420485 err,drop/s      0 xmit/s
  xmit eno0->eno2    0 xmit/s     1420485 drop/s     0 drv_err/s     15.71 bulk-avg

By analyzing the XDP_REDIRECT implementation of enetc driver, the
driver will reconfigure Tx and Rx BD rings when a bpf program is
installed or uninstalled, but there is no mechanisms to block the
redirected frames when enetc driver reconfigures rings. Similarly,
XDP_TX verdicts on received frames can also lead to frames being
enqueued in the Tx rings. Because XDP ignores the state set by the
netif_tx_wake_queue() API, so introduce the ENETC_TX_DOWN flag to
suppress transmission of XDP frames.

Fixes: c33bfaf ("net: enetc: set up XDP program under enetc_reconfigure()")
Cc: stable@vger.kernel.org
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241010092056.298128-3-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ojeda pushed a commit that referenced this pull request Oct 21, 2024
The Tx BD rings are disabled first in enetc_stop() and the driver
waits for them to become empty. This operation is not safe while
the ring is actively transmitting frames, and will cause the ring
to not be empty and hardware exception. As described in the NETC
block guide, software should only disable an active Tx ring after
all pending ring entries have been consumed (i.e. when PI = CI).
Disabling a transmit ring that is actively processing BDs risks
a HW-SW race hazard whereby a hardware resource becomes assigned
to work on one or more ring entries only to have those entries be
removed due to the ring becoming disabled.

When testing XDP_REDIRECT feautre, although all frames were blocked
from being put into Tx rings during ring reconfiguration, the similar
warning log was still encountered:

fsl_enetc 0000:00:00.2 eno2: timeout for tx ring #6 clear
fsl_enetc 0000:00:00.2 eno2: timeout for tx ring #7 clear

The reason is that when there are still unsent frames in the Tx ring,
disabling the Tx ring causes the remaining frames to be unable to be
sent out. And the Tx ring cannot be restored, which means that even
if the xdp program is uninstalled, the Tx frames cannot be sent out
anymore. Therefore, correct the operation order in enect_start() and
enect_stop().

Fixes: ff58fda ("net: enetc: prioritize ability to go down over packet processing")
Cc: stable@vger.kernel.org
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241010092056.298128-4-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ojeda pushed a commit that referenced this pull request Oct 21, 2024
Syzkaller reported a lockdep splat:

  ============================================
  WARNING: possible recursive locking detected
  6.11.0-rc6-syzkaller-00019-g67784a74e258 #0 Not tainted
  --------------------------------------------
  syz-executor364/5113 is trying to acquire lock:
  ffff8880449f1958 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline]
  ffff8880449f1958 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328

  but task is already holding lock:
  ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline]
  ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(k-slock-AF_INET);
    lock(k-slock-AF_INET);

   *** DEADLOCK ***

   May be due to missing lock nesting notation

  7 locks held by syz-executor364/5113:
   #0: ffff8880449f0e18 (sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1607 [inline]
   #0: ffff8880449f0e18 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0x153/0x1b10 net/mptcp/protocol.c:1806
   #1: ffff88803fe39ad8 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1607 [inline]
   #1: ffff88803fe39ad8 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg_fastopen+0x11f/0x530 net/mptcp/protocol.c:1727
   #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline]
   #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline]
   #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: __ip_queue_xmit+0x5f/0x1b80 net/ipv4/ip_output.c:470
   #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline]
   #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline]
   #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x45f/0x1390 net/ipv4/ip_output.c:228
   #4: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: local_lock_acquire include/linux/local_lock_internal.h:29 [inline]
   #4: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: process_backlog+0x33b/0x15b0 net/core/dev.c:6104
   #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline]
   #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline]
   #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: ip_local_deliver_finish+0x230/0x5f0 net/ipv4/ip_input.c:232
   #6: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline]
   #6: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328

  stack backtrace:
  CPU: 0 UID: 0 PID: 5113 Comm: syz-executor364 Not tainted 6.11.0-rc6-syzkaller-00019-g67784a74e258 #0
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
  Call Trace:
   <IRQ>
   __dump_stack lib/dump_stack.c:93 [inline]
   dump_stack_lvl+0x241/0x360 lib/dump_stack.c:119
   check_deadlock kernel/locking/lockdep.c:3061 [inline]
   validate_chain+0x15d3/0x5900 kernel/locking/lockdep.c:3855
   __lock_acquire+0x137a/0x2040 kernel/locking/lockdep.c:5142
   lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5759
   __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
   _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154
   spin_lock include/linux/spinlock.h:351 [inline]
   sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328
   mptcp_sk_clone_init+0x32/0x13c0 net/mptcp/protocol.c:3279
   subflow_syn_recv_sock+0x931/0x1920 net/mptcp/subflow.c:874
   tcp_check_req+0xfe4/0x1a20 net/ipv4/tcp_minisocks.c:853
   tcp_v4_rcv+0x1c3e/0x37f0 net/ipv4/tcp_ipv4.c:2267
   ip_protocol_deliver_rcu+0x22e/0x440 net/ipv4/ip_input.c:205
   ip_local_deliver_finish+0x341/0x5f0 net/ipv4/ip_input.c:233
   NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314
   NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314
   __netif_receive_skb_one_core net/core/dev.c:5661 [inline]
   __netif_receive_skb+0x2bf/0x650 net/core/dev.c:5775
   process_backlog+0x662/0x15b0 net/core/dev.c:6108
   __napi_poll+0xcb/0x490 net/core/dev.c:6772
   napi_poll net/core/dev.c:6841 [inline]
   net_rx_action+0x89b/0x1240 net/core/dev.c:6963
   handle_softirqs+0x2c4/0x970 kernel/softirq.c:554
   do_softirq+0x11b/0x1e0 kernel/softirq.c:455
   </IRQ>
   <TASK>
   __local_bh_enable_ip+0x1bb/0x200 kernel/softirq.c:382
   local_bh_enable include/linux/bottom_half.h:33 [inline]
   rcu_read_unlock_bh include/linux/rcupdate.h:908 [inline]
   __dev_queue_xmit+0x1763/0x3e90 net/core/dev.c:4450
   dev_queue_xmit include/linux/netdevice.h:3105 [inline]
   neigh_hh_output include/net/neighbour.h:526 [inline]
   neigh_output include/net/neighbour.h:540 [inline]
   ip_finish_output2+0xd41/0x1390 net/ipv4/ip_output.c:235
   ip_local_out net/ipv4/ip_output.c:129 [inline]
   __ip_queue_xmit+0x118c/0x1b80 net/ipv4/ip_output.c:535
   __tcp_transmit_skb+0x2544/0x3b30 net/ipv4/tcp_output.c:1466
   tcp_rcv_synsent_state_process net/ipv4/tcp_input.c:6542 [inline]
   tcp_rcv_state_process+0x2c32/0x4570 net/ipv4/tcp_input.c:6729
   tcp_v4_do_rcv+0x77d/0xc70 net/ipv4/tcp_ipv4.c:1934
   sk_backlog_rcv include/net/sock.h:1111 [inline]
   __release_sock+0x214/0x350 net/core/sock.c:3004
   release_sock+0x61/0x1f0 net/core/sock.c:3558
   mptcp_sendmsg_fastopen+0x1ad/0x530 net/mptcp/protocol.c:1733
   mptcp_sendmsg+0x1884/0x1b10 net/mptcp/protocol.c:1812
   sock_sendmsg_nosec net/socket.c:730 [inline]
   __sock_sendmsg+0x1a6/0x270 net/socket.c:745
   ____sys_sendmsg+0x525/0x7d0 net/socket.c:2597
   ___sys_sendmsg net/socket.c:2651 [inline]
   __sys_sendmmsg+0x3b2/0x740 net/socket.c:2737
   __do_sys_sendmmsg net/socket.c:2766 [inline]
   __se_sys_sendmmsg net/socket.c:2763 [inline]
   __x64_sys_sendmmsg+0xa0/0xb0 net/socket.c:2763
   do_syscall_x64 arch/x86/entry/common.c:52 [inline]
   do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
  RIP: 0033:0x7f04fb13a6b9
  Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 01 1a 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
  RSP: 002b:00007ffd651f42d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
  RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f04fb13a6b9
  RDX: 0000000000000001 RSI: 0000000020000d00 RDI: 0000000000000004
  RBP: 00007ffd651f4310 R08: 0000000000000001 R09: 0000000000000001
  R10: 0000000020000080 R11: 0000000000000246 R12: 00000000000f4240
  R13: 00007f04fb187449 R14: 00007ffd651f42f4 R15: 00007ffd651f4300
   </TASK>

As noted by Cong Wang, the splat is false positive, but the code
path leading to the report is an unexpected one: a client is
attempting an MPC handshake towards the in-kernel listener created
by the in-kernel PM for a port based signal endpoint.

Such connection will be never accepted; many of them can make the
listener queue full and preventing the creation of MPJ subflow via
such listener - its intended role.

Explicitly detect this scenario at initial-syn time and drop the
incoming MPC request.

Fixes: 1729cf1 ("mptcp: create the listening socket for new port")
Cc: stable@vger.kernel.org
Reported-by: syzbot+f4aacdfef2c6a6529c3e@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=f4aacdfef2c6a6529c3e
Cc: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241014-net-mptcp-mpc-port-endp-v2-1-7faea8e6b6ae@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants