Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] master from torvalds:master #166

Merged
merged 88 commits into from
Aug 23, 2020
Merged

[pull] master from torvalds:master #166

merged 88 commits into from
Aug 23, 2020

Conversation

pull[bot]
Copy link

@pull pull bot commented Aug 23, 2020

See Commits and Changes for more details.


Created by pull[bot]. Want to support this open source service? Please star it : )

vaibhavk2 and others added 30 commits August 15, 2020 20:24
Linux only has support to read total DDR reads and writes. Here we
add support to enable bandwidth breakdown-GT, IA and IO. Breakdown
of BW is important to debug and optimize memory access. This can also
be used for telemetry and improving the system software.The offsets for
GT, IA and IO are added and these free running counters can be accessed
via MMIO space.

The BW breakdown can be measured using the following cmd:

  perf stat -e uncore_imc/gt_requests/,uncore_imc/ia_requests/,uncore_imc/io_requests/

             30.57 MiB  uncore_imc/gt_requests/
           1346.13 MiB  uncore_imc/ia_requests/
            190.97 MiB  uncore_imc/io_requests/

       5.984572733 seconds time elapsed

     BW/s = <gt,ia,io>_requests/time elapsed

Signed-off-by: Vaibhav Shankar <vaibhav.shankar@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200814022234.23605-1-vaibhav.shankar@intel.com
On POWER10 bit 12 in the PVR indicates if the core is SMT4 or SMT8.
Bit 12 is set for SMT4.

Without this patch, /proc/cpuinfo on a SMT4 DD1 POWER10 looks like
this:
  cpu             : POWER10, altivec supported
  revision        : 17.0 (pvr 0080 1100)

Fixes: a3ea40d ("powerpc: Add POWER10 architected mode")
Cc: stable@vger.kernel.org # v5.8
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200803035600.1820371-1-mikey@neuling.org
Add support for perf extended register capability in powerpc. The
capability flag PERF_PMU_CAP_EXTENDED_REGS, is used to indicate the
PMU which support extended registers. The generic code define the mask
of extended registers as 0 for non supported architectures.

Patch adds extended regs support for power9 platform by exposing
MMCR0, MMCR1 and MMCR2 registers.

REG_RESERVED mask needs update to include extended regs.
PERF_REG_EXTENDED_MASK, contains mask value of the supported
registers, is defined at runtime in the kernel based on platform since
the supported registers may differ from one processor version to
another and hence the MASK value.

With the patch:

  available registers: r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11
  r12 r13 r14 r15 r16 r17 r18 r19 r20 r21 r22 r23 r24 r25 r26
  r27 r28 r29 r30 r31 nip msr orig_r3 ctr link xer ccr softe
  trap dar dsisr sier mmcra mmcr0 mmcr1 mmcr2

  PERF_RECORD_SAMPLE(IP, 0x1): 4784/4784: 0 period: 1 addr: 0
  ... intr regs: mask 0xffffffffffff ABI 64-bit
  .... r0    0xc00000000012b77c
  .... r1    0xc000003fe5e03930
  .... r2    0xc000000001b0e000
  .... r3    0xc000003fdcddf800
  .... r4    0xc000003fc7880000
  .... r5    0x9c422724be
  .... r6    0xc000003fe5e03908
  .... r7    0xffffff63bddc8706
  .... r8    0x9e4
  .... r9    0x0
  .... r10   0x1
  .... r11   0x0
  .... r12   0xc0000000001299c0
  .... r13   0xc000003ffffc4800
  .... r14   0x0
  .... r15   0x7fffdd8b8b00
  .... r16   0x0
  .... r17   0x7fffdd8be6b8
  .... r18   0x7e7076607730
  .... r19   0x2f
  .... r20   0xc00000001fc26c68
  .... r21   0xc0002041e4227e00
  .... r22   0xc00000002018fb60
  .... r23   0x1
  .... r24   0xc000003ffec4d900
  .... r25   0x80000000
  .... r26   0x0
  .... r27   0x1
  .... r28   0x1
  .... r29   0xc000000001be1260
  .... r30   0x6008010
  .... r31   0xc000003ffebb7218
  .... nip   0xc00000000012b910
  .... msr   0x9000000000009033
  .... orig_r3 0xc00000000012b86c
  .... ctr   0xc0000000001299c0
  .... link  0xc00000000012b77c
  .... xer   0x0
  .... ccr   0x28002222
  .... softe 0x1
  .... trap  0xf00
  .... dar   0x0
  .... dsisr 0x80000000000
  .... sier  0x0
  .... mmcra 0x80000000000
  .... mmcr0 0x82008090
  .... mmcr1 0x1e000000
  .... mmcr2 0x0
   ... thread: perf:4784

Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Tested-by: Nageswara R Sastry <nasastry@in.ibm.com>
Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Reviewed-by: Kajol Jain <kjain@linux.ibm.com>
Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1596794701-23530-2-git-send-email-atrajeev@linux.vnet.ibm.com
Include capability flag PERF_PMU_CAP_EXTENDED_REGS for power10 and
expose MMCR3, SIER2, SIER3 registers as part of extended regs. Also
introduce PERF_REG_PMU_MASK_31 to define extended mask value at
runtime for power10.

Suggested-by: Ryan Grimm <grimm@linux.ibm.com>
Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Tested-by: Nageswara R Sastry <nasastry@in.ibm.com>
Reviewed-by: Kajol Jain <kjain@linux.ibm.com>
Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1596794701-23530-3-git-send-email-atrajeev@linux.vnet.ibm.com
Add a raw mode cputable entry for POWER10. Copies most of the fields
from commit a3ea40d ("powerpc: Add POWER10 architected mode")
except for oprofile_cpu_type, machine_check_early, pvr_mask and
pvr_mask fields. On bare metal systems we use DT CPU features, which
doesn't need a cputable entry. But in VMs we still rely on the raw
cputable entry to set the correct values for the PMU related fields.

Fixes: a3ea40d ("powerpc: Add POWER10 architected mode")
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
[mpe: Reorder vs cleanup patch and add Fixes tag]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200817005618.3305028-2-maddy@linux.ibm.com
__machine_check_early_realmode_p*() are currently declared as extern
in cputable.c and because of this when compiled with "C=1" (which
enables semantic checker) produces these warnings.

    CHECK   arch/powerpc/kernel/mce_power.c
  arch/powerpc/kernel/mce_power.c:709:6: warning: symbol '__machine_check_early_realmode_p7' was not declared. Should it be static?
  arch/powerpc/kernel/mce_power.c:717:6: warning: symbol '__machine_check_early_realmode_p8' was not declared. Should it be static?
  arch/powerpc/kernel/mce_power.c:722:6: warning: symbol '__machine_check_early_realmode_p9' was not declared. Should it be static?
  arch/powerpc/kernel/mce_power.c:740:6: warning: symbol '__machine_check_early_realmode_p10' was not declared. Should it be static?

Patch here moves the declaration to asm/mce.h and includes the same in
cputable.c

Fixes: ae744f3 ("powerpc/book3s: Flush SLB/TLBs if we get SLB/TLB machine check errors on power8")
Fixes: 7b9f71f ("powerpc/64s: POWER9 machine check handler")
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200817005618.3305028-1-maddy@linux.ibm.com
IS_ENABLED() instead of #ifdef still requires variable declaration.
In this specific case, default_uamor is declared in asm/pkeys.h which
is only included if PPC_MEM_KEYS is enabled.

arch/powerpc/mm/book3s64/hash_utils.c: In function ‘hash__early_init_mmu_secondary’:
arch/powerpc/mm/book3s64/hash_utils.c:1119:21: error: ‘default_uamor’ undeclared (first use in this function)
 1119 |   mtspr(SPRN_UAMOR, default_uamor);
      |                     ^~~~~~~~~~~~~

Fixes: 6553fb7 ("powerpc/pkeys: Fix boot failures with Nemo board (A-EON AmigaOne X1000)")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200817103301.158836-1-aneesh.kumar@linux.ibm.com
Commit ("03fd42d458fb powerpc/fixmap: Fix FIX_EARLY_DEBUG_BASE when
page size is 256k") reworked the setup of the early debug area and
mistakenly replaced 128 * 1024 by SZ_128.

Change to SZ_128K to restore the original 128 kbytes size of the area.

Fixes: 03fd42d ("powerpc/fixmap: Fix FIX_EARLY_DEBUG_BASE when page size is 256k")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/996184974d674ff984643778cf1cdd7fe58cc065.1597644194.git.christophe.leroy@csgroup.eu
With latest `bpftool prog` command, we observed the following kernel
panic.
    BUG: kernel NULL pointer dereference, address: 0000000000000000
    #PF: supervisor instruction fetch in kernel mode
    #PF: error_code(0x0010) - not-present page
    PGD dfe894067 P4D dfe894067 PUD deb663067 PMD 0
    Oops: 0010 [#1] SMP
    CPU: 9 PID: 6023 ...
    RIP: 0010:0x0
    Code: Bad RIP value.
    RSP: 0000:ffffc900002b8f18 EFLAGS: 00010286
    RAX: ffff8883a405f400 RBX: ffff888e46a6bf00 RCX: 000000008020000c
    RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8883a405f400
    RBP: ffff888e46a6bf50 R08: 0000000000000000 R09: ffffffff81129600
    R10: ffff8883a405f300 R11: 0000160000000000 R12: 0000000000002710
    R13: 000000e9494b690c R14: 0000000000000202 R15: 0000000000000009
    FS:  00007fd9187fe700(0000) GS:ffff888e46a40000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffffffffffffd6 CR3: 0000000de5d33002 CR4: 0000000000360ee0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
     <IRQ>
     rcu_core+0x1a4/0x440
     __do_softirq+0xd3/0x2c8
     irq_exit+0x9d/0xa0
     smp_apic_timer_interrupt+0x68/0x120
     apic_timer_interrupt+0xf/0x20
     </IRQ>
    RIP: 0033:0x47ce80
    Code: Bad RIP value.
    RSP: 002b:00007fd9187fba40 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
    RAX: 0000000000000002 RBX: 00007fd931789160 RCX: 000000000000010c
    RDX: 00007fd9308cdfb4 RSI: 00007fd9308cdfb4 RDI: 00007ffedd1ea0a8
    RBP: 00007fd9187fbab0 R08: 000000000000000e R09: 000000000000002a
    R10: 0000000000480210 R11: 00007fd9187fc570 R12: 00007fd9316cc400
    R13: 0000000000000118 R14: 00007fd9308cdfb4 R15: 00007fd9317a9380

After further analysis, the bug is triggered by
Commit eaaacd2 ("bpf: Add task and task/file iterator targets")
which introduced task_file bpf iterator, which traverses all open file
descriptors for all tasks in the current namespace.
The latest `bpftool prog` calls a task_file bpf program to traverse
all files in the system in order to associate processes with progs/maps, etc.
When traversing files for a given task, rcu read_lock is taken to
access all files in a file_struct. But it used get_file() to grab
a file, which is not right. It is possible file->f_count is 0 and
get_file() will unconditionally increase it.
Later put_file() may cause all kind of issues with the above
as one of sympotoms.

The failure can be reproduced with the following steps in a few seconds:
    $ cat t.c
    #include <stdio.h>
    #include <sys/types.h>
    #include <sys/stat.h>
    #include <fcntl.h>
    #include <unistd.h>

    #define N 10000
    int fd[N];
    int main() {
      int i;

      for (i = 0; i < N; i++) {
        fd[i] = open("./note.txt", 'r');
        if (fd[i] < 0) {
           fprintf(stderr, "failed\n");
           return -1;
        }
      }
      for (i = 0; i < N; i++)
        close(fd[i]);

      return 0;
    }
    $ gcc -O2 t.c
    $ cat run.sh
    #/bin/bash
    for i in {1..100}
    do
      while true; do ./a.out; done &
    done
    $ ./run.sh
    $ while true; do bpftool prog >& /dev/null; done

This patch used get_file_rcu() which only grabs a file if the
file->f_count is not zero. This is to ensure the file pointer
is always valid. The above reproducer did not fail for more
than 30 minutes.

Fixes: eaaacd2 ("bpf: Add task and task/file iterator targets")
Suggested-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/bpf/20200817174214.252601-1-yhs@fb.com
On BOOK3S_32, when we have modules and strict kernel RWX, modules
are not in vmalloc space but in a dedicated segment that is
below PAGE_OFFSET.

So KASAN_SHADOW_START must take it into account.

MODULES_VADDR can't be used because it is not defined yet
in kasan.h

Fixes: 6ca0553 ("powerpc/32s: Use dedicated segment for modules with STRICT_KERNEL_RWX")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/6eddca2d5611fd57312a88eae31278c87a8fc99d.1596641224.git.christophe.leroy@csgroup.eu
When MODULES_VADDR is defined, is_module_segment() shall check the
address against it instead of checking agains VMALLOC_START.

Fixes: 6ca0553 ("powerpc/32s: Use dedicated segment for modules with STRICT_KERNEL_RWX")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/07884ed033c31e074747b7eb8eaa329d15db07ec.1596641219.git.christophe.leroy@csgroup.eu
For a power9 KVM guest with XIVE enabled, running a test loop
where we hotplug 384 vcpus and then unplug them, the following traces
can be seen (generally within a few loops) either from the unplugged
vcpu:

  cpu 65 (hwid 65) Ready to die...
  Querying DEAD? cpu 66 (66) shows 2
  list_del corruption. next->prev should be c00a000002470208, but was c00a000002470048
  ------------[ cut here ]------------
  kernel BUG at lib/list_debug.c:56!
  Oops: Exception in kernel mode, sig: 5 [#1]
  LE SMP NR_CPUS=2048 NUMA pSeries
  Modules linked in: fuse nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 ...
  CPU: 66 PID: 0 Comm: swapper/66 Kdump: loaded Not tainted 4.18.0-221.el8.ppc64le #1
  NIP:  c0000000007ab50c LR: c0000000007ab508 CTR: 00000000000003ac
  REGS: c0000009e5a17840 TRAP: 0700   Not tainted  (4.18.0-221.el8.ppc64le)
  MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28000842  XER: 20040000
  ...
  NIP __list_del_entry_valid+0xac/0x100
  LR  __list_del_entry_valid+0xa8/0x100
  Call Trace:
    __list_del_entry_valid+0xa8/0x100 (unreliable)
    free_pcppages_bulk+0x1f8/0x940
    free_unref_page+0xd0/0x100
    xive_spapr_cleanup_queue+0x148/0x1b0
    xive_teardown_cpu+0x1bc/0x240
    pseries_mach_cpu_die+0x78/0x2f0
    cpu_die+0x48/0x70
    arch_cpu_idle_dead+0x20/0x40
    do_idle+0x2f4/0x4c0
    cpu_startup_entry+0x38/0x40
    start_secondary+0x7bc/0x8f0
    start_secondary_prolog+0x10/0x14

or on the worker thread handling the unplug:

  pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a
  Querying DEAD? cpu 314 (314) shows 2
  BUG: Bad page state in process kworker/u768:3  pfn:95de1
  cpu 314 (hwid 314) Ready to die...
  page:c00a000002577840 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0
  flags: 0x5ffffc00000000()
  raw: 005ffffc00000000 5deadbeef0000100 5deadbeef0000200 0000000000000000
  raw: 0000000000000000 0000000000000000 00000000ffffff7f 0000000000000000
  page dumped because: nonzero mapcount
  Modules linked in: kvm xt_CHECKSUM ipt_MASQUERADE xt_conntrack ...
  CPU: 0 PID: 548 Comm: kworker/u768:3 Kdump: loaded Not tainted 4.18.0-224.el8.bz1856588.ppc64le #1
  Workqueue: pseries hotplug workque pseries_hp_work_fn
  Call Trace:
    dump_stack+0xb0/0xf4 (unreliable)
    bad_page+0x12c/0x1b0
    free_pcppages_bulk+0x5bc/0x940
    page_alloc_cpu_dead+0x118/0x120
    cpuhp_invoke_callback.constprop.5+0xb8/0x760
    _cpu_down+0x188/0x340
    cpu_down+0x5c/0xa0
    cpu_subsys_offline+0x24/0x40
    device_offline+0xf0/0x130
    dlpar_offline_cpu+0x1c4/0x2a0
    dlpar_cpu_remove+0xb8/0x190
    dlpar_cpu_remove_by_index+0x12c/0x150
    dlpar_cpu+0x94/0x800
    pseries_hp_work_fn+0x128/0x1e0
    process_one_work+0x304/0x5d0
    worker_thread+0xcc/0x7a0
    kthread+0x1ac/0x1c0
    ret_from_kernel_thread+0x5c/0x80

The latter trace is due to the following sequence:

  page_alloc_cpu_dead
    drain_pages
      drain_pages_zone
        free_pcppages_bulk

where drain_pages() in this case is called under the assumption that
the unplugged cpu is no longer executing. To ensure that is the case,
and early call is made to __cpu_die()->pseries_cpu_die(), which runs a
loop that waits for the cpu to reach a halted state by polling its
status via query-cpu-stopped-state RTAS calls. It only polls for 25
iterations before giving up, however, and in the trace above this
results in the following being printed only .1 seconds after the
hotplug worker thread begins processing the unplug request:

  pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a
  Querying DEAD? cpu 314 (314) shows 2

At that point the worker thread assumes the unplugged CPU is in some
unknown/dead state and procedes with the cleanup, causing the race
with the XIVE cleanup code executed by the unplugged CPU.

Fix this by waiting indefinitely, but also making an effort to avoid
spurious lockup messages by allowing for rescheduling after polling
the CPU status and printing a warning if we wait for longer than 120s.

Fixes: eac1e73 ("powerpc/xive: guest exploitation of the XIVE interrupt controller")
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Tested-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
[mpe: Trim oopses in change log slightly for readability]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200811161544.10513-1-mdroth@linux.vnet.ibm.com
IA32_MCG_STATUS.RIPV indicates whether the return RIP value pushed onto
the stack as part of machine check delivery is valid or not.

Various drivers copied a code fragment that uses the RIPV bit to
determine the severity of the error as either HW_EVENT_ERR_UNCORRECTED
or HW_EVENT_ERR_FATAL, but this check is reversed (marking errors where
RIPV is set as "FATAL").

Reverse the tests so that the error is marked fatal when RIPV is not set.

Reported-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200707194324.14884-1-tony.luck@intel.com
On ppc64le we get the following warning:

  In file included from btf_dump.c:16:0:
  btf_dump.c: In function ‘btf_dump_emit_struct_def’:
  ../include/linux/kernel.h:20:17: error: comparison of distinct pointer types lacks a cast [-Werror]
    (void) (&_max1 == &_max2);  \
                   ^
  btf_dump.c:882:11: note: in expansion of macro ‘max’
      m_sz = max(0LL, btf__resolve_size(d->btf, m->type));
             ^~~

Fix by explicitly casting to __s64, which is a return type from
btf__resolve_size().

Fixes: 702eddc ("libbpf: Handle GCC built-in types for Arm NEON")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200818164456.1181661-1-andriin@fb.com
Remote source MAC addresses can be set on a 'source mode' macvlan
interface via the IFLA_MACVLAN_MACADDR_DATA attribute. This commit
tightens the validation of these MAC addresses to match the validation
already performed when setting or adding a single MAC address via the
IFLA_MACVLAN_MACADDR attribute.

iproute2 uses IFLA_MACVLAN_MACADDR_DATA for its 'macvlan macaddr set'
command, and IFLA_MACVLAN_MACADDR for its 'macvlan macaddr add' command,
which demonstrates the inconsistent behaviour that this commit
addresses:

 # ip link add link eth0 name macvlan0 type macvlan mode source
 # ip link set link dev macvlan0 type macvlan macaddr add 01:00:00:00:00:00
 RTNETLINK answers: Cannot assign requested address
 # ip link set link dev macvlan0 type macvlan macaddr set 01:00:00:00:00:00
 # ip -d link show macvlan0
 5: macvlan0@eth0: <BROADCAST,MULTICAST,DYNAMIC,UP,LOWER_UP> mtu 1500 ...
     link/ether 2e:ac:fd:2d:69:f8 brd ff:ff:ff:ff:ff:ff promiscuity 0
     macvlan mode source remotes (1) 01:00:00:00:00:00 numtxqueues 1 ...

With this change, the 'set' command will (rightly) fail in the same way
as the 'add' command.

Signed-off-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Actually hook up the .rx_buf_hash_valid method in EF100's nic_type.

Fixes: 0688854 ("sfc: check hash is valid before using it")
Reported-by: Martin Habets <mhabets@solarflare.com>
Signed-off-by: Edward Cree <ecree@solarflare.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When downing and upping the ef100 filter table, we need to take a write
 lock on efx->filter_sem, not just a read lock, because we may kfree()
 the table pointers.
Without this, resets cause a WARN_ON from efx_rwsem_assert_write_locked().

Fixes: a9dc3d5 ("sfc_ef100: RX filter table management and related gubbins")
Signed-off-by: Edward Cree <ecree@solarflare.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If an ef100_net_open() fails, ef100_net_stop() may be called without
 channel->rps_flow_id having been written; thus it may hold the address
 freed by a previous ef100_net_stop()'s call to efx_remove_filters().
 This then causes a double-free when efx_remove_filters() is called
 again, leading to a panic.
To prevent this, after freeing it, overwrite it with NULL.

Fixes: a9dc3d5 ("sfc_ef100: RX filter table management and related gubbins")
Signed-off-by: Edward Cree <ecree@solarflare.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If efx_nic_init_interrupt fails, or was never run (e.g. due to an earlier
 failure in ef100_net_open), freeing irqs in efx_nic_fini_interrupt is not
 needed and will cause error messages and stack traces.
So instead, only do this if efx_nic_init_interrupt successfully completed,
 as indicated by the new efx->irqs_hooked flag.

Fixes: 965b549 ("sfc_ef100: implement ndo_open/close and EVQ probing")
Signed-off-by: Edward Cree <ecree@solarflare.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree says:

====================
sfc: more EF100 fixes

Fix up some bugs in the initial EF100 submission, and re-fix
 the hash_valid fix which was incomplete.

The reset bugs are currently hard to trigger; they were found
 with an in-progress patch adding ethtool support, whereby
 ethtool --reset reliably reproduces them.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
Work request used for sending loopback packet needs to add
the firmware work request only once. So, fix by using
correct structure size.

Fixes: 7235ffa ("cxgb4: add loopback ethtool self-test")
Signed-off-by: Ganji Aravind <ganji.aravind@chelsio.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Even after Tx queues are marked stopped, there exists a
small window where the current packet in the normal Tx
path is still being sent out and loopback selftest ends
up corrupting the same Tx ring. So, ensure selftest takes
the Tx lock to synchronize access the Tx ring.

Fixes: 7235ffa ("cxgb4: add loopback ethtool self-test")
Signed-off-by: Ganji Aravind <ganji.aravind@chelsio.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ganji Aravind says:

====================
cxgb4: Fix ethtool selftest flits calculation

Patch 1 will fix work request size calculation for loopback selftest.

Patch 2 will fix race between loopback selftest and normal Tx handler.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
Every iteration of for_each_available_child_of_node() decrements
reference count of the previous node, however when control
is transferred from the middle of the loop, as in the case of
a return or break or goto, there is no decrement thus ultimately
resulting in a memory leak.

Fix a potential memory leak in gianfar.c by inserting of_node_put()
before the goto statement.

Issue found with Coccinelle.

Signed-off-by: Sumera Priyadarsini <sylphrenadin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
pskb_carve_frag_list() may return -ENOMEM in pskb_carve_inside_nonlinear().
we should handle this correctly or we would get wrong sk_buff.

Fixes: 6fa01cc ("skbuff: Add pskb_extract() helper function")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the ARP monitor is used for link detection, ARP replies are
validated for all slaves (arp_validate=3) and fail_over_mac is set to
active, two slaves of an active-backup bond may get stuck in a state
where both of them are active and pass packets that they receive to
the bond. This state makes IPv6 duplicate address detection fail. The
state is reached thus:
1. The current active slave goes down because the ARP target
   is not reachable.
2. The current ARP slave is chosen and made active.
3. A new slave is enslaved. This new slave becomes the current active
   slave and can reach the ARP target.
As a result, the current ARP slave stays active after the enslave
action has finished and the log is littered with "PROBE BAD" messages:
> bond0: PROBE: c_arp ens10 && cas ens11 BAD
The workaround is to remove the slave with "going back" status from
the bond and re-enslave it. This issue was encountered when DPDK PMD
interfaces were being enslaved to an active-backup bond.

I would be possible to fix the issue in bond_enslave() or
bond_change_active_slave() but the ARP monitor was fixed instead to
keep most of the actions changing the current ARP slave in the ARP
monitor code. The current ARP slave is set as inactive and backup
during the commit phase. A new state, BOND_LINK_FAIL, has been
introduced for slaves in the context of the ARP monitor. This allows
administrators to see how slaves are rotated for sending ARP requests
and attempts are made to find a new active slave.

Fixes: b2220ca ("bonding: refactor ARP active-backup monitor")
Signed-off-by: Jiri Wiesner <jwiesner@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is to do 3 things for ipv6_dev_find():

  As David A. noticed,

  - rt6_lookup() is not really needed. Different from __ip_dev_find(),
    ipv6_dev_find() doesn't have a compatibility problem, so remove it.

  As Hideaki suggested,

  - "valid" (non-tentative) check for the address is also needed.
    ipv6_chk_addr() calls ipv6_chk_addr_and_flags(), which will
    traverse the address hash list, but it's heavy to be called
    inside ipv6_dev_find(). This patch is to reuse the code of
    ipv6_chk_addr_and_flags() for ipv6_dev_find().

  - dev parameter is passed into ipv6_dev_find(), as link-local
    addresses from user space has sin6_scope_id set and the dev
    lookup needs it.

Fixes: 81f6cb3 ("ipv6: add ipv6_dev_find()")
Suggested-by: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
Reported-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, ethtool-netlink calculates new wanted bits as:
(req_wanted & req_mask) | (old_active & ~req_mask)

It completely discards the old wanted bits, so they are forgotten with
the next ethtool command. Sample steps to reproduce:

1. ethtool -k eth0
   tx-tcp-segmentation: on # TSO is on from the beginning
2. ethtool -K eth0 tx off
   tx-tcp-segmentation: off [not requested]
3. ethtool -k eth0
   tx-tcp-segmentation: off [requested on]
4. ethtool -K eth0 rx off # Some change unrelated to TSO
5. ethtool -k eth0
   tx-tcp-segmentation: off # "Wanted on" is forgotten

This commit fixes it by changing the formula to:
(req_wanted & req_mask) | (old_wanted & ~req_mask),
where old_active was replaced by old_wanted to account for the wanted
bits.

The shortcut condition for the case where nothing was changed now
compares wanted bitmasks, instead of wanted to active.

Fixes: 0980bfc ("ethtool: set netdev features with FEATURES_SET request")
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
ethtool-netlink ignores dev->hw_features and may confuse the drivers by
asking them to enable features not in the hw_features bitmask. For
example:

1. ethtool -k eth0
   tls-hw-tx-offload: off [fixed]
2. ethtool -K eth0 tls-hw-tx-offload on
   tls-hw-tx-offload: on
3. ethtool -k eth0
   tls-hw-tx-offload: on [fixed]

Fitler out dev->hw_features from req_wanted to fix it and to resemble
the legacy ethtool behavior.

Fixes: 0980bfc ("ethtool: set netdev features with FEATURES_SET request")
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
The legacy ethtool userspace tool shows an error when no features could
be changed. It's useful to have a netlink reply to be able to show this
error when __netdev_update_features wasn't called, for example:

1. ethtool -k eth0
   large-receive-offload: off
2. ethtool -K eth0 rx-fcs on
3. ethtool -K eth0 lro on
   Could not change any device features
   rx-lro: off [requested on]
4. ethtool -K eth0 lro on
   # The output should be the same, but without this patch the kernel
   # doesn't send the reply, and ethtool is unable to detect the error.

This commit makes ethtool-netlink always return a reply when requested,
and it still avoids unnecessary calls to __netdev_update_features if the
wanted features haven't changed.

Fixes: 0980bfc ("ethtool: set netdev features with FEATURES_SET request")
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
geertu and others added 22 commits August 20, 2020 16:30
- Remove pinctrl consumer properties, as they are handled by core
    dt-schema,
  - Document missing properties,
  - Document missing PHY child node,
  - Add "additionalProperties: false".

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Rob Herring <robh@kernel.org>
Reviewed-by: Sergei Shtylyov <sergei.shtylyov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The number of output and input streams was never being reduced, eg when
processing received INIT or INIT_ACK chunks.
The effect is that DATA chunks can be sent with invalid stream ids
and then discarded by the remote system.

Fixes: 2075e50 ("sctp: convert to genradix")
Signed-off-by: David Laight <david.laight@aculab.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
…rror flow

tcf_ct_handle_fragments() shouldn't free the skb when ip_defrag() call
fails. Otherwise, we will cause a double-free bug.
In such cases, just return the error to the caller.

Fixes: b57dc7c ("net/sched: Introduce action ct")
Signed-off-by: Alaa Hleihel <alaa@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
b->media->send_msg() requires rcu_read_lock(), as we can see
elsewhere in tipc,  tipc_bearer_xmit, tipc_bearer_xmit_skb
and tipc_bearer_bc_xmit().

Syzbot has reported this issue as:

  net/tipc/bearer.c:466 suspicious rcu_dereference_check() usage!
  Workqueue: cryptd cryptd_queue_worker
  Call Trace:
   tipc_l2_send_msg+0x354/0x420 net/tipc/bearer.c:466
   tipc_aead_encrypt_done+0x204/0x3a0 net/tipc/crypto.c:761
   cryptd_aead_crypt+0xe8/0x1d0 crypto/cryptd.c:739
   cryptd_queue_worker+0x118/0x1b0 crypto/cryptd.c:181
   process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
   worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
   kthread+0x3b5/0x4a0 kernel/kthread.c:291
   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:293

So fix it by calling rcu_read_lock() in tipc_aead_encrypt_done()
for b->media->send_msg().

Fixes: fc1b6d6 ("tipc: introduce TIPC encryption & authentication")
Reported-by: syzbot+47bbc6b678d317cccbe0@syzkaller.appspotmail.com
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
…000000

In is_module_segment(), when VMALLOC_END is over 0xf0000000,
ALIGN(VMALLOC_END, SZ_256M) has value 0.

In that case, addr >= ALIGN(VMALLOC_END, SZ_256M) is always
true then is_module_segment() always returns false.

Use (ALIGN(VMALLOC_END, SZ_256M) - 1) which will have
value 0xffffffff and will be suitable for the comparison.

Fixes: c496433 ("powerpc/32s: Only leave NX unset on segments used for modules")
Reported-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Tested-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/09fc73fe9c7423c6b4cf93f93df9bb0ed8eefab5.1597994047.git.christophe.leroy@csgroup.eu
Commit 792f73f ("powerpc/hv-24x7: Add sysfs files inside hv-24x7
device to show cpumask") added cpumask file as part of hv-24x7 driver
inside the interface folder. The cpumask file is supposed to be in the
top folder of the PMU driver in order to make hotplug work.

This patch fixes that issue and creates new group 'cpumask_attr_group'
to add cpumask file and make sure it added in top folder.

  command:# cat /sys/devices/hv_24x7/cpumask
  0

Fixes: 792f73f ("powerpc/hv-24x7: Add sysfs files inside hv-24x7 device to show cpumask")
Signed-off-by: Kajol Jain <kjain@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200821080610.123997-1-kjain@linux.ibm.com
KVM has an optmization to avoid expensive MRS read/writes on
VMENTER/EXIT. It caches the MSR values and restores them either when
leaving the run loop, on preemption or when going out to user space.

The affected MSRs are not required for kernel context operations. This
changed with the recently introduced mechanism to handle FSGSBASE in the
paranoid entry code which has to retrieve the kernel GSBASE value by
accessing per CPU memory. The mechanism needs to retrieve the CPU number
and uses either LSL or RDPID if the processor supports it.

Unfortunately RDPID uses MSR_TSC_AUX which is in the list of cached and
lazily restored MSRs, which means between the point where the guest value
is written and the point of restore, MSR_TSC_AUX contains a random number.

If an NMI or any other exception which uses the paranoid entry path happens
in such a context, then RDPID returns the random guest MSR_TSC_AUX value.

As a consequence this reads from the wrong memory location to retrieve the
kernel GSBASE value. Kernel GS is used to for all regular this_cpu_*()
operations. If the GSBASE in the exception handler points to the per CPU
memory of a different CPU then this has the obvious consequences of data
corruption and crashes.

As the paranoid entry path is the only place which accesses MSR_TSX_AUX
(via RDPID) and the fallback via LSL is not significantly slower, remove
the RDPID alternative from the entry path and always use LSL.

The alternative would be to write MSR_TSC_AUX on every VMENTER and VMEXIT
which would be inflicting massive overhead on that code path.

[ tglx: Rewrote changelog ]

Fixes: eaad981 ("x86/entry/64: Introduce the FIND_PERCPU_BASE macro")
Reported-by: Tom Lendacky <thomas.lendacky@amd.com>
Debugged-by: Tom Lendacky <thomas.lendacky@amd.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200821105229.18938-1-pbonzini@redhat.com
The transcript of the x86 entry code to the generic version failed to
reload the syscall number from ptregs after ptrace and seccomp have run,
which both can modify the syscall number in ptregs. It returns the original
syscall number instead which is obviously not the right thing to do.

Reload the syscall number to fix that.

Fixes: 142781e ("entry: Provide generic syscall entry functionality")
Reported-by: Kyle Huey <me@kylehuey.com> 
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kyle Huey <me@kylehuey.com> 
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/87blj6ifo8.fsf@nanos.tec.linutronix.de
clang static analysis reports this problem

b53_common.c:1583:13: warning: The left expression of the compound
  assignment is an uninitialized value. The computed value will
  also be garbage
        ent.port &= ~BIT(port);
        ~~~~~~~~ ^

ent is set by a successful call to b53_arl_read().  Unsuccessful
calls are caught by an switch statement handling specific returns.
b32_arl_read() calls b53_arl_op_wait() which fails with the
unhandled -ETIMEDOUT.

So add -ETIMEDOUT to the switch statement.  Because
b53_arl_op_wait() already prints out a message, do not add another
one.

Fixes: 1da6df8 ("net: dsa: b53: Implement ARL add/del/dump operations")
Signed-off-by: Tom Rix <trix@redhat.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also remove trailing whitespaces in bpf_skb_get_tunnel_key example code.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200821133642.18870-1-tklauser@distanz.ch
Alexei Starovoitov says:

====================
pull-request: bpf 2020-08-21

The following pull-request contains BPF updates for your *net* tree.

We've added 11 non-merge commits during the last 5 day(s) which contain
a total of 12 files changed, 78 insertions(+), 24 deletions(-).

The main changes are:

1) three fixes in BPF task iterator logic, from Yonghong.

2) fix for compressed dwarf sections in vmlinux, from Jiri.

3) fix xdp attach regression, from Andrii.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the nexthop code will use an empty NHA_GROUP attribute, but it
requires at least 1 entry in order to function properly. Otherwise we
end up derefencing null or random pointers all over the place due to not
having any nh_grp_entry members allocated, nexthop code relies on having at
least the first member present. Empty NHA_GROUP doesn't make any sense so
just disallow it.
Also add a WARN_ON for any future users of nexthop_create_group().

 BUG: kernel NULL pointer dereference, address: 0000000000000080
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] SMP
 CPU: 0 PID: 558 Comm: ip Not tainted 5.9.0-rc1+ #93
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
 RIP: 0010:fib_check_nexthop+0x4a/0xaa
 Code: 0f 84 83 00 00 00 48 c7 02 80 03 f7 81 c3 40 80 fe fe 75 12 b8 ea ff ff ff 48 85 d2 74 6b 48 c7 02 40 03 f7 81 c3 48 8b 40 10 <48> 8b 80 80 00 00 00 eb 36 80 78 1a 00 74 12 b8 ea ff ff ff 48 85
 RSP: 0018:ffff88807983ba00 EFLAGS: 00010213
 RAX: 0000000000000000 RBX: ffff88807983bc00 RCX: 0000000000000000
 RDX: ffff88807983bc00 RSI: 0000000000000000 RDI: ffff88807bdd0a80
 RBP: ffff88807983baf8 R08: 0000000000000dc0 R09: 000000000000040a
 R10: 0000000000000000 R11: ffff88807bdd0ae8 R12: 0000000000000000
 R13: 0000000000000000 R14: ffff88807bea3100 R15: 0000000000000001
 FS:  00007f10db393700(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000080 CR3: 000000007bd0f004 CR4: 00000000003706f0
 Call Trace:
  fib_create_info+0x64d/0xaf7
  fib_table_insert+0xf6/0x581
  ? __vma_adjust+0x3b6/0x4d4
  inet_rtm_newroute+0x56/0x70
  rtnetlink_rcv_msg+0x1e3/0x20d
  ? rtnl_calcit.isra.0+0xb8/0xb8
  netlink_rcv_skb+0x5b/0xac
  netlink_unicast+0xfa/0x17b
  netlink_sendmsg+0x334/0x353
  sock_sendmsg_nosec+0xf/0x3f
  ____sys_sendmsg+0x1a0/0x1fc
  ? copy_msghdr_from_user+0x4c/0x61
  ___sys_sendmsg+0x63/0x84
  ? handle_mm_fault+0xa39/0x11b5
  ? sockfd_lookup_light+0x72/0x9a
  __sys_sendmsg+0x50/0x6e
  do_syscall_64+0x54/0xbe
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f10dacc0bb7
 Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb cd 66 0f 1f 44 00 00 8b 05 9a 4b 2b 00 85 c0 75 2e 48 63 ff 48 63 d2 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 b1 f2 2a 00 f7 d8 64 89 02 48
 RSP: 002b:00007ffcbe628bf8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
 RAX: ffffffffffffffda RBX: 00007ffcbe628f80 RCX: 00007f10dacc0bb7
 RDX: 0000000000000000 RSI: 00007ffcbe628c60 RDI: 0000000000000003
 RBP: 000000005f41099c R08: 0000000000000001 R09: 0000000000000008
 R10: 00000000000005e9 R11: 0000000000000246 R12: 0000000000000000
 R13: 0000000000000000 R14: 00007ffcbe628d70 R15: 0000563a86c6e440
 Modules linked in:
 CR2: 0000000000000080

CC: David Ahern <dsahern@gmail.com>
Fixes: 430a049 ("nexthop: Add support for nexthop groups")
Reported-by: syzbot+a61aa19b0c14c8770bd9@syzkaller.appspotmail.com
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When adding a new fd to an epoll, and that this new fd is an
epoll fd itself, we recursively scan the fds attached to it
to detect cycles, and add non-epool files to a "check list"
that gets subsequently parsed.

However, this check list isn't completely safe when deletions
can happen concurrently. To sidestep the issue, make sure that
a struct file placed on the check list sees its f_count increased,
ensuring that a concurrent deletion won't result in the file
disapearing from under our feet.

Cc: stable@vger.kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
…l/git/viro/vfs

Pull epoll fixes from Al Viro:
 "Fix reference counting and clean up exit paths"

* 'work.epoll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  do_epoll_ctl(): clean the failure exits up a bit
  epoll: Keep a reference on files added to the check list
Pull networking fixes from David Miller:
 "Nothing earth shattering here, lots of small fixes (f.e. missing RCU
  protection, bad ref counting, missing memset(), etc.) all over the
  place:

   1) Use get_file_rcu() in task_file iterator, from Yonghong Song.

   2) There are two ways to set remote source MAC addresses in macvlan
      driver, but only one of which validates things properly. Fix this.
      From Alvin Šipraga.

   3) Missing of_node_put() in gianfar probing, from Sumera
      Priyadarsini.

   4) Preserve device wanted feature bits across multiple netlink
      ethtool requests, from Maxim Mikityanskiy.

   5) Fix rcu_sched stall in task and task_file bpf iterators, from
      Yonghong Song.

   6) Avoid reset after device destroy in ena driver, from Shay
      Agroskin.

   7) Missing memset() in netlink policy export reallocation path, from
      Johannes Berg.

   8) Fix info leak in __smc_diag_dump(), from Peilin Ye.

   9) Decapsulate ECN properly for ipv6 in ipv4 tunnels, from Mark
      Tomlinson.

  10) Fix number of data stream negotiation in SCTP, from David Laight.

  11) Fix double free in connection tracker action module, from Alaa
      Hleihel.

  12) Don't allow empty NHA_GROUP attributes, from Nikolay Aleksandrov"

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (46 commits)
  net: nexthop: don't allow empty NHA_GROUP
  bpf: Fix two typos in uapi/linux/bpf.h
  net: dsa: b53: check for timeout
  tipc: call rcu_read_lock() in tipc_aead_encrypt_done()
  net/sched: act_ct: Fix skb double-free in tcf_ct_handle_fragments() error flow
  net: sctp: Fix negotiation of the number of data streams.
  dt-bindings: net: renesas, ether: Improve schema validation
  gre6: Fix reception with IP6_TNL_F_RCV_DSCP_COPY
  hv_netvsc: Fix the queue_mapping in netvsc_vf_xmit()
  hv_netvsc: Remove "unlikely" from netvsc_select_queue
  bpf: selftests: global_funcs: Check err_str before strstr
  bpf: xdp: Fix XDP mode when no mode flags specified
  selftests/bpf: Remove test_align leftovers
  tools/resolve_btfids: Fix sections with wrong alignment
  net/smc: Prevent kernel-infoleak in __smc_diag_dump()
  sfc: fix build warnings on 32-bit
  net: phy: mscc: Fix a couple of spelling mistakes "spcified" -> "specified"
  libbpf: Fix map index used in error message
  net: gemini: Fix missing free_netdev() in error path of gemini_ethernet_port_probe()
  net: atlantic: Use readx_poll_timeout() for large timeout
  ...
…linux/kernel/git/ras/ras

Pull EDAC fix from Borislav Petkov:
 "A single fix correcting a reversed error severity determination check
  which lead to a recoverable error getting marked as fatal, by Tony
  Luck"

* tag 'edac_urgent_for_v5.9_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
  EDAC/{i7core,sb,pnd2,skx}: Fix error event severity
…nux/kernel/git/tip/tip

Pull entry fix from Thomas Gleixner:
 "A single bug fix for the common entry code.

  The transcription of the x86 version messed up the reload of the
  syscall number from pt_regs after ptrace and seccomp which breaks
  syscall number rewriting"

* tag 'core-urgent-2020-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  core/entry: Respect syscall number rewrites
…ux/kernel/git/tip/tip

Pull EFI fixes from Thomas Gleixner:

 - Enforce NX on RO data in mixed EFI mode

 - Destroy workqueue in an error handling path to prevent UAF

 - Stop argument parser at '--' which is the delimiter for init

 - Treat a NULL command line pointer as empty instead of dereferncing it
   unconditionally.

 - Handle an unterminated command line correctly

 - Cleanup the 32bit code leftovers and remove obsolete documentation

* tag 'efi-urgent-2020-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Documentation: efi: remove description of efi=old_map
  efi/x86: Move 32-bit code into efi_32.c
  efi/libstub: Handle unterminated cmdline
  efi/libstub: Handle NULL cmdline
  efi/libstub: Stop parsing arguments at "--"
  efi: add missed destroy_workqueue when efisubsys_init fails
  efi/x86: Mark kernel rodata non-executable for mixed mode
…nux/kernel/git/tip/tip

Pull x86 perf fix from Thomas Gleixner:
 "A single update for perf on x86 which has support for the broken down
  bandwith counters"

* tag 'perf-urgent-2020-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel/uncore: Add BW counters for GT, IA and IO breakdown
…ux/kernel/git/tip/tip

Pull x86 fix from Thomas Gleixner:
 "A single fix for x86 which removes the RDPID usage from the paranoid
  entry path and unconditionally uses LSL to retrieve the CPU number.

  RDPID depends on MSR_TSX_AUX. KVM has an optmization to avoid
  expensive MRS read/writes on VMENTER/EXIT. It caches the MSR values
  and restores them either when leaving the run loop, on preemption or
  when going out to user space. MSR_TSX_AUX is part of that lazy MSR
  set, so after writing the guest value and before the lazy restore any
  exception using the paranoid entry will read the guest value and use
  it as CPU number to retrieve the GSBASE value for the current CPU when
  FSGSBASE is enabled. As RDPID is only used in that particular entry
  path, there is no reason to burden VMENTER/EXIT with two extra MSR
  writes. Remove the RDPID optimization, which is not even backed by
  numbers from the paranoid entry path instead"

* tag 'x86-urgent-2020-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/entry/64: Do not use RDPID in paranoid entry to accomodate KVM
…l/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:

 - Add perf support for emitting extended registers for power10.

 - A fix for CPU hotplug on pseries, where on large/loaded systems we
   may not wait long enough for the CPU to be offlined, leading to
   crashes.

 - Addition of a raw cputable entry for Power10, which is not required
   to boot, but is required to make our PMU setup work correctly in
   guests.

 - Three fixes for the recent changes on 32-bit Book3S to move modules
   into their own segment for strict RWX.

 - A fix for a recent change in our powernv PCI code that could lead to
   crashes.

 - A change to our perf interrupt accounting to avoid soft lockups when
   using some events, found by syzkaller.

 - A change in the way we handle power loss events from the hypervisor
   on pseries. We no longer immediately shut down if we're told we're
   running on a UPS.

 - A few other minor fixes.

Thanks to Alexey Kardashevskiy, Andreas Schwab, Aneesh Kumar K.V, Anju T
Sudhakar, Athira Rajeev, Christophe Leroy, Frederic Barrat, Greg Kurz,
Kajol Jain, Madhavan Srinivasan, Michael Neuling, Michael Roth,
Nageswara R Sastry, Oliver O'Halloran, Thiago Jung Bauermann,
Vaidyanathan Srinivasan, Vasant Hegde.

* tag 'powerpc-5.9-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/perf/hv-24x7: Move cpumask file to top folder of hv-24x7 driver
  powerpc/32s: Fix module loading failure when VMALLOC_END is over 0xf0000000
  powerpc/pseries: Do not initiate shutdown when system is running on UPS
  powerpc/perf: Fix soft lockups due to missed interrupt accounting
  powerpc/powernv/pci: Fix possible crash when releasing DMA resources
  powerpc/pseries/hotplug-cpu: wait indefinitely for vCPU death
  powerpc/32s: Fix is_module_segment() when MODULES_VADDR is defined
  powerpc/kasan: Fix KASAN_SHADOW_START on BOOK3S_32
  powerpc/fixmap: Fix the size of the early debug area
  powerpc/pkeys: Fix build error with PPC_MEM_KEYS disabled
  powerpc/kernel: Cleanup machine check function declarations
  powerpc: Add POWER10 raw mode cputable entry
  powerpc/perf: Add extended regs support for power10 platform
  powerpc/perf: Add support for outputting extended regs in perf intr_regs
  powerpc: Fix P10 PVR revision in /proc/cpuinfo for SMT4 cores
@pull pull bot added the ⤵️ pull label Aug 23, 2020
@pull pull bot merged commit cb95712 into bergwolf:master Aug 23, 2020
pull bot pushed a commit that referenced this pull request Apr 30, 2022
The IPA BCM resource ("IP0") on sc7180 was moved to the clk-rpmh driver
in commit bcd63d2 ("clk: qcom: rpmh: Add IPA clock for SC7180") and
modeled as a clk, but this interconnect driver still had it modeled as
an interconnect. This was mostly OK because nobody used the interconnect
definition, until the interconnect framework started dropping bandwidth
requests on interconnects that aren't used via the sync_state callback
in commit 7d3b0b0 ("interconnect: qcom: Use icc_sync_state"). Once
that patch was applied the IP0 resource was going to be controlled from
two places, the clk framework and the interconnect framework.

Even then, things were probably going to be OK, because commit
b95b668 ("interconnect: qcom: icc-rpmh: Add BCMs to commit list in
pre_aggregate") was needed to actually drop bandwidth requests on unused
interconnects, of which the IPA was one of the interconnect that wasn't
getting dropped to zero. Combining the three commits together leads to
bad behavior where the interconnect framework is disabling the IP0
resource because it has no users while the clk framework thinks the IP0
resource is on because the only user, the IPA driver, has turned it on
via clk_prepare_enable(). Depending on when sync_state is called, we can
get into a situation like below:

  IPA driver probes
  IPA driver gets notified modem started
   runtime PM get()
    IPA clk enabled -> IP0 resource is ON
  sync_state runs
   interconnect zeroes out the IP0 resource -> IP0 resource is off
  IPA driver tries to access a register and blows up

The crash is an unclocked access that manifest as an SError.

 SError Interrupt on CPU0, code 0xbe000011 -- SError
 CPU: 0 PID: 3595 Comm: mmdata_mgr Not tainted 5.17.1+ #166
 Hardware name: Google Lazor (rev1 - 2) with LTE (DT)
 pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
 pc : mutex_lock+0x4c/0x80
 lr : mutex_lock+0x30/0x80
 sp : ffffffc00da9b9c0
 x29: ffffffc00da9b9c0 x28: 0000000000000000 x27: 0000000000000000
 x26: ffffffc00da9bc90 x25: ffffff80c2024010 x24: ffffff80c2024000
 x23: ffffff8083100000 x22: ffffff80831000d0 x21: ffffff80831000a8
 x20: ffffff80831000a8 x19: ffffff8083100070 x18: 00000000ffff0a00
 x17: 000000002f7254f1 x16: 0000000000000100 x15: 0000000000000000
 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
 x11: 000000000001f0b8 x10: ffffffc00931f0b8 x9 : 0000000000000000
 x8 : 0000000000000000 x7 : fefefefefeff2f60 x6 : 0000808080808080
 x5 : 0000000000000000 x4 : 8080808080800000 x3 : ffffff80d2d4ee28
 x2 : ffffff808c1d6e40 x1 : 0000000000000000 x0 : ffffff8083100070
 Kernel panic - not syncing: Asynchronous SError Interrupt
 CPU: 0 PID: 3595 Comm: mmdata_mgr Not tainted 5.17.1+ #166
 Hardware name: Google Lazor (rev1 - 2) with LTE (DT)
 Call trace:
  dump_backtrace+0xf4/0x114
  show_stack+0x24/0x30
  dump_stack_lvl+0x64/0x7c
  dump_stack+0x18/0x38
  panic+0x150/0x38c
  nmi_panic+0x88/0xa0
  arm64_serror_panic+0x74/0x80
  do_serror+0x0/0x80
  do_serror+0x58/0x80
  el1h_64_error_handler+0x34/0x4c
  el1h_64_error+0x78/0x7c
  mutex_lock+0x4c/0x80
  __gsi_channel_start+0x50/0x17c
  gsi_channel_start+0x54/0x90
  ipa_endpoint_enable_one+0x34/0xc0
  ipa_open+0x4c/0x120

Remove all IP0 resource management from the interconnect driver so that
clk-rpmh is the sole owner. This fixes the issue by preventing the
interconnect driver from overwriting the IP0 resource data that the
clk-rpmh driver wrote.

Cc: Alex Elder <elder@linaro.org>
Cc: Bjorn Andersson <bjorn.andersson@linaro.org>
Cc: Taniya Das <quic_tdas@quicinc.com>
Cc: Mike Tipton <quic_mdtipton@quicinc.com>
Fixes: b95b668 ("interconnect: qcom: icc-rpmh: Add BCMs to commit list in pre_aggregate")
Fixes: bcd63d2 ("clk: qcom: rpmh: Add IPA clock for SC7180")
Fixes: 7d3b0b0 ("interconnect: qcom: Use icc_sync_state")
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Tested-by: Alex Elder <elder@linaro.org>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Link: https://lore.kernel.org/r/20220412220033.1273607-2-swboyd@chromium.org
Signed-off-by: Georgi Djakov <djakov@kernel.org>
pull bot pushed a commit that referenced this pull request May 29, 2022
gpio_keys module can either accept gpios or interrupts. The module
initializes delayed work in case of gpios only and is only used if
debounce timer is not used, so make sure cancel_delayed_work_sync()
is called only when its gpio-backed and debounce_use_hrtimer is false.

This fixes the issue seen below when the gpio_keys module is unloaded and
an interrupt pin is used instead of GPIO:

[  360.297569] ------------[ cut here ]------------
[  360.302303] WARNING: CPU: 0 PID: 237 at kernel/workqueue.c:3066 __flush_work+0x414/0x470
[  360.310531] Modules linked in: gpio_keys(-)
[  360.314797] CPU: 0 PID: 237 Comm: rmmod Not tainted 5.18.0-rc5-arm64-renesas-00116-g73636105874d-dirty #166
[  360.324662] Hardware name: Renesas SMARC EVK based on r9a07g054l2 (DT)
[  360.331270] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  360.338318] pc : __flush_work+0x414/0x470
[  360.342385] lr : __cancel_work_timer+0x140/0x1b0
[  360.347065] sp : ffff80000a7fba00
[  360.350423] x29: ffff80000a7fba00 x28: ffff000012b9c5c0 x27: 0000000000000000
[  360.357664] x26: ffff80000a7fbb80 x25: ffff80000954d0a8 x24: 0000000000000001
[  360.364904] x23: ffff800009757000 x22: 0000000000000000 x21: ffff80000919b000
[  360.372143] x20: ffff00000f5974e0 x19: ffff00000f5974e0 x18: ffff8000097fcf48
[  360.379382] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000053f40
[  360.386622] x14: ffff800009850e88 x13: 0000000000000002 x12: 000000000000a60c
[  360.393861] x11: 000000000000a610 x10: 0000000000000000 x9 : 0000000000000008
[  360.401100] x8 : 0101010101010101 x7 : 00000000a473c394 x6 : 0080808080808080
[  360.408339] x5 : 0000000000000001 x4 : 0000000000000000 x3 : ffff80000919b458
[  360.415578] x2 : ffff8000097577f0 x1 : 0000000000000001 x0 : 0000000000000000
[  360.422818] Call trace:
[  360.425299]  __flush_work+0x414/0x470
[  360.429012]  __cancel_work_timer+0x140/0x1b0
[  360.433340]  cancel_delayed_work_sync+0x10/0x18
[  360.437931]  gpio_keys_quiesce_key+0x28/0x58 [gpio_keys]
[  360.443327]  devm_action_release+0x10/0x18
[  360.447481]  release_nodes+0x8c/0x1a0
[  360.451194]  devres_release_all+0x90/0x100
[  360.455346]  device_unbind_cleanup+0x14/0x60
[  360.459677]  device_release_driver_internal+0xe8/0x168
[  360.464883]  driver_detach+0x4c/0x90
[  360.468509]  bus_remove_driver+0x54/0xb0
[  360.472485]  driver_unregister+0x2c/0x58
[  360.476462]  platform_driver_unregister+0x10/0x18
[  360.481230]  gpio_keys_exit+0x14/0x828 [gpio_keys]
[  360.486088]  __arm64_sys_delete_module+0x1e0/0x270
[  360.490945]  invoke_syscall+0x40/0xf8
[  360.494661]  el0_svc_common.constprop.3+0xf0/0x110
[  360.499515]  do_el0_svc+0x20/0x78
[  360.502877]  el0_svc+0x48/0xf8
[  360.505977]  el0t_64_sync_handler+0x88/0xb0
[  360.510216]  el0t_64_sync+0x148/0x14c
[  360.513930] irq event stamp: 4306
[  360.517288] hardirqs last  enabled at (4305): [<ffff8000080b0300>] __cancel_work_timer+0x130/0x1b0
[  360.526359] hardirqs last disabled at (4306): [<ffff800008d194fc>] el1_dbg+0x24/0x88
[  360.534204] softirqs last  enabled at (4278): [<ffff8000080104a0>] _stext+0x4a0/0x5e0
[  360.542133] softirqs last disabled at (4267): [<ffff8000080932ac>] irq_exit_rcu+0x18c/0x1b0
[  360.550591] ---[ end trace 0000000000000000 ]---

Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Link: https://lore.kernel.org/r/20220524135822.14764-1-prabhakar.mahadev-lad.rj@bp.renesas.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
pull bot pushed a commit that referenced this pull request Jul 17, 2022
When running with return thunks enabled under 32-bit EFI, the system
crashes with:

  kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
  BUG: unable to handle page fault for address: 000000005bc02900
  #PF: supervisor instruction fetch in kernel mode
  #PF: error_code(0x0011) - permissions violation
  PGD 18f7063 P4D 18f7063 PUD 18ff063 PMD 190e063 PTE 800000005bc02063
  Oops: 0011 [#1] PREEMPT SMP PTI
  CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.19.0-rc6+ #166
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:0x5bc02900
  Code: Unable to access opcode bytes at RIP 0x5bc028d6.
  RSP: 0018:ffffffffb3203e10 EFLAGS: 00010046
  RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000048
  RDX: 000000000190dfac RSI: 0000000000001710 RDI: 000000007eae823b
  RBP: ffffffffb3203e70 R08: 0000000001970000 R09: ffffffffb3203e28
  R10: 747563657865206c R11: 6c6977203a696665 R12: 0000000000001710
  R13: 0000000000000030 R14: 0000000001970000 R15: 0000000000000001
  FS:  0000000000000000(0000) GS:ffff8e013ca00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0018 ES: 0018 CR0: 0000000080050033
  CR2: 000000005bc02900 CR3: 0000000001930000 CR4: 00000000000006f0
  Call Trace:
   ? efi_set_virtual_address_map+0x9c/0x175
   efi_enter_virtual_mode+0x4a6/0x53e
   start_kernel+0x67c/0x71e
   x86_64_start_reservations+0x24/0x2a
   x86_64_start_kernel+0xe9/0xf4
   secondary_startup_64_no_verify+0xe5/0xeb

That's because it cannot jump to the return thunk from the 32-bit code.

Using a naked RET and marking it as safe allows the system to proceed
booting.

Fixes: aa3d480 ("x86: Use return-thunk in asm code")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: <stable@vger.kernel.org>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pull bot pushed a commit that referenced this pull request Feb 5, 2023
If a relocatable kernel is loaded at an address that is not 2MB aligned
and told not to relocate to zero, the kernel can crash due to
mark_rodata_ro() incorrectly changing some read-write data to read-only.

Scenarios where the misalignment can occur are when the kernel is
loaded by kdump or using the RELOCATABLE_TEST config option.

Example crash with the kernel loaded at 5MB:

  Run /sbin/init as init process
  BUG: Unable to handle kernel data access on write at 0xc000000000452000
  Faulting instruction address: 0xc0000000005b6730
  Oops: Kernel access of bad area, sig: 11 [#1]
  LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
  CPU: 1 PID: 1 Comm: init Not tainted 6.2.0-rc1-00011-g349188be4841 #166
  Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw) 0x4e1202 0xf000005 of:SLOF,git-5b4c5a hv:linux,kvm pSeries
  NIP:  c0000000005b6730 LR: c000000000ae9ab8 CTR: 0000000000000380
  REGS: c000000004503250 TRAP: 0300   Not tainted  (6.2.0-rc1-00011-g349188be4841)
  MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 44288480  XER: 00000000
  CFAR: c0000000005b66ec DAR: c000000000452000 DSISR: 0a000000 IRQMASK: 0
  ...
  NIP memset+0x68/0x104
  LR  zero_user_segments.constprop.0+0xa8/0xf0
  Call Trace:
    ext4_mpage_readpages+0x7f8/0x830
    ext4_readahead+0x48/0x60
    read_pages+0xb8/0x380
    page_cache_ra_unbounded+0x19c/0x250
    filemap_fault+0x58c/0xae0
    __do_fault+0x60/0x100
    __handle_mm_fault+0x1230/0x1a40
    handle_mm_fault+0x120/0x300
    ___do_page_fault+0x20c/0xa80
    do_page_fault+0x30/0xc0
    data_access_common_virt+0x210/0x220

This happens because mark_rodata_ro() tries to change permissions on the
range _stext..__end_rodata, but _stext sits in the middle of the 2MB
page from 4MB to 6MB:

  radix-mmu: Mapped 0x0000000000000000-0x0000000000200000 with 2.00 MiB pages (exec)
  radix-mmu: Mapped 0x0000000000200000-0x0000000000400000 with 2.00 MiB pages
  radix-mmu: Mapped 0x0000000000400000-0x0000000002400000 with 2.00 MiB pages (exec)

The logic that changes the permissions assumes the linear mapping was
split correctly at boot, so it marks the entire 2MB page read-only. That
leads to the write fault above.

To fix it, the boot time mapping logic needs to consider that if the
kernel is running at a non-zero address then _stext is a boundary where
it must split the mapping.

That leads to the mapping being split correctly, allowing the rodata
permission change to take happen correctly, with no spillover:

  radix-mmu: Mapped 0x0000000000000000-0x0000000000200000 with 2.00 MiB pages (exec)
  radix-mmu: Mapped 0x0000000000200000-0x0000000000400000 with 2.00 MiB pages
  radix-mmu: Mapped 0x0000000000400000-0x0000000000500000 with 64.0 KiB pages
  radix-mmu: Mapped 0x0000000000500000-0x0000000000600000 with 64.0 KiB pages (exec)
  radix-mmu: Mapped 0x0000000000600000-0x0000000002400000 with 2.00 MiB pages (exec)

If the kernel is loaded at a 2MB aligned address, the mapping continues
to use 2MB pages as before:

  radix-mmu: Mapped 0x0000000000000000-0x0000000000200000 with 2.00 MiB pages (exec)
  radix-mmu: Mapped 0x0000000000200000-0x0000000000400000 with 2.00 MiB pages
  radix-mmu: Mapped 0x0000000000400000-0x0000000002c00000 with 2.00 MiB pages (exec)
  radix-mmu: Mapped 0x0000000002c00000-0x0000000100000000 with 2.00 MiB pages

Fixes: c55d7b5 ("powerpc: Remove STRICT_KERNEL_RWX incompatibility with RELOCATABLE")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20230110124753.1325426-1-mpe@ellerman.id.au
pull bot pushed a commit that referenced this pull request Jan 10, 2024
…el.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.8-mergeA

xfs: log intent item recovery should reconstruct defer work state

Long Li reported a KASAN report from a UAF when intent recovery fails:

 ==================================================================
 BUG: KASAN: slab-use-after-free in xfs_cui_release+0xb7/0xc0
 Read of size 4 at addr ffff888012575e60 by task kworker/u8:3/103
 CPU: 3 PID: 103 Comm: kworker/u8:3 Not tainted 6.4.0-rc7-next-20230619-00003-g94543a53f9a4-dirty #166
 Workqueue: xfs-cil/sda xlog_cil_push_work
 Call Trace:
  <TASK>
  dump_stack_lvl+0x50/0x70
  print_report+0xc2/0x600
  kasan_report+0xb6/0xe0
  xfs_cui_release+0xb7/0xc0
  xfs_cud_item_release+0x3c/0x90
  xfs_trans_committed_bulk+0x2d5/0x7f0
  xlog_cil_committed+0xaba/0xf20
  xlog_cil_push_work+0x1a60/0x2360
  process_one_work+0x78e/0x1140
  worker_thread+0x58b/0xf60
  kthread+0x2cd/0x3c0
  ret_from_fork+0x1f/0x30
  </TASK>

 Allocated by task 531:
  kasan_save_stack+0x22/0x40
  kasan_set_track+0x25/0x30
  __kasan_slab_alloc+0x55/0x60
  kmem_cache_alloc+0x195/0x5f0
  xfs_cui_init+0x198/0x1d0
  xlog_recover_cui_commit_pass2+0x133/0x5f0
  xlog_recover_items_pass2+0x107/0x230
  xlog_recover_commit_trans+0x3e7/0x9c0
  xlog_recovery_process_trans+0x140/0x1d0
  xlog_recover_process_ophdr+0x1a0/0x3d0
  xlog_recover_process_data+0x108/0x2d0
  xlog_recover_process+0x1f6/0x280
  xlog_do_recovery_pass+0x609/0xdb0
  xlog_do_log_recovery+0x84/0xe0
  xlog_do_recover+0x7d/0x470
  xlog_recover+0x25f/0x490
  xfs_log_mount+0x2dd/0x6f0
  xfs_mountfs+0x11ce/0x1e70
  xfs_fs_fill_super+0x10ec/0x1b20
  get_tree_bdev+0x3c8/0x730
  vfs_get_tree+0x89/0x2c0
  path_mount+0xecf/0x1800
  do_mount+0xf3/0x110
  __x64_sys_mount+0x154/0x1f0
  do_syscall_64+0x39/0x80
  entry_SYSCALL_64_after_hwframe+0x63/0xcd

 Freed by task 531:
  kasan_save_stack+0x22/0x40
  kasan_set_track+0x25/0x30
  kasan_save_free_info+0x2b/0x40
  __kasan_slab_free+0x114/0x1b0
  kmem_cache_free+0xf8/0x510
  xfs_cui_item_free+0x95/0xb0
  xfs_cui_release+0x86/0xc0
  xlog_recover_cancel_intents.isra.0+0xf8/0x210
  xlog_recover_finish+0x7e7/0x980
  xfs_log_mount_finish+0x2bb/0x4a0
  xfs_mountfs+0x14bf/0x1e70
  xfs_fs_fill_super+0x10ec/0x1b20
  get_tree_bdev+0x3c8/0x730
  vfs_get_tree+0x89/0x2c0
  path_mount+0xecf/0x1800
  do_mount+0xf3/0x110
  __x64_sys_mount+0x154/0x1f0
  do_syscall_64+0x39/0x80
  entry_SYSCALL_64_after_hwframe+0x63/0xcd

 The buggy address belongs to the object at ffff888012575dc8
  which belongs to the cache xfs_cui_item of size 432
 The buggy address is located 152 bytes inside of
  freed 432-byte region [ffff888012575dc8, ffff888012575f78)

 The buggy address belongs to the physical page:
 page:ffffea0000495d00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888012576208 pfn:0x12574
 head:ffffea0000495d00 order:2 entire_mapcount:0 nr_pages_mapped:0 pincount:0
 flags: 0x1fffff80010200(slab|head|node=0|zone=1|lastcpupid=0x1fffff)
 page_type: 0xffffffff()
 raw: 001fffff80010200 ffff888012092f40 ffff888014570150 ffff888014570150
 raw: ffff888012576208 00000000001e0010 00000001ffffffff 0000000000000000
 page dumped because: kasan: bad access detected

 Memory state around the buggy address:
  ffff888012575d00: fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc fc
  ffff888012575d80: fc fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb
 >ffff888012575e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                        ^
  ffff888012575e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff888012575f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc
 ==================================================================

"If process intents fails, intent items left in AIL will be delete
from AIL and freed in error handling, even intent items that have been
recovered and created done items. After this, uaf will be triggered when
done item committed, because at this point the released intent item will
be accessed.

xlog_recover_finish                     xlog_cil_push_work
----------------------------            ---------------------------
xlog_recover_process_intents
  xfs_cui_item_recover//cui_refcount == 1
    xfs_trans_get_cud
    xfs_trans_commit
      <add cud item to cil>
  xfs_cui_item_recover
    <error occurred and return>
xlog_recover_cancel_intents
  xfs_cui_release     //cui_refcount == 0
    xfs_cui_item_free //free cui
  <release other intent items>
xlog_force_shutdown   //shutdown
                               <...>
                                        <push items in cil>
                                        xlog_cil_committed
                                          xfs_cud_item_release
                                            xfs_cui_release // UAF

"Intent log items are created with a reference count of 2, one for the
creator, and one for the intent done object. Log recovery explicitly
drops the creator reference after it is inserted into the AIL, but it
then processes the log item as if it also owns the intent-done reference.

"The code in ->iop_recovery should assume that it passes the reference
to the done intent, we can remove the intent item from the AIL after
creating the done-intent, but if that code fails before creating the
done-intent then it needs to release the intent reference by log recovery
itself.

"That way when we go to cancel the intent, the only intents we find in
the AIL are the ones we know have not been processed yet and hence we
can safely drop both the creator and the intent done reference from
xlog_recover_cancel_intents().

"Hence if we remove the intent from the list of intents that need to
be recovered after we have done the initial recovery, we acheive two
things:

"1. the tail of the log can be moved forward with the commit of the
done intent or new intent to continue the operation, and

"2. We avoid the problem of trying to determine how many reference
counts we need to drop from intent recovery cancelling because we
never come across intents we've actually attempted recovery on."

Restated: The cause of the UAF is that xlog_recover_cancel_intents
thinks that it owns the refcount on any intent item in the AIL, and that
it's always safe to release these intent items.  This is not true after
the recovery function creates an log intent done item and points it at
the log intent item because releasing the done item always releases the
intent item.

The runtime defer ops code avoids all this by tracking both the log
intent and the intent done items, and releasing only the intent done
item if both have been created.  Long Li proposed fixing this by adding
state flags, but I have a more comprehensive fix.

First, observe that the latter half of the intent _recover functions are
nearly open-coded versions of the corresponding _finish_one function
that uses an onstack deferred work item to single-step through the item.

Second, notice that the recover function is not an exact match because
of the odd behavior that unfinished recovered work items are relogged
with separate log intent items instead of a single new log intent item,
which is what the defer ops machinery does.

Dave and I have long suspected that recovery should be reconstructing
the defer work state from what's in the recovered intent item.  Now we
finally have an excuse to refactor the code to do that.

This series starts by fixing a resource leak in LARP recovery.  We fix
the bug that Long Li reported by switching the intent recovery code to
construct chains of xfs_defer_pending objects and then using the defer
pending objects to track the intent/done item ownership.  Finally, we
clean up the code to reconstruct the exact incore state, which means we
can remove all the opencoded _recover code, which makes maintaining log
items much easier.

v2: minor changes per review comments
v3: pick up more rvb tags, fix build errors

This has been lightly tested with fstests.  Enjoy!

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

* tag 'reconstruct-defer-work-6.8_2023-12-06' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: move ->iop_recover to xfs_defer_op_type
  xfs: use xfs_defer_finish_one to finish recovered work items
  xfs: dump the recovered xattri log item if corruption happens
  xfs: recreate work items when recovering intent items
  xfs: transfer recovered intent item ownership in ->iop_recover
  xfs: pass the xfs_defer_pending object to iop_recover
  xfs: use xfs_defer_pending objects to recover intent items
  xfs: don't leak recovered attri intent items
pull bot pushed a commit that referenced this pull request May 17, 2024
Recent additions in BPF like cpu v4 instructions, test_bpf module
exhibits the following failures:

  test_bpf: #82 ALU_MOVSX | BPF_B jited:1 ret 2 != 1 (0x2 != 0x1)FAIL (1 times)
  test_bpf: #83 ALU_MOVSX | BPF_H jited:1 ret 2 != 1 (0x2 != 0x1)FAIL (1 times)
  test_bpf: #84 ALU64_MOVSX | BPF_B jited:1 ret 2 != 1 (0x2 != 0x1)FAIL (1 times)
  test_bpf: #85 ALU64_MOVSX | BPF_H jited:1 ret 2 != 1 (0x2 != 0x1)FAIL (1 times)
  test_bpf: #86 ALU64_MOVSX | BPF_W jited:1 ret 2 != 1 (0x2 != 0x1)FAIL (1 times)

  test_bpf: #165 ALU_SDIV_X: -6 / 2 = -3 jited:1 ret 2147483645 != -3 (0x7ffffffd != 0xfffffffd)FAIL (1 times)
  test_bpf: #166 ALU_SDIV_K: -6 / 2 = -3 jited:1 ret 2147483645 != -3 (0x7ffffffd != 0xfffffffd)FAIL (1 times)

  test_bpf: #169 ALU_SMOD_X: -7 % 2 = -1 jited:1 ret 1 != -1 (0x1 != 0xffffffff)FAIL (1 times)
  test_bpf: #170 ALU_SMOD_K: -7 % 2 = -1 jited:1 ret 1 != -1 (0x1 != 0xffffffff)FAIL (1 times)

  test_bpf: #172 ALU64_SMOD_K: -7 % 2 = -1 jited:1 ret 1 != -1 (0x1 != 0xffffffff)FAIL (1 times)

  test_bpf: #313 BSWAP 16: 0x0123456789abcdef -> 0xefcd
  eBPF filter opcode 00d7 (@2) unsupported
  jited:0 301 PASS
  test_bpf: #314 BSWAP 32: 0x0123456789abcdef -> 0xefcdab89
  eBPF filter opcode 00d7 (@2) unsupported
  jited:0 555 PASS
  test_bpf: #315 BSWAP 64: 0x0123456789abcdef -> 0x67452301
  eBPF filter opcode 00d7 (@2) unsupported
  jited:0 268 PASS
  test_bpf: #316 BSWAP 64: 0x0123456789abcdef >> 32 -> 0xefcdab89
  eBPF filter opcode 00d7 (@2) unsupported
  jited:0 269 PASS
  test_bpf: #317 BSWAP 16: 0xfedcba9876543210 -> 0x1032
  eBPF filter opcode 00d7 (@2) unsupported
  jited:0 460 PASS
  test_bpf: #318 BSWAP 32: 0xfedcba9876543210 -> 0x10325476
  eBPF filter opcode 00d7 (@2) unsupported
  jited:0 320 PASS
  test_bpf: #319 BSWAP 64: 0xfedcba9876543210 -> 0x98badcfe
  eBPF filter opcode 00d7 (@2) unsupported
  jited:0 222 PASS
  test_bpf: #320 BSWAP 64: 0xfedcba9876543210 >> 32 -> 0x10325476
  eBPF filter opcode 00d7 (@2) unsupported
  jited:0 273 PASS

  test_bpf: #344 BPF_LDX_MEMSX | BPF_B
  eBPF filter opcode 0091 (@5) unsupported
  jited:0 432 PASS
  test_bpf: #345 BPF_LDX_MEMSX | BPF_H
  eBPF filter opcode 0089 (@5) unsupported
  jited:0 381 PASS
  test_bpf: #346 BPF_LDX_MEMSX | BPF_W
  eBPF filter opcode 0081 (@5) unsupported
  jited:0 505 PASS

  test_bpf: #490 JMP32_JA: Unconditional jump: if (true) return 1
  eBPF filter opcode 0006 (@1) unsupported
  jited:0 261 PASS

  test_bpf: Summary: 1040 PASSED, 10 FAILED, [924/1038 JIT'ed]

Fix them by adding missing processing.

Fixes: daabb2b ("bpf/tests: add tests for cpuv4 instructions")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/91de862dda99d170697eb79ffb478678af7e0b27.1709652689.git.christophe.leroy@csgroup.eu
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.