Skip to content

Conversation

RudraSwat
Copy link

I've replaced Linux kernel with mOS in the README and have removed references to kernel.org (which is for the Linux kernel).

bvanassche and others added 30 commits January 29, 2020 16:45
commit 04060db upstream.

iscsit_close_connection() calls isert_wait_conn(). Due to commit
e9d3009 both functions call target_wait_for_sess_cmds() although that
last function should be called only once. Fix this by removing the
target_wait_for_sess_cmds() call from isert_wait_conn() and by only calling
isert_wait_conn() after target_wait_for_sess_cmds().

Fixes: e9d3009 ("scsi: target: iscsi: Wait for all commands to finish before freeing a session").
Link: https://lore.kernel.org/r/20200116044737.19507-1-bvanassche@acm.org
Reported-by: Rahul Kundu <rahul.kundu@chelsio.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d0695e2 upstream.

Just as commit 0566e40 ("tracing: initcall: Ordered comparison of
function pointers"), this patch fixes another remaining one in xen.h
found by clang-9.

In file included from arch/x86/xen/trace.c:21:
In file included from ./include/trace/events/xen.h:475:
In file included from ./include/trace/define_trace.h:102:
In file included from ./include/trace/trace_events.h:473:
./include/trace/events/xen.h:69:7: warning: ordered comparison of function \
pointers ('xen_mc_callback_fn_t' (aka 'void (*)(void *)') and 'xen_mc_callback_fn_t') [-Wordered-compare-function-pointers]
                    __field(xen_mc_callback_fn_t, fn)
                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
./include/trace/trace_events.h:421:29: note: expanded from macro '__field'
                                ^
./include/trace/trace_events.h:407:6: note: expanded from macro '__field_ext'
                                 is_signed_type(type), filter_type);    \
                                 ^
./include/linux/trace_events.h:554:44: note: expanded from macro 'is_signed_type'
                                              ^

Fixes: c796f21 ("xen/trace: add multicall tracing")
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b9f726c upstream.

It used to be the case that if we got here, we wouldn't warn
but instead allocate the queue (DQA). With using the mac80211
TXQs model this changed, and we really have nothing to do with
the frame here anymore, hence the warning now.

However, clearly we missed in coding & review that this is now
a pure error path and leaks the SKB if we return 0 instead of
an indication that the SKB needs to be freed. Fix this.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Fixes: cfbc6c4 ("iwlwifi: mvm: support mac80211 TXQs model")
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit df2378a upstream.

When we transmit after TXQ dequeue, we aren't paying attention to
the return value of the transmit functions, leading to a potential
SKB leak.

Refactor the code a bit (and rename ..._tx to ..._tx_sta) to check
for this happening.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Fixes: cfbc6c4 ("iwlwifi: mvm: support mac80211 TXQs model")
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ecc4d2a upstream.

If we create a rather large userptr object(e.g 1ULL << 32) we might
shift past the type-width of num_pages: (int)num_pages << PAGE_SHIFT,
resulting in a totally bogus sg_table, which fortunately will eventually
manifest as:

gen8_ppgtt_insert_huge:463 GEM_BUG_ON(iter->sg->length < page_size)
kernel BUG at drivers/gpu/drm/i915/gt/gen8_ppgtt.c:463!

v2: more unsigned long
    prefer I915_GTT_PAGE_SIZE

Fixes: 5cc9ed4 ("drm/i915: Introduce mapping of user pages into video memory (userptr) ioctl")
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/msgid/20200117132413.1170563-2-matthew.auld@intel.com
(cherry picked from commit 8e78871)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4e4362d upstream.

Commit 9b42c1f ("xfrm: Extend the output_mark") added output_mark
support but missed ESP offload support.

xfrm_smark_get() is not called within xfrm_input() for packets coming
from esp4_gro_receive() or esp6_gro_receive(). Therefore call
xfrm_smark_get() directly within these functions.

Fixes: 9b42c1f ("xfrm: Extend the output_mark to support input direction and masking.")
Signed-off-by: Ulrich Weber <ulrich.weber@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 58c8db9 upstream.

As John Fastabend reports [0], psock state tear-down can happen on receive
path *after* unlocking the socket, if the only other psock user, that is
sockmap or sockhash, releases its psock reference before tcp_bpf_recvmsg
does so:

 tcp_bpf_recvmsg()
  psock = sk_psock_get(sk)                         <- refcnt 2
  lock_sock(sk);
  ...
                                  sock_map_free()  <- refcnt 1
  release_sock(sk)
  sk_psock_put()                                   <- refcnt 0

Remove the lockdep check for socket lock in psock tear-down that got
introduced in 7e81a35 ("bpf: Sockmap, ensure sock lock held during
tear down").

[0] https://lore.kernel.org/netdev/5e25dc995d7d_74082aaee6e465b441@john-XPS-13-9370.notmuch/

Fixes: 7e81a35 ("bpf: Sockmap, ensure sock lock held during tear down")
Reported-by: syzbot+d73682fcf7fee6982fe3@syzkaller.appspotmail.com
Suggested-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d0cb501 upstream.

may_create_in_sticky() call is done when we already have dropped the
reference to dir.

Fixes: 30aba66 (namei: allow restricted O_CREAT of FIFOs and regular files)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2c6b7bc upstream.

Commit 8a23eb8 ("Make filldir[64]() verify the directory entry
filename is valid") added some minimal validity checks on the directory
entries passed to filldir[64]().  But they really were pretty minimal.

This fleshes out at least the name length check: we used to disallow
zero-length names, but really, negative lengths or oevr-long names
aren't ok either.  Both could happen if there is some filesystem
corruption going on.

Now, most filesystems tend to use just an "unsigned char" or similar for
the length of a directory entry name, so even with a corrupt filesystem
you should never see anything odd like that.  But since we then use the
name length to create the directory entry record length, let's make sure
it actually is half-way sensible.

Note how POSIX states that the size of a path component is limited by
NAME_MAX, but we actually use PATH_MAX for the check here.  That's
because while NAME_MAX is generally the correct maximum name length
(it's 255, for the same old "name length is usually just a byte on
disk"), there's nothing in the VFS layer that really cares.

So the real limitation at a VFS layer is the total pathname length you
can pass as a filename: PATH_MAX.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 865ad2f upstream.

The netif_stop_queue() call in sonic_send_packet() races with the
netif_wake_queue() call in sonic_interrupt(). This causes issues
like "NETDEV WATCHDOG: eth0 (macsonic): transmit queue 0 timed out".
Fix this by disabling interrupts when accessing tx_skb[] and next_tx.
Update a comment to clarify the synchronization properties.

Fixes: efcce83 ("[PATCH] macsonic/jazzsonic network drivers update")
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5fedabf upstream.

The chip can change a packet's descriptor status flags at any time.
However, an active interrupt flag gets cleared rather late. This
allows a race condition that could theoretically lose an interrupt.
Fix this by clearing asserted interrupt flags immediately.

Fixes: efcce83 ("[PATCH] macsonic/jazzsonic network drivers update")
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e3885f5 upstream.

The driver accesses descriptor memory which is simultaneously accessed by
the chip, so the compiler must not be allowed to re-order CPU accesses.
sonic_buf_get() used 'volatile' to prevent that. sonic_buf_put() should
have done so too but was overlooked.

Fixes: efcce83 ("[PATCH] macsonic/jazzsonic network drivers update")
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 427db97 upstream.

The tx_aborted_errors statistic should count packets flagged with EXD,
EXC, FU, or BCM bits because those bits denote an aborted transmission.
That corresponds to the bitmask 0x0446, not 0x0642. Use macros for these
constants to avoid mistakes. Better to leave out FIFO Underruns (FU) as
there's a separate counter for that purpose.

Don't lump all these errors in with the general tx_errors counter as
that's used for tx timeout events.

On the rx side, don't count RDE and RBAE interrupts as dropped packets.
These interrupts don't indicate a lost packet, just a lack of resources.
When a lack of resources results in a lost packet, this gets reported
in the rx_missed_errors counter (along with RFO events).

Don't double-count rx_frame_errors and rx_crc_errors.

Don't use the general rx_errors counter for events that already have
special counters.

Fixes: 1da177e ("Linux-2.6.12-rc2")
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9e31182 upstream.

The SONIC can sometimes advance its rx buffer pointer (RRP register)
without advancing its rx descriptor pointer (CRDA register). As a result
the index of the current rx descriptor may not equal that of the current
rx buffer. The driver mistakenly assumes that they are always equal.
This assumption leads to incorrect packet lengths and possible packet
duplication. Avoid this by calling a new function to locate the buffer
corresponding to a given descriptor.

Fixes: efcce83 ("[PATCH] macsonic/jazzsonic network drivers update")
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit eaabfd1 upstream.

The while loop in sonic_rx() traverses the rx descriptor ring. It stops
when it reaches a descriptor that the SONIC has not used. Each iteration
advances the EOL flag so the SONIC can keep using more descriptors.
Therefore, the while loop has no definite termination condition.

The algorithm described in the National Semiconductor literature is quite
different. It consumes descriptors up to the one with its EOL flag set
(which will also have its "in use" flag set). All freed descriptors are
then returned to the ring at once, by adjusting the EOL flags (and link
pointers).

Adopt the algorithm from datasheet as it's simpler, terminates quickly
and avoids a lot of pointless descriptor EOL flag changes.

Fixes: efcce83 ("[PATCH] macsonic/jazzsonic network drivers update")
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 94b1663 upstream.

After sonic_tx_timeout() calls sonic_init(), it can happen that
sonic_rx() will subsequently encounter a receive descriptor with no
flags set. Remove the comment that says that this can't happen.

When giving a receive descriptor to the SONIC, clear the descriptor
status field. That way, any rx descriptor with flags set can only be
a newly received packet.

Don't process a descriptor without the LPKT bit set. The buffer is
still in use by the SONIC.

Fixes: 1da177e ("Linux-2.6.12-rc2")
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 89ba879 upstream.

As soon as the driver is finished with a receive buffer it allocs a new
one and overwrites the corresponding RRA entry with a new buffer pointer.

Problem is, the buffer pointer is split across two word-sized registers.
It can't be updated in one atomic store. So this operation races with the
chip while it stores received packets and advances its RRP register.
This could result in memory corruption by a DMA write.

Avoid this problem by adding buffers only at the location given by the
RWP register, in accordance with the National Semiconductor datasheet.

Re-factor this code into separate functions to calculate a RRA pointer
and to update the RWP.

Fixes: efcce83 ("[PATCH] macsonic/jazzsonic network drivers update")
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3f4b7e6 upstream.

Make sure the SONIC's DMA engine is idle before altering the transmit
and receive descriptors. Add a helper for this as it will be needed
again.

Fixes: 1da177e ("Linux-2.6.12-rc2")
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 27e0c31 upstream.

There are several issues relating to command register usage during
chip initialization.

Firstly, the SONIC sometimes comes out of software reset with the
Start Timer bit set. This gets logged as,

    macsonic macsonic eth0: sonic_init: status=24, i=101

Avoid this by giving the Stop Timer command earlier than later.

Secondly, the loop that waits for the Read RRA command to complete has
the break condition inverted. That's why the for loop iterates until
its termination condition. Call the helper for this instead.

Finally, give the Receiver Enable command after clearing interrupts,
not before, to avoid the possibility of losing an interrupt.

Fixes: 1da177e ("Linux-2.6.12-rc2")
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 772f664 upstream.

Section 4.3.1 of the datasheet says,

    This bit [TXP] must not be set if a Load CAM operation is in
    progress (LCAM is set). The SONIC will lock up if both bits are
    set simultaneously.

Testing has shown that the driver sometimes attempts to set LCAM
while TXP is set. Avoid this by waiting for command completion
before and after giving the LCAM command.

After issuing the Load CAM command, poll for !SONIC_CR_LCAM rather than
SONIC_INT_LCD, because the SONIC_CR_TXP bit can't be used until
!SONIC_CR_LCAM.

When in reset mode, take the opportunity to reset the CAM Enable
register.

Fixes: 1da177e ("Linux-2.6.12-rc2")
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 686f85d upstream.

Section 5.5.3.2 of the datasheet says,

    If FIFO Underrun, Byte Count Mismatch, Excessive Collision, or
    Excessive Deferral (if enabled) errors occur, transmission ceases.

In this situation, the chip asserts a TXER interrupt rather than TXDN.
But the handler for the TXDN is the only way that the transmit queue
gets restarted. Hence, an aborted transmission can result in a watchdog
timeout.

This problem can be reproduced on congested link, as that can result in
excessive transmitter collisions. Another way to reproduce this is with
a FIFO Underrun, which may be caused by DMA latency.

In event of a TXER interrupt, prevent a watchdog timeout by restarting
transmission.

Fixes: 1da177e ("Linux-2.6.12-rc2")
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e5e884b upstream.

add_ie_rates() copys rates without checking the length
in bss descriptor from remote AP.when victim connects to
remote attacker, this may trigger buffer overflow.
lbs_ibss_join_existing() copys rates without checking the length
in bss descriptor from remote IBSS node.when victim connects to
remote attacker, this may trigger buffer overflow.
Fix them by putting the length check before performing copy.

This fix addresses CVE-2019-14896 and CVE-2019-14897.
This also fix build warning of mixed declarations and code.

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Wen Huang <huangwenabc@gmail.com>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ee8951e upstream.

v4l2_vbi_format, v4l2_sliced_vbi_format and v4l2_sdr_format
have a reserved array at the end that should be zeroed by drivers
as per the V4L2 spec. Older drivers often do not do this, so just
handle this in the core.

Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 32c7216 upstream.

The bitmap allocation did not use full unsigned long sizes
when calculating the required size and that was triggered by KASAN
as slab-out-of-bounds read in several places. The patch fixes all
of them.

Reported-by: syzbot+fabca5cbf5e54f3fe2de@syzkaller.appspotmail.com
Reported-by: syzbot+827ced406c9a1d9570ed@syzkaller.appspotmail.com
Reported-by: syzbot+190d63957b22ef673ea5@syzkaller.appspotmail.com
Reported-by: syzbot+dfccdb2bdb4a12ad425e@syzkaller.appspotmail.com
Reported-by: syzbot+df0d0f5895ef1f41a65b@syzkaller.appspotmail.com
Reported-by: syzbot+b08bd19bb37513357fd4@syzkaller.appspotmail.com
Reported-by: syzbot+53cdd0ec0bbabd53370a@syzkaller.appspotmail.com
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8260354 upstream.

This new helper function validates that unknown family and chain type
coming from userspace do not trigger an out-of-bound array access. Bail
out in case __nft_chain_type_get() returns NULL from
nft_chain_parse_hook().

Fixes: 9370761 ("netfilter: nf_tables: convert built-in tables/chains to chain types")
Reported-by: syzbot+156a04714799b1d480bc@syzkaller.appspotmail.com
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit eb014de upstream.

This patch introduces a list of pending module requests. This new module
list is composed of nft_module_request objects that contain the module
name and one status field that tells if the module has been already
loaded (the 'done' field).

In the first pass, from the preparation phase, the netlink command finds
that a module is missing on this list. Then, a module request is
allocated and added to this list and nft_request_module() returns
-EAGAIN. This triggers the abort path with the autoload parameter set on
from nfnetlink, request_module() is called and the module request enters
the 'done' state. Since the mutex is released when loading modules from
the abort phase, the module list is zapped so this is iteration occurs
over a local list. Therefore, the request_module() calls happen when
object lists are in consistent state (after fulling aborting the
transaction) and the commit list is empty.

On the second pass, the netlink command will find that it already tried
to load the module, so it does not request it again and
nft_request_module() returns 0. Then, there is a look up to find the
object that the command was missing. If the module was successfully
loaded, the command proceeds normally since it finds the missing object
in place, otherwise -ENOENT is reported to userspace.

This patch also updates nfnetlink to include the reason to enter the
abort phase, which is required for this new autoload module rationale.

Fixes: ec7470b ("netfilter: nf_tables: store transaction list locally while requesting module")
Reported-by: syzbot+29125d208b3dae9a7019@syzkaller.appspotmail.com
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e21dba7 upstream.

This patch fixes 2 issues in x25_connect():

1. It makes absolutely no sense to reset the neighbour and the
connection state after a (successful) nonblocking call of x25_connect.
This prevents any connection from being established, since the response
(call accept) cannot be processed.

2. Any further calls to x25_connect() while a call is pending should
simply return, instead of creating new Call Request (on different
logical channels).

This patch should also fix the "KASAN: null-ptr-deref Write in
x25_connect" and "BUG: unable to handle kernel NULL pointer dereference
in x25_connect" bugs reported by syzbot.

Signed-off-by: Martin Schiller <ms@dev.tdt.de>
Reported-by: syzbot+429c200ffc8772bfe070@syzkaller.appspotmail.com
Reported-by: syzbot+eec0c87f31a7c3b66f7b@syzkaller.appspotmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 22cc6b7 upstream.

USB completion handlers are called in atomic context and must
specifically not allocate memory using GFP_KERNEL.

Fixes: a1c49c4 ("Bluetooth: btusb: Add protocol support for MediaTek MT7668U USB devices")
Cc: stable <stable@vger.kernel.org>     # 5.3
Cc: Sean Wang <sean.wang@mediatek.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b73e05a upstream.

Make sure to use the current alternate setting when verifying the
interface descriptors to avoid binding to an invalid interface.

Failing to do so could cause the driver to misbehave or trigger a WARN()
in usb_submit_urb() that kernels with panic_on_warn set would choke on.

Fixes: 9afac70 ("orinoco: add orinoco_usb driver")
Cc: stable <stable@vger.kernel.org>     # 2.6.35
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
yosh1k104 and others added 22 commits February 5, 2020 21:22
[ Upstream commit 59fb9b6 ]

This patch applies new flag (FLOW_DISSECTOR_KEY_PORTS_RANGE) and
field (tp_range) to BPF flow dissector to generate appropriate flow
keys when classified by specified port ranges.

Fixes: 8ffb055 ("cls_flower: Fix the behavior using port ranges with hw-offload")
Signed-off-by: Yoshiki Komachi <komachi.yoshiki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Petar Penkov <ppenkov@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200117070533.402240-2-komachi.yoshiki@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a4a8d28 ]

dm-thin uses struct pool to hold the state of the pool. There may be
multiple pool_c's pointing to a given pool, each pool_c represents a
loaded target. pool_c's may be created and destroyed arbitrarily and the
pool contains a reference count of pool_c's pointing to it.

Since commit 694cfe7 ("dm thin: Flush data device before
committing metadata") a pointer to pool_c is passed to
dm_pool_register_pre_commit_callback and this function stores it in
pmd->pre_commit_context. If this pool_c is freed, but pool is not
(because there is another pool_c referencing it), we end up in a
situation where pmd->pre_commit_context structure points to freed
pool_c. It causes a crash in metadata_pre_commit_callback.

Fix this by moving the dm_pool_register_pre_commit_callback() from
pool_ctr() to pool_preresume(). This way the in-core thin-pool metadata
is only ever armed with callback data whose lifetime matches the
active thin-pool target.

In should be noted that this fix preserves the ability to load a
thin-pool table that uses a different data block device (that contains
the same data) -- though it is unclear if that capability is still
useful and/or needed.

Fixes: 694cfe7 ("dm thin: Flush data device before committing metadata")
Cc: stable@vger.kernel.org
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c3314a7 ]

Commit 800d3f5 ("perf report: Add warning when libunwind not
compiled in") breaks the s390 platform. S390 uses libdw-dwarf-unwind for
call chain unwinding and had no support for libunwind.

So the warning "Please install libunwind development packages during the
perf build." caused the confusion even if the call-graph is displayed
correctly.

This patch adds checking for HAVE_DWARF_SUPPORT, which is set when
libdw-dwarf-unwind is compiled in.

Fixes: 800d3f5 ("perf report: Add warning when libunwind not compiled in")
Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
Reviewed-by: Thomas Richter <tmricht@linux.ibm.com>
Tested-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jin Yao <yao.jin@intel.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lore.kernel.org/lkml/20200107191745.18415-1-yao.jin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit dfe9aa2 ]

If we get here after successfully adding page to list, err would be 1 to
indicate the page is queued in the list.

Current code has two problems:

  * on success, 0 is not returned
  * on error, if add_page_for_migratioin() return 1, and the following err1
    from do_move_pages_to_node() is set, the err1 is not returned since err
    is 1

And these behaviors break the user interface.

Link: http://lkml.kernel.org/r/20200119065753.21694-1-richardw.yang@linux.intel.com
Fixes: e0153fc ("mm: move_pages: return valid node id in status if the page is already on the target node").
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
…tion order

[ Upstream commit 8ce1cbd ]

The code which checks the return value for snd_soc_add_dai_link() call
in soc_tplg_fe_link_create() moved the snd_soc_add_dai_link() call before
link->dobj members initialization.

While it does not affect the latest kernels, the old soc-core.c code
in the stable kernels is affected. The snd_soc_add_dai_link() function uses
the link->dobj.type member to check, if the link structure is valid.

Reorder the link->dobj initialization to make things work again.
It's harmless for the recent code (and the structure should be properly
initialized before other calls anyway).

The problem is in stable linux-5.4.y since version 5.4.11 when the
upstream commit 76d2703 was applied.

Fixes: 76d2703 ("ASoC: topology: Check return value for snd_soc_add_dai_link()")
Cc: Dragos Tarcatu <dragos_tarcatu@mentor.com>
Cc: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Cc: Ranjani Sridharan <ranjani.sridharan@linux.intel.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jaroslav Kysela <perex@perex.cz>
Link: https://lore.kernel.org/r/20200122190752.3081016-1-perex@perex.cz
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c5dcf8f ]

This reverts commit f170d44.

USB core will never call a USB-driver probe function with a NULL
device-id pointer.

Reverting before removing the existing checks in order to document this
and prevent the offending commit from being "autoselected" for stable.

Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b61387c ]

Commit 99c9a92 ("tracing/uprobe: Fix double perf_event
linking on multiprobe uprobe") moved trace_uprobe_filter on
trace_probe_event. However, since it introduced a flexible
data structure with char array and type casting, the
alignment of trace_uprobe_filter can be broken.

This changes the type of the array to trace_uprobe_filter
data strucure to fix it.

Link: http://lore.kernel.org/r/20200120124022.GA14897@hirez.programming.kicks-ass.net
Link: http://lkml.kernel.org/r/157966340499.5107.10978352478952144902.stgit@devnote2

Fixes: 99c9a92 ("tracing/uprobe: Fix double perf_event linking on multiprobe uprobe")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Change-Id: Ib206bb953b7ac42ca1dae0691bdafc33f0c8415c
Signed-off-by: Tom Musta <tom.musta@intel.com>
Signed-off-by: Rolf Riesen <rolf.riesen@intel.com>
Signed-off-by: Andrew Tauferner <andrew.t.tauferner@intel.com>
Signed-off-by: John Attinella <john.e.attinella@intel.com>
Signed-off-by: Evan Powers <evan.powers@intel.com>
Signed-off-by: Sharath Kumar Bhat <sharath.k.bhat@intel.com>
master: Inject Arbitrary Events

Rather than hardwire the mOS test event, the injection interface is
updated to support injecting an arbitrary event.  The first token
of the string written is interpreted as a message ID.

master: Update to version 0.8

master: Burst of errors in RAS injection

The return code from writing to the injection file is zero in
the nominal case but must actually be the number of bytes written.
This results in a burst of duplicated console messages for any
clients that attempt to retry, like 'echo'.

master: unit-test: Add coverage for RAS injection

master: Adding license to .gitignore

master: Bad Pointer in RAS Injection Path

There is a subtle bug in the sysfs write handler for the
RAS injection path (/sys/kernel/mOS/ras/inject).  The
written string is duplicated, but then the pointer is
altered by strsep().  Consequently, the pointer passed
to kfree() is not what was obtained from kstrdup().

master: LWK partition precise memory designation

This commit, combined with associated commits in the mos-core
and lwkmem branches will enable the ability to fail a partition
create if the requested memory designation cannot be honored.
Prior to this change, the value provided for the requested memory
designation was always treated as an upper limit. If that amount
of memory could not be designated for the LWK partition, whatever
amount of memory available would be designated and the command
would succeed with a return code of 0 and no indication of a
problem other than three RAS messages. There was no immediate
way for the caller to know if all the requested memory was
designated. A new option has been added on the lwkctl command:
'--precise <yes/no>'. If the value specified for this option is
'yes', and if the requested memory designation cannot be
satisfied, the command will write an error message to stderr,
return a non-zero return code, and generate 'failure' RAS. The
default behavior when creating a partition using the lwkctl
command remains unchanged at this time ('--precise no' behavior).
The RAS messages were modified to generate 'warning' level RAS
if the requested memory designation was not completely
satisfied when '--precise no' was requested. The RAS messages
were modifed to generate 'failure' level messages with the
control action of setting 'node in error' if the requested
memory was not available when '--precise yes' was requested.

master: Return memory if error configuring CPUs

When an LWK partition is created using the 'lwkctl -c'
command and an error is encounterred during the configuring
of the LWK CPUs, return memory to Linux that has been taken
by mOS for use in the LWK.

master: Undo IRQ affinity save and restore patch

Until 4.13 kernel did not have a method to restore
IRQ affinities of managed IRQs when a CPU is being
onlined from an offlined state. As a result those
managed IRQs used to land up on other online CPUs
when migrated away forcefully from a CPU being
offlined and were never re-affinitized back on to
the CPU where it originally was affinitized.

Due to this kernel limitation mOS needed to have a
mechanism to save and restore affinities of IRQs on
CPUs which it used as LWKCPUs during LWK partition
creation and later gave back to Linux upon deletion
of the LWK partition.

This problem was fixed in 4.13 kernel where Linux
introduced a new CPU hotplug state that restores the
affinity of managed IRQs. In mOS we skip this CPU
hotplug step while booting a CPU as LWKCPU. As a
result those IRQs are never re-affinitized when a CPU
was booted as an LWKCPU. Later when the CPU is handed
over to Linux it restores the IRQ affinity on that CPU
using the new CPU hotplug state.

With the introduction of this new CPU hotplug state
we do not need the explicit IRQ save and restore mechanism
in mOS as was done before. This patch undos the changes
done for that mechanism.

master: Deny attempts to affinitize IRQs to only to LWKCPUs

If an attempt is made to affinitize an IRQ to only
LWKCPUs then return EINVAL without changing the
current affinity mask of the IRQ.

master: Convert RAS message to warning for lwkmem_static

If lwkmem_static is set and if a user specifies an lwkmem=
specification during partition creation then the kernel
currently prints a RAS error message. This message was
originally intended to be a warning and not error since
the static lwkmem is a debug option and not a standard.

We do not want the control system to kill the compute node
when this condition occurs. So this patch converts the message
to just warning instead of error.

master: unit-test: Clean out RAS sysfs Upon Test Completion

Security policy file needed for SDL

As part of the Security Development Lifecycle (SDL) we need to set
and publish a policy on how security vulnerabilities in mOS can be
reported and how we announce these vulnerabilities and fixes for them.

The content of this file will show up at

    https://github.com/intel/mOS/security/policy

master: Fix kernel stack overflow for large MAX_NUMNODES

When MAX_NUMNODES is set to a very large number, example
by setting CONFIG_NODES_SHIFT to 10, it results in kernel
stack overflow. This patch provides a fix for such a
configuration.

master: Fix returning un-initialized NUMA array

Change-Id: I9d00f64ff9ce441a6af56482074897344ad7e452
Signed-off-by: Sharath Kumar Bhat <sharath.k.bhat@intel.com>
Signed-off-by: Tom Musta <tom.musta@intel.com>
Signed-off-by: John Attinella <john.e.attinella@intel.com>
Change-Id: Ia3c4a3f6c3d2ddf77022627fb23c65ca377a9ab9
Signed-off-by: Tom Musta <tom.musta@intel.com>
Signed-off-by: Rolf Riesen <rolf.riesen@intel.com>
Signed-off-by: Evan Powers <evan.powers@intel.com>
Signed-off-by: John Attinella <john.e.attinella@intel.com>
Signed-off-by: Sharath Kumar Bhat <sharath.k.bhat@intel.com>
Add support for the mOS memory management subsystem.

Change-Id: I59e858eb261ae9958d81d6c4c76dffa4edab05d9
Signed-off-by: Rolf Riesen <rolf.riesen@intel.com>
Signed-off-by: Tom Musta <tom.musta@intel.com>
Signed-off-by: Andrew Tauferner <andrew.t.tauferner@intel.com>
Signed-off-by: Sharath Kumar Bhat <sharath.k.bhat@intel.com>
Signed-off-by: Evan Powers <evan.powers@intel.com>
Signed-off-by: Sharath Kumar Bhat <sharath.k.bhat@intel.com>
Signed-off-by: John Attinella <john.e.attinella@intel.com>
Change-Id: Ibd453b771e3779a9154a792f545d89502e8e8345
Signed-off-by: John Attinella <john.e.attinella@intel.com>
Signed-off-by: Tom Musta <tom.musta@intel.com>
Signed-off-by: Evan Powers <evan.powers@intel.com>
Signed-off-by: Rolf Riesen <rolf.riesen@intel.com>
Signed-off-by: Sharath Kumar Bhat <sharath.k.bhat@intel.com>
lwksched: Syscall list now indicates remote

Previously the exception list indicated which syscalls
we want to remain local. We are reversing the default behavior
so that syscalls remain local unless they show up on the
exception list. The exception list now indicates which
syscalls we want to ship. Initially this will be set to an
empty list. More analysis will be necessary to determine
if there are syscalls that would be benficial to ship.

lwksched: Round robin behavior changes

When the round robin scheduling is enabled and there is one
or less threads on that that CPU, do not enable the timeslice
scheduler tick. Also make round robin scheduing the default
behavior since when there is no overcommitment, this policy
does not introduce any additional overhead or noise over the
previous default of FIFO scheduling. By changing the default,
we will allow a wider range of applications that do overcommit
CPUs to run without hanging and run without needing to specify
special YOD options. The default timeslice option will be 100ms
and can be adjusted using the yod option 'lwksched-enable-rr=<ms>.
The minumum value supported is 10ms. If there is a situation
where an application that is overcommitting CPUs wishes to not
allow time-based preemption (expectd to be very rare), the
exisitng yod option to control round robin behavior can be used
to turn off time-slicing in on overcommitted CPU by specifying
the yod option '--lwksched-enable-rr=0'.

lwksched: Unit test changes for round-robin change

The unit tests were updated to support the change in
default scheduler behavior with regards to FIFO veres
round-robin scheduling.

Change-Id: I48c1f69f27cb5da68866404a0e9141b79749ca4c
Signed-off-by: John Attinella <john.e.attinella@intel.com>
Introduce support for a new resource reservation syntax and
mechanism:

	yod -R file:<file> ...

When specified, the file may be used to map resources for a
specific MPI rank.  The contents of the file is a list of
lines like this:

      <local-rank> <resource-arg>[ ...]

where <local-rank> is the MPI rank (or wildcard) and <resource-arg>
is a yod resource argument (CPUs, cores, memory or resource).

Here is an example:

     # Map rank 0 to CPU 5 and use 1GiB of memory:
     0 -c 5 -M 1G
     # Map rank 1 to CPU 20 and use 1/4 of designated memory:
     1 --cpus 20 -M 1/4
     # An optional fall-thru wildcard (should be last)
     * -C 1 -M 1G

The motivation for doing this is for MPI primitive measurements
using OSU benchmarks and MPICH.  There is a requirement to
pin ranks to specific CPUs in order to measure intra-node
latencies (intra-socket and inter-socket) as well as inter-node
latencies.  Certain CPUs have been deemed to be interesting
based on their proximity to NICs.

For now, only a subset of yod argument is supported.

mos-core: unit-test: Add unit tests for Rank-to-Resource Mapping File

mos-core: Balance memory with lwkmem=auto

The lwkctl command supports auto configuration of
memory and CPUs. This is done using the keyword=<value>
pairs: lwkcpus=auto lwkmem=auto. This automatic
configuration functionality is meant to create an LWK
partition with a reasonable designation of resources
for the running of a typical HPC application in the LWK,
based on the physical topology of the system node. A
typical HPC application will contain multiple ranks
(processes) executing on each system node. For best
performance, it is advantageous to have each rank contain
memory isolated to a single numa domain. If a fractial resource
value is specified to YOD at process launch time (typical), YOD
can accomplish this reservation isolation if the memory resources
in the LWK partition are evenly balanced across the numa
domains and the divisor is a multiple of the number of numa domains.
If the designated numa domain sizes are not balanced within the
LWK partition, YOD's division of the available memory resources will
result in one or more processes containing memory from multiple
domains which can introduce performance degradation.

When creating a partition using the lwkctl command and
specifying lwkmem=auto, this change provides a balanced memory
designation across like-sized numa domains. For example
on KNL with HBM and DDR running in SNC-4, each of the 4
DDR NUMA domains will have the same LWK memory designation
and each of the 4 HBM memory domains will have the same
LWK memory designation. The value chosen will be based on
the like-sized domain that has the least amount of memory
available to be moved into the LWK partition. An option is
provided to revert to the previous behavior and give the
maximum possible memory to the LWK partition without regards
to balancing.

To provide balanced allocations:
lwkctl -c "lwkcpus=auto lwkmem=auto"

To give maximum memory possible to LWK:
lwkctl -c "lwkcpus=auto lwkmem=auto:max"

If you set the verbosity level to 4 on the lwkctl command, there will be
debug output provided related to the balancing actions. For example
on SKL-10, with balancing on, you may see:

[lwkctl:42680] Begin Numa domain balancing.
[lwkctl:42680] Numa domain balancing: Node 1 has 86G available but
limited to 85G by node 0.
[lwkctl:42680] End Numa domain balancing. Lwk memory: 170G. Sacrificed
1G of potential LWK memory.

Note that the default "auto" behavior is to do balancing because
this is considered to produce a partition optimized for the majority
of HPC situations. If a maximum possible memory designation is desired
without regards to having balanced numa domain designations, you can
specify "auto:max". This logic could be reversed, i.e. the code can be
modified to have "auto" designate the maximum memory and then have
"auto:balance" if 'max' is considered to be a better default behavior.

mos-core: lwkctl precise memory designation

This commmit, combined with associated commits in the master
and lwkmem branches will enable the ability to fail a partition
create if the requested memory designation cannot be honored.
Prior to this change, the value provided for the requested memory
designation was always treated as an upper limit. If that amount
of memory could not be designated for the LWK partition, whatever
amount of memory available would be designated and the command
would succeed with a return code of 0 and no indication of a
problem other than three RAS messages. There was no immediate
way for the caller to know if all the requested memory was
designated. A new option has been added on the lwkctl command:
'--precise <yes/no>'. If the value specified for this option is
'yes', and if the requested memory designation cannot be
satisfied, the command will write an error message to stderr,
return a non-zero return code, and generate 'failure' RAS. The
default behavior when creating a partition using the lwkctl
command remains unchanged at this time ('--precise no' behavior).
The RAS messages were modified to generate 'warning' level RAS
if the requested memory designation was not completely
satisfied when '--precise no' was requested. The RAS messages
were modified to generate 'failure' level messages with the
control action of setting 'node in error' if the requested
memory was not available when '--precise yes' was requested.

mos-core: unit tests for lwkctl precise option

mos-core: Add Serialization Lock to lwkctl

Add a serialization lock to lwkctl, inhibiting concurrent execution
of commands that modify the partition (create and delete).  The
lock is implemented via an advisory lock on /sys/kernel/mOS.

The lwkctl utility will block waiting to acquire the lock.  The
maximum wait time defaults to 5 minutes but can be overridden via
the --timeout option.  A value of zero will block forever.

Improve Debug Data for Insufficient Resources Error

When the NUMA fit algorithm in yod cannot fulfill a request
for resources, the result is a rather unhelpful error message:
"Insufficient LWK Resources".  This can occur for a variety
of reasons, including the case where a compute node is inadvertently
double booked.

In order to help diagnose the situation, a more complete dump of
node's LWK state is prepended to the message.  This includes designated,
reserved and requested CPUs & memory, as well as the active LWK
processes.

In support of this, the show_state() routine is cleaned up and
improved.

Additionally, the logging level YOD_QUIET is renamed to YOD_CRIT
which seems more descriptive.

mos-core: Specify non-uniform no.of utility threads per rank

This feature enables user to specify non-uniform number of
utility threads across ranks of a job. Extends -R file:map_file
argument of yod to specify -u option per rank in the map_file.

The -u value specified through this map file option overrides
the number of utility threads specified through -u yod argument.

This does not break the existing usage of -u and -R file: options,
i.e. if one specifies utility threads through -u argument and
-R file: argument does not specify utility threads in the file
then the number of utility threads specified through -u argument
is respected.

mos-core: Update packaging

mos-core: stop and start irqbalance daemon in lwkctl

irqbalance daemon interprets /sys/../online and /proc/stat
files for determining the number of CPUs. If the CPUs
are being dynamically hotplugged (like in lwkctl) then
irqbalance could see inconsistent number of online cpus
between the two read to sysfs and procfs. In order to
avoid this inconsistent view of online cpus lwkctl needs
to stop irqbalance when a partition creation or deletion
is ongoing.

irqbalance also sets the affinity of user managed irqs.
But when an LWK partition is being created or present
the irqbalance should not consider LWKCPUs for balancing
the irqs. In order to achieve this,
  a. we stop the irqbalance daemon during an LWK
     partition is being created or deleted. And restart
     the daemon after the LWK partition creation/deletion
     is complete.
  b. when an LWK partition is created we set the
     irqbalance daemon's environment variable
     IRQBALANCE_BANNED_CPUS before starting it. This
     ensures that the irqbalance ignores LWKCPUs for
     balancing irqs.

mos-core: Modified lwkctl tests to be topology aware

lwkctl unit tests now discovers the topology of
hardware being tested and generates the test LWK
partitions that are more realistic in usage.

mos-core: test_precise_yes_exceed should consider lwkmem_static

This test would succeed with an LWK partition created when
lwkmem_static is set on the kernel command line. When LWK
memory is static the kernel ignores the lwkmem= specification
and prints a RAS warning that the lwkmem partition is static.

This patch amends the test case to check for lwkmem_static
before flagging the test as failed.

mos-core: lwkctl: Block Partition Creation/Deletion If Jobs Are Active

If a job is deemed to be active (as seen by the RAS subsystem), then
inhibit the deletion or creation of a partition.  This behavior
may be overridden via a command line option.

Ref: JIRA mOS-1488

mos-core: unit-test: Test Partitioning with Busy Job State

Change-Id: I9150f8e4435d8d18e89e53b9112251ebf256ab93
Signed-off-by: John Attinella <john.e.attinella@intel.com>
Signed-off-by: Sharath Kumar Bhat <sharath.k.bhat@intel.com>
Signed-off-by: Tom Musta <tom.musta@intel.com>
Signed-off-by: Andrew Tauferner <andrew.t.tauferner@intel.com>
lwkmem_mutex is meant to be used only for serializing
access to global lwkmem resources. There are no such
accesses in next_lwkmem_address(), so it doesn't need
to grab global lwkmem_mutex. There is no per process
serialization necessary at this point either since
mm->map_sem is already acquired before the caller
calls this function.

lwkmem: Make all LWKMEM regions PROT_EXEC by default

With this patch all LWKMEM regions are rwx by default.
Additionally a yod option '-o lwkmem-prot-exec-disable'
is provided which when specified doesn't make LWKMEM
regions executable by default.

lwkmem : Fix GUP fast path

A defect in __get_user_pages_fast and get_user_pages_fast was
potentially causing a kernel panic and/or hang.  These functions
have been updated to properly retrieve user pages for non-LWK
processes.  __get_user_pages_fast will not retrieve LWK pages
at this time.

lwkmem: Improve RAS in the mremap Failure Scenario

Improve the RAS message content for events long the failure
path of mremap.

  1) The common path error inside of build_lwkvm is updated with the
     (hopefully) more descriptive and useful message:

          build_lwkvma: Could not insert LWK VMA at [2aaaaaac0000,2aaaaac00000) length=1310720 rc=-12

  2) The mremap RAS message now clearly identifies the address *and*
     old and new lengths:

         lwk_sys_mremap: remap failed: address=0x2aaaaaab0000 old_size=65536 new_size=2097152

     This information is useful if we need to override default
     yod behavior (--aligned-mmap).

Ref: JIRA MOS-1393

lwkmem: add lwkctl precise memory designation

This commit, combined with associated commits in the master
and mos-core branches will enable the ability to fail a partition
create if the requested memory designation cannot be honored.
Prior to this change, the value provided for the requested memory
designation was always treated as an upper limit. If that amount
of memory could not be designated for the LWK partition, whatever
amount of memory available would be designated and the command
would succeed with a return code of 0 and no indication of a
problem other than three RAS messages. There was no immediate
way for the caller to know if all the requested memory was
designated. A new option has been added on the lwkctl command:
'--precise <yes/no>'. If the value specified for this option is
'yes', and if the requested memory designation cannot be
satisfied, the command will write an error message to stderr,
return a non-zero return code, and generate 'failure' RAS. The
default behavior when creating a partition using the lwkctl
command remained unchanged at this time ('--precise no' behavior).
The RAS messages were modified to generate  'warning' level RAS
if the requested memory designation was not completely
satisfied when '--precise no' was requested. The RAS messages
were modified to generate 'failure' level messages with the
control action of setting 'node in error' if the requested
memory was not available when '--precise yes' was requested.

lwkmem: Adapt kernelcore/movablecore/movable node patch to 5.3

In 5.3 kernel Linux supports specifying kernelcore, movablecore
as percentages of total memory in addition to the previously
supported absolute byte format. This patch migrates mOS
changes in that area to adapt to the new kernel feature.

lwkmem: Adapt to new mmap flag MAP_FIXED_NOREPLACE

Linux v5.3 has MAP_FIXED_NOREPLACE which in behavior
is similar to MAP_FIXED except that if a previous map
existed in the requested address range then the mmap
should fail returning EEXIST instead of unmapping the
old map and creating a new map in that range. This
patch adjusts lwkmem code to accomodate this new flag

lwkmem: Add mOS view to new entry in hugetlb

Linux v5.3 adds a new entry to meminfo from hugetlb.
This patch adds mOS view to that new entry.

lwkmem: Adapt TLB flush to 5.3

In 5.3 TLB flush functionality exposes a stride that
can be specified along with the range. This patch
re-works the 5.3 rebase to use proper stride.

lwkmem: call xpmem fault handler directly

The Linux core page fault handler allocates all higher levels
of page table hierarchy (pgd, p4d, pud, pmd) before it invokes
XPMEM fault handler. This is ok for allocating a base page in
XPMEM fault handler, but for allocating large pages such as
1g page pmd level is not needed and shouldn't be allocated
by the Linux page fault handler before it invokes the XPMEM
page fault handler.

This patch modifies the Linux core page fault handler to
invoke registered page fault handler of the XPMEM driver
directly before allocating pmd level page table if the
faulting address is an XPMEM VMA.

Change-Id: Iec3041d9e377002bd6f2ed8527a05c0180869e70
Signed-off-by: Sharath Kumar Bhat <sharath.k.bhat@intel.com>
Signed-off-by: John Attinella <john.e.attinella@intel.com>
Signed-off-by: Tom Musta <tom.musta@intel.com>
Change-Id: I4a064ca2811951fbcd6cc02e921aabefa05cab70
Signed-off-by: John Attinella <john.e.attinella@intel.com>
Change-Id: Iada3b62f9da65ec8a29ae714b3320c6ad139a5e0
Signed-off-by: Andrew Tauferner <andrew.t.tauferner@intel.com>
When CPUs are overcommitted in an LWK partition, the
additional threads were being assigned to the CPU of the thread that
created the new threads. For example, if 10 CPUs were in the
reservation for the process and 100 threads were created by the
main thread of the process, there would be 1 thread running on each
of the upper 9 CPUs and 91 threads all trying to all run on
the first CPU of the reservation. This mOS problem was
introduced during a recent rebase in which Linux changed the clone
system call code flow.

Change-Id: I068dd29ebdb8dc195798a68ef268f5719247e59d
Signed-off-by: John Attinella <john.e.attinella@intel.com>
The mOS schduler has a spin lock in its CPU-scoped run-queue
object. This spin lock is obtained when we are committing and
un-comitting a thread to a specific CPU. All calls to the spin lock must
occur when interrupts are disabled. However, in the thread exit path
interrupts are enabled. In this exit path we are un-commtting the
exiting thread. While we held the spin lock, a scheduler timer tick
fired. This fired because we have over-committed threads on this CPU (we
disable the timer tick if we are not over-comitted). The timer tick
processing drove us through the mOS code to wake up and dispatch another
thread. This flow attempts to obtain the spin lock to commit the thread
to this run-queue, resulting in a deadlock. The fix is to use a more
robust spin-lock interface to lock and unlock which guarantees that
interrupts are always disabled while the spin-lock is held.

Change-Id: I6b296abc4a78c2b143973a27ae1e0e12b996904f
Signed-off-by: John Attinella <john.e.attinella@intel.com>
Copy link

@rolfriesen rolfriesen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, @RudraSwat thanks for the suggested change. This is the original README that comes with the Linux kernel. We did not change it because the vast majority of code here is still Linux and this README refers to that.
OTOH it is a little confusing that it talks about Linux on the main Code page for mOS ;-)
If we decide to change it, maybe we should rename the current one to README.linux and bring in the mOS README from https://github.com/intel/mOS/wiki/mOS-for-HPC-v0.8-Readme.
What do you think?

@RudraSwat
Copy link
Author

@rolfriesen Sorry for the late reply. Yes, I guess we can rename the current README and move https://github.com/intel/mOS/wiki/mOS-for-HPC-v0.8-Readme to the README file (and make it plaintext-friendly).

@rolfriesen
Copy link

Hello @RudraSwat we just pushed a new release of mOS and updated the top-level README to be about mOS instead of Linux. Thanks for that suggestion.

atauferner pushed a commit to atauferner/mOS that referenced this pull request Jul 26, 2023
[ Upstream commit 99d4850 ]

Found by leak sanitizer:
```
==1632594==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 21 byte(s) in 1 object(s) allocated from:
    #0 0x7f2953a7077b in __interceptor_strdup ../../../../src/libsanitizer/asan/asan_interceptors.cpp:439
    #1 0x556701d6fbbf in perf_env__read_cpuid util/env.c:369
    #2 0x556701d70589 in perf_env__cpuid util/env.c:465
    intel#3 0x55670204bba2 in x86__is_amd_cpu arch/x86/util/env.c:14
    intel#4 0x5567020487a2 in arch__post_evsel_config arch/x86/util/evsel.c:83
    intel#5 0x556701d8f78b in evsel__config util/evsel.c:1366
    intel#6 0x556701ef5872 in evlist__config util/record.c:108
    intel#7 0x556701cd6bcd in test__PERF_RECORD tests/perf-record.c:112
    intel#8 0x556701cacd07 in run_test tests/builtin-test.c:236
    intel#9 0x556701cacfac in test_and_print tests/builtin-test.c:265
    intel#10 0x556701cadddb in __cmd_test tests/builtin-test.c:402
    intel#11 0x556701caf2aa in cmd_test tests/builtin-test.c:559
    intel#12 0x556701d3b557 in run_builtin tools/perf/perf.c:323
    intel#13 0x556701d3bac8 in handle_internal_command tools/perf/perf.c:377
    intel#14 0x556701d3be90 in run_argv tools/perf/perf.c:421
    #15 0x556701d3c3f8 in main tools/perf/perf.c:537
    #16 0x7f2952a46189 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

SUMMARY: AddressSanitizer: 21 byte(s) leaked in 1 allocation(s).
```

Fixes: f7b58cb ("perf mem/c2c: Add load store event mappings for AMD")
Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Ravi Bangoria <ravi.bangoria@amd.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20230613235416.1650755-1-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
atauferner pushed a commit to atauferner/mOS that referenced this pull request Jul 26, 2023
[ Upstream commit b684c09 ]

ppc_save_regs() skips one stack frame while saving the CPU register states.
Instead of saving current R1, it pulls the previous stack frame pointer.

When vmcores caused by direct panic call (such as `echo c >
/proc/sysrq-trigger`), are debugged with gdb, gdb fails to show the
backtrace correctly. On further analysis, it was found that it was because
of mismatch between r1 and NIP.

GDB uses NIP to get current function symbol and uses corresponding debug
info of that function to unwind previous frames, but due to the
mismatching r1 and NIP, the unwinding does not work, and it fails to
unwind to the 2nd frame and hence does not show the backtrace.

GDB backtrace with vmcore of kernel without this patch:

---------
(gdb) bt
 #0  0xc0000000002a53e8 in crash_setup_regs (oldregs=<optimized out>,
    newregs=0xc000000004f8f8d8) at ./arch/powerpc/include/asm/kexec.h:69
 #1  __crash_kexec (regs=<optimized out>) at kernel/kexec_core.c:974
 #2  0x0000000000000063 in ?? ()
 intel#3  0xc000000003579320 in ?? ()
---------

Further analysis revealed that the mismatch occurred because
"ppc_save_regs" was saving the previous stack's SP instead of the current
r1. This patch fixes this by storing current r1 in the saved pt_regs.

GDB backtrace with vmcore of patched kernel:

--------
(gdb) bt
 #0  0xc0000000002a53e8 in crash_setup_regs (oldregs=0x0, newregs=0xc00000000670b8d8)
    at ./arch/powerpc/include/asm/kexec.h:69
 #1  __crash_kexec (regs=regs@entry=0x0) at kernel/kexec_core.c:974
 #2  0xc000000000168918 in panic (fmt=fmt@entry=0xc000000001654a60 "sysrq triggered crash\n")
    at kernel/panic.c:358
 intel#3  0xc000000000b735f8 in sysrq_handle_crash (key=<optimized out>) at drivers/tty/sysrq.c:155
 intel#4  0xc000000000b742cc in __handle_sysrq (key=key@entry=99, check_mask=check_mask@entry=false)
    at drivers/tty/sysrq.c:602
 intel#5  0xc000000000b7506c in write_sysrq_trigger (file=<optimized out>, buf=<optimized out>,
    count=2, ppos=<optimized out>) at drivers/tty/sysrq.c:1163
 intel#6  0xc00000000069a7bc in pde_write (ppos=<optimized out>, count=<optimized out>,
    buf=<optimized out>, file=<optimized out>, pde=0xc00000000362cb40) at fs/proc/inode.c:340
 intel#7  proc_reg_write (file=<optimized out>, buf=<optimized out>, count=<optimized out>,
    ppos=<optimized out>) at fs/proc/inode.c:352
 intel#8  0xc0000000005b3bbc in vfs_write (file=file@entry=0xc000000006aa6b00,
    buf=buf@entry=0x61f498b4f60 <error: Cannot access memory at address 0x61f498b4f60>,
    count=count@entry=2, pos=pos@entry=0xc00000000670bda0) at fs/read_write.c:582
 intel#9  0xc0000000005b4264 in ksys_write (fd=<optimized out>,
    buf=0x61f498b4f60 <error: Cannot access memory at address 0x61f498b4f60>, count=2)
    at fs/read_write.c:637
 intel#10 0xc00000000002ea2c in system_call_exception (regs=0xc00000000670be80, r0=<optimized out>)
    at arch/powerpc/kernel/syscall.c:171
 intel#11 0xc00000000000c270 in system_call_vectored_common ()
    at arch/powerpc/kernel/interrupt_64.S:192
--------

Nick adds:
  So this now saves regs as though it was an interrupt taken in the
  caller, at the instruction after the call to ppc_save_regs, whereas
  previously the NIP was there, but R1 came from the caller's caller and
  that mismatch is what causes gdb's dwarf unwinder to go haywire.

Signed-off-by: Aditya Gupta <adityag@linux.ibm.com>
Fixes: d16a58f ("powerpc: Improve ppc_save_regs()")
Reivewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230615091047.90433-1-adityag@linux.ibm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
atauferner pushed a commit to atauferner/mOS that referenced this pull request Aug 11, 2023
[ Upstream commit 93a3319 ]

The cited commit holds encap tbl lock unconditionally when setting
up dests. But it may cause the following deadlock:

 PID: 1063722  TASK: ffffa062ca5d0000  CPU: 13   COMMAND: "handler8"
  #0 [ffffb14de05b7368] __schedule at ffffffffa1d5aa91
  #1 [ffffb14de05b7410] schedule at ffffffffa1d5afdb
  #2 [ffffb14de05b7430] schedule_preempt_disabled at ffffffffa1d5b528
  intel#3 [ffffb14de05b7440] __mutex_lock at ffffffffa1d5d6cb
  intel#4 [ffffb14de05b74e8] mutex_lock_nested at ffffffffa1d5ddeb
  intel#5 [ffffb14de05b74f8] mlx5e_tc_tun_encap_dests_set at ffffffffc12f2096 [mlx5_core]
  intel#6 [ffffb14de05b7568] post_process_attr at ffffffffc12d9fc5 [mlx5_core]
  intel#7 [ffffb14de05b75a0] mlx5e_tc_add_fdb_flow at ffffffffc12de877 [mlx5_core]
  intel#8 [ffffb14de05b75f0] __mlx5e_add_fdb_flow at ffffffffc12e0eef [mlx5_core]
  intel#9 [ffffb14de05b7660] mlx5e_tc_add_flow at ffffffffc12e12f7 [mlx5_core]
 intel#10 [ffffb14de05b76b8] mlx5e_configure_flower at ffffffffc12e1686 [mlx5_core]
 intel#11 [ffffb14de05b7720] mlx5e_rep_indr_offload at ffffffffc12e3817 [mlx5_core]
 intel#12 [ffffb14de05b7730] mlx5e_rep_indr_setup_tc_cb at ffffffffc12e388a [mlx5_core]
 intel#13 [ffffb14de05b7740] tc_setup_cb_add at ffffffffa1ab2ba8
 intel#14 [ffffb14de05b77a0] fl_hw_replace_filter at ffffffffc0bdec2f [cls_flower]
 #15 [ffffb14de05b7868] fl_change at ffffffffc0be6caa [cls_flower]
 #16 [ffffb14de05b7908] tc_new_tfilter at ffffffffa1ab71f0

[1031218.028143]  wait_for_completion+0x24/0x30
[1031218.028589]  mlx5e_update_route_decap_flows+0x9a/0x1e0 [mlx5_core]
[1031218.029256]  mlx5e_tc_fib_event_work+0x1ad/0x300 [mlx5_core]
[1031218.029885]  process_one_work+0x24e/0x510

Actually no need to hold encap tbl lock if there is no encap action.
Fix it by checking if encap action exists or not before holding
encap tbl lock.

Fixes: 37c3b9f ("net/mlx5e: Prevent encap offload when neigh update is running")
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.