Update README #6

RudraSwat · 2020-08-12T08:19:48Z

I've replaced Linux kernel with mOS in the README and have removed references to kernel.org (which is for the Linux kernel).

commit 04060db upstream. iscsit_close_connection() calls isert_wait_conn(). Due to commit e9d3009 both functions call target_wait_for_sess_cmds() although that last function should be called only once. Fix this by removing the target_wait_for_sess_cmds() call from isert_wait_conn() and by only calling isert_wait_conn() after target_wait_for_sess_cmds(). Fixes: e9d3009 ("scsi: target: iscsi: Wait for all commands to finish before freeing a session"). Link: https://lore.kernel.org/r/20200116044737.19507-1-bvanassche@acm.org Reported-by: Rahul Kundu <rahul.kundu@chelsio.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Acked-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit d0695e2 upstream. Just as commit 0566e40 ("tracing: initcall: Ordered comparison of function pointers"), this patch fixes another remaining one in xen.h found by clang-9. In file included from arch/x86/xen/trace.c:21: In file included from ./include/trace/events/xen.h:475: In file included from ./include/trace/define_trace.h:102: In file included from ./include/trace/trace_events.h:473: ./include/trace/events/xen.h:69:7: warning: ordered comparison of function \ pointers ('xen_mc_callback_fn_t' (aka 'void (*)(void *)') and 'xen_mc_callback_fn_t') [-Wordered-compare-function-pointers] __field(xen_mc_callback_fn_t, fn) ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ./include/trace/trace_events.h:421:29: note: expanded from macro '__field' ^ ./include/trace/trace_events.h:407:6: note: expanded from macro '__field_ext' is_signed_type(type), filter_type); \ ^ ./include/linux/trace_events.h:554:44: note: expanded from macro 'is_signed_type' ^ Fixes: c796f21 ("xen/trace: add multicall tracing") Signed-off-by: Changbin Du <changbin.du@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit b9f726c upstream. It used to be the case that if we got here, we wouldn't warn but instead allocate the queue (DQA). With using the mac80211 TXQs model this changed, and we really have nothing to do with the frame here anymore, hence the warning now. However, clearly we missed in coding & review that this is now a pure error path and leaks the SKB if we return 0 instead of an indication that the SKB needs to be freed. Fix this. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Fixes: cfbc6c4 ("iwlwifi: mvm: support mac80211 TXQs model") Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit df2378a upstream. When we transmit after TXQ dequeue, we aren't paying attention to the return value of the transmit functions, leading to a potential SKB leak. Refactor the code a bit (and rename ..._tx to ..._tx_sta) to check for this happening. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Fixes: cfbc6c4 ("iwlwifi: mvm: support mac80211 TXQs model") Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit ecc4d2a upstream. If we create a rather large userptr object(e.g 1ULL << 32) we might shift past the type-width of num_pages: (int)num_pages << PAGE_SHIFT, resulting in a totally bogus sg_table, which fortunately will eventually manifest as: gen8_ppgtt_insert_huge:463 GEM_BUG_ON(iter->sg->length < page_size) kernel BUG at drivers/gpu/drm/i915/gt/gen8_ppgtt.c:463! v2: more unsigned long prefer I915_GTT_PAGE_SIZE Fixes: 5cc9ed4 ("drm/i915: Introduce mapping of user pages into video memory (userptr) ioctl") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Link: https://patchwork.freedesktop.org/patch/msgid/20200117132413.1170563-2-matthew.auld@intel.com (cherry picked from commit 8e78871) Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 4e4362d upstream. Commit 9b42c1f ("xfrm: Extend the output_mark") added output_mark support but missed ESP offload support. xfrm_smark_get() is not called within xfrm_input() for packets coming from esp4_gro_receive() or esp6_gro_receive(). Therefore call xfrm_smark_get() directly within these functions. Fixes: 9b42c1f ("xfrm: Extend the output_mark to support input direction and masking.") Signed-off-by: Ulrich Weber <ulrich.weber@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 58c8db9 upstream. As John Fastabend reports [0], psock state tear-down can happen on receive path *after* unlocking the socket, if the only other psock user, that is sockmap or sockhash, releases its psock reference before tcp_bpf_recvmsg does so: tcp_bpf_recvmsg() psock = sk_psock_get(sk) <- refcnt 2 lock_sock(sk); ... sock_map_free() <- refcnt 1 release_sock(sk) sk_psock_put() <- refcnt 0 Remove the lockdep check for socket lock in psock tear-down that got introduced in 7e81a35 ("bpf: Sockmap, ensure sock lock held during tear down"). [0] https://lore.kernel.org/netdev/5e25dc995d7d_74082aaee6e465b441@john-XPS-13-9370.notmuch/ Fixes: 7e81a35 ("bpf: Sockmap, ensure sock lock held during tear down") Reported-by: syzbot+d73682fcf7fee6982fe3@syzkaller.appspotmail.com Suggested-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit d0cb501 upstream. may_create_in_sticky() call is done when we already have dropped the reference to dir. Fixes: 30aba66 (namei: allow restricted O_CREAT of FIFOs and regular files) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 2c6b7bc upstream. Commit 8a23eb8 ("Make filldir[64]() verify the directory entry filename is valid") added some minimal validity checks on the directory entries passed to filldir[64](). But they really were pretty minimal. This fleshes out at least the name length check: we used to disallow zero-length names, but really, negative lengths or oevr-long names aren't ok either. Both could happen if there is some filesystem corruption going on. Now, most filesystems tend to use just an "unsigned char" or similar for the length of a directory entry name, so even with a corrupt filesystem you should never see anything odd like that. But since we then use the name length to create the directory entry record length, let's make sure it actually is half-way sensible. Note how POSIX states that the size of a path component is limited by NAME_MAX, but we actually use PATH_MAX for the check here. That's because while NAME_MAX is generally the correct maximum name length (it's 255, for the same old "name length is usually just a byte on disk"), there's nothing in the VFS layer that really cares. So the real limitation at a VFS layer is the total pathname length you can pass as a filename: PATH_MAX. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 865ad2f upstream. The netif_stop_queue() call in sonic_send_packet() races with the netif_wake_queue() call in sonic_interrupt(). This causes issues like "NETDEV WATCHDOG: eth0 (macsonic): transmit queue 0 timed out". Fix this by disabling interrupts when accessing tx_skb[] and next_tx. Update a comment to clarify the synchronization properties. Fixes: efcce83 ("[PATCH] macsonic/jazzsonic network drivers update") Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 5fedabf upstream. The chip can change a packet's descriptor status flags at any time. However, an active interrupt flag gets cleared rather late. This allows a race condition that could theoretically lose an interrupt. Fix this by clearing asserted interrupt flags immediately. Fixes: efcce83 ("[PATCH] macsonic/jazzsonic network drivers update") Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit e3885f5 upstream. The driver accesses descriptor memory which is simultaneously accessed by the chip, so the compiler must not be allowed to re-order CPU accesses. sonic_buf_get() used 'volatile' to prevent that. sonic_buf_put() should have done so too but was overlooked. Fixes: efcce83 ("[PATCH] macsonic/jazzsonic network drivers update") Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 427db97 upstream. The tx_aborted_errors statistic should count packets flagged with EXD, EXC, FU, or BCM bits because those bits denote an aborted transmission. That corresponds to the bitmask 0x0446, not 0x0642. Use macros for these constants to avoid mistakes. Better to leave out FIFO Underruns (FU) as there's a separate counter for that purpose. Don't lump all these errors in with the general tx_errors counter as that's used for tx timeout events. On the rx side, don't count RDE and RBAE interrupts as dropped packets. These interrupts don't indicate a lost packet, just a lack of resources. When a lack of resources results in a lost packet, this gets reported in the rx_missed_errors counter (along with RFO events). Don't double-count rx_frame_errors and rx_crc_errors. Don't use the general rx_errors counter for events that already have special counters. Fixes: 1da177e ("Linux-2.6.12-rc2") Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 9e31182 upstream. The SONIC can sometimes advance its rx buffer pointer (RRP register) without advancing its rx descriptor pointer (CRDA register). As a result the index of the current rx descriptor may not equal that of the current rx buffer. The driver mistakenly assumes that they are always equal. This assumption leads to incorrect packet lengths and possible packet duplication. Avoid this by calling a new function to locate the buffer corresponding to a given descriptor. Fixes: efcce83 ("[PATCH] macsonic/jazzsonic network drivers update") Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit eaabfd1 upstream. The while loop in sonic_rx() traverses the rx descriptor ring. It stops when it reaches a descriptor that the SONIC has not used. Each iteration advances the EOL flag so the SONIC can keep using more descriptors. Therefore, the while loop has no definite termination condition. The algorithm described in the National Semiconductor literature is quite different. It consumes descriptors up to the one with its EOL flag set (which will also have its "in use" flag set). All freed descriptors are then returned to the ring at once, by adjusting the EOL flags (and link pointers). Adopt the algorithm from datasheet as it's simpler, terminates quickly and avoids a lot of pointless descriptor EOL flag changes. Fixes: efcce83 ("[PATCH] macsonic/jazzsonic network drivers update") Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 94b1663 upstream. After sonic_tx_timeout() calls sonic_init(), it can happen that sonic_rx() will subsequently encounter a receive descriptor with no flags set. Remove the comment that says that this can't happen. When giving a receive descriptor to the SONIC, clear the descriptor status field. That way, any rx descriptor with flags set can only be a newly received packet. Don't process a descriptor without the LPKT bit set. The buffer is still in use by the SONIC. Fixes: 1da177e ("Linux-2.6.12-rc2") Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 89ba879 upstream. As soon as the driver is finished with a receive buffer it allocs a new one and overwrites the corresponding RRA entry with a new buffer pointer. Problem is, the buffer pointer is split across two word-sized registers. It can't be updated in one atomic store. So this operation races with the chip while it stores received packets and advances its RRP register. This could result in memory corruption by a DMA write. Avoid this problem by adding buffers only at the location given by the RWP register, in accordance with the National Semiconductor datasheet. Re-factor this code into separate functions to calculate a RRA pointer and to update the RWP. Fixes: efcce83 ("[PATCH] macsonic/jazzsonic network drivers update") Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 3f4b7e6 upstream. Make sure the SONIC's DMA engine is idle before altering the transmit and receive descriptors. Add a helper for this as it will be needed again. Fixes: 1da177e ("Linux-2.6.12-rc2") Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 27e0c31 upstream. There are several issues relating to command register usage during chip initialization. Firstly, the SONIC sometimes comes out of software reset with the Start Timer bit set. This gets logged as, macsonic macsonic eth0: sonic_init: status=24, i=101 Avoid this by giving the Stop Timer command earlier than later. Secondly, the loop that waits for the Read RRA command to complete has the break condition inverted. That's why the for loop iterates until its termination condition. Call the helper for this instead. Finally, give the Receiver Enable command after clearing interrupts, not before, to avoid the possibility of losing an interrupt. Fixes: 1da177e ("Linux-2.6.12-rc2") Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 772f664 upstream. Section 4.3.1 of the datasheet says, This bit [TXP] must not be set if a Load CAM operation is in progress (LCAM is set). The SONIC will lock up if both bits are set simultaneously. Testing has shown that the driver sometimes attempts to set LCAM while TXP is set. Avoid this by waiting for command completion before and after giving the LCAM command. After issuing the Load CAM command, poll for !SONIC_CR_LCAM rather than SONIC_INT_LCD, because the SONIC_CR_TXP bit can't be used until !SONIC_CR_LCAM. When in reset mode, take the opportunity to reset the CAM Enable register. Fixes: 1da177e ("Linux-2.6.12-rc2") Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 686f85d upstream. Section 5.5.3.2 of the datasheet says, If FIFO Underrun, Byte Count Mismatch, Excessive Collision, or Excessive Deferral (if enabled) errors occur, transmission ceases. In this situation, the chip asserts a TXER interrupt rather than TXDN. But the handler for the TXDN is the only way that the transmit queue gets restarted. Hence, an aborted transmission can result in a watchdog timeout. This problem can be reproduced on congested link, as that can result in excessive transmitter collisions. Another way to reproduce this is with a FIFO Underrun, which may be caused by DMA latency. In event of a TXER interrupt, prevent a watchdog timeout by restarting transmission. Fixes: 1da177e ("Linux-2.6.12-rc2") Tested-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Finn Thain <fthain@telegraphics.com.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit e5e884b upstream. add_ie_rates() copys rates without checking the length in bss descriptor from remote AP.when victim connects to remote attacker, this may trigger buffer overflow. lbs_ibss_join_existing() copys rates without checking the length in bss descriptor from remote IBSS node.when victim connects to remote attacker, this may trigger buffer overflow. Fix them by putting the length check before performing copy. This fix addresses CVE-2019-14896 and CVE-2019-14897. This also fix build warning of mixed declarations and code. Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Wen Huang <huangwenabc@gmail.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit ee8951e upstream. v4l2_vbi_format, v4l2_sliced_vbi_format and v4l2_sdr_format have a reserved array at the end that should be zeroed by drivers as per the V4L2 spec. Older drivers often do not do this, so just handle this in the core. Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 32c7216 upstream. The bitmap allocation did not use full unsigned long sizes when calculating the required size and that was triggered by KASAN as slab-out-of-bounds read in several places. The patch fixes all of them. Reported-by: syzbot+fabca5cbf5e54f3fe2de@syzkaller.appspotmail.com Reported-by: syzbot+827ced406c9a1d9570ed@syzkaller.appspotmail.com Reported-by: syzbot+190d63957b22ef673ea5@syzkaller.appspotmail.com Reported-by: syzbot+dfccdb2bdb4a12ad425e@syzkaller.appspotmail.com Reported-by: syzbot+df0d0f5895ef1f41a65b@syzkaller.appspotmail.com Reported-by: syzbot+b08bd19bb37513357fd4@syzkaller.appspotmail.com Reported-by: syzbot+53cdd0ec0bbabd53370a@syzkaller.appspotmail.com Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 8260354 upstream. This new helper function validates that unknown family and chain type coming from userspace do not trigger an out-of-bound array access. Bail out in case __nft_chain_type_get() returns NULL from nft_chain_parse_hook(). Fixes: 9370761 ("netfilter: nf_tables: convert built-in tables/chains to chain types") Reported-by: syzbot+156a04714799b1d480bc@syzkaller.appspotmail.com Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit eb014de upstream. This patch introduces a list of pending module requests. This new module list is composed of nft_module_request objects that contain the module name and one status field that tells if the module has been already loaded (the 'done' field). In the first pass, from the preparation phase, the netlink command finds that a module is missing on this list. Then, a module request is allocated and added to this list and nft_request_module() returns -EAGAIN. This triggers the abort path with the autoload parameter set on from nfnetlink, request_module() is called and the module request enters the 'done' state. Since the mutex is released when loading modules from the abort phase, the module list is zapped so this is iteration occurs over a local list. Therefore, the request_module() calls happen when object lists are in consistent state (after fulling aborting the transaction) and the commit list is empty. On the second pass, the netlink command will find that it already tried to load the module, so it does not request it again and nft_request_module() returns 0. Then, there is a look up to find the object that the command was missing. If the module was successfully loaded, the command proceeds normally since it finds the missing object in place, otherwise -ENOENT is reported to userspace. This patch also updates nfnetlink to include the reason to enter the abort phase, which is required for this new autoload module rationale. Fixes: ec7470b ("netfilter: nf_tables: store transaction list locally while requesting module") Reported-by: syzbot+29125d208b3dae9a7019@syzkaller.appspotmail.com Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit e21dba7 upstream. This patch fixes 2 issues in x25_connect(): 1. It makes absolutely no sense to reset the neighbour and the connection state after a (successful) nonblocking call of x25_connect. This prevents any connection from being established, since the response (call accept) cannot be processed. 2. Any further calls to x25_connect() while a call is pending should simply return, instead of creating new Call Request (on different logical channels). This patch should also fix the "KASAN: null-ptr-deref Write in x25_connect" and "BUG: unable to handle kernel NULL pointer dereference in x25_connect" bugs reported by syzbot. Signed-off-by: Martin Schiller <ms@dev.tdt.de> Reported-by: syzbot+429c200ffc8772bfe070@syzkaller.appspotmail.com Reported-by: syzbot+eec0c87f31a7c3b66f7b@syzkaller.appspotmail.com Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit 22cc6b7 upstream. USB completion handlers are called in atomic context and must specifically not allocate memory using GFP_KERNEL. Fixes: a1c49c4 ("Bluetooth: btusb: Add protocol support for MediaTek MT7668U USB devices") Cc: stable <stable@vger.kernel.org> # 5.3 Cc: Sean Wang <sean.wang@mediatek.com> Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit b73e05a upstream. Make sure to use the current alternate setting when verifying the interface descriptors to avoid binding to an invalid interface. Failing to do so could cause the driver to misbehave or trigger a WARN() in usb_submit_urb() that kernels with panic_on_warn set would choke on. Fixes: 9afac70 ("orinoco: add orinoco_usb driver") Cc: stable <stable@vger.kernel.org> # 2.6.35 Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

[ Upstream commit 59fb9b6 ] This patch applies new flag (FLOW_DISSECTOR_KEY_PORTS_RANGE) and field (tp_range) to BPF flow dissector to generate appropriate flow keys when classified by specified port ranges. Fixes: 8ffb055 ("cls_flower: Fix the behavior using port ranges with hw-offload") Signed-off-by: Yoshiki Komachi <komachi.yoshiki@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Petar Penkov <ppenkov@google.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200117070533.402240-2-komachi.yoshiki@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>

[ Upstream commit a4a8d28 ] dm-thin uses struct pool to hold the state of the pool. There may be multiple pool_c's pointing to a given pool, each pool_c represents a loaded target. pool_c's may be created and destroyed arbitrarily and the pool contains a reference count of pool_c's pointing to it. Since commit 694cfe7 ("dm thin: Flush data device before committing metadata") a pointer to pool_c is passed to dm_pool_register_pre_commit_callback and this function stores it in pmd->pre_commit_context. If this pool_c is freed, but pool is not (because there is another pool_c referencing it), we end up in a situation where pmd->pre_commit_context structure points to freed pool_c. It causes a crash in metadata_pre_commit_callback. Fix this by moving the dm_pool_register_pre_commit_callback() from pool_ctr() to pool_preresume(). This way the in-core thin-pool metadata is only ever armed with callback data whose lifetime matches the active thin-pool target. In should be noted that this fix preserves the ability to load a thin-pool table that uses a different data block device (that contains the same data) -- though it is unclear if that capability is still useful and/or needed. Fixes: 694cfe7 ("dm thin: Flush data device before committing metadata") Cc: stable@vger.kernel.org Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Reported-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>

[ Upstream commit c3314a7 ] Commit 800d3f5 ("perf report: Add warning when libunwind not compiled in") breaks the s390 platform. S390 uses libdw-dwarf-unwind for call chain unwinding and had no support for libunwind. So the warning "Please install libunwind development packages during the perf build." caused the confusion even if the call-graph is displayed correctly. This patch adds checking for HAVE_DWARF_SUPPORT, which is set when libdw-dwarf-unwind is compiled in. Fixes: 800d3f5 ("perf report: Add warning when libunwind not compiled in") Signed-off-by: Jin Yao <yao.jin@linux.intel.com> Reviewed-by: Thomas Richter <tmricht@linux.ibm.com> Tested-by: Thomas Richter <tmricht@linux.ibm.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jin Yao <yao.jin@intel.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/20200107191745.18415-1-yao.jin@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>

[ Upstream commit dfe9aa2 ] If we get here after successfully adding page to list, err would be 1 to indicate the page is queued in the list. Current code has two problems: * on success, 0 is not returned * on error, if add_page_for_migratioin() return 1, and the following err1 from do_move_pages_to_node() is set, the err1 is not returned since err is 1 And these behaviors break the user interface. Link: http://lkml.kernel.org/r/20200119065753.21694-1-richardw.yang@linux.intel.com Fixes: e0153fc ("mm: move_pages: return valid node id in status if the page is already on the target node"). Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> Acked-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>

…tion order [ Upstream commit 8ce1cbd ] The code which checks the return value for snd_soc_add_dai_link() call in soc_tplg_fe_link_create() moved the snd_soc_add_dai_link() call before link->dobj members initialization. While it does not affect the latest kernels, the old soc-core.c code in the stable kernels is affected. The snd_soc_add_dai_link() function uses the link->dobj.type member to check, if the link structure is valid. Reorder the link->dobj initialization to make things work again. It's harmless for the recent code (and the structure should be properly initialized before other calls anyway). The problem is in stable linux-5.4.y since version 5.4.11 when the upstream commit 76d2703 was applied. Fixes: 76d2703 ("ASoC: topology: Check return value for snd_soc_add_dai_link()") Cc: Dragos Tarcatu <dragos_tarcatu@mentor.com> Cc: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Cc: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Cc: Mark Brown <broonie@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Jaroslav Kysela <perex@perex.cz> Link: https://lore.kernel.org/r/20200122190752.3081016-1-perex@perex.cz Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>

[ Upstream commit c5dcf8f ] This reverts commit f170d44. USB core will never call a USB-driver probe function with a NULL device-id pointer. Reverting before removing the existing checks in order to document this and prevent the offending commit from being "autoselected" for stable. Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Sasha Levin <sashal@kernel.org>

[ Upstream commit b61387c ] Commit 99c9a92 ("tracing/uprobe: Fix double perf_event linking on multiprobe uprobe") moved trace_uprobe_filter on trace_probe_event. However, since it introduced a flexible data structure with char array and type casting, the alignment of trace_uprobe_filter can be broken. This changes the type of the array to trace_uprobe_filter data strucure to fix it. Link: http://lore.kernel.org/r/20200120124022.GA14897@hirez.programming.kicks-ass.net Link: http://lkml.kernel.org/r/157966340499.5107.10978352478952144902.stgit@devnote2 Fixes: 99c9a92 ("tracing/uprobe: Fix double perf_event linking on multiprobe uprobe") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>

Change-Id: Ib206bb953b7ac42ca1dae0691bdafc33f0c8415c Signed-off-by: Tom Musta <tom.musta@intel.com> Signed-off-by: Rolf Riesen <rolf.riesen@intel.com> Signed-off-by: Andrew Tauferner <andrew.t.tauferner@intel.com> Signed-off-by: John Attinella <john.e.attinella@intel.com> Signed-off-by: Evan Powers <evan.powers@intel.com> Signed-off-by: Sharath Kumar Bhat <sharath.k.bhat@intel.com>

master: Inject Arbitrary Events Rather than hardwire the mOS test event, the injection interface is updated to support injecting an arbitrary event. The first token of the string written is interpreted as a message ID. master: Update to version 0.8 master: Burst of errors in RAS injection The return code from writing to the injection file is zero in the nominal case but must actually be the number of bytes written. This results in a burst of duplicated console messages for any clients that attempt to retry, like 'echo'. master: unit-test: Add coverage for RAS injection master: Adding license to .gitignore master: Bad Pointer in RAS Injection Path There is a subtle bug in the sysfs write handler for the RAS injection path (/sys/kernel/mOS/ras/inject). The written string is duplicated, but then the pointer is altered by strsep(). Consequently, the pointer passed to kfree() is not what was obtained from kstrdup(). master: LWK partition precise memory designation This commit, combined with associated commits in the mos-core and lwkmem branches will enable the ability to fail a partition create if the requested memory designation cannot be honored. Prior to this change, the value provided for the requested memory designation was always treated as an upper limit. If that amount of memory could not be designated for the LWK partition, whatever amount of memory available would be designated and the command would succeed with a return code of 0 and no indication of a problem other than three RAS messages. There was no immediate way for the caller to know if all the requested memory was designated. A new option has been added on the lwkctl command: '--precise <yes/no>'. If the value specified for this option is 'yes', and if the requested memory designation cannot be satisfied, the command will write an error message to stderr, return a non-zero return code, and generate 'failure' RAS. The default behavior when creating a partition using the lwkctl command remains unchanged at this time ('--precise no' behavior). The RAS messages were modified to generate 'warning' level RAS if the requested memory designation was not completely satisfied when '--precise no' was requested. The RAS messages were modifed to generate 'failure' level messages with the control action of setting 'node in error' if the requested memory was not available when '--precise yes' was requested. master: Return memory if error configuring CPUs When an LWK partition is created using the 'lwkctl -c' command and an error is encounterred during the configuring of the LWK CPUs, return memory to Linux that has been taken by mOS for use in the LWK. master: Undo IRQ affinity save and restore patch Until 4.13 kernel did not have a method to restore IRQ affinities of managed IRQs when a CPU is being onlined from an offlined state. As a result those managed IRQs used to land up on other online CPUs when migrated away forcefully from a CPU being offlined and were never re-affinitized back on to the CPU where it originally was affinitized. Due to this kernel limitation mOS needed to have a mechanism to save and restore affinities of IRQs on CPUs which it used as LWKCPUs during LWK partition creation and later gave back to Linux upon deletion of the LWK partition. This problem was fixed in 4.13 kernel where Linux introduced a new CPU hotplug state that restores the affinity of managed IRQs. In mOS we skip this CPU hotplug step while booting a CPU as LWKCPU. As a result those IRQs are never re-affinitized when a CPU was booted as an LWKCPU. Later when the CPU is handed over to Linux it restores the IRQ affinity on that CPU using the new CPU hotplug state. With the introduction of this new CPU hotplug state we do not need the explicit IRQ save and restore mechanism in mOS as was done before. This patch undos the changes done for that mechanism. master: Deny attempts to affinitize IRQs to only to LWKCPUs If an attempt is made to affinitize an IRQ to only LWKCPUs then return EINVAL without changing the current affinity mask of the IRQ. master: Convert RAS message to warning for lwkmem_static If lwkmem_static is set and if a user specifies an lwkmem= specification during partition creation then the kernel currently prints a RAS error message. This message was originally intended to be a warning and not error since the static lwkmem is a debug option and not a standard. We do not want the control system to kill the compute node when this condition occurs. So this patch converts the message to just warning instead of error. master: unit-test: Clean out RAS sysfs Upon Test Completion Security policy file needed for SDL As part of the Security Development Lifecycle (SDL) we need to set and publish a policy on how security vulnerabilities in mOS can be reported and how we announce these vulnerabilities and fixes for them. The content of this file will show up at https://github.com/intel/mOS/security/policy master: Fix kernel stack overflow for large MAX_NUMNODES When MAX_NUMNODES is set to a very large number, example by setting CONFIG_NODES_SHIFT to 10, it results in kernel stack overflow. This patch provides a fix for such a configuration. master: Fix returning un-initialized NUMA array Change-Id: I9d00f64ff9ce441a6af56482074897344ad7e452 Signed-off-by: Sharath Kumar Bhat <sharath.k.bhat@intel.com> Signed-off-by: Tom Musta <tom.musta@intel.com> Signed-off-by: John Attinella <john.e.attinella@intel.com>

Change-Id: Ia3c4a3f6c3d2ddf77022627fb23c65ca377a9ab9 Signed-off-by: Tom Musta <tom.musta@intel.com> Signed-off-by: Rolf Riesen <rolf.riesen@intel.com> Signed-off-by: Evan Powers <evan.powers@intel.com> Signed-off-by: John Attinella <john.e.attinella@intel.com> Signed-off-by: Sharath Kumar Bhat <sharath.k.bhat@intel.com>

Add support for the mOS memory management subsystem. Change-Id: I59e858eb261ae9958d81d6c4c76dffa4edab05d9 Signed-off-by: Rolf Riesen <rolf.riesen@intel.com> Signed-off-by: Tom Musta <tom.musta@intel.com> Signed-off-by: Andrew Tauferner <andrew.t.tauferner@intel.com> Signed-off-by: Sharath Kumar Bhat <sharath.k.bhat@intel.com> Signed-off-by: Evan Powers <evan.powers@intel.com> Signed-off-by: Sharath Kumar Bhat <sharath.k.bhat@intel.com> Signed-off-by: John Attinella <john.e.attinella@intel.com>

Change-Id: Ibd453b771e3779a9154a792f545d89502e8e8345 Signed-off-by: John Attinella <john.e.attinella@intel.com> Signed-off-by: Tom Musta <tom.musta@intel.com> Signed-off-by: Evan Powers <evan.powers@intel.com> Signed-off-by: Rolf Riesen <rolf.riesen@intel.com> Signed-off-by: Sharath Kumar Bhat <sharath.k.bhat@intel.com>

lwksched: Syscall list now indicates remote Previously the exception list indicated which syscalls we want to remain local. We are reversing the default behavior so that syscalls remain local unless they show up on the exception list. The exception list now indicates which syscalls we want to ship. Initially this will be set to an empty list. More analysis will be necessary to determine if there are syscalls that would be benficial to ship. lwksched: Round robin behavior changes When the round robin scheduling is enabled and there is one or less threads on that that CPU, do not enable the timeslice scheduler tick. Also make round robin scheduing the default behavior since when there is no overcommitment, this policy does not introduce any additional overhead or noise over the previous default of FIFO scheduling. By changing the default, we will allow a wider range of applications that do overcommit CPUs to run without hanging and run without needing to specify special YOD options. The default timeslice option will be 100ms and can be adjusted using the yod option 'lwksched-enable-rr=<ms>. The minumum value supported is 10ms. If there is a situation where an application that is overcommitting CPUs wishes to not allow time-based preemption (expectd to be very rare), the exisitng yod option to control round robin behavior can be used to turn off time-slicing in on overcommitted CPU by specifying the yod option '--lwksched-enable-rr=0'. lwksched: Unit test changes for round-robin change The unit tests were updated to support the change in default scheduler behavior with regards to FIFO veres round-robin scheduling. Change-Id: I48c1f69f27cb5da68866404a0e9141b79749ca4c Signed-off-by: John Attinella <john.e.attinella@intel.com>

Introduce support for a new resource reservation syntax and mechanism: yod -R file:<file> ... When specified, the file may be used to map resources for a specific MPI rank. The contents of the file is a list of lines like this: <local-rank> <resource-arg>[ ...] where <local-rank> is the MPI rank (or wildcard) and <resource-arg> is a yod resource argument (CPUs, cores, memory or resource). Here is an example: # Map rank 0 to CPU 5 and use 1GiB of memory: 0 -c 5 -M 1G # Map rank 1 to CPU 20 and use 1/4 of designated memory: 1 --cpus 20 -M 1/4 # An optional fall-thru wildcard (should be last) * -C 1 -M 1G The motivation for doing this is for MPI primitive measurements using OSU benchmarks and MPICH. There is a requirement to pin ranks to specific CPUs in order to measure intra-node latencies (intra-socket and inter-socket) as well as inter-node latencies. Certain CPUs have been deemed to be interesting based on their proximity to NICs. For now, only a subset of yod argument is supported. mos-core: unit-test: Add unit tests for Rank-to-Resource Mapping File mos-core: Balance memory with lwkmem=auto The lwkctl command supports auto configuration of memory and CPUs. This is done using the keyword=<value> pairs: lwkcpus=auto lwkmem=auto. This automatic configuration functionality is meant to create an LWK partition with a reasonable designation of resources for the running of a typical HPC application in the LWK, based on the physical topology of the system node. A typical HPC application will contain multiple ranks (processes) executing on each system node. For best performance, it is advantageous to have each rank contain memory isolated to a single numa domain. If a fractial resource value is specified to YOD at process launch time (typical), YOD can accomplish this reservation isolation if the memory resources in the LWK partition are evenly balanced across the numa domains and the divisor is a multiple of the number of numa domains. If the designated numa domain sizes are not balanced within the LWK partition, YOD's division of the available memory resources will result in one or more processes containing memory from multiple domains which can introduce performance degradation. When creating a partition using the lwkctl command and specifying lwkmem=auto, this change provides a balanced memory designation across like-sized numa domains. For example on KNL with HBM and DDR running in SNC-4, each of the 4 DDR NUMA domains will have the same LWK memory designation and each of the 4 HBM memory domains will have the same LWK memory designation. The value chosen will be based on the like-sized domain that has the least amount of memory available to be moved into the LWK partition. An option is provided to revert to the previous behavior and give the maximum possible memory to the LWK partition without regards to balancing. To provide balanced allocations: lwkctl -c "lwkcpus=auto lwkmem=auto" To give maximum memory possible to LWK: lwkctl -c "lwkcpus=auto lwkmem=auto:max" If you set the verbosity level to 4 on the lwkctl command, there will be debug output provided related to the balancing actions. For example on SKL-10, with balancing on, you may see: [lwkctl:42680] Begin Numa domain balancing. [lwkctl:42680] Numa domain balancing: Node 1 has 86G available but limited to 85G by node 0. [lwkctl:42680] End Numa domain balancing. Lwk memory: 170G. Sacrificed 1G of potential LWK memory. Note that the default "auto" behavior is to do balancing because this is considered to produce a partition optimized for the majority of HPC situations. If a maximum possible memory designation is desired without regards to having balanced numa domain designations, you can specify "auto:max". This logic could be reversed, i.e. the code can be modified to have "auto" designate the maximum memory and then have "auto:balance" if 'max' is considered to be a better default behavior. mos-core: lwkctl precise memory designation This commmit, combined with associated commits in the master and lwkmem branches will enable the ability to fail a partition create if the requested memory designation cannot be honored. Prior to this change, the value provided for the requested memory designation was always treated as an upper limit. If that amount of memory could not be designated for the LWK partition, whatever amount of memory available would be designated and the command would succeed with a return code of 0 and no indication of a problem other than three RAS messages. There was no immediate way for the caller to know if all the requested memory was designated. A new option has been added on the lwkctl command: '--precise <yes/no>'. If the value specified for this option is 'yes', and if the requested memory designation cannot be satisfied, the command will write an error message to stderr, return a non-zero return code, and generate 'failure' RAS. The default behavior when creating a partition using the lwkctl command remains unchanged at this time ('--precise no' behavior). The RAS messages were modified to generate 'warning' level RAS if the requested memory designation was not completely satisfied when '--precise no' was requested. The RAS messages were modified to generate 'failure' level messages with the control action of setting 'node in error' if the requested memory was not available when '--precise yes' was requested. mos-core: unit tests for lwkctl precise option mos-core: Add Serialization Lock to lwkctl Add a serialization lock to lwkctl, inhibiting concurrent execution of commands that modify the partition (create and delete). The lock is implemented via an advisory lock on /sys/kernel/mOS. The lwkctl utility will block waiting to acquire the lock. The maximum wait time defaults to 5 minutes but can be overridden via the --timeout option. A value of zero will block forever. Improve Debug Data for Insufficient Resources Error When the NUMA fit algorithm in yod cannot fulfill a request for resources, the result is a rather unhelpful error message: "Insufficient LWK Resources". This can occur for a variety of reasons, including the case where a compute node is inadvertently double booked. In order to help diagnose the situation, a more complete dump of node's LWK state is prepended to the message. This includes designated, reserved and requested CPUs & memory, as well as the active LWK processes. In support of this, the show_state() routine is cleaned up and improved. Additionally, the logging level YOD_QUIET is renamed to YOD_CRIT which seems more descriptive. mos-core: Specify non-uniform no.of utility threads per rank This feature enables user to specify non-uniform number of utility threads across ranks of a job. Extends -R file:map_file argument of yod to specify -u option per rank in the map_file. The -u value specified through this map file option overrides the number of utility threads specified through -u yod argument. This does not break the existing usage of -u and -R file: options, i.e. if one specifies utility threads through -u argument and -R file: argument does not specify utility threads in the file then the number of utility threads specified through -u argument is respected. mos-core: Update packaging mos-core: stop and start irqbalance daemon in lwkctl irqbalance daemon interprets /sys/../online and /proc/stat files for determining the number of CPUs. If the CPUs are being dynamically hotplugged (like in lwkctl) then irqbalance could see inconsistent number of online cpus between the two read to sysfs and procfs. In order to avoid this inconsistent view of online cpus lwkctl needs to stop irqbalance when a partition creation or deletion is ongoing. irqbalance also sets the affinity of user managed irqs. But when an LWK partition is being created or present the irqbalance should not consider LWKCPUs for balancing the irqs. In order to achieve this, a. we stop the irqbalance daemon during an LWK partition is being created or deleted. And restart the daemon after the LWK partition creation/deletion is complete. b. when an LWK partition is created we set the irqbalance daemon's environment variable IRQBALANCE_BANNED_CPUS before starting it. This ensures that the irqbalance ignores LWKCPUs for balancing irqs. mos-core: Modified lwkctl tests to be topology aware lwkctl unit tests now discovers the topology of hardware being tested and generates the test LWK partitions that are more realistic in usage. mos-core: test_precise_yes_exceed should consider lwkmem_static This test would succeed with an LWK partition created when lwkmem_static is set on the kernel command line. When LWK memory is static the kernel ignores the lwkmem= specification and prints a RAS warning that the lwkmem partition is static. This patch amends the test case to check for lwkmem_static before flagging the test as failed. mos-core: lwkctl: Block Partition Creation/Deletion If Jobs Are Active If a job is deemed to be active (as seen by the RAS subsystem), then inhibit the deletion or creation of a partition. This behavior may be overridden via a command line option. Ref: JIRA mOS-1488 mos-core: unit-test: Test Partitioning with Busy Job State Change-Id: I9150f8e4435d8d18e89e53b9112251ebf256ab93 Signed-off-by: John Attinella <john.e.attinella@intel.com> Signed-off-by: Sharath Kumar Bhat <sharath.k.bhat@intel.com> Signed-off-by: Tom Musta <tom.musta@intel.com> Signed-off-by: Andrew Tauferner <andrew.t.tauferner@intel.com>

lwkmem_mutex is meant to be used only for serializing access to global lwkmem resources. There are no such accesses in next_lwkmem_address(), so it doesn't need to grab global lwkmem_mutex. There is no per process serialization necessary at this point either since mm->map_sem is already acquired before the caller calls this function. lwkmem: Make all LWKMEM regions PROT_EXEC by default With this patch all LWKMEM regions are rwx by default. Additionally a yod option '-o lwkmem-prot-exec-disable' is provided which when specified doesn't make LWKMEM regions executable by default. lwkmem : Fix GUP fast path A defect in __get_user_pages_fast and get_user_pages_fast was potentially causing a kernel panic and/or hang. These functions have been updated to properly retrieve user pages for non-LWK processes. __get_user_pages_fast will not retrieve LWK pages at this time. lwkmem: Improve RAS in the mremap Failure Scenario Improve the RAS message content for events long the failure path of mremap. 1) The common path error inside of build_lwkvm is updated with the (hopefully) more descriptive and useful message: build_lwkvma: Could not insert LWK VMA at [2aaaaaac0000,2aaaaac00000) length=1310720 rc=-12 2) The mremap RAS message now clearly identifies the address *and* old and new lengths: lwk_sys_mremap: remap failed: address=0x2aaaaaab0000 old_size=65536 new_size=2097152 This information is useful if we need to override default yod behavior (--aligned-mmap). Ref: JIRA MOS-1393 lwkmem: add lwkctl precise memory designation This commit, combined with associated commits in the master and mos-core branches will enable the ability to fail a partition create if the requested memory designation cannot be honored. Prior to this change, the value provided for the requested memory designation was always treated as an upper limit. If that amount of memory could not be designated for the LWK partition, whatever amount of memory available would be designated and the command would succeed with a return code of 0 and no indication of a problem other than three RAS messages. There was no immediate way for the caller to know if all the requested memory was designated. A new option has been added on the lwkctl command: '--precise <yes/no>'. If the value specified for this option is 'yes', and if the requested memory designation cannot be satisfied, the command will write an error message to stderr, return a non-zero return code, and generate 'failure' RAS. The default behavior when creating a partition using the lwkctl command remained unchanged at this time ('--precise no' behavior). The RAS messages were modified to generate 'warning' level RAS if the requested memory designation was not completely satisfied when '--precise no' was requested. The RAS messages were modified to generate 'failure' level messages with the control action of setting 'node in error' if the requested memory was not available when '--precise yes' was requested. lwkmem: Adapt kernelcore/movablecore/movable node patch to 5.3 In 5.3 kernel Linux supports specifying kernelcore, movablecore as percentages of total memory in addition to the previously supported absolute byte format. This patch migrates mOS changes in that area to adapt to the new kernel feature. lwkmem: Adapt to new mmap flag MAP_FIXED_NOREPLACE Linux v5.3 has MAP_FIXED_NOREPLACE which in behavior is similar to MAP_FIXED except that if a previous map existed in the requested address range then the mmap should fail returning EEXIST instead of unmapping the old map and creating a new map in that range. This patch adjusts lwkmem code to accomodate this new flag lwkmem: Add mOS view to new entry in hugetlb Linux v5.3 adds a new entry to meminfo from hugetlb. This patch adds mOS view to that new entry. lwkmem: Adapt TLB flush to 5.3 In 5.3 TLB flush functionality exposes a stride that can be specified along with the range. This patch re-works the 5.3 rebase to use proper stride. lwkmem: call xpmem fault handler directly The Linux core page fault handler allocates all higher levels of page table hierarchy (pgd, p4d, pud, pmd) before it invokes XPMEM fault handler. This is ok for allocating a base page in XPMEM fault handler, but for allocating large pages such as 1g page pmd level is not needed and shouldn't be allocated by the Linux page fault handler before it invokes the XPMEM page fault handler. This patch modifies the Linux core page fault handler to invoke registered page fault handler of the XPMEM driver directly before allocating pmd level page table if the faulting address is an XPMEM VMA. Change-Id: Iec3041d9e377002bd6f2ed8527a05c0180869e70 Signed-off-by: Sharath Kumar Bhat <sharath.k.bhat@intel.com> Signed-off-by: John Attinella <john.e.attinella@intel.com> Signed-off-by: Tom Musta <tom.musta@intel.com>

Change-Id: I4a064ca2811951fbcd6cc02e921aabefa05cab70 Signed-off-by: John Attinella <john.e.attinella@intel.com>

Change-Id: Iada3b62f9da65ec8a29ae714b3320c6ad139a5e0 Signed-off-by: Andrew Tauferner <andrew.t.tauferner@intel.com>

When CPUs are overcommitted in an LWK partition, the additional threads were being assigned to the CPU of the thread that created the new threads. For example, if 10 CPUs were in the reservation for the process and 100 threads were created by the main thread of the process, there would be 1 thread running on each of the upper 9 CPUs and 91 threads all trying to all run on the first CPU of the reservation. This mOS problem was introduced during a recent rebase in which Linux changed the clone system call code flow. Change-Id: I068dd29ebdb8dc195798a68ef268f5719247e59d Signed-off-by: John Attinella <john.e.attinella@intel.com>

The mOS schduler has a spin lock in its CPU-scoped run-queue object. This spin lock is obtained when we are committing and un-comitting a thread to a specific CPU. All calls to the spin lock must occur when interrupts are disabled. However, in the thread exit path interrupts are enabled. In this exit path we are un-commtting the exiting thread. While we held the spin lock, a scheduler timer tick fired. This fired because we have over-committed threads on this CPU (we disable the timer tick if we are not over-comitted). The timer tick processing drove us through the mOS code to wake up and dispatch another thread. This flow attempts to obtain the spin lock to commit the thread to this run-queue, resulting in a deadlock. The fix is to use a more robust spin-lock interface to lock and unlock which guarantees that interrupts are always disabled while the spin-lock is held. Change-Id: I6b296abc4a78c2b143973a27ae1e0e12b996904f Signed-off-by: John Attinella <john.e.attinella@intel.com>

rolfriesen

Hello, @RudraSwat thanks for the suggested change. This is the original README that comes with the Linux kernel. We did not change it because the vast majority of code here is still Linux and this README refers to that.
OTOH it is a little confusing that it talks about Linux on the main Code page for mOS ;-)
If we decide to change it, maybe we should rename the current one to README.linux and bring in the mOS README from https://github.com/intel/mOS/wiki/mOS-for-HPC-v0.8-Readme.
What do you think?

RudraSwat · 2021-07-15T08:30:45Z

@rolfriesen Sorry for the late reply. Yes, I guess we can rename the current README and move https://github.com/intel/mOS/wiki/mOS-for-HPC-v0.8-Readme to the README file (and make it plaintext-friendly).

rolfriesen · 2021-08-12T21:55:59Z

Hello @RudraSwat we just pushed a new release of mOS and updated the top-level README to be about mOS instead of Linux. Thanks for that suggestion.

[ Upstream commit 99d4850 ] Found by leak sanitizer: ``` ==1632594==ERROR: LeakSanitizer: detected memory leaks Direct leak of 21 byte(s) in 1 object(s) allocated from: #0 0x7f2953a7077b in __interceptor_strdup ../../../../src/libsanitizer/asan/asan_interceptors.cpp:439 #1 0x556701d6fbbf in perf_env__read_cpuid util/env.c:369 #2 0x556701d70589 in perf_env__cpuid util/env.c:465 intel#3 0x55670204bba2 in x86__is_amd_cpu arch/x86/util/env.c:14 intel#4 0x5567020487a2 in arch__post_evsel_config arch/x86/util/evsel.c:83 intel#5 0x556701d8f78b in evsel__config util/evsel.c:1366 intel#6 0x556701ef5872 in evlist__config util/record.c:108 intel#7 0x556701cd6bcd in test__PERF_RECORD tests/perf-record.c:112 intel#8 0x556701cacd07 in run_test tests/builtin-test.c:236 intel#9 0x556701cacfac in test_and_print tests/builtin-test.c:265 intel#10 0x556701cadddb in __cmd_test tests/builtin-test.c:402 intel#11 0x556701caf2aa in cmd_test tests/builtin-test.c:559 intel#12 0x556701d3b557 in run_builtin tools/perf/perf.c:323 intel#13 0x556701d3bac8 in handle_internal_command tools/perf/perf.c:377 intel#14 0x556701d3be90 in run_argv tools/perf/perf.c:421 #15 0x556701d3c3f8 in main tools/perf/perf.c:537 #16 0x7f2952a46189 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 SUMMARY: AddressSanitizer: 21 byte(s) leaked in 1 allocation(s). ``` Fixes: f7b58cb ("perf mem/c2c: Add load store event mappings for AMD") Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Ravi Bangoria <ravi.bangoria@amd.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20230613235416.1650755-1-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>

[ Upstream commit b684c09 ] ppc_save_regs() skips one stack frame while saving the CPU register states. Instead of saving current R1, it pulls the previous stack frame pointer. When vmcores caused by direct panic call (such as `echo c > /proc/sysrq-trigger`), are debugged with gdb, gdb fails to show the backtrace correctly. On further analysis, it was found that it was because of mismatch between r1 and NIP. GDB uses NIP to get current function symbol and uses corresponding debug info of that function to unwind previous frames, but due to the mismatching r1 and NIP, the unwinding does not work, and it fails to unwind to the 2nd frame and hence does not show the backtrace. GDB backtrace with vmcore of kernel without this patch: --------- (gdb) bt #0 0xc0000000002a53e8 in crash_setup_regs (oldregs=<optimized out>, newregs=0xc000000004f8f8d8) at ./arch/powerpc/include/asm/kexec.h:69 #1 __crash_kexec (regs=<optimized out>) at kernel/kexec_core.c:974 #2 0x0000000000000063 in ?? () intel#3 0xc000000003579320 in ?? () --------- Further analysis revealed that the mismatch occurred because "ppc_save_regs" was saving the previous stack's SP instead of the current r1. This patch fixes this by storing current r1 in the saved pt_regs. GDB backtrace with vmcore of patched kernel: -------- (gdb) bt #0 0xc0000000002a53e8 in crash_setup_regs (oldregs=0x0, newregs=0xc00000000670b8d8) at ./arch/powerpc/include/asm/kexec.h:69 #1 __crash_kexec (regs=regs@entry=0x0) at kernel/kexec_core.c:974 #2 0xc000000000168918 in panic (fmt=fmt@entry=0xc000000001654a60 "sysrq triggered crash\n") at kernel/panic.c:358 intel#3 0xc000000000b735f8 in sysrq_handle_crash (key=<optimized out>) at drivers/tty/sysrq.c:155 intel#4 0xc000000000b742cc in __handle_sysrq (key=key@entry=99, check_mask=check_mask@entry=false) at drivers/tty/sysrq.c:602 intel#5 0xc000000000b7506c in write_sysrq_trigger (file=<optimized out>, buf=<optimized out>, count=2, ppos=<optimized out>) at drivers/tty/sysrq.c:1163 intel#6 0xc00000000069a7bc in pde_write (ppos=<optimized out>, count=<optimized out>, buf=<optimized out>, file=<optimized out>, pde=0xc00000000362cb40) at fs/proc/inode.c:340 intel#7 proc_reg_write (file=<optimized out>, buf=<optimized out>, count=<optimized out>, ppos=<optimized out>) at fs/proc/inode.c:352 intel#8 0xc0000000005b3bbc in vfs_write (file=file@entry=0xc000000006aa6b00, buf=buf@entry=0x61f498b4f60 <error: Cannot access memory at address 0x61f498b4f60>, count=count@entry=2, pos=pos@entry=0xc00000000670bda0) at fs/read_write.c:582 intel#9 0xc0000000005b4264 in ksys_write (fd=<optimized out>, buf=0x61f498b4f60 <error: Cannot access memory at address 0x61f498b4f60>, count=2) at fs/read_write.c:637 intel#10 0xc00000000002ea2c in system_call_exception (regs=0xc00000000670be80, r0=<optimized out>) at arch/powerpc/kernel/syscall.c:171 intel#11 0xc00000000000c270 in system_call_vectored_common () at arch/powerpc/kernel/interrupt_64.S:192 -------- Nick adds: So this now saves regs as though it was an interrupt taken in the caller, at the instruction after the call to ppc_save_regs, whereas previously the NIP was there, but R1 came from the caller's caller and that mismatch is what causes gdb's dwarf unwinder to go haywire. Signed-off-by: Aditya Gupta <adityag@linux.ibm.com> Fixes: d16a58f ("powerpc: Improve ppc_save_regs()") Reivewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230615091047.90433-1-adityag@linux.ibm.com Signed-off-by: Sasha Levin <sashal@kernel.org>

[ Upstream commit 93a3319 ] The cited commit holds encap tbl lock unconditionally when setting up dests. But it may cause the following deadlock: PID: 1063722 TASK: ffffa062ca5d0000 CPU: 13 COMMAND: "handler8" #0 [ffffb14de05b7368] __schedule at ffffffffa1d5aa91 #1 [ffffb14de05b7410] schedule at ffffffffa1d5afdb #2 [ffffb14de05b7430] schedule_preempt_disabled at ffffffffa1d5b528 intel#3 [ffffb14de05b7440] __mutex_lock at ffffffffa1d5d6cb intel#4 [ffffb14de05b74e8] mutex_lock_nested at ffffffffa1d5ddeb intel#5 [ffffb14de05b74f8] mlx5e_tc_tun_encap_dests_set at ffffffffc12f2096 [mlx5_core] intel#6 [ffffb14de05b7568] post_process_attr at ffffffffc12d9fc5 [mlx5_core] intel#7 [ffffb14de05b75a0] mlx5e_tc_add_fdb_flow at ffffffffc12de877 [mlx5_core] intel#8 [ffffb14de05b75f0] __mlx5e_add_fdb_flow at ffffffffc12e0eef [mlx5_core] intel#9 [ffffb14de05b7660] mlx5e_tc_add_flow at ffffffffc12e12f7 [mlx5_core] intel#10 [ffffb14de05b76b8] mlx5e_configure_flower at ffffffffc12e1686 [mlx5_core] intel#11 [ffffb14de05b7720] mlx5e_rep_indr_offload at ffffffffc12e3817 [mlx5_core] intel#12 [ffffb14de05b7730] mlx5e_rep_indr_setup_tc_cb at ffffffffc12e388a [mlx5_core] intel#13 [ffffb14de05b7740] tc_setup_cb_add at ffffffffa1ab2ba8 intel#14 [ffffb14de05b77a0] fl_hw_replace_filter at ffffffffc0bdec2f [cls_flower] #15 [ffffb14de05b7868] fl_change at ffffffffc0be6caa [cls_flower] #16 [ffffb14de05b7908] tc_new_tfilter at ffffffffa1ab71f0 [1031218.028143] wait_for_completion+0x24/0x30 [1031218.028589] mlx5e_update_route_decap_flows+0x9a/0x1e0 [mlx5_core] [1031218.029256] mlx5e_tc_fib_event_work+0x1ad/0x300 [mlx5_core] [1031218.029885] process_one_work+0x24e/0x510 Actually no need to hold encap tbl lock if there is no encap action. Fix it by checking if encap action exists or not before holding encap tbl lock. Fixes: 37c3b9f ("net/mlx5e: Prevent encap offload when neigh update is running") Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>

bvanassche and others added 30 commits January 29, 2020 16:45

Linux 5.4.16

60b6aa2

yosh1k104 and others added 22 commits February 5, 2020 21:22

Linux 5.4.18

58c7205

Merge branches 'origin/mos-core', b1be3a9, 992123f

d644370

Adding Configuration File (config.mos)

8b8e8da

Change-Id: I4a064ca2811951fbcd6cc02e921aabefa05cab70 Signed-off-by: John Attinella <john.e.attinella@intel.com>

mOS v0.8

1ede282

Change-Id: Iada3b62f9da65ec8a29ae714b3320c6ad139a5e0 Signed-off-by: Andrew Tauferner <andrew.t.tauferner@intel.com>

Update README

3bdccf3

rolfriesen reviewed Jun 29, 2021

View reviewed changes

jattine force-pushed the master branch from 890456f to 15e6b59 Compare August 12, 2021 21:00

rolfriesen closed this Aug 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update README #6

Update README #6

Uh oh!

RudraSwat commented Aug 12, 2020

Uh oh!

rolfriesen left a comment

Uh oh!

RudraSwat commented Jul 15, 2021

Uh oh!

rolfriesen commented Aug 12, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

126 participants

Update README #6

Update README #6

Uh oh!

Conversation

RudraSwat commented Aug 12, 2020

Uh oh!

rolfriesen left a comment

Choose a reason for hiding this comment

Uh oh!

RudraSwat commented Jul 15, 2021

Uh oh!

rolfriesen commented Aug 12, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

126 participants