Speculative page fault #359

In Ubuntu 16.10, gcc's defaults have been set to build Position Independent Executables (PIE) on amd64 and ppc64le (gcc was configured this way for s390x in Ubuntu 16.04 LTS). This breaks the kernel build on amd64. The following patch disables pie for x86 builds (though not yet verified to work with gcc configured to build PIE by default i386 -- we're not planning to enable it for that architecture). The intent is for this patch to go upstream after expanding it to additional architectures where needed, but I wanted to ensure that we could build 16.10 kernels first. I've successfully built kernels and booted them with this patch applied using the 16.10 compiler. Patch is against yakkety.git, but also applies with minor movement (no fuzz) against current linus.git. Signed-off-by: Steve Beattie <steve.beattie@canonical.com> [apw@canonical.com: shifted up so works in arch/<arch/Makefile.] BugLink: http://bugs.launchpad.net/bugs/1574982 Signed-off-by: Andy Whitcroft <apw@canonical.com> Acked-by: Tim Gardner <tim.gardner@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Andy Whitcroft <apw@canonical.com> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>

One of the side effects of speculating on faults (without holding mmap_sem) is that we can race with free_pgtables() and therefore we cannot assume the page-tables will stick around. Remove the relyance on the pte pointer. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

When speculating faults (without holding mmap_sem) we need to validate that the vma against which we loaded pages is still valid when we're ready to install the new PTE. Therefore, replace the pte_offset_map_lock() calls that (re)take the PTL with pte_map_lock() which can fail in case we find the VMA changed since we started the fault. Instead of passing around the endless list of function arguments, replace the lot with a single structure so we can change context without endless function signature changes. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> [port to 4.8 kernel] Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>

This is need because in handle_pte_fault() pte_offset_map() called and then fe->ptl is fetched and spin_locked. This was previously embedded in the call to pte_offset_map_lock().

Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence counts such that we can easily test if a VMA is changed. The unmap_page_range() one allows us to make assumptions about page-tables; when we find the seqcount hasn't changed we can assume page-tables are still valid. The flip side is that we cannot distinguish between a vma_adjust() and the unmap_page_range() -- where with the former we could have re-checked the vma bounds against the address. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Manage the VMAs with SRCU such that we can do a lockless VMA lookup. We put the fput(vma->vm_file) in the SRCU callback, this keeps files valid during speculative faults, this is possible due to the delayed fput work by Al Viro -- do we need srcu_barrier() in unmount someplace? We guard the mm_rb tree with a seqlock (XXX could be a seqcount but we'd have to disable preemption around the write side in order to make the retry loop in __read_seqcount_begin() work) such that we can know if the rb tree walk was correct. We cannot trust the restult of a lockless tree walk in the face of concurrent tree rotations; although we can trust on the termination of such walks -- tree rotations guarantee the end result is a tree again after all. Furthermore, we rely on the WMB implied by the write_seqlock/count_begin() to separate the VMA initialization and the publishing stores, analogous to the RELEASE in rcu_assign_pointer(). We also rely on the RMB from read_seqretry() to separate the vma load from further loads like the smp_read_barrier_depends() in regular RCU. We must not touch the vmacache while doing SRCU lookups as that is not properly serialized against changes. We update gap information after publishing the VMA, but A) we don't use that and B) the seqlock read side would fix that anyhow. We clear vma->vm_rb for nodes removed from the vma tree such that we can easily detect such 'dead' nodes, we rely on the WMB from write_sequnlock() to separate the tree removal and clearing the node. Provide find_vma_srcu() which wraps the required magic. XXX: mmap()/munmap() heavy workloads might suffer from the global lock in call_srcu() -- this is fixable with a 'better' SRCU implementation. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Provide infrastructure to do a speculative fault (not holding mmap_sem). The not holding of mmap_sem means we can race against VMA change/removal and page-table destruction. We use the SRCU VMA freeing to keep the VMA around. We use the VMA seqcount to detect change (including umapping / page-table deletion) and we use gup_fast() style page-table walking to deal with page-table races. Once we've obtained the page and are ready to update the PTE, we validate if the state we started the fault with is still valid, if not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the PTE and we're done. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Try a speculative fault before acquiring mmap_sem, if it returns with VM_FAULT_RETRY continue with the mmap_sem acquisition and do the traditional fault. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speculative page fault #359

Speculative page fault #359

Commits on Oct 25, 2016

Commits on Nov 17, 2016