Skip to content

Commit 6c28760

Browse files
davidhildenbrandakpm00
authored andcommitted
mm: remember exclusively mapped anonymous pages with PG_anon_exclusive
Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as exclusive, and use that information to make GUP pins reliable and stay consistent with the page mapped into the page table even if the page table entry gets write-protected. With that information at hand, we can extend our COW logic to always reuse anonymous pages that are exclusive. For anonymous pages that might be shared, the existing logic applies. As already documented, PG_anon_exclusive is usually only expressive in combination with a page table entry. Especially PTE vs. PMD-mapped anonymous pages require more thought, some examples: due to mremap() we can easily have a single compound page PTE-mapped into multiple page tables exclusively in a single process -- multiple page table locks apply. Further, due to MADV_WIPEONFORK we might not necessarily write-protect all PTEs, and only some subpages might be pinned. Long story short: once PTE-mapped, we have to track information about exclusivity per sub-page, but until then, we can just track it for the compound page in the head page and not having to update a whole bunch of subpages all of the time for a simple PMD mapping of a THP. For simplicity, this commit mostly talks about "anonymous pages", while it's for THP actually "the part of an anonymous folio referenced via a page table entry". To not spill PG_anon_exclusive code all over the mm code-base, we let the anon rmap code to handle all PG_anon_exclusive logic it can easily handle. If a writable, present page table entry points at an anonymous (sub)page, that (sub)page must be PG_anon_exclusive. If GUP wants to take a reliably pin (FOLL_PIN) on an anonymous page references via a present page table entry, it must only pin if PG_anon_exclusive is set for the mapped (sub)page. This commit doesn't adjust GUP, so this is only implicitly handled for FOLL_WRITE, follow-up commits will teach GUP to also respect it for FOLL_PIN without FOLL_WRITE, to make all GUP pins of anonymous pages fully reliable. Whenever an anonymous page is to be shared (fork(), KSM), or when temporarily unmapping an anonymous page (swap, migration), the relevant PG_anon_exclusive bit has to be cleared to mark the anonymous page possibly shared. Clearing will fail if there are GUP pins on the page: * For fork(), this means having to copy the page and not being able to share it. fork() protects against concurrent GUP using the PT lock and the src_mm->write_protect_seq. * For KSM, this means sharing will fail. For swap this means, unmapping will fail, For migration this means, migration will fail early. All three cases protect against concurrent GUP using the PT lock and a proper clear/invalidate+flush of the relevant page table entry. This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a pinned page gets mapped R/O and the successive write fault ends up replacing the page instead of reusing it. It improves the situation for O_DIRECT/vmsplice/... that still use FOLL_GET instead of FOLL_PIN, if fork() is *not* involved, however swapout and fork() are still problematic. Properly using FOLL_PIN instead of FOLL_GET for these GUP users will fix the issue for them. I. Details about basic handling I.1. Fresh anonymous pages page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the given page exclusive via __page_set_anon_rmap(exclusive=1). As that is the mechanism fresh anonymous pages come into life (besides migration code where we copy the page->mapping), all fresh anonymous pages will start out as exclusive. I.2. COW reuse handling of anonymous pages When a COW handler stumbles over a (sub)page that's marked exclusive, it simply reuses it. Otherwise, the handler tries harder under page lock to detect if the (sub)page is exclusive and can be reused. If exclusive, page_move_anon_rmap() will mark the given (sub)page exclusive. Note that hugetlb code does not yet check for PageAnonExclusive(), as it still uses the old COW logic that is prone to the COW security issue because hugetlb code cannot really tolerate unnecessary/wrong COW as huge pages are a scarce resource. I.3. Migration handling try_to_migrate() has to try marking an exclusive anonymous page shared via page_try_share_anon_rmap(). If it fails because there are GUP pins on the page, unmap fails. migrate_vma_collect_pmd() and __split_huge_pmd_locked() are handled similarly. Writable migration entries implicitly point at shared anonymous pages. For readable migration entries that information is stored via a new "readable-exclusive" migration entry, specific to anonymous pages. When restoring a migration entry in remove_migration_pte(), information about exlusivity is detected via the migration entry type, and RMAP_EXCLUSIVE is set accordingly for page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that information. I.4. Swapout handling try_to_unmap() has to try marking the mapped page possibly shared via page_try_share_anon_rmap(). If it fails because there are GUP pins on the page, unmap fails. For now, information about exclusivity is lost. In the future, we might want to remember that information in the swap entry in some cases, however, it requires more thought, care, and a way to store that information in swap entries. I.5. Swapin handling do_swap_page() will never stumble over exclusive anonymous pages in the swap cache, as try_to_migrate() prohibits that. do_swap_page() always has to detect manually if an anonymous page is exclusive and has to set RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly. I.6. THP handling __split_huge_pmd_locked() has to move the information about exclusivity from the PMD to the PTEs. a) In case we have a readable-exclusive PMD migration entry, simply insert readable-exclusive PTE migration entries. b) In case we have a present PMD entry and we don't want to freeze ("convert to migration entries"), simply forward PG_anon_exclusive to all sub-pages, no need to temporarily clear the bit. c) In case we have a present PMD entry and want to freeze, handle it similar to try_to_migrate(): try marking the page shared first. In case we fail, we ignore the "freeze" instruction and simply split ordinarily. try_to_migrate() will properly fail because the THP is still mapped via PTEs. When splitting a compound anonymous folio (THP), the information about exclusivity is implicitly handled via the migration entries: no need to replicate PG_anon_exclusive manually. I.7. fork() handling fork() handling is relatively easy, because PG_anon_exclusive is only expressive for some page table entry types. a) Present anonymous pages page_try_dup_anon_rmap() will mark the given subpage shared -- which will fail if the page is pinned. If it failed, we have to copy (or PTE-map a PMD to handle it on the PTE level). Note that device exclusive entries are just a pointer at a PageAnon() page. fork() will first convert a device exclusive entry to a present page table and handle it just like present anonymous pages. b) Device private entry Device private entries point at PageAnon() pages that cannot be mapped directly and, therefore, cannot get pinned. page_try_dup_anon_rmap() will mark the given subpage shared, which cannot fail because they cannot get pinned. c) HW poison entries PG_anon_exclusive will remain untouched and is stale -- the page table entry is just a placeholder after all. d) Migration entries Writable and readable-exclusive entries are converted to readable entries: possibly shared. I.8. mprotect() handling mprotect() only has to properly handle the new readable-exclusive migration entry: When write-protecting a migration entry that points at an anonymous page, remember the information about exclusivity via the "readable-exclusive" migration entry type. II. Migration and GUP-fast Whenever replacing a present page table entry that maps an exclusive anonymous page by a migration entry, we have to mark the page possibly shared and synchronize against GUP-fast by a proper clear/invalidate+flush to make the following scenario impossible: 1. try_to_migrate() places a migration entry after checking for GUP pins and marks the page possibly shared. 2. GUP-fast pins the page due to lack of synchronization 3. fork() converts the "writable/readable-exclusive" migration entry into a readable migration entry 4. Migration fails due to the GUP pin (failing to freeze the refcount) 5. Migration entries are restored. PG_anon_exclusive is lost -> We have a pinned page that is not marked exclusive anymore. Note that we move information about exclusivity from the page to the migration entry as it otherwise highly overcomplicates fork() and PTE-mapping a THP. III. Swapout and GUP-fast Whenever replacing a present page table entry that maps an exclusive anonymous page by a swap entry, we have to mark the page possibly shared and synchronize against GUP-fast by a proper clear/invalidate+flush to make the following scenario impossible: 1. try_to_unmap() places a swap entry after checking for GUP pins and clears exclusivity information on the page. 2. GUP-fast pins the page due to lack of synchronization. -> We have a pinned page that is not marked exclusive anymore. If we'd ever store information about exclusivity in the swap entry, similar to migration handling, the same considerations as in II would apply. This is future work. Link: https://lkml.kernel.org/r/20220428083441.37290-13-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: David Rientjes <rientjes@google.com> Cc: Don Dutile <ddutile@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Liang Zhang <zhangliang5@huawei.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Nadav Amit <namit@vmware.com> Cc: Oded Gabbay <oded.gabbay@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
1 parent 78fbe90 commit 6c28760

File tree

11 files changed

+289
-34
lines changed

11 files changed

+289
-34
lines changed

include/linux/rmap.h

+40
Original file line numberDiff line numberDiff line change
@@ -228,6 +228,13 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
228228
{
229229
VM_BUG_ON_PAGE(!PageAnon(page), page);
230230

231+
/*
232+
* No need to check+clear for already shared pages, including KSM
233+
* pages.
234+
*/
235+
if (!PageAnonExclusive(page))
236+
goto dup;
237+
231238
/*
232239
* If this page may have been pinned by the parent process,
233240
* don't allow to duplicate the mapping but instead require to e.g.,
@@ -239,14 +246,47 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
239246
unlikely(page_needs_cow_for_dma(vma, page))))
240247
return -EBUSY;
241248

249+
ClearPageAnonExclusive(page);
242250
/*
243251
* It's okay to share the anon page between both processes, mapping
244252
* the page R/O into both processes.
245253
*/
254+
dup:
246255
__page_dup_rmap(page, compound);
247256
return 0;
248257
}
249258

259+
/**
260+
* page_try_share_anon_rmap - try marking an exclusive anonymous page possibly
261+
* shared to prepare for KSM or temporary unmapping
262+
* @page: the exclusive anonymous page to try marking possibly shared
263+
*
264+
* The caller needs to hold the PT lock and has to have the page table entry
265+
* cleared/invalidated+flushed, to properly sync against GUP-fast.
266+
*
267+
* This is similar to page_try_dup_anon_rmap(), however, not used during fork()
268+
* to duplicate a mapping, but instead to prepare for KSM or temporarily
269+
* unmapping a page (swap, migration) via page_remove_rmap().
270+
*
271+
* Marking the page shared can only fail if the page may be pinned; device
272+
* private pages cannot get pinned and consequently this function cannot fail.
273+
*
274+
* Returns 0 if marking the page possibly shared succeeded. Returns -EBUSY
275+
* otherwise.
276+
*/
277+
static inline int page_try_share_anon_rmap(struct page *page)
278+
{
279+
VM_BUG_ON_PAGE(!PageAnon(page) || !PageAnonExclusive(page), page);
280+
281+
/* See page_try_dup_anon_rmap(). */
282+
if (likely(!is_device_private_page(page) &&
283+
unlikely(page_maybe_dma_pinned(page))))
284+
return -EBUSY;
285+
286+
ClearPageAnonExclusive(page);
287+
return 0;
288+
}
289+
250290
/*
251291
* Called from mm/vmscan.c to handle paging out
252292
*/

include/linux/swap.h

+11-4
Original file line numberDiff line numberDiff line change
@@ -78,12 +78,19 @@ static inline int current_is_kswapd(void)
7878
#endif
7979

8080
/*
81-
* NUMA node memory migration support
81+
* Page migration support.
82+
*
83+
* SWP_MIGRATION_READ_EXCLUSIVE is only applicable to anonymous pages and
84+
* indicates that the referenced (part of) an anonymous page is exclusive to
85+
* a single process. For SWP_MIGRATION_WRITE, that information is implicit:
86+
* (part of) an anonymous page that are mapped writable are exclusive to a
87+
* single process.
8288
*/
8389
#ifdef CONFIG_MIGRATION
84-
#define SWP_MIGRATION_NUM 2
85-
#define SWP_MIGRATION_READ (MAX_SWAPFILES + SWP_HWPOISON_NUM)
86-
#define SWP_MIGRATION_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
90+
#define SWP_MIGRATION_NUM 3
91+
#define SWP_MIGRATION_READ (MAX_SWAPFILES + SWP_HWPOISON_NUM)
92+
#define SWP_MIGRATION_READ_EXCLUSIVE (MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
93+
#define SWP_MIGRATION_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + 2)
8794
#else
8895
#define SWP_MIGRATION_NUM 0
8996
#endif

include/linux/swapops.h

+25
Original file line numberDiff line numberDiff line change
@@ -194,6 +194,7 @@ static inline bool is_writable_device_exclusive_entry(swp_entry_t entry)
194194
static inline int is_migration_entry(swp_entry_t entry)
195195
{
196196
return unlikely(swp_type(entry) == SWP_MIGRATION_READ ||
197+
swp_type(entry) == SWP_MIGRATION_READ_EXCLUSIVE ||
197198
swp_type(entry) == SWP_MIGRATION_WRITE);
198199
}
199200

@@ -202,11 +203,26 @@ static inline int is_writable_migration_entry(swp_entry_t entry)
202203
return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE);
203204
}
204205

206+
static inline int is_readable_migration_entry(swp_entry_t entry)
207+
{
208+
return unlikely(swp_type(entry) == SWP_MIGRATION_READ);
209+
}
210+
211+
static inline int is_readable_exclusive_migration_entry(swp_entry_t entry)
212+
{
213+
return unlikely(swp_type(entry) == SWP_MIGRATION_READ_EXCLUSIVE);
214+
}
215+
205216
static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
206217
{
207218
return swp_entry(SWP_MIGRATION_READ, offset);
208219
}
209220

221+
static inline swp_entry_t make_readable_exclusive_migration_entry(pgoff_t offset)
222+
{
223+
return swp_entry(SWP_MIGRATION_READ_EXCLUSIVE, offset);
224+
}
225+
210226
static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
211227
{
212228
return swp_entry(SWP_MIGRATION_WRITE, offset);
@@ -224,6 +240,11 @@ static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
224240
return swp_entry(0, 0);
225241
}
226242

243+
static inline swp_entry_t make_readable_exclusive_migration_entry(pgoff_t offset)
244+
{
245+
return swp_entry(0, 0);
246+
}
247+
227248
static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
228249
{
229250
return swp_entry(0, 0);
@@ -244,6 +265,10 @@ static inline int is_writable_migration_entry(swp_entry_t entry)
244265
{
245266
return 0;
246267
}
268+
static inline int is_readable_migration_entry(swp_entry_t entry)
269+
{
270+
return 0;
271+
}
247272

248273
#endif
249274

mm/huge_memory.c

+71-7
Original file line numberDiff line numberDiff line change
@@ -1054,7 +1054,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
10541054
swp_entry_t entry = pmd_to_swp_entry(pmd);
10551055

10561056
VM_BUG_ON(!is_pmd_migration_entry(pmd));
1057-
if (is_writable_migration_entry(entry)) {
1057+
if (!is_readable_migration_entry(entry)) {
10581058
entry = make_readable_migration_entry(
10591059
swp_offset(entry));
10601060
pmd = swp_entry_to_pmd(entry);
@@ -1292,6 +1292,10 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
12921292
page = pmd_page(orig_pmd);
12931293
VM_BUG_ON_PAGE(!PageHead(page), page);
12941294

1295+
/* Early check when only holding the PT lock. */
1296+
if (PageAnonExclusive(page))
1297+
goto reuse;
1298+
12951299
if (!trylock_page(page)) {
12961300
get_page(page);
12971301
spin_unlock(vmf->ptl);
@@ -1306,6 +1310,12 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
13061310
put_page(page);
13071311
}
13081312

1313+
/* Recheck after temporarily dropping the PT lock. */
1314+
if (PageAnonExclusive(page)) {
1315+
unlock_page(page);
1316+
goto reuse;
1317+
}
1318+
13091319
/*
13101320
* See do_wp_page(): we can only map the page writable if there are
13111321
* no additional references. Note that we always drain the LRU
@@ -1319,11 +1329,12 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
13191329
pmd_t entry;
13201330

13211331
page_move_anon_rmap(page, vma);
1332+
unlock_page(page);
1333+
reuse:
13221334
entry = pmd_mkyoung(orig_pmd);
13231335
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
13241336
if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry, 1))
13251337
update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
1326-
unlock_page(page);
13271338
spin_unlock(vmf->ptl);
13281339
return VM_FAULT_WRITE;
13291340
}
@@ -1708,6 +1719,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
17081719
#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
17091720
if (is_swap_pmd(*pmd)) {
17101721
swp_entry_t entry = pmd_to_swp_entry(*pmd);
1722+
struct page *page = pfn_swap_entry_to_page(entry);
17111723

17121724
VM_BUG_ON(!is_pmd_migration_entry(*pmd));
17131725
if (is_writable_migration_entry(entry)) {
@@ -1716,8 +1728,10 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
17161728
* A protection check is difficult so
17171729
* just be safe and disable write
17181730
*/
1719-
entry = make_readable_migration_entry(
1720-
swp_offset(entry));
1731+
if (PageAnon(page))
1732+
entry = make_readable_exclusive_migration_entry(swp_offset(entry));
1733+
else
1734+
entry = make_readable_migration_entry(swp_offset(entry));
17211735
newpmd = swp_entry_to_pmd(entry);
17221736
if (pmd_swp_soft_dirty(*pmd))
17231737
newpmd = pmd_swp_mksoft_dirty(newpmd);
@@ -1937,6 +1951,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
19371951
pgtable_t pgtable;
19381952
pmd_t old_pmd, _pmd;
19391953
bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
1954+
bool anon_exclusive = false;
19401955
unsigned long addr;
19411956
int i;
19421957

@@ -2018,6 +2033,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
20182033
entry = pmd_to_swp_entry(old_pmd);
20192034
page = pfn_swap_entry_to_page(entry);
20202035
write = is_writable_migration_entry(entry);
2036+
if (PageAnon(page))
2037+
anon_exclusive = is_readable_exclusive_migration_entry(entry);
20212038
young = false;
20222039
soft_dirty = pmd_swp_soft_dirty(old_pmd);
20232040
uffd_wp = pmd_swp_uffd_wp(old_pmd);
@@ -2029,8 +2046,26 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
20292046
young = pmd_young(old_pmd);
20302047
soft_dirty = pmd_soft_dirty(old_pmd);
20312048
uffd_wp = pmd_uffd_wp(old_pmd);
2049+
20322050
VM_BUG_ON_PAGE(!page_count(page), page);
20332051
page_ref_add(page, HPAGE_PMD_NR - 1);
2052+
2053+
/*
2054+
* Without "freeze", we'll simply split the PMD, propagating the
2055+
* PageAnonExclusive() flag for each PTE by setting it for
2056+
* each subpage -- no need to (temporarily) clear.
2057+
*
2058+
* With "freeze" we want to replace mapped pages by
2059+
* migration entries right away. This is only possible if we
2060+
* managed to clear PageAnonExclusive() -- see
2061+
* set_pmd_migration_entry().
2062+
*
2063+
* In case we cannot clear PageAnonExclusive(), split the PMD
2064+
* only and let try_to_migrate_one() fail later.
2065+
*/
2066+
anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
2067+
if (freeze && anon_exclusive && page_try_share_anon_rmap(page))
2068+
freeze = false;
20342069
}
20352070

20362071
/*
@@ -2052,6 +2087,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
20522087
if (write)
20532088
swp_entry = make_writable_migration_entry(
20542089
page_to_pfn(page + i));
2090+
else if (anon_exclusive)
2091+
swp_entry = make_readable_exclusive_migration_entry(
2092+
page_to_pfn(page + i));
20552093
else
20562094
swp_entry = make_readable_migration_entry(
20572095
page_to_pfn(page + i));
@@ -2063,6 +2101,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
20632101
} else {
20642102
entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
20652103
entry = maybe_mkwrite(entry, vma);
2104+
if (anon_exclusive)
2105+
SetPageAnonExclusive(page + i);
20662106
if (!write)
20672107
entry = pte_wrprotect(entry);
20682108
if (!young)
@@ -2294,6 +2334,13 @@ static void __split_huge_page_tail(struct page *head, int tail,
22942334
*
22952335
* After successful get_page_unless_zero() might follow flags change,
22962336
* for example lock_page() which set PG_waiters.
2337+
*
2338+
* Note that for mapped sub-pages of an anonymous THP,
2339+
* PG_anon_exclusive has been cleared in unmap_page() and is stored in
2340+
* the migration entry instead from where remap_page() will restore it.
2341+
* We can still have PG_anon_exclusive set on effectively unmapped and
2342+
* unreferenced sub-pages of an anonymous THP: we can simply drop
2343+
* PG_anon_exclusive (-> PG_mappedtodisk) for these here.
22972344
*/
22982345
page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
22992346
page_tail->flags |= (head->flags &
@@ -3025,6 +3072,7 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
30253072
struct vm_area_struct *vma = pvmw->vma;
30263073
struct mm_struct *mm = vma->vm_mm;
30273074
unsigned long address = pvmw->address;
3075+
bool anon_exclusive;
30283076
pmd_t pmdval;
30293077
swp_entry_t entry;
30303078
pmd_t pmdswp;
@@ -3034,10 +3082,19 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
30343082

30353083
flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
30363084
pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
3085+
3086+
anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
3087+
if (anon_exclusive && page_try_share_anon_rmap(page)) {
3088+
set_pmd_at(mm, address, pvmw->pmd, pmdval);
3089+
return;
3090+
}
3091+
30373092
if (pmd_dirty(pmdval))
30383093
set_page_dirty(page);
30393094
if (pmd_write(pmdval))
30403095
entry = make_writable_migration_entry(page_to_pfn(page));
3096+
else if (anon_exclusive)
3097+
entry = make_readable_exclusive_migration_entry(page_to_pfn(page));
30413098
else
30423099
entry = make_readable_migration_entry(page_to_pfn(page));
30433100
pmdswp = swp_entry_to_pmd(entry);
@@ -3071,10 +3128,17 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
30713128
if (pmd_swp_uffd_wp(*pvmw->pmd))
30723129
pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde));
30733130

3074-
if (PageAnon(new))
3075-
page_add_anon_rmap(new, vma, mmun_start, RMAP_COMPOUND);
3076-
else
3131+
if (PageAnon(new)) {
3132+
rmap_t rmap_flags = RMAP_COMPOUND;
3133+
3134+
if (!is_readable_migration_entry(entry))
3135+
rmap_flags |= RMAP_EXCLUSIVE;
3136+
3137+
page_add_anon_rmap(new, vma, mmun_start, rmap_flags);
3138+
} else {
30773139
page_add_file_rmap(new, vma, true);
3140+
}
3141+
VM_BUG_ON(pmd_write(pmde) && PageAnon(new) && !PageAnonExclusive(new));
30783142
set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
30793143

30803144
/* No need to invalidate - it was non-present before */

mm/hugetlb.c

+11-4
Original file line numberDiff line numberDiff line change
@@ -4790,7 +4790,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
47904790
is_hugetlb_entry_hwpoisoned(entry))) {
47914791
swp_entry_t swp_entry = pte_to_swp_entry(entry);
47924792

4793-
if (is_writable_migration_entry(swp_entry) && cow) {
4793+
if (!is_readable_migration_entry(swp_entry) && cow) {
47944794
/*
47954795
* COW mappings require pages in both
47964796
* parent and child to be set to read.
@@ -5190,6 +5190,8 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
51905190
set_huge_ptep_writable(vma, haddr, ptep);
51915191
return 0;
51925192
}
5193+
VM_BUG_ON_PAGE(PageAnon(old_page) && PageAnonExclusive(old_page),
5194+
old_page);
51935195

51945196
/*
51955197
* If the process that created a MAP_PRIVATE mapping is about to
@@ -6187,12 +6189,17 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
61876189
}
61886190
if (unlikely(is_hugetlb_entry_migration(pte))) {
61896191
swp_entry_t entry = pte_to_swp_entry(pte);
6192+
struct page *page = pfn_swap_entry_to_page(entry);
61906193

6191-
if (is_writable_migration_entry(entry)) {
6194+
if (!is_readable_migration_entry(entry)) {
61926195
pte_t newpte;
61936196

6194-
entry = make_readable_migration_entry(
6195-
swp_offset(entry));
6197+
if (PageAnon(page))
6198+
entry = make_readable_exclusive_migration_entry(
6199+
swp_offset(entry));
6200+
else
6201+
entry = make_readable_migration_entry(
6202+
swp_offset(entry));
61966203
newpte = swp_entry_to_pte(entry);
61976204
set_huge_swap_pte_at(mm, address, ptep,
61986205
newpte, huge_page_size(h));

0 commit comments

Comments
 (0)