Skip to content

Commit c98d2ec

Browse files
author
Alexander Gordeev
committed
s390/mm: Uncouple physical vs virtual address spaces
The uncoupling physical vs virtual address spaces brings the following benefits to s390: - virtual memory layout flexibility; - closes the address gap between kernel and modules, it caused s390-only problems in the past (e.g. 'perf' bugs); - allows getting rid of trampolines used for module calls into kernel; - allows simplifying BPF trampoline; - minor performance improvement in branch prediction; - kernel randomization entropy is magnitude bigger, as it is derived from the amount of available virtual, not physical memory; The whole change could be described in two pictures below: before and after the change. Some aspects of the virtual memory layout setup are not clarified (number of page levels, alignment, DMA memory), since these are not a part of this change or secondary with regard to how the uncoupling itself is implemented. The focus of the pictures is to explain why __va() and __pa() macros are implemented the way they are. Memory layout in V==R mode: | Physical | Virtual | +- 0 --------------+- 0 --------------+ identity mapping start | | S390_lowcore | Low-address memory | +- 8 KB -----------+ | | | | | identity | phys == virt | | mapping | virt == phys | | | +- AMODE31_START --+- AMODE31_START --+ .amode31 rand. phys/virt start |.amode31 text/data|.amode31 text/data| +- AMODE31_END ----+- AMODE31_END ----+ .amode31 rand. phys/virt start | | | | | | +- __kaslr_offset, __kaslr_offset_phys| kernel rand. phys/virt start | | | | kernel text/data | kernel text/data | phys == kvirt | | | +------------------+------------------+ kernel phys/virt end | | | | | | | | | | | | +- ident_map_size -+- ident_map_size -+ identity mapping end | | | ... unused gap | | | +---- vmemmap -----+ 'struct page' array start | | | virtually mapped | | memory map | | | +- __abs_lowcore --+ | | | Absolute Lowcore | | | +- __memcpy_real_area | | | Real Memory Copy| | | +- VMALLOC_START --+ vmalloc area start | | | vmalloc area | | | +- MODULES_VADDR --+ modules area start | | | modules area | | | +------------------+ UltraVisor Secure Storage limit | | | ... unused gap | | | +KASAN_SHADOW_START+ KASAN shadow memory start | | | KASAN shadow | | | +------------------+ ASCE limit Memory layout in V!=R mode: | Physical | Virtual | +- 0 --------------+- 0 --------------+ | | S390_lowcore | Low-address memory | +- 8 KB -----------+ | | | | | | | | ... unused gap | | | | +- AMODE31_START --+- AMODE31_START --+ .amode31 rand. phys/virt start |.amode31 text/data|.amode31 text/data| +- AMODE31_END ----+- AMODE31_END ----+ .amode31 rand. phys/virt end (<2GB) | | | | | | +- __kaslr_offset_phys | kernel rand. phys start | | | | kernel text/data | | | | | +------------------+ | kernel phys end | | | | | | | | | | | | +- ident_map_size -+ | | | | ... unused gap | | | +- __identity_base + identity mapping start (>= 2GB) | | | identity | phys == virt - __identity_base | mapping | virt == phys + __identity_base | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | +---- vmemmap -----+ 'struct page' array start | | | virtually mapped | | memory map | | | +- __abs_lowcore --+ | | | Absolute Lowcore | | | +- __memcpy_real_area | | | Real Memory Copy| | | +- VMALLOC_START --+ vmalloc area start | | | vmalloc area | | | +- MODULES_VADDR --+ modules area start | | | modules area | | | +- __kaslr_offset -+ kernel rand. virt start | | | kernel text/data | phys == (kvirt - __kaslr_offset) + | | __kaslr_offset_phys +- kernel .bss end + kernel rand. virt end | | | ... unused gap | | | +------------------+ UltraVisor Secure Storage limit | | | ... unused gap | | | +KASAN_SHADOW_START+ KASAN shadow memory start | | | KASAN shadow | | | +------------------+ ASCE limit Unused gaps in the virtual memory layout could be present or not - depending on how partucular system is configured. No page tables are created for the unused gaps. The relative order of vmalloc, modules and kernel image in virtual memory is defined by following considerations: - start of the modules area and end of the kernel should reside within 4GB to accommodate relative 32-bit jumps. The best way to achieve that is to place kernel next to modules; - vmalloc and module areas should locate next to each other to prevent failures and extra reworks in user level tools (makedumpfile, crash, etc.) which treat vmalloc and module addresses similarily; - kernel needs to be the last area in the virtual memory layout to easily distinguish between kernel and non-kernel virtual addresses. That is needed to (again) simplify handling of addresses in user level tools and make __pa() macro faster (see below); Concluding the above, the relative order of the considered virtual areas in memory is: vmalloc - modules - kernel. Therefore, the only change to the current memory layout is moving kernel to the end of virtual address space. With that approach the implementation of __pa() macro is straightforward - all linear virtual addresses less than kernel base are considered identity mapping: phys == virt - __identity_base All addresses greater than kernel base are kernel ones: phys == (kvirt - __kaslr_offset) + __kaslr_offset_phys By contrast, __va() macro deals only with identity mapping addresses: virt == phys + __identity_base .amode31 section is mapped separately and is not covered by __pa() macro. In fact, it could have been handled easily by checking whether a virtual address is within the section or not, but there is no need for that. Thus, let __pa() code do as little machine cycles as possible. The KASAN shadow memory is located at the very end of the virtual memory layout, at addresses higher than the kernel. However, that is not a linear mapping and no code other than KASAN instrumentation or API is expected to access it. When KASLR mode is enabled the kernel base address randomized within a memory window that spans whole unused virtual address space. The size of that window depends from the amount of physical memory available to the system, the limit imposed by UltraVisor (if present) and the vmalloc area size as provided by vmalloc= kernel command line parameter. In case the virtual memory is exhausted the minimum size of the randomization window is forcefully set to 2GB, which amounts to in 15 bits of entropy if KASAN is enabled or 17 bits of entropy in default configuration. The default kernel offset 0x100000 is used as a magic value both in the decompressor code and vmlinux linker script, but it will be removed with a follow-up change. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
1 parent f4cac27 commit c98d2ec

File tree

9 files changed

+241
-68
lines changed

9 files changed

+241
-68
lines changed

Documentation/arch/s390/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ s390 Architecture
88
cds
99
3270
1010
driver-model
11+
mm
1112
monreader
1213
qeth
1314
s390dbf

Documentation/arch/s390/mm.rst

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
=================
4+
Memory Management
5+
=================
6+
7+
Virtual memory layout
8+
=====================
9+
10+
.. note::
11+
12+
- Some aspects of the virtual memory layout setup are not
13+
clarified (number of page levels, alignment, DMA memory).
14+
15+
- Unused gaps in the virtual memory layout could be present
16+
or not - depending on how partucular system is configured.
17+
No page tables are created for the unused gaps.
18+
19+
- The virtual memory regions are tracked or untracked by KASAN
20+
instrumentation, as well as the KASAN shadow memory itself is
21+
created only when CONFIG_KASAN configuration option is enabled.
22+
23+
::
24+
25+
=============================================================================
26+
| Physical | Virtual | VM area description
27+
=============================================================================
28+
+- 0 --------------+- 0 --------------+
29+
| | S390_lowcore | Low-address memory
30+
| +- 8 KB -----------+
31+
| | |
32+
| | |
33+
| | ... unused gap | KASAN untracked
34+
| | |
35+
+- AMODE31_START --+- AMODE31_START --+ .amode31 rand. phys/virt start
36+
|.amode31 text/data|.amode31 text/data| KASAN untracked
37+
+- AMODE31_END ----+- AMODE31_END ----+ .amode31 rand. phys/virt end (<2GB)
38+
| | |
39+
| | |
40+
+- __kaslr_offset_phys | kernel rand. phys start
41+
| | |
42+
| kernel text/data | |
43+
| | |
44+
+------------------+ | kernel phys end
45+
| | |
46+
| | |
47+
| | |
48+
| | |
49+
+- ident_map_size -+ |
50+
| |
51+
| ... unused gap | KASAN untracked
52+
| |
53+
+- __identity_base + identity mapping start (>= 2GB)
54+
| |
55+
| identity | phys == virt - __identity_base
56+
| mapping | virt == phys + __identity_base
57+
| |
58+
| | KASAN tracked
59+
| |
60+
| |
61+
| |
62+
| |
63+
| |
64+
| |
65+
| |
66+
| |
67+
| |
68+
| |
69+
| |
70+
| |
71+
| |
72+
| |
73+
| |
74+
+---- vmemmap -----+ 'struct page' array start
75+
| |
76+
| virtually mapped |
77+
| memory map | KASAN untracked
78+
| |
79+
+- __abs_lowcore --+
80+
| |
81+
| Absolute Lowcore | KASAN untracked
82+
| |
83+
+- __memcpy_real_area
84+
| |
85+
| Real Memory Copy| KASAN untracked
86+
| |
87+
+- VMALLOC_START --+ vmalloc area start
88+
| | KASAN untracked or
89+
| vmalloc area | KASAN shallowly populated in case
90+
| | CONFIG_KASAN_VMALLOC=y
91+
+- MODULES_VADDR --+ modules area start
92+
| | KASAN allocated per module or
93+
| modules area | KASAN shallowly populated in case
94+
| | CONFIG_KASAN_VMALLOC=y
95+
+- __kaslr_offset -+ kernel rand. virt start
96+
| | KASAN tracked
97+
| kernel text/data | phys == (kvirt - __kaslr_offset) +
98+
| | __kaslr_offset_phys
99+
+- kernel .bss end + kernel rand. virt end
100+
| |
101+
| ... unused gap | KASAN untracked
102+
| |
103+
+------------------+ UltraVisor Secure Storage limit
104+
| |
105+
| ... unused gap | KASAN untracked
106+
| |
107+
+KASAN_SHADOW_START+ KASAN shadow memory start
108+
| |
109+
| KASAN shadow | KASAN untracked
110+
| |
111+
+------------------+ ASCE limit

arch/s390/boot/boot.h

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,10 +74,11 @@ void sclp_early_setup_buffer(void);
7474
void print_pgm_check_info(void);
7575
unsigned long randomize_within_range(unsigned long size, unsigned long align,
7676
unsigned long min, unsigned long max);
77-
void setup_vmem(unsigned long asce_limit);
77+
void setup_vmem(unsigned long kernel_start, unsigned long kernel_end, unsigned long asce_limit);
7878
void __printf(1, 2) decompressor_printk(const char *fmt, ...);
7979
void print_stacktrace(unsigned long sp);
8080
void error(char *m);
81+
int get_random(unsigned long limit, unsigned long *value);
8182

8283
extern struct machine_info machine;
8384

@@ -98,6 +99,10 @@ extern struct vmlinux_info _vmlinux_info;
9899
#define vmlinux _vmlinux_info
99100

100101
#define __abs_lowcore_pa(x) (((unsigned long)(x) - __abs_lowcore) % sizeof(struct lowcore))
102+
#define __kernel_va(x) ((void *)((unsigned long)(x) - __kaslr_offset_phys + __kaslr_offset))
103+
#define __kernel_pa(x) ((unsigned long)(x) - __kaslr_offset + __kaslr_offset_phys)
104+
#define __identity_va(x) ((void *)((unsigned long)(x) + __identity_base))
105+
#define __identity_pa(x) ((unsigned long)(x) - __identity_base)
101106

102107
static inline bool intersects(unsigned long addr0, unsigned long size0,
103108
unsigned long addr1, unsigned long size1)

arch/s390/boot/kaslr.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ static int check_prng(void)
4343
return PRNG_MODE_TDES;
4444
}
4545

46-
static int get_random(unsigned long limit, unsigned long *value)
46+
int get_random(unsigned long limit, unsigned long *value)
4747
{
4848
struct prng_parm prng = {
4949
/* initial parameter block for tdes mode, copied from libica */

arch/s390/boot/startup.c

Lines changed: 43 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -203,7 +203,7 @@ static void kaslr_adjust_relocs(unsigned long min_addr, unsigned long max_addr,
203203

204204
/* Adjust R_390_64 relocations */
205205
for (reloc = vmlinux_relocs_64_start; reloc < vmlinux_relocs_64_end; reloc++) {
206-
loc = (long)*reloc + offset;
206+
loc = (long)*reloc + phys_offset;
207207
if (loc < min_addr || loc > max_addr)
208208
error("64-bit relocation outside of kernel!\n");
209209
*(u64 *)loc += offset;
@@ -263,8 +263,25 @@ static void setup_ident_map_size(unsigned long max_physmem_end)
263263
#endif
264264
}
265265

266-
static unsigned long setup_kernel_memory_layout(void)
266+
#define FIXMAP_SIZE round_up(MEMCPY_REAL_SIZE + ABS_LOWCORE_MAP_SIZE, sizeof(struct lowcore))
267+
268+
static unsigned long get_vmem_size(unsigned long identity_size,
269+
unsigned long vmemmap_size,
270+
unsigned long vmalloc_size,
271+
unsigned long rte_size)
272+
{
273+
unsigned long max_mappable, vsize;
274+
275+
max_mappable = max(identity_size, MAX_DCSS_ADDR);
276+
vsize = round_up(SZ_2G + max_mappable, rte_size) +
277+
round_up(vmemmap_size, rte_size) +
278+
FIXMAP_SIZE + MODULES_LEN + KASLR_LEN;
279+
return size_add(vsize, vmalloc_size);
280+
}
281+
282+
static unsigned long setup_kernel_memory_layout(unsigned long kernel_size)
267283
{
284+
unsigned long kernel_start, kernel_end;
268285
unsigned long vmemmap_start;
269286
unsigned long asce_limit;
270287
unsigned long rte_size;
@@ -277,12 +294,11 @@ static unsigned long setup_kernel_memory_layout(void)
277294
vmemmap_size = SECTION_ALIGN_UP(pages) * sizeof(struct page);
278295

279296
/* choose kernel address space layout: 4 or 3 levels. */
280-
vsize = round_up(ident_map_size, _REGION3_SIZE) + vmemmap_size +
281-
MODULES_LEN + MEMCPY_REAL_SIZE + ABS_LOWCORE_MAP_SIZE;
282-
vsize = size_add(vsize, vmalloc_size);
297+
vsize = get_vmem_size(ident_map_size, vmemmap_size, vmalloc_size, _REGION3_SIZE);
283298
if (IS_ENABLED(CONFIG_KASAN) || (vsize > _REGION2_SIZE)) {
284299
asce_limit = _REGION1_SIZE;
285300
rte_size = _REGION2_SIZE;
301+
vsize = get_vmem_size(ident_map_size, vmemmap_size, vmalloc_size, _REGION2_SIZE);
286302
} else {
287303
asce_limit = _REGION2_SIZE;
288304
rte_size = _REGION3_SIZE;
@@ -298,12 +314,26 @@ static unsigned long setup_kernel_memory_layout(void)
298314
/* force vmalloc and modules below kasan shadow */
299315
vmax = min(vmax, KASAN_SHADOW_START);
300316
#endif
301-
MODULES_END = round_down(vmax, _SEGMENT_SIZE);
317+
kernel_end = vmax;
318+
if (kaslr_enabled()) {
319+
unsigned long kaslr_len, slots, pos;
320+
321+
vsize = min(vsize, vmax);
322+
kaslr_len = max(KASLR_LEN, vmax - vsize);
323+
slots = DIV_ROUND_UP(kaslr_len - kernel_size, THREAD_SIZE);
324+
if (get_random(slots, &pos))
325+
pos = 0;
326+
kernel_end -= pos * THREAD_SIZE;
327+
}
328+
kernel_start = round_down(kernel_end - kernel_size, THREAD_SIZE);
329+
__kaslr_offset = kernel_start;
330+
331+
MODULES_END = round_down(kernel_start, _SEGMENT_SIZE);
302332
MODULES_VADDR = MODULES_END - MODULES_LEN;
303333
VMALLOC_END = MODULES_VADDR;
304334

305335
/* allow vmalloc area to occupy up to about 1/2 of the rest virtual space left */
306-
vsize = (VMALLOC_END - (MEMCPY_REAL_SIZE + ABS_LOWCORE_MAP_SIZE)) / 2;
336+
vsize = (VMALLOC_END - FIXMAP_SIZE) / 2;
307337
vsize = round_down(vsize, _SEGMENT_SIZE);
308338
vmalloc_size = min(vmalloc_size, vsize);
309339
VMALLOC_START = VMALLOC_END - vmalloc_size;
@@ -330,6 +360,7 @@ static unsigned long setup_kernel_memory_layout(void)
330360
BUILD_BUG_ON(MAX_DCSS_ADDR > (1UL << MAX_PHYSMEM_BITS));
331361
max_mappable = max(ident_map_size, MAX_DCSS_ADDR);
332362
max_mappable = min(max_mappable, vmemmap_start);
363+
__identity_base = round_down(vmemmap_start - max_mappable, rte_size);
333364

334365
return asce_limit;
335366
}
@@ -358,7 +389,6 @@ static void setup_vmalloc_size(void)
358389

359390
static void kaslr_adjust_vmlinux_info(unsigned long offset)
360391
{
361-
*(unsigned long *)(&vmlinux.entry) += offset;
362392
vmlinux.bootdata_off += offset;
363393
vmlinux.bootdata_preserved_off += offset;
364394
#ifdef CONFIG_PIE_BUILD
@@ -386,6 +416,7 @@ void startup_kernel(void)
386416
unsigned long max_physmem_end;
387417
unsigned long vmlinux_lma = 0;
388418
unsigned long amode31_lma = 0;
419+
unsigned long kernel_size;
389420
unsigned long asce_limit;
390421
unsigned long safe_addr;
391422
void *img;
@@ -417,7 +448,8 @@ void startup_kernel(void)
417448
max_physmem_end = detect_max_physmem_end();
418449
setup_ident_map_size(max_physmem_end);
419450
setup_vmalloc_size();
420-
asce_limit = setup_kernel_memory_layout();
451+
kernel_size = vmlinux.default_lma + vmlinux.image_size + vmlinux.bss_size;
452+
asce_limit = setup_kernel_memory_layout(kernel_size);
421453
/* got final ident_map_size, physmem allocations could be performed now */
422454
physmem_set_usable_limit(ident_map_size);
423455
detect_physmem_online_ranges(max_physmem_end);
@@ -432,7 +464,6 @@ void startup_kernel(void)
432464
if (vmlinux_lma) {
433465
__kaslr_offset_phys = vmlinux_lma - vmlinux.default_lma;
434466
kaslr_adjust_vmlinux_info(__kaslr_offset_phys);
435-
__kaslr_offset = __kaslr_offset_phys;
436467
}
437468
}
438469
vmlinux_lma = vmlinux_lma ?: vmlinux.default_lma;
@@ -472,7 +503,7 @@ void startup_kernel(void)
472503
__kaslr_offset, __kaslr_offset_phys);
473504
kaslr_adjust_got(__kaslr_offset);
474505
free_relocs();
475-
setup_vmem(asce_limit);
506+
setup_vmem(__kaslr_offset, __kaslr_offset + kernel_size, asce_limit);
476507
copy_bootdata();
477508

478509
/*
@@ -484,7 +515,7 @@ void startup_kernel(void)
484515
/*
485516
* Jump to the decompressed kernel entry point and switch DAT mode on.
486517
*/
487-
psw.addr = vmlinux.entry;
518+
psw.addr = __kaslr_offset + vmlinux.entry;
488519
psw.mask = PSW_KERNEL_BITS;
489520
__load_psw(psw);
490521
}

0 commit comments

Comments
 (0)