Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate boot freeze on AMD Zen #815

Closed
Tracked by #651
serban300 opened this issue Jan 7, 2019 · 9 comments
Closed
Tracked by #651

Investigate boot freeze on AMD Zen #815

serban300 opened this issue Jan 7, 2019 · 9 comments
Assignees
Labels
Priority: High Indicates than an issue or pull request should be resolved ahead of issues or pull requests labelled
Milestone

Comments

@serban300
Copy link
Contributor

Here is the full boot log:

[    0.000000] Linux version 4.14.91+ (serban@serban) (gcc version 8.2.0 (Ubuntu 8.2.0-7ubuntu1)) #8 SMP Sun Jan 6 12:52:12 EET 2019
[    0.000000] Command line: console=ttyS0 debug ignore_loglevel rescue
[    0.000000] [Firmware Bug]: TSC doesn't count with P0 frequency!
[    0.000000] unchecked MSR access error: RDMSR from 0x10a at rIP: 0xffffffff81037b26 (native_read_msr+0x6/0x20)
[    0.000000] Call Trace:
[    0.000000]  early_cpu_init+0x118/0x1eb
[    0.000000]  setup_arch+0xc6/0xac9
[    0.000000]  ? printk+0x53/0x6a
[    0.000000]  start_kernel+0x68/0x48d
[    0.000000]  x86_64_start_reservations+0x29/0x2b
[    0.000000]  x86_64_start_kernel+0x71/0x74
[    0.000000]  secondary_startup_64+0xa5/0xb0
[    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000007ffffff] usable
[    0.000000] debug: ignoring loglevel setting.
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] DMI not present or invalid.
[    0.000000] Hypervisor detected: KVM
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
[    0.000000] e820: last_pfn = 0x8000 max_arch_pfn = 0x400000000
[    0.000000] MTRR default type: uncachable
[    0.000000] MTRR fixed ranges disabled:
[    0.000000]   00000-FFFFF uncachable
[    0.000000] MTRR variable ranges disabled:
[    0.000000]   0 disabled
[    0.000000]   1 disabled
[    0.000000]   2 disabled

[    0.000000]   3 disabled
[    0.000000]   4 disabled

[    0.000000]   5 disabled
[    0.000000]   6 disabled
[    0.000000]   7 disabled
[    0.000000] MTRR: Disabled

[    0.000000] x86/PAT: MTRRs disabled, skipping PAT initialization too.

[    0.000000] CPU MTRRs all blank - virtualized system.

[    0.000000] x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WB  WT  UC- UC  

[    0.000000] found SMP MP-table at [mem 0x0009fc00-0x0009fc0f] mapped at [ffffffffff200c00]

[    0.000000] Scanning 1 areas for low memory corruption

[    0.000000] Base memory trampoline at [ffff888000099000] 99000 size 24576

[    0.000000] Using GB pages for direct mapping

[    0.000000] BRK [0x020af000, 0x020affff] PGTABLE

[    0.000000] BRK [0x020b0000, 0x020b0fff] PGTABLE

[    0.000000] BRK [0x020b1000, 0x020b1fff] PGTABLE

[    0.000000] No NUMA configuration found
[    0.000000] Faking a node at [mem 0x0000000000000000-0x0000000007ffffff]
[    0.000000] NODE_DATA(0) allocated [mem 0x07fde000-0x07ffffff]
[    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[    0.000000] kvm-clock: cpu 0, msr 0:7fdc001, primary cpu clock
[    0.000000] kvm-clock: using sched offset of 29487892130 cycles
[    0.000000] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.000000]   DMA32    [mem 0x0000000001000000-0x0000000007ffffff]
[    0.000000]   Normal   empty
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009efff]
[    0.000000]   node   0: [mem 0x0000000000100000-0x0000000007ffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000001000-0x0000000007ffffff]
[    0.000000] On node 0 totalpages: 32670
[    0.000000]   DMA zone: 64 pages used for memmap
[    0.000000]   DMA zone: 21 pages reserved
[    0.000000]   DMA zone: 3998 pages, LIFO batch:0
[    0.000000]   DMA32 zone: 448 pages used for memmap
[    0.000000]   DMA32 zone: 28672 pages, LIFO batch:7
[    0.000000] Intel MultiProcessor Specification v1.4
[    0.000000] MPTABLE: OEM ID: FC      
[    0.000000] MPTABLE: Product ID: 000000000000
[    0.000000] MPTABLE: APIC at: 0xFEE00000
[    0.000000] Processor #0 (Bootup-CPU)
[    0.000000] IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, GSI 0-23
[    0.000000] Processors: 1
[    0.000000] smpboot: Allowing 1 CPUs, 0 hotplug CPUs
[    0.000000] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
[    0.000000] PM: Registered nosave memory: [mem 0x0009f000-0x000fffff]
[    0.000000] e820: [mem 0x08000000-0xffffffff] available for PCI devices
[    0.000000] Booting paravirtualized kernel on KVM
[    0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
[    0.000000] random: get_random_bytes called from start_kernel+0x94/0x48d with crng_init=0
[    0.000000] setup_percpu: NR_CPUS:128 nr_cpumask_bits:128 nr_cpu_ids:1 nr_node_ids:1
[    0.000000] percpu: Embedded 42 pages/cpu @ffff888007c00000 s132760 r8192 d31080 u2097152
[    0.000000] pcpu-alloc: s132760 r8192 d31080 u2097152 alloc=1*2097152
[    0.000000] pcpu-alloc: [0] 0
[    0.000000] KVM setup async PF for cpu 0
[    0.000000] kvm-stealtime: cpu 0, msr 7c15040
[    0.000000] PV qspinlock hash table entries: 256 (order: 0, 4096 bytes)
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 32137
[    0.000000] Policy zone: DMA32
[    0.000000] Kernel command line: console=ttyS0 debug ignore_loglevel rescue
[    0.000000] PID hash table entries: 512 (order: 0, 4096 bytes)
[    0.000000] Memory: 111068K/130680K available (8204K kernel code, 642K rwdata, 1576K rodata, 1280K init, 2784K bss, 19612K reserved, 0K cma-reserved)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[    0.004000] Hierarchical RCU implementation.
[    0.004000]     RCU restricting CPUs from NR_CPUS=128 to nr_cpu_ids=1.
[    0.004000] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1
[    0.004000] NR_IRQS: 4352, nr_irqs: 48, preallocated irqs: 16
[    0.004000] Console: colour dummy device 80x25
[    0.004000] console [ttyS0] enabled
[    0.004000] tsc: Detected 3393.624 MHz processor
[    0.004000] Calibrating delay loop (skipped) preset value.. 6787.24 BogoMIPS (lpj=13574496)
[    0.004000] pid_max: default: 32768 minimum: 301
[    0.004000] Security Framework initialized
[    0.004000] SELinux:  Initializing.
[    0.004000] SELinux:  Starting in permissive mode
[    0.004000] Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
[    0.004060] Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
[    0.004829] Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
[    0.005571] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes)
[    1.024004] random: fast init done

[   24.832026] random: crng init done

After random: crng init done the instance freezes without throwing any exception.

@rn
Copy link
Contributor

rn commented Jan 7, 2019

Does the AMD system have rdrand support? You can check on the host in /proc/cpuinfo in the flags section.

Recent Linux kernels refuse to boot if they don't get enough entropy.

@serban300
Copy link
Contributor Author

Yes, it has rdrand

processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 23
model		: 1
model name	: AMD Ryzen 7 1700X Eight-Core Processor
stepping	: 1
microcode	: 0x8001137
cpu MHz		: 1837.717
cache size	: 512 KB
physical id	: 0
siblings	: 16
core id		: 0
cpu cores	: 8
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 6786.19
TLB size	: 2560 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 43 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate eff_freq_ro [13] [14]

@serban300
Copy link
Contributor Author

Here is where it freezes: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/x86/kernel/cpu/intel_cacheinfo.c?h=v4.14.91#n631 (line 631 - 636) . It gets stuck in an infinite loop

@serban300
Copy link
Contributor Author

The kernel expects to find a line where eax[4:0] == 0 in order to stop iterating through the loop. The problem is that no matter the value of the counter, cpuid_count(op, i, &eax, &ebx, &ecx, &edx); returns the same thing:

0x8000001d 0x00: eax=0x00004121 ebx=0x01c0003f ecx=0x0000003f edx=0x00000000

I tried to bypass this problem and the guest booted successfully.

@serban300
Copy link
Contributor Author

I just realized that if the host runs kernel 4.14 the issue doesn't occur. Here is why:

torvalds/linux@806793f

KVM wasn't enabling TOPOEXT by default at that point.

@serban300
Copy link
Contributor Author

Another thing that seems to work is to set the KVM_CPUID_FLAG_SIGNIFCANT_INDEX flag for the 0x8000001d entries. This way the counter won't be ignored anymore.

I noticed that this is what KVM does natively for Intel CPUs: https://github.com/torvalds/linux/blob/4064e47c82810586975b4304b105056389beaa06/arch/x86/kvm/cpuid.c#L461

@serban300
Copy link
Contributor Author

To summarize, here are all the possible fixes that I found so far:

  1. Disabling TOPOEXT for the guest in Firecracker

  2. Setting the KVM_CPUID_FLAG_SIGNIFCANT_INDEX flag in Firecracker

  3. Submitting a KVM patch in order to treat the AMD cache property entries (0x8000001d) just like the Intel ones (0x4)

@bkleiner
Copy link

FYI: I ran into to the same problem and worked around it by overriding cpuid 0x0 to an Intel Processor, as most of the firecracker cpuid is intel only anyways (Multicore enumeration via x2apic, HT handling, Brand-string, all cpuids above 0x80000002)

0x0 => {
  entry.ebx = 0x47656e75; // Genu 
  entry.ecx = 0x696e6549; // ineI
  entry.edx = 0x6e74656c; // ntel
}

As a user that's probably what i expect to see anyway when i select a intel based cpu-template.

@serban300
Copy link
Contributor Author

Thank you for the information ! It's interesting that this works. Anyway, while this is ok as a workaround I'm not sure what side effects it might have. As a long term fix we will emulate the extended cache topology from userspace.

serban300 added a commit to serban300/firecracker that referenced this issue Jan 14, 2019
fixes firecracker-microvm#815

Signed-off-by: Serban Iorga <seriorga@amazon.com>
serban300 added a commit to serban300/firecracker that referenced this issue Jan 14, 2019
fixes firecracker-microvm#815

Signed-off-by: Serban Iorga <seriorga@amazon.com>
serban300 added a commit to serban300/firecracker that referenced this issue Jan 15, 2019
fixes firecracker-microvm#815

Signed-off-by: Serban Iorga <seriorga@amazon.com>
serban300 added a commit to serban300/firecracker that referenced this issue Jan 22, 2019
fixes firecracker-microvm#815

Signed-off-by: Serban Iorga <seriorga@amazon.com>
serban300 pushed a commit to serban300/firecracker that referenced this issue Jan 23, 2019
serban300 pushed a commit to serban300/firecracker that referenced this issue Jan 23, 2019
serban300 pushed a commit to serban300/firecracker that referenced this issue Jan 24, 2019
serban300 pushed a commit to serban300/firecracker that referenced this issue Jan 31, 2019
serban300 pushed a commit to serban300/firecracker that referenced this issue Feb 1, 2019
The helper methods are needed for adding AMD support

firecracker-microvm#815

Signed-off-by: Serban Iorga <seriorga@amazon.com>
serban300 pushed a commit to serban300/firecracker that referenced this issue Feb 12, 2019
The helper methods are needed for adding AMD support

firecracker-microvm#815

Signed-off-by: Serban Iorga <seriorga@amazon.com>
@raduweiss raduweiss added this to the AMD Support milestone Feb 15, 2019
@alexandruag alexandruag added the Priority: Medium Indicates than an issue or pull request should be resolved ahead of issues or pull requests labelled label Feb 15, 2019
@serban300 serban300 added Priority: High Indicates than an issue or pull request should be resolved ahead of issues or pull requests labelled and removed Priority: Medium Indicates than an issue or pull request should be resolved ahead of issues or pull requests labelled labels Feb 18, 2019
serban300 pushed a commit to serban300/firecracker that referenced this issue Feb 18, 2019
The helper methods are needed for adding AMD support

firecracker-microvm#815

Signed-off-by: Serban Iorga <seriorga@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: High Indicates than an issue or pull request should be resolved ahead of issues or pull requests labelled
Projects
None yet
Development

No branches or pull requests

5 participants