Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel oops on amdgpu load (Polaris / Vega) #44

Closed
madscientist159 opened this issue Feb 4, 2018 · 5 comments
Closed

Kernel oops on amdgpu load (Polaris / Vega) #44

madscientist159 opened this issue Feb 4, 2018 · 5 comments

Comments

@madscientist159
Copy link
Contributor

madscientist159 commented Feb 4, 2018

On a ppc64el system with this kernel and a WX7100 (Polaris) card, loading the amdgpu module results in a kernel oops. Note that the upstream Linux 4.15 amdgpu module works and allows a full graphical environment to load; the oops is specific to the AMD 4.13 kernel. Oops follows:

[   89.848698] checking generic (600c280010000 500000) vs hw (6000000000000 10000000)
[   89.848800] amdgpu 0000:01:00.0: enabling device (0140 -> 0142)
[   89.915446] [drm] initializing kernel modesetting (POLARIS10 0x1002:0x67C4 0x1002:0x0B0D 0x00).
[   89.965406] [drm] register mmio base: 0x00000000
[   89.965458] [drm] register mmio size: 262144
[   89.965502] [drm] PCI I/O BAR is not found.
[   89.965540] [drm] probing gen 2 caps for device 1014:4c1 = 300104/180001e
[   89.965584] [drm] probing mlw for device 1014:4c1 = 300104
[   89.965631] [drm] UVD is enabled in VM mode
[   89.965658] [drm] VCE enabled in VM mode
[   90.299090] [drm] PCI I/O BAR is not found. Using MMIO to access ATOM BIOS
[   90.299092] ATOM BIOS: 113-C9540101-100
[   90.299103] [drm] GPU post is not needed
[   90.299130] [drm] vm size is 64 GB, block size is 13-bit, fragment size is 9-bit
[   90.299147] amdgpu: No suitable DMA available
[   92.836890] amdgpu 0000:01:00.0: VRAM: 8192M 0x000000F400000000 - 0x000000F5FFFFFFFF (8192M used)
[   92.836969] amdgpu 0000:01:00.0: GTT: 256M 0x0000000000000000 - 0x000000000FFFFFFF
[   92.837021] [drm] Detected VRAM RAM=8192M, BAR=256M
[   92.837056] [drm] RAM width 256bits GDDR5
[   92.837183] [TTM] Zone  kernel: Available graphics memory: 7471346 kiB
[   92.837227] [TTM] Initializing pool allocator
[   92.837289] [drm] amdgpu: 8192M of VRAM memory ready
[   92.837325] [drm] amdgpu: 8192M of GTT memory ready.
[   92.837383] [drm] GART: num cpu pages 65536, num gpu pages 65536
[   92.837555] [drm] PCIE GART of 256M enabled (table at 0x000000F400040000).
[   92.837607] amdgpu 0000:01:00.0: (-12) failed to allocate kernel bo
[   92.837651] amdgpu 0000:01:00.0: (-12) create WB bo failed
[   92.837829] [drm:amdgpu_device_init [amdgpu]] *ERROR* amdgpu_wb_init failed -12
[   92.837912] amdgpu 0000:01:00.0: amdgpu_init failed
[   92.838002] Unable to handle kernel paging request for data at address 0xc00c000085a80000
[   92.838066] Faulting instruction address: 0xc008000005a2f1cc
[   92.838122] Oops: Kernel access of bad area, sig: 11 [#1]
[   92.838166] SMP NR_CPUS=2048
[   92.838168] NUMA
[   92.838200] PowerNV
[   92.838257] Modules linked in: amdgpu(+) mfd_core ttm drm_kms_helper drm syscopyarea sysfillrect sysimgblt fb_sys_fops i2c_algo_bit i2c_dev ghash_generic gf128mul ecb snd_hda_codec_hdmi snd_hda_intel xts snd_hda_codec joydev ofpart ctr evdev ipmi_powernv powernv_flash ipmi_devintf cbc snd_hda_core vmx_crypto mtd snd_hwdep ipmi_msghandler at24 opal_prd binfmt_misc snd_aloop snd_pcm snd_timer snd soundcore parport_pc lp parport ip_tables x_tables autofs4 nfsv3 nfs_acl nfs lockd grace sunrpc fscache hid_generic usbhid hid xhci_pci xhci_hcd usbcore tg3 ptp pps_core libphy
[   92.838719] CPU: 0 PID: 971 Comm: kworker/0:1 Not tainted 4.13.0+ #1
[   92.838778] Workqueue: events work_for_cpu_fn
[   92.838823] task: c0000001d35c4700 task.stack: c0000001d35c8000
[   92.838876] NIP: c008000005a2f1cc LR: c0080000059a036c CTR: c008000005a2f178
[   92.838940] REGS: c0000001d35cb4c0 TRAP: 0300   Not tainted  (4.13.0+)
[   92.838993] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
[   92.839003]   CR: 28002288  XER: 20040000
[   92.839079] CFAR: c008000005a2f1ac DAR: c00c000085a80000 DSISR: 42000000 SOFTE: 1
               GPR00: c0080000059a036c c0000001d35cb740 c008000005c5bde0 c000000009be0000
               GPR04: c00c000085a80000 0000000000000000 0000000000080000 0000000000000000
               GPR08: 0000000000000001 c008000005a2f178 0000000000000001 c008000004a1e5d8
               GPR12: c008000005a2f178 c00000000fb80000 c000000000128568 c000000009be2f20
               GPR16: c000000009be2f28 c000000009be2f18 c000000009be2f38 c000000009be2f40
               GPR20: c000000009be2f30 0000000000008000 0000000000000400 c000000009be2f38
               GPR24: c000000009be2f40 c000000009be2f30 c000000009be2f18 0000000000000000
               GPR28: 0000000000000000 0000000000000000 c00c000085a80000 0000000000080000
[   92.839738] NIP [c008000005a2f1cc] gmc_v8_0_gart_set_pte_pde+0x54/0x90 [amdgpu]
[   92.839914] LR [c0080000059a036c] amdgpu_gart_unbind+0xa4/0x130 [amdgpu]
[   92.839968] Call Trace:
[   92.839992] [c0000001d35cb740] [c000000009be2720] 0xc000000009be2720 (unreliable)
[   92.840138] [c0000001d35cb780] [c0080000059a036c] amdgpu_gart_unbind+0xa4/0x130 [amdgpu]
[   92.840290] [c0000001d35cb800] [c0080000059a06e8] amdgpu_gart_fini+0x40/0x70 [amdgpu]
[   92.840447] [c0000001d35cb830] [c008000005a30b98] gmc_v8_0_sw_fini+0x50/0x90 [amdgpu]
[   92.840593] [c0000001d35cb860] [c00800000597f1d0] amdgpu_fini+0x208/0x560 [amdgpu]
[   92.840741] [c0000001d35cb910] [c008000005985b5c] amdgpu_device_init+0xcc4/0x1590 [amdgpu]
[   92.840889] [c0000001d35cba30] [c0080000059880fc] amdgpu_driver_load_kms+0xb4/0x2d0 [amdgpu]
[   92.840976] [c0000001d35cbab0] [c0080000044cab7c] drm_dev_register+0x1d4/0x290 [drm]
[   92.841121] [c0000001d35cbb50] [c00800000597d880] amdgpu_pci_probe+0x128/0x1f0 [amdgpu]
[   92.841228] [c0000001d35cbbd0] [c0000000005d851c] local_pci_probe+0x6c/0x140
[   92.841296] [c0000001d35cbc60] [c0000000001199d8] work_for_cpu_fn+0x38/0x60
[   92.843968] [c0000001d35cbc90] [c00000000011ead8] process_one_work+0x248/0x520
[   92.848119] [c0000001d35cbd30] [c00000000011f030] worker_thread+0x280/0x5d0
[   92.851012] [c0000001d35cbdc0] [c00000000012870c] kthread+0x1ac/0x1c0
[   92.851102] [c0000001d35cbe30] [c00000000000bae0] ret_from_kernel_thread+0x5c/0x7c
[   92.851209] Instruction dump:
[   92.851231] 7cdf3378 7c9e2378 7cbd2b78 7cfc3b78 48000008 e8410018 7be6c6c4 7bbd1828
[   92.852725] 78c64602 7fdeea14 7cc6e378 7c0004ac <f8de0000> 39200001 38600000 992d019c
[   92.852815] ---[ end trace 2915333da62340c0 ]---

EDIT: lspci output for the AMD card:

        Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon Pro WX 7100]
        Flags: fast devsel, IRQ 24, NUMA node 0
        Memory at 6000000000000 (64-bit, prefetchable) [size=256M]
        Memory at 6000010000000 (64-bit, prefetchable) [size=2M]
        I/O ports at <unassigned> [disabled]
        Memory at 600c000000000 (32-bit, non-prefetchable) [size=256K]
        Expansion ROM at 600c000040000 [disabled] [size=128K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
        Capabilities: [58] Express Legacy Endpoint, MSI 00
        Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [150] Advanced Error Reporting
        Capabilities: [200] #15
        Capabilities: [270] #19
        Capabilities: [2b0] Address Translation Service (ATS)
        Capabilities: [2c0] Page Request Interface (PRI)
        Capabilities: [2d0] Process Address Space ID (PASID)
        Capabilities: [320] Latency Tolerance Reporting
        Capabilities: [328] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [370] L1 PM Substates
        Kernel driver in use: amdgpu
        Kernel modules: amdgpu
@madscientist159
Copy link
Contributor Author

@johnbridgman Looping you in in case you might know if this is something we can help fix, or if we should just wait for further upstreaming effort?

@madscientist159
Copy link
Contributor Author

Also happens on Vega

[  114.147878] [drm] amdgpu kernel modesetting enabled.
[  114.225774] amdgpu 0030:03:00.0: enabling device (0140 -> 0142)
[  114.289686] [drm] initializing kernel modesetting (VEGA10 0x1002:0x687F 0x1002:0x6B76 0xC3).
[  114.353506] [drm] register mmio base: 0x00000000
[  114.414689] [drm] register mmio size: 524288
[  114.477494] [drm] PCI I/O BAR is not found.
[  114.602973] [drm] probing gen 2 caps for device 1022:1471 = 700d03/e
[  114.632727] [drm] probing mlw for device 1022:1471 = 700d03
[  114.633532] [drm] UVD is enabled in VM mode
[  114.633573] [drm] UVD ENC is enabled in VM mode
[  114.633617] [drm] VCE enabled in VM mode
[  114.957770] [drm] PCI I/O BAR is not found. Using MMIO to access ATOM BIOS
[  114.957881] ATOM BIOS: 113-D0500300-101
[  114.957992] [drm] GPU posting now...
[  115.083314] [drm] vm size is 262144 GB, block size is 9-bit,fragment size is 9-bit
[  115.083453] amdgpu: No suitable DMA available.
[  115.083503] amdgpu 0030:03:00.0: VRAM: 8176M 0x000000F400000000 - 0x000000F5FEFFFFFF (8176M used)
[  115.083568] amdgpu 0030:03:00.0: GTT: 256M 0x000000F5FF000000 - 0x000000F60EFFFFFF
[  115.083622] [drm] Detected VRAM RAM=8176M, BAR=256M
[  115.083659] [drm] RAM width 2048bits HBM
[  115.085835] [TTM] Zone  kernel: Available graphics memory: 61727928 kiB
[  115.085886] [TTM] Initializing pool allocator
[  115.085984] [drm] amdgpu: 8176M of VRAM memory ready
[  115.086024] [drm] amdgpu: 64299M of GTT memory ready.
[  115.086108] [drm] GART: num cpu pages 65536, num gpu pages 65536
[  115.086309] [drm] PCIE GART of 256M enabled (table at 0x000000F400800000).
[  115.086372] amdgpu 0030:03:00.0: (-12) failed to allocate kernel bo
[  115.086420] amdgpu 0030:03:00.0: (-12) create WB bo failed
[  115.086532] [drm:amdgpu_device_init [amdgpu]] *ERROR* amdgpu_wb_init failed -12
[  115.086587] amdgpu 0030:03:00.0: amdgpu_init failed
[  115.086652] Unable to handle kernel paging request for data at address 0xc00c000081a00000
[  115.086706] Faulting instruction address: 0xc00800001a810ccc
[  115.086809] Oops: Kernel access of bad area, sig: 11 [#1]
[  115.086846] SMP NR_CPUS=2048
[  115.086848] NUMA
[  115.086876] PowerNV
[  115.086917] Modules linked in: amdgpu(+) mfd_core ttm drm_kms_helper drm syscopyarea sysfillrect sysimgblt fb_sys_fops i2c_algo_bit i2c_dev snd_hda_codec_hdmi ipmi_powernv snd_hda_intel ipmi_devintf snd_hda_codec snd_hda_core ghash_generic gf128mul ecb xts ctr evdev ofpart joydev cbc powernv_flash nvme vmx_crypto opal_prd snd_hwdep mtd at24 nvme_core ipmi_msghandler snd_aloop snd_pcm snd_timer binfmt_misc snd soundcore parport_pc lp parport ip_tables x_tables autofs4 nfsv3 nfs_acl nfs lockd grace hid_generic usbhid sunrpc hid fscache xhci_pci xhci_hcd usbcore tg3 ptp pps_core libphy
[  115.087313] CPU: 16 PID: 235 Comm: kworker/16:1 Not tainted 4.13.0+ #1
[  115.087369] Workqueue: events work_for_cpu_fn
[  115.087408] task: c000200707904380 task.stack: c000200707908000
[  115.087454] NIP: c00800001a810ccc LR: c00800001a77d36c CTR: c00800001a810c78
[  115.087509] REGS: c00020070790b4c0 TRAP: 0300   Not tainted  (4.13.0+)
[  115.087554] MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>
[  115.087565]   CR: 28002288  XER: 20040000
[  115.087639] CFAR: c00800001a810cac DAR: c00c000081a00000 DSISR: 42000000 SOFTE: 1
[  115.087639] GPR00: c00800001a77d36c c00020070790b740 c00800001aa38de0 c0002006fcd50000
[  115.087639] GPR04: c00c000081a00000 0000000000000000 0000000000080000 0000000000000000
[  115.087639] GPR08: 0000000000000001 c00800001a810c78 0000000000000001 c008000012a515d8
[  115.087639] GPR12: c00800001a810c78 c00000000fb89000 c000000000128568 c0002006fcd52f20
[  115.087639] GPR16: c0002006fcd52f28 c0002006fcd52f18 c0002006fcd52f38 c0002006fcd52f40
[  115.087639] GPR20: c0002006fcd52f30 0000000000008000 0000000000000400 c0002006fcd52f38
[  115.087639] GPR24: c0002006fcd52f40 c0002006fcd52f30 c0002006fcd52f18 0000000000000000
[  115.087639] GPR28: 0000000000000000 0000000000000000 c00c000081a00000 0000000000080000
[  115.088218] NIP [c00800001a810ccc] gmc_v9_0_gart_set_pte_pde+0x54/0x90 [amdgpu]
[  115.088331] LR [c00800001a77d36c] amdgpu_gart_unbind+0xa4/0x130 [amdgpu]
[  115.088378] Call Trace:
[  115.088402] [c00020070790b740] [c0002006fcd52720] 0xc0002006fcd52720 (unreliable)
[  115.088515] [c00020070790b780] [c00800001a77d36c] amdgpu_gart_unbind+0xa4/0x130 [amdgpu]
[  115.088628] [c00020070790b800] [c00800001a77d6e8] amdgpu_gart_fini+0x40/0x70 [amdgpu]
[  115.088745] [c00020070790b830] [c00800001a8114bc] gmc_v9_0_sw_fini+0x44/0x80 [amdgpu]
[  115.088856] [c00020070790b860] [c00800001a75c1d0] amdgpu_fini+0x208/0x560 [amdgpu]
[  115.088968] [c00020070790b910] [c00800001a762b5c] amdgpu_device_init+0xcc4/0x1590 [amdgpu]
[  115.089079] [c00020070790ba30] [c00800001a7650fc] amdgpu_driver_load_kms+0xb4/0x2d0 [amdgpu]
[  115.089154] [c00020070790bab0] [c008000011ffeb7c] drm_dev_register+0x1d4/0x290 [drm]
[  115.089264] [c00020070790bb50] [c00800001a75a880] amdgpu_pci_probe+0x128/0x1f0 [amdgpu]
[  115.089569] [c00020070790bbd0] [c0000000005d851c] local_pci_probe+0x6c/0x140
[  115.090996] [c00020070790bc60] [c0000000001199d8] work_for_cpu_fn+0x38/0x60
[  115.093781] [c00020070790bc90] [c00000000011ead8] process_one_work+0x248/0x520
[  115.097969] [c00020070790bd30] [c00000000011f030] worker_thread+0x280/0x5d0
[  115.103504] [c00020070790bdc0] [c00000000012870c] kthread+0x1ac/0x1c0
[  115.104970] [c00020070790be30] [c00000000000bae0] ret_from_kernel_thread+0x5c/0x7c
[  115.105127] Instruction dump:
[  115.105178] 7cdf3378 7c9e2378 7cbd2b78 7cfc3b78 48000008 e8410018 7be680e4 7bbd1828
[  115.105380] 78c68402 7fdeea14 7cc6e378 7c0004ac <f8de0000> 39200001 38600000 992d019c
[  115.105532] ---[ end trace b62a8741e75295fa ]---

@madscientist159 madscientist159 changed the title Kernel oops on amdgpu load (Polaris) Kernel oops on amdgpu load (Polaris / Vega) Feb 21, 2018
@madscientist159
Copy link
Contributor Author

madscientist159 commented Feb 21, 2018

A bit more information: This is failing in amdgpu_bo_create_reserved() trying to allocate 32MB of RAM. No failure occurs on a stock 4.15 kernel with the upstream amdgpu driver.

EDIT: Full trace of the failing buffer allocation:

[   49.673446] [c00020070794f580] [c00000000093bcdc] dump_stack+0xb0/0xf4 (unreliable)
[   49.674349] [c00020070794f5c0] [c00800001a0c94f8] amdgpu_bo_do_create+0x5a0/0x630 [amdgpu]
[   49.675196] [c00020070794f6d0] [c00800001a0c9634] amdgpu_bo_create+0xac/0x300 [amdgpu]
[   49.676022] [c00020070794f7b0] [c00800001a0c9b0c] amdgpu_bo_create_reserved+0x284/0x310 [amdgpu]
[   49.676776] [c00020070794f880] [c00800001a0c9c10] amdgpu_bo_create_kernel+0x78/0x130 [amdgpu]
[   49.677578] [c00020070794f920] [c00800001a0aff4c] amdgpu_device_init+0x10b4/0x1590 [amdgpu]
[   49.678418] [c00020070794fa30] [c00800001a0b20fc] amdgpu_driver_load_kms+0xb4/0x2d0 [amdgpu]
[   49.678567] [c00020070794fab0] [c008000011955b7c] drm_dev_register+0x1d4/0x290 [drm]
[   49.679343] [c00020070794fb50] [c00800001a0a7880] amdgpu_pci_probe+0x128/0x1f0 [amdgpu]
[   49.679422] [c00020070794fbd0] [c0000000005d851c] local_pci_probe+0x6c/0x140
[   49.679500] [c00020070794fc60] [c0000000001199d8] work_for_cpu_fn+0x38/0x60
[   49.679573] [c00020070794fc90] [c00000000011ead8] process_one_work+0x248/0x520
[   49.679597] [c00020070794fd30] [c00000000011f030] worker_thread+0x280/0x5d0
[   49.679663] [c00020070794fdc0] [c00000000012870c] kthread+0x1ac/0x1c0
[   49.679747] [c00020070794fe30] [c00000000000bae0] ret_from_kernel_thread+0x5c/0x7c

@kentrussell
Copy link
Contributor

kentrussell commented Sep 12, 2018

We should have this addressed in 1.9. Can you let us know if it's still happening?

@jlgreathouse
Copy link
Contributor

Closing since we've gone ~6 months without hearing more. I presume this is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants