-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Linux kernel >= 6.8.0-44 #171
base: master
Are you sure you want to change the base?
Conversation
Switch the default mes to uni mes for gfx v12. V2: remove uni_mes set for gfx v11. Signed-off-by: Likun Gao <Likun.Gao@amd.com> Reviewed-by: Jack Xiao <Jack.Xiao@amd.com>
Enable mmhub and athub cg on gc 12.0.0 Signed-off-by: Likun Gao <Likun.Gao@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Enable GFXOFF for GC v12.0.0. Signed-off-by: Likun Gao <Likun.Gao@amd.com> Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
add pp_dpm_dcefclk for smu 14.0.2/3 Signed-off-by: Kenneth Feng <kenneth.feng@amd.com> Reviewed-by: Jack Gui <Jack.Gui@amd.com>
use mc address for wptr in add queue packet Signed-off-by: Frank Min <Frank.Min@amd.com> Reviewed-by: Jack Xiao <Jack.Xiao@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com>
gfx12 query video mem channel/type/width from umc_info of atom list, so fix it accordingly. Signed-off-by: Frank Min <Frank.Min@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com>
disable gpo temprarily since it is not ready in fw Signed-off-by: Kenneth Feng <kenneth.feng@amd.com> Reviewed-by: Jack Gui <Jack.Gui@amd.com>
create a new helper function to avoid compiler 'side-effect' check about RAS_EVENT_LOG() macro. Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
gpu_id needs to be unique for user space to identify GPUs via KFD interface. In the current implementation there is a very small probability of having non unique gpu_ids. v2: Add check to confirm if gpu_id is unique. If not unique, find one Changed commit header to reflect the above v3: Use crc16 as suggested-by: Lijo Lazar <lijo.lazar@amd.com> Ensure that gpu_id != 0 Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Fix up parameter descriptions. Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
To catch GPU mapping of system memory, TTM_PL_TT and AMDGPU_PL_PREEMPT must be checked. Fixes: 7c06cc729edc ("drm/amdkfd: mark GFX12 system and peer GPU memory mappings as MTYPE_NC") Signed-off-by: Sreekant Somasekharan <sreekant.somasekharan@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Fix up parameter descriptions. Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
GFX1201 was missed in the commit below. Adding it in. Fixes: 7c06cc729edc ("drm/amdkfd: mark GFX12 system and peer GPU memory mappings as MTYPE_NC") Signed-off-by: Sreekant Somasekharan <sreekant.somasekharan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
add module parameter for jpeg. this is a temporary workaround for jpeg unit test fail on vcn 5.0 now. will be removed later. Signed-off-by: Kenneth Feng <kenneth.feng@amd.com> Reviewed-by: Sonny Jiang <sonny.jiang@amd.com>
support pp_dpm_pcie on smu v14.0.2/3 Signed-off-by: Kenneth Feng <kenneth.feng@amd.com> Reviewed-by: Jack Gui <Jack.Gui@amd.com>
When user sets an interval less than what driver can handle, soft lockup arises. To clear this soft lockup with adding a schedule before trigger a new host trap. [ 2896.405488] watchdog: BUG: soft lockup - CPU#22 stuck for 26s! [pcs_130:38057] [ 2896.405676] Supported: No, Unsupported modules are loaded [ 2896.405678] CPU: 22 PID: 38057 Comm: pcs_130 Kdump: loaded Tainted: G OE X N 5.14.21-150500.55.59-default ROCm#1 SLE15-SP5 3a8569df5696e57cdcb648c7e890af33bdc23f85 [ 2896.405683] Hardware name: Dell Inc. PowerEdge R7525/0590KW, BIOS 2.6.6 01/13/2022 [ 2896.405684] RIP: 0010:amdgpu_device_rreg.part.42+0x57/0x1d0 [amdgpu] [ 2896.405978] Code: 6f 4c 9c 00 4c 8b 83 b8 08 00 00 4d 01 e0 85 c9 74 15 65 48 8b 04 25 00 1c 02 00 3b 88 b8 09 00 00 0f 85 52 01 00 00 41 8b 28 <8b> 05 43 4c 9c 00 85 c0 74 56 65 48 8b 14 25 00 1c 02 00 39 82 b8 [ 2896.405981] RSP: 0018:ffffb7a6ecc33e30 EFLAGS: 00000246 [ 2896.405984] RAX: ffff949389f18000 RBX: ffff94d3d1100000 RCX: 00000000000094a9 [ 2896.405985] RDX: 0000000000000000 RSI: 0000000000002376 RDI: ffff94d3d1100000 [ 2896.405987] RBP: 0000000000000000 R08: ffffb7a6e2b88dd8 R09: ffff94d30e3b1f14 [ 2896.405989] R10: ffffb7a6c0427d88 R11: ffffb7a6ecc33c80 R12: 0000000000008dd8 [ 2896.405990] R13: 0000000000002376 R14: ffff94d30e3b1f14 R15: ffff94d3d1100000 [ 2896.405992] FS: 0000000000000000(0000) GS:ffff9512ff580000(0000) knlGS:0000000000000000 [ 2896.405994] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2896.405996] CR2: 00007f5b2b732000 CR3: 0000006a1ee10003 CR4: 0000000000770ee0 [ 2896.405998] PKRU: 55555554 [ 2896.405999] Call Trace: [ 2896.406004] <TASK> [ 2896.406007] kgd_gfx_v9_trigger_pc_sample_trap+0x1d6/0x4f0 [amdgpu 75bb93fc913928fc00917a1c71d5c2dca258175d] Signed-off-by: James Zhu <James.Zhu@amd.com> Tested-by: Vladimir Indic <Vladimir.Indic@amd.com> Reviewed-by: Vladimir Indic <Vladimir.Indic@amd.com>
When host trap pc sampling is activted. Since Command bus from SPI/SQG to SQ may have some conflict with SQ internal clock gating, when we have many host trap command it will trigger qcm fence timeout. Signed-off-by: James Zhu <James.Zhu@amd.com> Tested-by: Vladimir Indic <Vladimir.Indic@amd.com> Reviewed-by: Vladimir Indic <Vladimir.Indic@amd.com>
Signed-off-by: Asher Song <Asher.Song@amd.com>
Signed-off-by: Asher Song <Asher.Song@amd.com>
is_smca_umc_v2 function never occurs in upstream kernel, macro HAVE_SMCA_UMC_V2 is undefined all the time, which cause MCE notifications is not handled on MI200 A+A platform. So we drop macro HAVE_SMCA_UMC_V2. On the other hand, on Centos 7.9, SMCA_UMC_V2 is not defined in arch/x86/include/asm/mce.h, we don't care umc_v2 error notification on centos 7.9. Signed-off-by: Asher Song <Asher.Song@amd.com> Reviewed-by: Flora Cui <flora.cui@amd.com> Reviewed-by: Bob Zhou <bob.zhou@amd.com>
When redefining HAVE_SMCA_UMC_V2, the fake function smca_get_bank_type is called by amdgpu_bad_page_notifier. However origin fake function can not be referenced when making intree build as it defined in amdkcl modules. So we make a macro for the fake function in backport/kcl_mce.h Signed-off-by: Asher Song <Asher.Song@amd.com> Reviewed-by: Flora Cui <flora.cui@amd.com> Reviewed-by: Bob Zhou <bob.zhou@amd.com>
There is a typo in patch drm/amdkcl: fake smca_get_bank_type, fix it Signed-off-by: Asher Song <Asher.Song@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
The parameters segment_width and last_segment_width are used to control the configuration of the Output Plane Processor (OPP), specifically the width of each segment that the display is divided into and the width of the last segment Fixes the below with gcc W=1: drivers/gpu/drm/amd/amdgpu/../display/dc/optc/dcn35/dcn35_optc.c:59: warning: Function parameter or struct member 'segment_width' not described in 'optc35_set_odm_combine' drivers/gpu/drm/amd/amdgpu/../display/dc/optc/dcn35/dcn35_optc.c:59: warning: Function parameter or struct member 'last_segment_width' not described in 'optc35_set_odm_combine' drivers/gpu/drm/amd/amdgpu/../display/dc/optc/dcn35/dcn35_optc.c:59: warning: Excess function parameter 'timing' description in 'optc35_set_odm_combine' Cc: Tom Chung <chiahsuan.chung@amd.com> Cc: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com> Cc: Roman Li <roman.li@amd.com> Cc: Aurabindo Pillai <aurabindo.pillai@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Tom Chung <chiahsuan.chung@amd.com>
Align with new port same as smu 13.x. Signed-off-by: Kenneth Feng <kenneth.feng@amd.com> Reviewed-by: Jack Gui <Jack.Gui@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Update the capabilities for supporting 8k encoding. Reviewed-by: David (Ming Qiang) Wu <David.Wu3@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Ruijing Dong <ruijing.dong@amd.com>
The following commit updated gmc->noretry from 0 to 1 for GC HW IP 9.3.0: commit 5f3854f ("drm/amdgpu: add more cases to noretry=1") This causes the device to hang when a page fault occurs, until the device is rebooted. Instead, revert back to gmc->noretry=0 so the device is still responsive. Fixes: 5f3854f ("drm/amdgpu: add more cases to noretry=1") Signed-off-by: Tim Van Patten <timvp@google.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
./drivers/gpu/drm/amd/amdgpu/amdgpu.h: amdgpu_umsch_mm.h is included more than once. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9063 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
…ing_set_wptr This commit removes a duplicate check for *is_queue_unmap in the sdma_v7_0_ring_set_wptr function. The check at line 171 was considered dead code because at this point in the code, we already know that *is_queue_unmap is false due to the check at line 161. By removing this unnecessary check, improves the readability of the code Fixes the below: drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c:171 sdma_v7_0_ring_set_wptr() warn: duplicate check '*is_queue_unmap' (previous on line 161) drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c 140 static void sdma_v7_0_ring_set_wptr(struct amdgpu_ring *ring) 141 { 142 struct amdgpu_device *adev = ring->adev; 143 uint32_t *wptr_saved; 144 uint32_t *is_queue_unmap; 145 uint64_t aggregated_db_index; 146 uint32_t mqd_size = adev->mqds[AMDGPU_HW_IP_DMA].mqd_size; 147 148 DRM_DEBUG("Setting write pointer\n"); 149 150 if (ring->is_mes_queue) { 151 wptr_saved = (uint32_t *)(ring->mqd_ptr + mqd_size); 152 is_queue_unmap = (uint32_t *)(ring->mqd_ptr + mqd_size + ^^^^^^^^^^^^^^^^ Set here 153 sizeof(uint32_t)); 154 aggregated_db_index = 155 amdgpu_mes_get_aggregated_doorbell_index(adev, 156 ring->hw_prio); 157 158 atomic64_set((atomic64_t *)ring->wptr_cpu_addr, 159 ring->wptr << 2); 160 *wptr_saved = ring->wptr << 2; 161 if (*is_queue_unmap) { ^^^^^^^^^^^^^^^ Checked here 162 WDOORBELL64(aggregated_db_index, ring->wptr << 2); 163 DRM_DEBUG("calling WDOORBELL64(0x%08x, 0x%016llx)\n", 164 ring->doorbell_index, ring->wptr << 2); 165 WDOORBELL64(ring->doorbell_index, ring->wptr << 2); 166 } else { 167 DRM_DEBUG("calling WDOORBELL64(0x%08x, 0x%016llx)\n", 168 ring->doorbell_index, ring->wptr << 2); 169 WDOORBELL64(ring->doorbell_index, ring->wptr << 2); 170 --> 171 if (*is_queue_unmap) ^^^^^^^^^^^^^^^ This is dead code. We know it's false. 172 WDOORBELL64(aggregated_db_index, 173 ring->wptr << 2); 174 } 175 } else { 176 if (ring->use_doorbell) { 177 DRM_DEBUG("Using doorbell -- " 178 "wptr_offs == 0x%08x " Fixes: 6d9c711786e6 ("drm/amdgpu: Add sdma v7_0 ip block support (v7)") Cc: Likun Gao <Likun.Gao@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Likun Gao <Likun.Gao@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com>
modify the lock type to 'spinlock' to avoid schedule issue in interrupt context. Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Add support to set/get information about different DPM policies. The support is only available on SOCs which use swsmu architecture. A DPM policy type may be defined with different levels. For example, a policy may be defined to select Pstate preference and then later a pstate preference may be chosen. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Per firmware's requirement, replace mode2 with mode1. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
GFX v9.4.3 uses mode1 reset, other ASICs choose mode2. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Acked-by: Lijo Lazar <lijo.lazar@amd.com>
Since it is not stable on stress test. Signed-off-by: James Zhu <James.Zhu@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Vladimir Indic <Vladimir.Indic@amd.com> Tested-by: Vladimir Indic <Vladimir.Indic@amd.com>
…rarily not for upstream. -v2: fix typo -v3: rename kfd_ioctl_pc_sample_args "reserved" to "version" Signed-off-by: James Zhu <James.Zhu@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Vladimir Indic <Vladimir.Indic@amd.com> Tested-by: Vladimir Indic <Vladimir.Indic@amd.com>
This reverts commit 6ac6a32. The fixed issue has disappeared, so revert the workaround. Signed-off-by: Bob Zhou <bob.zhou@amd.com> Reviewed-by: Jingwen Chen <Jingwen.Chen2@amd.com>
This reverts commit 4ff45ec. The fixed issue has disappeared, so revert the workaround. Signed-off-by: Bob Zhou <bob.zhou@amd.com> Reviewed-by: Jingwen Chen <Jingwen.Chen2@amd.com>
We send back the ready to reset message before we stop anything. This is wrong. Move it to when we are actually ready for the FLR to happen. In the current state since we take tens of seconds to stop everything, it is very likely that host would give up waiting and reset the GPU before we send ready, so it would be the same as before. But this gets rid of the hack with reset_domain locking and also let us tell how slow ready to reset actually is from the host. The ready to reset speed can be improved later. Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Emily Deng <Emily.Deng@amd.com>
…dapter Signed-off-by: Vignesh Chander <Vignesh.Chander@amd.com> Reviewed-by: Zhigang Luo <Zhigang.Luo@amd.com>
For RAS error scenario, VF guest driver will check mailbox and set fed flag to avoid unnecessary HW accesses. additionally, poll for reset completion message first to avoid accidentally spamming multiple reset requests to host. v2: add another mailbox check for handling case where kfd detects timeout first v3: set host_flr bit and use wait_for_reset Signed-off-by: Vignesh Chander <Vignesh.Chander@amd.com> Reviewed-by: Zhigang Luo <Zhigang.Luo@amd.com>
Flag "mes.ring.shced.ready" will be set as true after mes hw init and set as false when mes hw fini to avoid duplicate initialization. But hw fini will not be called when function level reset, which will cause mes hw init be skipped during FLR, which will leads to mapping legacy queue fail. Set this flag as false when post reset will fix this issue. Signed-off-by: Lin.Cao <lincao12@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com>
Accessing registers via host is missing the check for skip_hw_access and the lockdep check that comes with it. Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com>
is_hws_hang and is_resetting serves pretty much the same purpose and they all duplicates the work of the reset_domain lock, just check that directly instead. This also eliminate a few bugs listed below and get rid of dqm->ops.pre_reset. kfd_hws_hang did not need to avoid scheduling another reset. If the on-going reset decided to skip GPU reset we have a bad time, otherwise the extra reset will get cancelled anyway. remove_queue_mes forgot to check is_resetting flag compared to the pre-MES path unmap_queue_cpsch, so it did not block hw access during reset correctly. Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
At this point the gart is not set up, there's no point to invalidate tlb here and it could even be harmful. Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com>
When amdgpu_gart_invalidate_tlb helper is introduced this part was left out of the conversion. Avoid the code duplication here. Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com>
Which method is used to flush tlb does not depend on whether a reset is in progress or not. We should skip flush altogether if the GPU will get reset. So put both path under reset_domain read lock. Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> CC: stable@vger.kernel.org
We need to take the reset domain lock before flush hdp. We can't put the lock inside amdgpu_device_flush_hdp itself because it is used during reset where we already take the write side lock. Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com>
We need to take the reset domain lock before talking to MES. While in this case we can take the lock inside the mes helper. We can't do so for most other mes helpers since they are used during reset. So for consistency sake we add the lock here. Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Here since we are in reset and takes the reset_domain write side lock already. We can't use the flush tlb helper which tries to take the read side. Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com>
This reverts commit d409c20. The commit is a partial revert that left things broken, also this was never ported back to drm-next. This revert is needed by patch series https://gerrit-git.amd.com/c/brahma/ec/linux/+/1068977
Add support to tune phase detect parameters. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Add debugfs nodes for enabling/disabling and tuning parameters used in phase detect. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Add support for enabling phase detect and tuning params for SMUv13.0.6 SOCs. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Phase detect controls are only available for SMUv13.0.6 dGPUs. Create control object only on those. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
…or Linux kernel >= 6.8.0-44
i wonder if this is a ubuntu specific problem, as they choose 6.8 kernel (sadly not a LTS one) , picked some patch in -44 and thus break their partner's code ! The bug is also reported on ubuntu, as it is their change that caused the bug in 6.8.0 (but this will need to be fixed for future kernels) https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2080823 |
Thanks for submitting this, though I wonder if this will still work with distributions outside of Ubuntu. I admit I don't know much about kernel development. |
torvalds/linux@5a507b7 is where the change happened. It is additionally a security issue, see CVE-2024-39498, which makes me think this will be backported to older kernels. The real question is what value of |
We've got a KCL-based solution coming in the next release. I'll leave this open for now in case people want it as a workaround |
Related ROCm/ROCm#3701 and thanks to @kswit for the reference and @alain-bkr for the solution.
I have guarded the code so it doesn't break older kernel versions.
Checking https://packages.ubuntu.com/noble/all/linux-headers-6.8.0-41/download, particularly the
Makefile
it is not clear what the version is. I can add a runtimeuname
usage if you like. On my 6.8.0-41 for example,/usr/include/linux/version.h
contains#define LINUX_VERSION_CODE 395276
. In Alpine in Docker with 6.6-r0 of linux-headersapk
on that same base, I get#define LINUX_VERSION_CODE 394752
. Pretty sure this version restriction in the PR is sufficient, but keep this comment in mind; happy to change the version.