Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DKMS include path parsing #117

Open
wants to merge 3,190 commits into
base: master
Choose a base branch
from
Open

Fix DKMS include path parsing #117

wants to merge 3,190 commits into from

Conversation

tsipa
Copy link

@tsipa tsipa commented Oct 28, 2021

filter-out first argument is the space separated list of patterns.
Current implementation will remove any individual "-include" positional
arguments which will make cmdline invalid.

more details can be found in ROCm/ROCm#1601

jlgreathouse and others added 30 commits August 7, 2021 11:02
Add the MODE register into the per-wave debug information.
This register holds state such as FP rounding and denorm
modes, which exceptions are enabled, and active clamping
modes.

Signed-off-by: Joseph Greathouse <Joseph.Greathouse@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
If the platform uses BOCO, don't use BACO in runtime suspend.
We could end up executing the BACO path if the platform supports
both.

Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1669
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
…ate() and amdgpu_ttm_tt_unpopulate()

The varialbe gtt in the function amdgpu_ttm_tt_populate() and
amdgpu_ttm_tt_unpopulate() is guaranteed to be not NULL in the context.
Thus the null-pointer checks are redundant and can be dropped.

Reported-by: TOTE Robot <oslab@tsinghua.edu.cn>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Tuo Li <islituo@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Enable pp_num_states, pp_cur_state, pp_force_state, pp_table sysfs under
SRIOV 1-VF scenario.

Signed-off-by: Jiawei Gu <Jiawei.Gu@amd.com>
Add u32 gfx_target_version field to kfd_node_properties and
kfd_device_info. Populate <asic>_device_info structs accordingly and
expose to sysfs.

This allows eliminating device-ID-based lookup tables in user mode for
future ASICs.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
…ables_init()'

'watermarks_table' must be freed instead 'clocks_table', because
'clocks_table' is known to be NULL at this point and 'watermarks_table' is
never freed if the last kzalloc fails.

Fixes: c98ee89 ("drm/amd/pm: add the fine grain tuning function for vangogh")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
In some systems only MACO is supported. This is to fix the problem
that runtime pm is enabled but BACO is not supported. MACO will be
handled seperately.

Signed-off-by: Kenneth Feng <kenneth.feng@amd.com>
Reviewed-by: Jack Gui <Jack.Gui@amd.com>
The variable eng_id is being initialized with a value that is never
read, it is being re-assigned on the next statment. The assignment
is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
- Extend wait time and add retry, currently 6s * 2times
- Change timing algorithm

Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Signed-off-by: Peng Ju Zhou <PengJu.Zhou@amd.com>
Reviewed-by: Victor Zhao <Victor.Zhao@amd.com>
When init failed in early init stage, amdgpu_object has
not been initialized, so hasn't the ttm delayed queue functions.

Signed-off-by: YuBiao Wang <YuBiao.Wang@amd.com>
Reviewed-by: Emily.Deng <Emily.Deng@amd.com>
Add device ids.

Signed-off-by: Chengming Gui <Jack.Gui@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Rui Teng <rui.teng@amd.com>
There may be multiple instances and only one is harvested.

v2: fix typo in commit message

Fixes: 83a0b86 ("drm/amdgpu: add judgement when add ip blocks (v2)")
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1673
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: James Zhu <James.Zhu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Dmytro Laktyushkin <Dmytro.Laktyushkin@amd.com>
Acked-by: Anson Jacob <Anson.Jacob@amd.com>
Signed-off-by: Eric Bernstein <eric.bernstein@amd.com>
Cc: stable@vger.kernel.org
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
[Why]
If the plane has been removed, the writeback disablement logic
doesn't run

[How]
fix the logic order

Acked-by: Anson Jacob <Anson.Jacob@amd.com>
Signed-off-by: Roy Chan <roy.chan@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
…cking logic

Acked-by: Anson Jacob <Anson.Jacob@amd.com>
Signed-off-by: Roy Chan <roy.chan@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Acked-by: Anson Jacob <Anson.Jacob@amd.com>
Signed-off-by: Roy Chan <roy.chan@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
[How]
the programming sequeune was for old asic.
the correct programming sequeunce should be similar to the one
used in mpc. the fix is copied from the mpc programming sequeunce.

Reviewed-by: Anthony Koo <Anthony.Koo@amd.com>
Acked-by: Anson Jacob <Anson.Jacob@amd.com>
Signed-off-by: Roy Chan <roy.chan@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Acked-by: Anson Jacob <Anson.Jacob@amd.com>
Signed-off-by: Roy Chan <roy.chan@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
[Why]
Developers can find it useful if the driver can produce
AUX traces without special equipment.

[How]
Add AUX tracing.

Reviewed-by: Zhan Liu <zhan.liu@amd.com>
Acked-by: Anson Jacob <Anson.Jacob@amd.com>
Signed-off-by: Ashley Thomas <Ashley.Thomas2@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
[why]
DM needs to be notified when hdcp module has completed
authentication attempt.

Reviewed-by: Bhawanpreet Lakha <bhawanpreet.lakha@amd.com>
Reviewed-by: George Shen <George.Shen@amd.com>
Acked-by: Anson Jacob <Anson.Jacob@amd.com>
Signed-off-by: Wenjing Liu <wenjing.liu@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Acked-by: Anson Jacob <Anson.Jacob@amd.com>
Signed-off-by: Anthony Koo <Anthony.Koo@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
This version brings along following fixes:
- Fix memory allocation in dm IRQ context to use GFP_ATOMIC
- Increase timeout threshold for DMCUB reset
- Clear GPINT after DMCUB has reset
- Add AUX I2C tracing
- Fix code commenting style
- Some refactoring
- Remove invalid assert for ODM + MPC case

Reviewed-by: Wyatt Wood <Wyatt.Wood@amd.com>
Acked-by: Anson Jacob <Anson.Jacob@amd.com>
Signed-off-by: Anthony Koo <Anthony.Koo@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
[Why]
Otherwise we can end up processing whatever was left in the register
if the DMCUB was previously reset.

If DMCUB gets force reset too early from another client then we might
not have even acked the disable yet - causing DMCUB instantly shutdown
if the command was 10020000.

[How]
Move the GPINT clear outside of the reset loop and do it unconditionally
after the DMCUB has been properly reset.

Reviewed-by: Roy Chan <roy.chan@amd.com>
Reviewed-by: Eric Yang <Eric.Yang2@amd.com>
Acked-by: Anson Jacob <Anson.Jacob@amd.com>
Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
[Why]
If we're backdoor loading the DMCUB performs more work than just
the PHY reset so we can end up resetting before the cleanup has fully
finished.

[How]
Increase timeout, add udelay between spins to guarantee a minimum.

Reviewed-by: Roy Chan <roy.chan@amd.com>
Reviewed-by: Eric Yang <Eric.Yang2@amd.com>
Acked-by: Anson Jacob <Anson.Jacob@amd.com>
Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Replace GFP_KERNEL with GFP_ATOMIC as amdgpu_dm_irq_schedule_work
can't sleep.

BUG: sleeping function called from invalid context at include/linux/sched/mm.h:196
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 253, name: kworker/6:1H
CPU: 6 PID: 253 Comm: kworker/6:1H Tainted: G        W  OE     5.11.0-promotion_2021_06_07-18_36_28_prelim_revert_retrain ROCm#8
Hardware name: System manufacturer System Product Name/PRIME X570-PRO, BIOS 3405 02/01/2021
Workqueue: events_highpri dm_irq_work_func [amdgpu]
Call Trace:
 <IRQ>
 dump_stack+0x5e/0x74
 ___might_sleep.cold+0x87/0x98
 __might_sleep+0x4b/0x80
 kmem_cache_alloc_trace+0x390/0x4f0
 amdgpu_dm_irq_handler+0x171/0x230 [amdgpu]
 amdgpu_irq_dispatch+0xc0/0x1e0 [amdgpu]
 amdgpu_ih_process+0x81/0x100 [amdgpu]
 amdgpu_irq_handler+0x26/0xa0 [amdgpu]
 __handle_irq_event_percpu+0x49/0x190
 ? __hrtimer_get_next_event+0x4d/0x80
 handle_irq_event_percpu+0x33/0x80
 handle_irq_event+0x33/0x60
 handle_edge_irq+0x82/0x190
 asm_call_irq_on_stack+0x12/0x20
 </IRQ>
 common_interrupt+0xbb/0x140
 asm_common_interrupt+0x1e/0x40
RIP: 0010:amdgpu_device_rreg.part.0+0x44/0xf0 [amdgpu]
Code: 53 48 89 fb 4c 3b af c8 08 00 00 73 6d 83 e2 02 75 0d f6 87 40 62 01 00 10 0f 85 83 00 00 00 4c 03 ab d0 08 00 00 45 8b 6d 00 <8b> 05 3e b6 52 00 85 c0 7e 62 48 8b 43 08 0f b7 70 3e 65 8b 05 e3
RSP: 0018:ffffae7740fff9e8 EFLAGS: 00000286
RAX: ffffffffc05ee610 RBX: ffff8aaf8f620000 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000005430 RDI: ffff8aaf8f620000
RBP: ffffae7740fffa08 R08: 0000000000000001 R09: 000000000000000a
R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000005430
R13: 0000000071000000 R14: 0000000000000001 R15: 0000000000005430
 ? amdgpu_cgs_write_register+0x20/0x20 [amdgpu]
 amdgpu_device_rreg+0x17/0x20 [amdgpu]
 amdgpu_cgs_read_register+0x14/0x20 [amdgpu]
 dm_read_reg_func+0x38/0xb0 [amdgpu]
 generic_reg_wait+0x80/0x160 [amdgpu]
 dce_aux_transfer_raw+0x324/0x7c0 [amdgpu]
 dc_link_aux_transfer_raw+0x43/0x50 [amdgpu]
 dm_dp_aux_transfer+0x87/0x110 [amdgpu]
 drm_dp_dpcd_access+0x72/0x110 [drm_kms_helper]
 drm_dp_dpcd_read+0xb7/0xf0 [drm_kms_helper]
 drm_dp_get_one_sb_msg+0x349/0x480 [drm_kms_helper]
 drm_dp_mst_hpd_irq+0xc5/0xe40 [drm_kms_helper]
 ? drm_dp_mst_hpd_irq+0xc5/0xe40 [drm_kms_helper]
 dm_handle_hpd_rx_irq+0x184/0x1a0 [amdgpu]
 ? dm_handle_hpd_rx_irq+0x184/0x1a0 [amdgpu]
 handle_hpd_rx_irq+0x195/0x240 [amdgpu]
 ? __switch_to_asm+0x42/0x70
 ? __switch_to+0x131/0x450
 dm_irq_work_func+0x19/0x20 [amdgpu]
 process_one_work+0x209/0x400
 worker_thread+0x4d/0x3e0
 ? cancel_delayed_work+0xa0/0xa0
 kthread+0x124/0x160
 ? kthread_park+0x90/0x90
 ret_from_fork+0x22/0x30

Reviewed-by: Aurabindo Jayamohanan Pillai <Aurabindo.Pillai@amd.com>
Acked-by: Anson Jacob <Anson.Jacob@amd.com>
Signed-off-by: Anson Jacob <Anson.Jacob@amd.com>
Cc: stable@vger.kernel.org
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Building with W=1 complains about an empty 'else' statement, so use the
usual do-nothing-while-0 loop to quieten this warning.

../drivers/gpu/drm/amd/amdgpu/../display/dc/dce/dmub_psr.c:113:53: warning: suggest braces around empty body in an 'else' statement [-Wempty-body]
  113 |                                 *state, retry_count);

Fixes: b30eda8 ("drm/amd/display: Add ETW log to dmub_psr_get_state")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Wyatt Wood <wyatt.wood@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Harry Wentland <harry.wentland@amd.com>
Cc: Leo Li <sunpeng.li@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Don't use "begin kernel-doc notation" (/**) for comments that are
not kernel-doc. This eliminates warnings reported by the 0day bot.

drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c:89: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
    * This shader is used to clear VGPRS and LDS, and also write the input
drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c:209: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
    * The below shaders are used to clear SGPRS, and also write the input
drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c:301: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
    * This shader is used to clear the uninitiated sgprs after the above

Fixes: 0e0036c ("drm/amdgpu: fix no full coverage issue for gprs initialization")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Dennis Li <Dennis.Li@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
 initial modification of files
  smu_cmn.c
  navi10_ppt.c

=== Test ===
AMDGPU_PCI_ADDR=`lspci -nn | grep "VGA\|Display" | cut -d " " -f 1`
AMDGPU_HWMON=`ls -la /sys/class/hwmon | grep $AMDGPU_PCI_ADDR | awk '{print $9}'`
HWMON_DIR=/sys/class/hwmon/${AMDGPU_HWMON}
LOGFILE=pp_printf.test.log

lspci -nn | grep "VGA\|Display"  > $LOGFILE
FILES="pp_dpm_sclk
pp_sclk_od
pp_mclk_od
pp_dpm_pcie
pp_od_clk_voltage
pp_power_profile_mode "

for f in $FILES
do
  echo === $f === >> $LOGFILE
  cat $HWMON_DIR/device/$f >> $LOGFILE
done
cat $LOGFILE

Signed-off-by: Darren Powell <darren.powell@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
 modification of smu11 files
  arcturus_ppt.c
  sienna_cichlid_ppt.c
  vangogh_ppt.c

=== Test ===
AMDGPU_PCI_ADDR=`lspci -nn | grep "VGA\|Display" | cut -d " " -f 1`
AMDGPU_HWMON=`ls -la /sys/class/hwmon | grep $AMDGPU_PCI_ADDR | awk '{print $9}'`
HWMON_DIR=/sys/class/hwmon/${AMDGPU_HWMON}
LOGFILE=pp_printf.test.log

lspci -nn | grep "VGA\|Display"  > $LOGFILE
FILES="pp_dpm_sclk
pp_power_profile_mode "

for f in $FILES
do
  echo === $f === >> $LOGFILE
  cat $HWMON_DIR/device/$f >> $LOGFILE
done
cat $LOGFILE

Signed-off-by: Darren Powell <darren.powell@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
 initial modification of files
  renoir_ppt.c
  aldebaran_ppt.c
  yellow_carp_ppt.c

=== Test ===
AMDGPU_PCI_ADDR=`lspci -nn | grep "VGA\|Display" | cut -d " " -f 1`
AMDGPU_HWMON=`ls -la /sys/class/hwmon | grep $AMDGPU_PCI_ADDR | awk '{print $9}'`
HWMON_DIR=/sys/class/hwmon/${AMDGPU_HWMON}
LOGFILE=pp_printf.test.log

lspci -nn | grep "VGA\|Display"  > $LOGFILE
FILES="pp_dpm_sclk
pp_power_profile_mode "

for f in $FILES
do
  echo === $f === >> $LOGFILE
  cat $HWMON_DIR/device/$f >> $LOGFILE
done
cat $LOGFILE

Signed-off-by: Darren Powell <darren.powell@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
alexsierrag and others added 14 commits October 12, 2021 16:48
This fix the deadlock with the BO reservations during SVM_BO evictions
while allocations in VRAM are concurrently performed. More specific,
while the ttm waits for the fence to be signaled (ttm_bo_wait), it
already has the BO reserved. In parallel, the restore worker might be
running, prefetching memory to VRAM. This also requires to reserve the
BO, but blocks the mmap semaphore first. The deadlock happens when the
SVM_BO eviction worker kicks in and waits for the mmap semaphore held
in restore worker. Preventing signal the fence back, causing the
deadlock until the ttm times out.

We don't need to hold the BO reservation anymore during validation
and mapping. Now the physical addresses are taken from hmm_range_fault.
We also take migrate_mutex to prevent range migration while
validate_and_map update GPU page table.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Philip Yang <philip.yang@amd.com>
(cherry picked from commit e934786f57dd4039c3dfacfef2910f3ad29ad8b5)

Change-Id: I75ec8b036366d0f8571990ed7c14a13ef21ba82a
sysfs_emit and sysfs_emit_at requrie a page boundary
aligned buf address. Make them happy!

v2: use an inline function.

Warning Log:
[  492.545174] invalid sysfs_emit_at: buf:00000000f19bdfde at:0
[  492.546416] WARNING: CPU: 7 PID: 1304 at fs/sysfs/file.c:765 sysfs_emit_at+0x4a/0xa0
[  492.654805] Call Trace:
[  492.655353]  ? smu_cmn_get_metrics_table+0x40/0x50 [amdgpu]
[  492.656780]  vangogh_print_clk_levels+0x369/0x410 [amdgpu]
[  492.658245]  vangogh_common_print_clk_levels+0x77/0x80 [amdgpu]
[  492.659733]  ? preempt_schedule_common+0x18/0x30
[  492.660713]  smu_print_ppclk_levels+0x65/0x90 [amdgpu]
[  492.662107]  amdgpu_get_pp_od_clk_voltage+0x13d/0x190 [amdgpu]
[  492.663620]  dev_attr_show+0x1d/0x40

Signed-off-by: Lang Yu <lang.yu@amd.com>
Acked-by: Huang Rui <ray.huang@amd.com>
Current RUNPM mechanism relies on PMFW to master the timing for BACO
in/exit. And that needs cooperation from sound driver for dstate
change notification for function 1(audio). Otherwise(on sound driver
missing), BACO cannot be kicked in correctly and hang will be observed
on RUNPM exit.

By switching back to legacy message way on sound driver missing,
we are able to fix the runpm hang observed for the scenario below:
amdgpu driver loaded -> runpm suspend kicked -> sound driver loaded

Change-Id: I0e44fef11349b5e45e6102913eb46c8c7d279c65
Signed-off-by: Evan Quan <evan.quan@amd.com>
Reported-and-tested-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
SVM range may includes multiple VMAs with different vm_flags, if prange
page index is the last page of the VMA offset + npages, update GPU
mapping to create GPU page table with same VMA access permission.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
No function change, use pr_debug_ratelimited to avoid per page debug
message overflowing dmesg buf and console log.

use dev_err to show error message from unexpected situation, to provide
clue to help debug without enabling dynamic debug log. Define dev_fmt to
output function name in error message.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
migrate_vma_setup may return cpages 0, means 0 page can be migrated,
treat this as error case to skip the rest of vma migration steps.

Change svm_migrate_vma_to_vram and svm_migrate_vma_to_ram to return the
number of pages migrated successfully or error code. The caller add up
all the successful migration pages and update prange->actual_loc only if
the total migrated pages is not 0.

This also removes the warning message "VRAM BO missing during
validation" if migration cpages is 0.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Flora Cui <flora.cui@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Flora Cui <flora.cui@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Enable GFX RAS under generic RAS enablement path

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Change-Id: I41ccc83bee77c4f0c0f50379581cbaf0902a6afd
(cherry picked from commit a0020492cbeff6952d49b00148ff8b5fd47f4422)
(cherry picked from commit 1abe53dc78f9c2a495256ca8c1b08c08c38648a2)
When creating unregistered new svm range to recover retry fault, avoid
new svm range to overlap with ranges or userptr ranges managed by TTM,
otherwise svm migration will trigger TTM or userptr eviction, to evict
user queues unexpectedly.

Change helper amdgpu_ttm_tt_affect_userptr to return userptr which is
inside the range. Add helper svm_range_check_vm_userptr to scan all
userptr of the vm, and return overlap userptr bo start, last.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
In older kernel version 5.9, migrate.cpages reset to 0 by
migrate_vma_finalize, so use local variable to save migrate.cpages. This
change work for older kernels and upstream kernel.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
The userptr can be unmapped by application and still registered to
driver, restore userptr work return user pages will get -EFAULT bad
address error. Pretend this error as succeed. GPU access this userptr
will have VM fault later, it is better than application soft hangs with
stalled user mode queues.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
On arcturus, not all platforms use PMFW based fan control. On such
ASICs fan control by PMFW will be disabled in PPTable. Disable hwmon
knobs for fan control also as it is not possible to report or control
fan speed on such platforms through driver.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
filter-out first argument is the space separated list of patterns.
Current implementation will remove any individual "-include" positional
arguments which will make cmdline invalid.

Signed-off-by: Sergei Iudin <tsipa740@gmail.com>
@fxkamd
Copy link
Contributor

fxkamd commented Oct 28, 2021

This was also just found and fixed internally. I don't think the fix made it into ROCm 4.5, which was just released last night. It should make it into the next release after that. Our internal patch looks different, though:

--- a/drivers/gpu/drm/amd/dkms/Makefile
+++ b/drivers/gpu/drm/amd/dkms/Makefile
@@ -125,7 +125,7 @@ endif
 export OS_NAME OS_VERSION
 
 LINUX_SRCTREE_INCLUDE := \
-       $(filter-out -I%/uapi -include %/kconfig.h,$(LINUXINCLUDE))
+       $(filter-out -I%/uapi "-include %/kconfig.h",$(LINUXINCLUDE))
 USER_INCLUDE := $(filter-out $(LINUX_SRCTREE_INCLUDE), $(LINUXINCLUDE))
 
 LINUXINCLUDE := \

There was some discussion about a more reliable way to fix this. I'm pointing them at your patch for reference. Thank you for reporting the issue and working out a fix.

@tsipa
Copy link
Author

tsipa commented Oct 28, 2021

"-include %/kconfig.h" works too. Feel free to close PR if you think it's irrelevant or let me know if you want me to change it from sed to "filter-out" stanza.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.