Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drm/amdkfd: fix the incorrect exception handling logic in function amd_acquire() #176

Open
wants to merge 2,469 commits into
base: master
Choose a base branch
from

Conversation

zxpdemonio
Copy link

@zxpdemonio zxpdemonio commented Oct 31, 2024

In function amd_acquire(), kfd_get_process() is call to get process. When judge whether we get an exception pointer, we shouldn't judge whether it's a null pointer, because kfd_get_process will return ERR_PTR(-EINVAL) instead of null pointer if error.

Because of this wrong logic, the kernel will panic then once kfd_get_process() returns ERR_PTR(-EINVAL).

So, the correct logic should be:
if (IS_ERR(p)) {

Fixes: commit 779b4d0("drm/amdkfd: Add RDMA and PeerDirect support")
fixes: #175

srishanm and others added 30 commits May 20, 2024 09:56
The parameters segment_width and last_segment_width are used to control
the configuration of the Output Plane Processor (OPP), specifically the
width of each segment that the display is divided into and the width of
the last segment

Fixes the below with gcc W=1:
drivers/gpu/drm/amd/amdgpu/../display/dc/optc/dcn35/dcn35_optc.c:59: warning: Function parameter or struct member 'segment_width' not described in 'optc35_set_odm_combine'
drivers/gpu/drm/amd/amdgpu/../display/dc/optc/dcn35/dcn35_optc.c:59: warning: Function parameter or struct member 'last_segment_width' not described in 'optc35_set_odm_combine'
drivers/gpu/drm/amd/amdgpu/../display/dc/optc/dcn35/dcn35_optc.c:59: warning: Excess function parameter 'timing' description in 'optc35_set_odm_combine'

Cc: Tom Chung <chiahsuan.chung@amd.com>
Cc: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com>
Cc: Roman Li <roman.li@amd.com>
Cc: Aurabindo Pillai <aurabindo.pillai@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Tom Chung <chiahsuan.chung@amd.com>
Align with new port same as smu 13.x.

Signed-off-by: Kenneth Feng <kenneth.feng@amd.com>
Reviewed-by: Jack Gui <Jack.Gui@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Update the capabilities for supporting 8k encoding.

Reviewed-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Ruijing Dong <ruijing.dong@amd.com>
The following commit updated gmc->noretry from 0 to 1 for GC HW IP
9.3.0:

    commit 5f3854f ("drm/amdgpu: add more cases to noretry=1")

This causes the device to hang when a page fault occurs, until the
device is rebooted. Instead, revert back to gmc->noretry=0 so the device
is still responsive.

Fixes: 5f3854f ("drm/amdgpu: add more cases to noretry=1")
Signed-off-by: Tim Van Patten <timvp@google.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
./drivers/gpu/drm/amd/amdgpu/amdgpu.h: amdgpu_umsch_mm.h is included more than once.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9063
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
…ing_set_wptr

This commit removes a duplicate check for *is_queue_unmap in the
sdma_v7_0_ring_set_wptr function. The check at line 171 was considered
dead code because at this point in the code, we already know that
*is_queue_unmap is false due to the check at line 161.

By removing this unnecessary check, improves the readability of the
code

Fixes the below:
	drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c:171 sdma_v7_0_ring_set_wptr()
	warn: duplicate check '*is_queue_unmap' (previous on line 161)

drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
    140 static void sdma_v7_0_ring_set_wptr(struct amdgpu_ring *ring)
    141 {
    142         struct amdgpu_device *adev = ring->adev;
    143         uint32_t *wptr_saved;
    144         uint32_t *is_queue_unmap;
    145         uint64_t aggregated_db_index;
    146         uint32_t mqd_size = adev->mqds[AMDGPU_HW_IP_DMA].mqd_size;
    147
    148         DRM_DEBUG("Setting write pointer\n");
    149
    150         if (ring->is_mes_queue) {
    151                 wptr_saved = (uint32_t *)(ring->mqd_ptr + mqd_size);
    152                 is_queue_unmap = (uint32_t *)(ring->mqd_ptr + mqd_size +
                        ^^^^^^^^^^^^^^^^ Set here

    153                                               sizeof(uint32_t));
    154                 aggregated_db_index =
    155                         amdgpu_mes_get_aggregated_doorbell_index(adev,
    156                                                          ring->hw_prio);
    157
    158                 atomic64_set((atomic64_t *)ring->wptr_cpu_addr,
    159                              ring->wptr << 2);
    160                 *wptr_saved = ring->wptr << 2;
    161                 if (*is_queue_unmap) {
                            ^^^^^^^^^^^^^^^ Checked here

    162                         WDOORBELL64(aggregated_db_index, ring->wptr << 2);
    163                         DRM_DEBUG("calling WDOORBELL64(0x%08x, 0x%016llx)\n",
    164                                         ring->doorbell_index, ring->wptr << 2);
    165                         WDOORBELL64(ring->doorbell_index, ring->wptr << 2);
    166                 } else {
    167                         DRM_DEBUG("calling WDOORBELL64(0x%08x, 0x%016llx)\n",
    168                                         ring->doorbell_index, ring->wptr << 2);
    169                         WDOORBELL64(ring->doorbell_index, ring->wptr << 2);
    170
--> 171                         if (*is_queue_unmap)
                                    ^^^^^^^^^^^^^^^ This is dead code.  We know it's false.

    172                                 WDOORBELL64(aggregated_db_index,
    173                                             ring->wptr << 2);
    174                 }
    175         } else {
    176                 if (ring->use_doorbell) {
    177                         DRM_DEBUG("Using doorbell -- "
    178                                   "wptr_offs == 0x%08x "

Fixes: 6d9c711786e6 ("drm/amdgpu: Add sdma v7_0 ip block support (v7)")
Cc: Likun Gao <Likun.Gao@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Likun Gao <Likun.Gao@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
modify the lock type to 'spinlock' to avoid schedule issue
in interrupt context.

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Add support to set/get information about different DPM policies. The
support is only available on SOCs which use swsmu architecture.

A DPM policy type may be defined with different levels. For example, a
policy may be defined to select Pstate preference and then later a
pstate preference may be chosen.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
It's caused by 7535d371a27cc788d8f41394a22e3fd492ee5d2c
"drm/amd/pm: Add support for DPM policies"

Signed-off-by: Bob Zhou <bob.zhou@amd.com>
Reviewed-by: Asher Song <Asher.Song@amd.com>
Add PMF message to select a Pstate policy in SOCs with SMU v13.0.6.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Add support to select pstate policy in SOCs with SMUv13.0.6

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Add support to set XGMI PLPD policy levels through 'pm_policy/xgmi_plpd'
sysfs node.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
On SOCs with SMU v13.0.6, allow changing xgmi plpd policy through
'pm_policy/xgmi_plpd' sysfs interface.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
On aldebaran, allow changing xgmi plpd policy through
'pm_policy/xgmi_plpd' sysfs interface.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
modify the lock type to 'spinlock' to avoid schedule issue
in interrupt context.

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
On arcturus, allow changing xgmi plpd policy through
'pm_policy/xgmi_plpd' sysfs interface.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Replace the legacy interface with amdgpu_dpm_set_pm_policy to set XGMI
PLPD mode. Also, xgmi_plpd_policy sysfs node is not used by any client.
Remove that as well.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
fix ACA no query result after gpu reset.

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Move dsc functions from dc.c to dc_dsc.c.

Co-Developed-by: George Shen <george.shen@amd.com>
Signed-off-by: Wenjing Liu <wenjing.liu@amd.com>
Reviewed-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
The function that commits planes calls the same set of functions twice,
and in the case of the FAMs utilization, it is not desired to call the
dmub, hwss_build and hwss_execute. This commit just removes the
unnecessary calls to those functions.

Acked-by: Roman Li <roman.li@amd.com>
Signed-off-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
…or dcn401

The functions are missing. These two functions are required to support
MST.

Reviewed-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com>
Signed-off-by: Wenjing Liu <wenjing.liu@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
The comparisons intend to be DCN401 inclusive, and fix it by adding
equal signs.

Reviewed-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com>
Signed-off-by: Alex Hung <alex.hung@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
[Why]
DVI is TMDS signal like HDMI but without audio. Current signal check
does not correctly reflect DVI clock programming.

[How]
Define a new signal check for TMDS that includes DVI to HDMI TMDS
programming.

Reviewed-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com>
Signed-off-by: Chris Park <chris.park@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
This fixes indentations and adjust spaces for better readability and
code styles.

Reviewed-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com>
Signed-off-by: Alex Hung <alex.hung@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
[Why]
Enable adaptive scaler support for DCN401

[How]
- Enable build flag for SPL
- Set prefer_easf flag to true
- Apply light linear scaling policy based on transfer function and pixel
  format.  Choose between linear or non-linear scaling
- Set matrix_mode based on pixel format
- Disable ring estimator
- Add missing EASF register defines, masks, and writes
- Disable EASF if scale ratio or number of taps is unsupported and when
  bypassing the scaler
- Add debug flags and registry keys for debugging SPL and EASF
- Add support for Visual Confirm with EASF

Reviewed-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com>
Signed-off-by: Samson Tam <Samson.Tam@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
[why]
Cleaning up the code refactor requires hubbub to be in its own
component.

[how]
Move all DCN401 files under newly created hubbub folder and fixing the
makefiles.

Reviewed-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com>
Signed-off-by: Harikrishna Revalla <Harikrishna.Revalla@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Need to select DTBCLK and DPREFCLK as DTBCLK_p source according to
hardware guidance.

Reviewed-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com>
Signed-off-by: Dillon Varone <dillon.varone@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Acked-by: Roman Li <roman.li@amd.com>
Signed-off-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
update driver-if interface for smu 14.0.2/3

Signed-off-by: Kenneth Feng <kenneth.feng@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why] Coverity reports NULL_RETURN warning.

[How] Add pointer NULL check.

Reviewed-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com>
Signed-off-by: Hersen Wu <hersenxs.wu@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Lijo Lazar and others added 26 commits July 12, 2024 14:37
Add debugfs nodes for enabling/disabling and tuning parameters used in
phase detect.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Add support for enabling phase detect and tuning params for SMUv13.0.6
SOCs.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Phase detect controls are only available for SMUv13.0.6 dGPUs. Create
control object only on those.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
v3.x changed the how vram width was encoded.  The previous
implementation actually worked correctly for most boards.
Fix the implementation to work correctly everywhere.

This fixes the vram width reported in the kernel log on
some boards.

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Limiting the allocation of higher order pages to the closest NUMA node
and enabling direct memory reclaim provides not only failsafe against
situations when memory becomes too much fragmented and the allocator is
not able to satisfy the request from the local node but falls back to
remote pages (HUGEPAGE) but also offers performance improvement.
Accessing remote pages suffers due to bandwidth limitations and could be
avoided if memory becomes defragmented and in most cases without using
manual compaction. (/proc/sys/vm/compact_memory)

Note: On certain distros such as RHEL, the proactive compaction is
disabled. (https://tinyurl.com/4f32f7rs)

v2 (chk): drop __GFP_RECLAIM since that is already set by GFP_USER

Cc: Dave Airlie <airlied@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240708160636.1147308-1-rajneesh.bhardwaj@amd.com
Signed-off-by: Christian König <christian.koenig@amd.com>
(cherry picked from commit 73534de04176394f72d0ca0e7990bf03addcf5ea)
Add support to get phase detect residency through debugfs

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Add support to get phase detect residency information on SMUv13.0.6
SOCs.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Sysfs node disable query error count during gpu reset.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Cache the PCI state before bus master is disabled. The saved state is
later used for other cases like restoring config space after mode-2
reset.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

Fixes: 5c03e58 ("drm/amdgpu:add smu mode1/2 support for aldebaran")
To align with firmware, hbm id field 0x1 refers to
hbm stack 0, 0x2 refers to hbm statck 1.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Data abort exception and unknown errors are supported.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
In the convenience of calling it globally.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Return RMA status without message print.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Instead of printing GPU reset failed.

v2: add check for reset_context->src.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
KIQ timeouts no longer seen.

This reverts commit b4d12cc00ad69e8a0dea2ece7202bacfd8b894fb.

Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com>
Reviewed-by: Zhigang Luo <zhigang.luo@amd.com>
VFs do not perform HW fini/suspend in FLR, so the dpm_enabled
is incorrectly kept enabled. Add interface to disable it in
virt_pre_reset call.

v2: Made implementation generic for all asics
v3: Re-order conditionals so PP_MP1_STATE_FLR is only evaluated on VF

Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Register access from userspace should be blocked until
reset is complete.

Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Stop waiting for the KIQ to return back when there is a reset pending.
It's quite likely that the KIQ will never response.

Signed-off-by: Koenig Christian <Christian.Koenig@amd.com>
Suggested-by: Lazar Lijo <Lijo.Lazar@amd.com>
Tested-by: Victor Skvortsov <victor.skvortsov@amd.com>
Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
[background] when unloading amdgpu driver right after running a
workload, drain_workqueue is causing "Fence fallback timer
expired on ring sdma0.0". Under sriov, this issue will cause sriov
full access timeout and a reset happening.

move drain_workqueue before shutdown is set to allow ih process and
before enter full access under sriov to avoid full access time cost.

Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Driver mode-2 is only supported by relative new
smc firmware.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
This was only used as workaround for recovering the page tables after
VRAM was lost and is no longer necessary after the function
amdgpu_vm_bo_reset_state_machine() started to do the same.

Compute never used shadows either, so the only proplematic case left is
SVM and that is most likely not recoverable in any way when VRAM is
lost.

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Lijo Lazar <lijo.lazar@amd.com>
Query pmfw version to determine if aux sos fw needs to be loaded
in psp v13.0.

v2: refine callback to check if aux_fw loading is needed instead of
    getting pmfw version barely
v3: return the comparison directly

Signed-off-by: Le Ma <le.ma@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
To be compatible with legacy IFWI, driver needs to carry legacy tOS and
query pmfw version to load them accordingly.

Add psp_firmware_header_v2_1 to handle the combined sos binary.

Double the sos count limit for the case of aux sos fw packed.

v2: pass the correct fw_bin_desc to parse_sos_bin_descriptor

Signed-off-by: Le Ma <le.ma@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
…ments

It's caused by v6.9-rc6-1554-g8a0a7b98d4b6
drm/mst: Fix NULL pointer dereference at drm_dp_add_payload_part2

Signed-off-by: Bob Zhou <bob.zhou@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
In multiple GPUs case, after a GPU has started
resetting all GPUs on hive, other GPUs do not
need to trigger GPU reset again.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
…d_acquire()

In function amd_acquire(), kfd_get_process() is call to get process.
When judge whether we get an exception pointer, we shouldn't judge
whether it's a null pointer, because kfd_get_process will return
ERR_PTR(-EINVAL) instead of null pointer if error.

Because of this wrong logic, the kernel will panic then once
kfd_get_process() returns ERR_PTR(-EINVAL).

So, the correct logic should be:
	if (IS_ERR(p)) {

Fixes: commit 779b4d0("drm/amdkfd: Add RDMA and PeerDirect support")

Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
@hkasivis
Copy link
Contributor

Looks good. We will merge it soon.

@whchung
Copy link

whchung commented Nov 19, 2024

@kentrussell
Copy link
Contributor

PeerDirect isn't upstreamable, so it cannot go to amd-staging-drm-next (which is what the amd-gfx mailing list is for). You'll notice that kfd_peerdirect.c isn't even present in the amd-staging-drm-next branch. However, I can confirm that this patch was picked internally, and will be in the upcoming ROCm release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Issue]: Incorrect exception handling logic in function amd_acquire()