修复paddle.linalg.cholesky分解大Tensor的问题 #74377
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR Category
Execute Infrastructure
PR Types
Bug fixes
Description
1. 背景
Cholesky 分解是一种将正定方阵 A 分解为下三角矩阵 L 与其转置 L' 乘积(A = LL')的常用方法。
2. 存在的问题
旧版的
cholesky_kernel实现主要存在以下两个问题:3. 解决方案
针对上述问题,本次提交进行了如下修改:
升级 cuSOLVER API
cusolverDnSpotrfAPI 中,其工作空间大小参数Lwork为int类型。当矩阵元素总数超过int类型的最大值时,会发生溢出,导致计算失败。cusolverDnXpotrf(),该 API 支持更大的工作空间。同时,在cholesky_grad中也做了对应的兼容性修改,从而可以处理更大规模的矩阵。优化报错逻辑
info[i]不为零,就笼统地报告为“主子式非正定”的错误。这导致当处理大矩阵因 API 调用失败时,用户看到的报错信息仍然是关于矩阵正定性的,从而产生误导。info的不同返回值进行更准确、具体的错误提示。4. 实验结果