Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

在使用Ascend进行模型训练的时候遇到错误 #48

Open
LianShuaiLong opened this issue Sep 3, 2024 · 1 comment
Open

在使用Ascend进行模型训练的时候遇到错误 #48

LianShuaiLong opened this issue Sep 3, 2024 · 1 comment

Comments

@LianShuaiLong
Copy link

训练数据量1.2M,采用16卡进行训练
已经设置
export HCCL_EXEC_TIMEOUT=17340

错误信息:
W NPUStream.cpp:409] Warning: NPU warning, error code is 507046[Error]:
[Error]: In the specified timeout waiting event, all tasks in the specified stream are not completed.
Rectify the fault based on the error information in the ascend log.
EE1002: 2024-09-03-16:50:00.041.231 Stream synchronize timeout. rtDeviceSynchronize execute failed, reason=[stream sync timeout]
Possible Cause: 1. The timeout interval may be improperly set.
Solution: 1. Check whether the timeout interval is properly set. 2. Check whether the network is normal.
TraceBack (most recent call last):
wait for compute device to finish failed, runtime result = 507046.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
(function npuSynchronizeUsedDevices)

@yunyiyun
Copy link

麻烦提供下详细plog日志
plog日志默认路径:/root/ascend/log
可以通过export ASCEND_GLOBAL_LOG_LEVEL=1配成INFO级别(0对应DEBUG级别,2对应WARN级别,3为默认的ERROR级别)。
对于分布式通信(hccl)问题,一般需要开启event日志 export ASCEND_GLOBAL_EVENT_ENABLE=1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants