-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce long duration for the exit -6 re-run
process.
#9400
Conversation
Thanks for your contribution! |
86de410
to
69609bc
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #9400 +/- ##
===========================================
- Coverage 53.01% 52.97% -0.04%
===========================================
Files 678 676 -2
Lines 108787 107838 -949
===========================================
- Hits 57668 57124 -544
+ Misses 51119 50714 -405 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
@@ -117,6 +117,8 @@ function llm_qwen_case_list_auto() { | |||
|
|||
function llama_dygraph_auto_bs8_fp32_DP2() { | |||
echo "=========== $FUNCNAME run begin ===========" | |||
export_env | |||
cd ${llama_case_path} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么每个函数里要再调用一次 export_env 和 cd ${llama_case_path},建议在run_ci.sh 中执行一次hook即可
llama_align_dy2st_fthenb_and_vpp_auto_bs2_fp32_DP1-MP1-PP4 | ||
llama_align_dygraph_dy2st_pir_auto_pp_bs2_bf16_DP1-MP1-PP4 | ||
) | ||
restore_func $fun_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议将 restore_llama_case_list_auto_func 与 llama_case_list_auto 合并为一个,仅维护一个列表。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Others
PR changes
Others
Description
线上自动并行
PaddleNLP-CI-gpt-3
CI 会在测试执行结束后出现随机挂exit -6
的现象。原本的处理方法是结束CI,并提示用户通过re-run
的方式来解决。但随机挂概率较高,导致用户负担较重。因此,本PR主要是通过自动re-run
的方式解决该问题:test_case
需要通过子进程方式将test函数集合写入function.txt
中,父进程读取function.txt
中的函数名称,再通过子进程的方式进行调用执行。因此,父进程可以捕获每个测试的ExitCode
ExitCode = 250
则代表出现随机挂exit -6
,该测试会自动re-run
一次,如果问题复现,则日志中只提示用户该测试多次出现随机挂问题,不再中断CI,继续执行剩余测试test_case
执行完后,统计当前case
的总测试数目、成功测试数目、运行失败测试数目、精度校验失败测试数目、随机挂测试数目