Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce long duration for the exit -6 re-run process. #9400

Merged
merged 4 commits into from
Nov 19, 2024

Conversation

waliwali777
Copy link
Contributor

@waliwali777 waliwali777 commented Nov 11, 2024

PR types

Others

PR changes

Others

Description

线上自动并行 PaddleNLP-CI-gpt-3 CI 会在测试执行结束后出现随机挂 exit -6的现象。原本的处理方法是结束CI,并提示用户通过re-run的方式来解决。但随机挂概率较高,导致用户负担较重。因此,本PR主要是通过自动re-run的方式解决该问题:

  1. 执行test_case 需要通过子进程方式将test函数集合写入function.txt中,父进程读取function.txt中的函数名称,再通过子进程的方式进行调用执行。因此,父进程可以捕获每个测试的ExitCode
  2. 子进程返回 ExitCode = 250则代表出现随机挂 exit -6,该测试会自动re-run一次,如果问题复现,则日志中只提示用户该测试多次出现随机挂问题,不再中断CI,继续执行剩余测试
  3. test_case执行完后,统计当前case的总测试数目、成功测试数目、运行失败测试数目、精度校验失败测试数目、随机挂测试数目

Copy link

paddle-bot bot commented Nov 11, 2024

Thanks for your contribution!

Copy link

codecov bot commented Nov 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 52.97%. Comparing base (b5e3f0c) to head (77d8675).
Report is 14 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #9400      +/-   ##
===========================================
- Coverage    53.01%   52.97%   -0.04%     
===========================================
  Files          678      676       -2     
  Lines       108787   107838     -949     
===========================================
- Hits         57668    57124     -544     
+ Misses       51119    50714     -405     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@@ -117,6 +117,8 @@ function llm_qwen_case_list_auto() {

function llama_dygraph_auto_bs8_fp32_DP2() {
echo "=========== $FUNCNAME run begin ==========="
export_env
cd ${llama_case_path}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么每个函数里要再调用一次 export_env 和 cd ${llama_case_path},建议在run_ci.sh 中执行一次hook即可

llama_align_dy2st_fthenb_and_vpp_auto_bs2_fp32_DP1-MP1-PP4
llama_align_dygraph_dy2st_pir_auto_pp_bs2_bf16_DP1-MP1-PP4
)
restore_func $fun_list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议将 restore_llama_case_list_auto_func 与 llama_case_list_auto 合并为一个,仅维护一个列表。

Copy link
Collaborator

@wawltor wawltor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wawltor wawltor merged commit 3fe6aba into PaddlePaddle:develop Nov 19, 2024
10 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants