-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
increase parallel tests in Linux #34908
Conversation
Thanks for your contribution! |
f812f1f
to
52c38a3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个rerun了多次跑没,如果只跑了1次,后续风险可能大一些,因为 TIMEOUT与失败 的随机性比较大, 对收益与稳定性影响较大。建议合入后多观察观察
paddle/scripts/paddle_build.sh
Outdated
card_test "$single_card_tests_high_parallel" 1 24 # run cases the most each time with single GPU | ||
card_test "$single_card_tests_secondary_high_parallel" 1 12 | ||
card_test "$single_card_tests_third_high_parallel" 1 15 | ||
card_test "$single_card_tests_tetrad_parallel" 1 7 # run cases 2 job each time with single GPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tetrad取名是表示4个一起的,现在变化挺大,变量命名也改一下吧,防止误解
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
paddle/scripts/paddle_build.sh
Outdated
single_ut_startTime_s=`date +%s` | ||
card_test "$single_card_tests_high_parallel" 1 24 # run cases the most each time with single GPU | ||
card_test "$single_card_tests_secondary_high_parallel" 1 12 | ||
card_test "$single_card_tests_third_high_parallel" 1 15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
第三级别是15,比第二级别更高吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个名字确实有些问题,我修改一下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -1030,6 +1030,7 @@ function get_quickly_disable_ut() { | |||
|
|||
function card_test() { | |||
set -m | |||
CTEST_PARALLEL_LEVEL=2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个会在CI里设置了,Coverage和py3里都有设,这里会导致写死了变量,CI里就没法设了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
因为这里py3在CI里设置了CTEST_PARALLEL_LEVEL=4,但是coverage是CTEST_PARALLEL_LEVEL=2,所以不在这里写死,这么大的并发度在py3是行不通的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
因为这里py3在CI里设置了CTEST_PARALLEL_LEVEL=4,但是coverage是CTEST_PARALLEL_LEVEL=2,所以不在这里写死,这么大的并发度在py3是行不通的。
把Py3里的CI配置改一下吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
需要快上线在改,要不对现在其他PR,py3的并发度会降低。这个PR没问题后,我先合入,然后改CI配置,然后在提一个PR把这个删除掉即可。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的
paddle/scripts/paddle_build.sh
Outdated
single_ut_endTime_s=`date +%s` | ||
|
||
multi_ut_startTime_s=`date +%s` | ||
card_test "$multiple_card_tests_two_parallel" 2 4 # run cases 2 job each time with two GPUs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的很多注释都变了,更新下注释吧,删除也可以,防止误解
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
52c38a3
to
78994b5
Compare
这个代码已经rerun多次,在这个PR之前,在#34570 也都测试过了。合入后我会多观察观察的 |
@@ -1318,7 +1360,7 @@ set +x | |||
done | |||
|
|||
if [[ "$one_card_retry" != "" ]]; then | |||
card_test "$one_card_retry" 1 | |||
card_test "$one_card_retry" 1 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里是不是也应该增加执行次数的判断,
第一次执行retry(阈值为80),这里的并发度为4,如果是后三次执行重试,是否应该改为默认的并发值呢?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我认为没必要,因为这里失败的数量很少(个位数的样子),即使开多大的并发,也并发不起来呀,没那么多case
set +e | ||
retry_unittests_record="$retry_unittests_record$failed_test_lists" | ||
failed_test_lists_ult=`echo "${failed_test_lists}" |grep -Po '[^ ].*$'` | ||
set -e | ||
if [[ "${exec_times}" == "1" ]];then | ||
if [[ "${exec_times}" == "1" ]] || [[ "${exec_times}" == "3" ]];then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
第一,二次retry失败,第三次retry成功,则不会执行第4次retry,
假设这里执行了第四次retry,成功,最终的CI结果也会失败,这种情况下需要确定最终结果是判断为成功还是失败。
见
Paddle/paddle/scripts/paddle_build.sh
Line 1393 in 4981894
retry_unittests_record_judge=$(echo ${retry_unittests_ut_name}| tr ' ' '\n' | sort | uniq -c | awk '{if ($1 >=3) {print $2}}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
下个PR改。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Others
PR changes
Others
Describe
a. 更新单测与内存关系
b. 根据内存的不同,单卡单测并发度分别为48、14、2;多卡单测并发度分别为4、2;独占单测并发度分别为8、4、2
当前单测数量在1500+,设定首次失败的单测数目在80个以内就可以直接进行rerun(1600*5%,此数据在上线之后需要观察)。首次失败的单测需要降低并发度执行一次,看是否可以成功,如果成功无需进入QA同学之前的rerun逻辑,如果rerun失败,就进入QA同学之前的rerun逻辑(3次有50%的成功率)。
注:此次修改未涉及到windows,windows相关的修改放到下一个PR
补充测试数据:
测试PR:#34570
coverage测试结果(commit: d3fed5)如下:
本PR 52c38a3 coverage运行时间: