Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

increase parallel tests in Linux #34908

Merged
merged 1 commit into from
Aug 16, 2021

Conversation

lelelelelez
Copy link
Contributor

@lelelelelez lelelelelez commented Aug 15, 2021

PR types

Others

PR changes

Others

Describe

  1. 增大单测的并发度:
    a. 更新单测与内存关系
    b. 根据内存的不同,单卡单测并发度分别为48、14、2;多卡单测并发度分别为4、2;独占单测并发度分别为8、4、2
  2. 更新rerun逻辑:
    当前单测数量在1500+,设定首次失败的单测数目在80个以内就可以直接进行rerun(1600*5%,此数据在上线之后需要观察)。首次失败的单测需要降低并发度执行一次,看是否可以成功,如果成功无需进入QA同学之前的rerun逻辑,如果rerun失败,就进入QA同学之前的rerun逻辑(3次有50%的成功率)。

注:此次修改未涉及到windows,windows相关的修改放到下一个PR

补充测试数据:
测试PR:#34570
coverage测试结果(commit: d3fed5)如下:

运行时间 耗时 测试目的
8.15 12:24 - rerun逻辑测试点: 1. 首次失败单测个数少于80个才能进行retry; 2. 第一次retry成功后,就直接置为成功,无需继续retry;3. 第一次retry失败后,开始3次retry,三次中必须有2次以上的成功才能置为成功
8.15 13:25 总耗时:2818s(包含分析单测);单卡:1072s;2卡:113s;独占:1276s;rerun:230s 收集并发运行的耗时
8.15 15:49 总耗时:2887s(包含分析单测);单卡:1101s;2卡:111s;独占:1315s;rerun:228s 收集并发运行的耗时

本PR 52c38a3 coverage运行时间:

  • 运行时间8.15 19:17,总耗时:3206s(包含分析单测);单卡:1238s;2卡:128s;独占:1474s;rerun:239s
  • 运行时间8.15 20:28,总耗时:2802s(包含分析单测);单卡:1067s;2卡:106s;独占:1282s;rerun:222s
  • 运行时间8.16 10:10,总耗时:2819s(包含分析单测);单卡:1076s;2卡:108s;独占:1279s;rerun:230s

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copy link
Contributor

@zhwesky2010 zhwesky2010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个rerun了多次跑没,如果只跑了1次,后续风险可能大一些,因为 TIMEOUT与失败 的随机性比较大, 对收益与稳定性影响较大。建议合入后多观察观察

card_test "$single_card_tests_high_parallel" 1 24 # run cases the most each time with single GPU
card_test "$single_card_tests_secondary_high_parallel" 1 12
card_test "$single_card_tests_third_high_parallel" 1 15
card_test "$single_card_tests_tetrad_parallel" 1 7 # run cases 2 job each time with single GPU
Copy link
Contributor

@zhwesky2010 zhwesky2010 Aug 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tetrad取名是表示4个一起的,现在变化挺大,变量命名也改一下吧,防止误解

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

single_ut_startTime_s=`date +%s`
card_test "$single_card_tests_high_parallel" 1 24 # run cases the most each time with single GPU
card_test "$single_card_tests_secondary_high_parallel" 1 12
card_test "$single_card_tests_third_high_parallel" 1 15
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

第三级别是15,比第二级别更高吗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个名字确实有些问题,我修改一下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -1030,6 +1030,7 @@ function get_quickly_disable_ut() {

function card_test() {
set -m
CTEST_PARALLEL_LEVEL=2
Copy link
Contributor

@zhwesky2010 zhwesky2010 Aug 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个会在CI里设置了,Coverage和py3里都有设,这里会导致写死了变量,CI里就没法设了。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为这里py3在CI里设置了CTEST_PARALLEL_LEVEL=4,但是coverage是CTEST_PARALLEL_LEVEL=2,所以不在这里写死,这么大的并发度在py3是行不通的。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为这里py3在CI里设置了CTEST_PARALLEL_LEVEL=4,但是coverage是CTEST_PARALLEL_LEVEL=2,所以不在这里写死,这么大的并发度在py3是行不通的。

把Py3里的CI配置改一下吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要快上线在改,要不对现在其他PR,py3的并发度会降低。这个PR没问题后,我先合入,然后改CI配置,然后在提一个PR把这个删除掉即可。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的

single_ut_endTime_s=`date +%s`

multi_ut_startTime_s=`date +%s`
card_test "$multiple_card_tests_two_parallel" 2 4 # run cases 2 job each time with two GPUs
Copy link
Contributor

@zhwesky2010 zhwesky2010 Aug 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的很多注释都变了,更新下注释吧,删除也可以,防止误解

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@lelelelelez
Copy link
Contributor Author

这个rerun了多次跑没,如果只跑了1次,后续风险可能大一些,因为 TIMEOUT与失败 的随机性比较大, 对收益与稳定性影响较大。建议合入后多观察观察

这个代码已经rerun多次,在这个PR之前,在#34570 也都测试过了。合入后我会多观察观察的

@@ -1318,7 +1360,7 @@ set +x
done

if [[ "$one_card_retry" != "" ]]; then
card_test "$one_card_retry" 1
card_test "$one_card_retry" 1 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是不是也应该增加执行次数的判断,
第一次执行retry(阈值为80),这里的并发度为4,如果是后三次执行重试,是否应该改为默认的并发值呢?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我认为没必要,因为这里失败的数量很少(个位数的样子),即使开多大的并发,也并发不起来呀,没那么多case

set +e
retry_unittests_record="$retry_unittests_record$failed_test_lists"
failed_test_lists_ult=`echo "${failed_test_lists}" |grep -Po '[^ ].*$'`
set -e
if [[ "${exec_times}" == "1" ]];then
if [[ "${exec_times}" == "1" ]] || [[ "${exec_times}" == "3" ]];then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

第一,二次retry失败,第三次retry成功,则不会执行第4次retry,

假设这里执行了第四次retry,成功,最终的CI结果也会失败,这种情况下需要确定最终结果是判断为成功还是失败。

retry_unittests_record_judge=$(echo ${retry_unittests_ut_name}| tr ' ' '\n' | sort | uniq -c | awk '{if ($1 >=3) {print $2}}')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

下个PR改。

Copy link
Contributor

@XieYunshen XieYunshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lelelelelez lelelelelez merged commit ed6624a into PaddlePaddle:develop Aug 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants