Spawn multiple pytest workers to speedup CI #758

lijinf2 · 2024-10-15T17:51:46Z

with pytest -n 4 on a V100 32GB GPU,
run_tests.sh got passed in 20 mins.
run_tests.sh --runslow got passed in 40 mins.
peak CPU memory usage was around 2GB.

Signed-off-by: Jinfeng <jinfengl@nvidia.com>

lijinf2 · 2024-10-15T17:54:08Z

build

lijinf2 · 2024-10-15T18:50:09Z

pre-merge CI took about 50 mins

eordentlich · 2024-10-15T22:16:25Z

Wasn't it taking 50 min before?

eordentlich

@pxLi @YanxuanLiu Any potential issues with this?
Seems to give 2x speedup in premerge.

…ontext

lijinf2 · 2024-10-16T22:07:42Z

build

lijinf2 · 2024-10-16T23:48:10Z

build

pxLi · 2024-10-17T00:10:13Z

@pxLi @YanxuanLiu Any potential issues with this? Seems to give 2x speedup in premerge.

It may cause some GPU OOM or host memory OOM as in the CI pool, parts of machines have only 16/24GB GPU memory.

It would be nice for the ML, the test script could detect the gpu type (mem size) and then dynamically set the parallelism like e.g. https://github.com/NVIDIA/spark-rapids/blob/branch-24.12/integration_tests/run_pyspark_from_build.sh#L113-L134 instead of hardcode the numbers thanks

lijinf2 · 2024-10-17T19:06:50Z

build

lijinf2 · 2024-10-17T21:19:46Z

build

eordentlich · 2024-10-17T22:26:47Z

python/run_test.sh

+CPU_CORES=`nproc`
+HOST_MEM_PARALLEL=`cat /proc/meminfo | grep MemAvailable | awk '{print int($2 / (8 * 1024 * 1024))}'`


Do these reflect container limits or VM limits? And where did denominators come from?

denominator is inherited from the script that Peixin shared. The script expects 8GB peak memory usage during the whole testing. Ours was 2.5 GB I observed. 8GB is much larger than that so also works for us.

@pxLi Do you have an idea on whether the the /proc/meminfo reflects container limits or VM limits?

Revised back to -n 3 instances.

Do you have an idea on whether the the /proc/meminfo reflects container limits or VM limits?

the info is for local machine but not the correct value in docker.

please give it a hard upper bound if possible as the GPU memory size is always correct~

like

if dynamical parallelism > upper_bound then go with upper bound. The link that I gave above would also covered the developer local run, and spark have a hard limit in the same script https://github.com/NVIDIA/spark-rapids/blob/branch-24.12/integration_tests/run_pyspark_from_build.sh#L174-L186

handle corner case minimal 1 pytest worker get CUDA_VISIBLE_DEVICES working

lijinf2 · 2024-10-17T23:00:50Z

@pxLi @YanxuanLiu Any potential issues with this? Seems to give 2x speedup in premerge.

It may cause some GPU OOM or host memory OOM as in the CI pool, parts of machines have only 16/24GB GPU memory.

It would be nice for the ML, the test script could detect the gpu type (mem size) and then dynamically set the parallelism like e.g. https://github.com/NVIDIA/spark-rapids/blob/branch-24.12/integration_tests/run_pyspark_from_build.sh#L113-L134 instead of hardcode the numbers thanks

Thanks for sharing the link.
Added the dynamic parallelism with a few modifications:
(1) calculate GPU_PARALLEL with the minimum device memory.
(2) support CUDA_VISIBLE_DEVICES.

lijinf2 · 2024-10-17T23:00:59Z

build

lijinf2 · 2024-10-18T00:22:14Z

build

lijinf2 · 2024-10-18T00:25:17Z

@pxLi @YanxuanLiu Any potential issues with this? Seems to give 2x speedup in premerge.

It may cause some GPU OOM or host memory OOM as in the CI pool, parts of machines have only 16/24GB GPU memory.

It would be nice for the ML, the test script could detect the gpu type (mem size) and then dynamically set the parallelism like e.g. https://github.com/NVIDIA/spark-rapids/blob/branch-24.12/integration_tests/run_pyspark_from_build.sh#L113-L134 instead of hardcode the numbers thanks

The script gave 9 on CI. It seems we need only 3 at this stage after discussing with Erik. No OOM was observed when I tested pre-merge and nightly tests locally. Should be safe.

pxLi · 2024-10-18T00:32:11Z

@pxLi @YanxuanLiu Any potential issues with this? Seems to give 2x speedup in premerge.

It may cause some GPU OOM or host memory OOM as in the CI pool, parts of machines have only 16/24GB GPU memory.
It would be nice for the ML, the test script could detect the gpu type (mem size) and then dynamically set the parallelism like e.g. https://github.com/NVIDIA/spark-rapids/blob/branch-24.12/integration_tests/run_pyspark_from_build.sh#L113-L134 instead of hardcode the numbers thanks

The script gave 9 on CI. It seems we need only 3 at this stage after discussing with Erik. No OOM was observed when I tested pre-merge and nightly tests locally. Should be safe.

I replied above, the important part is the GPU memory, feel free to remove the host memory detect part, just give GPU memory calculation (e.g. spark test required ~1.5g GPU mem for each instance)
and put 3 as a upper bound for parallelism if possible thanks.

If you can also confirm hardcode 3 would not consume more than 15g (minimal GPU mem size in CI pool), we should be good too~

lijinf2 · 2024-10-18T17:44:58Z

build

lijinf2 · 2024-10-18T17:54:52Z

@pxLi @YanxuanLiu Any potential issues with this? Seems to give 2x speedup in premerge.

It may cause some GPU OOM or host memory OOM as in the CI pool, parts of machines have only 16/24GB GPU memory.
It would be nice for the ML, the test script could detect the gpu type (mem size) and then dynamically set the parallelism like e.g. https://github.com/NVIDIA/spark-rapids/blob/branch-24.12/integration_tests/run_pyspark_from_build.sh#L113-L134 instead of hardcode the numbers thanks

The script gave 9 on CI. It seems we need only 3 at this stage after discussing with Erik. No OOM was observed when I tested pre-merge and nightly tests locally. Should be safe.

I replied above, the important part is the GPU memory, feel free to remove the host memory detect part, just give GPU memory calculation (e.g. spark test required ~1.5g GPU mem for each instance) and put 3 as a upper bound for parallelism if possible thanks.

If you can also confirm hardcode 3 would not consume more than 15g (minimal GPU mem size in CI pool), we should be good too~

@pxLi Got you. Just added back the dynamic parallelism, with host memory constraint removed yet a maximum parallelism 3 added. Our pre-merge and nightly tests took < 4 GB device memory tested in my workstation. Hardcode 3 should consume no more than 12g.

python/run_test.sh

lijinf2 · 2024-10-18T22:06:26Z

build

eordentlich

👍

parallelize pytest

4897a9a

Signed-off-by: Jinfeng <jinfengl@nvidia.com>

lijinf2 force-pushed the pytest_xdist branch from 811d9b7 to 4897a9a Compare October 15, 2024 17:52

lijinf2 marked this pull request as ready for review October 15, 2024 18:50

eordentlich reviewed Oct 16, 2024

View reviewed changes

increase pytest -n 4 to -n 8 per ci uses local[gpu_number] as spark c…

68130f8

…ontext

-n 8 does not improve over -n 4, revised to -n 3

495fe39

-n 3 seems 3 mins faster, revised to -n 2

1019d04

add dynamic pytest parallelism calculation on machine hardware

a91a8bb

revise from using max gpu memory to minimum gpu memory

a3750ed

eordentlich reviewed Oct 17, 2024

View reviewed changes

revise to use minimum GPU memory to determine parallelism

f65869e

handle corner case minimal 1 pytest worker get CUDA_VISIBLE_DEVICES working

revise back to 3

53e710c

add back automatic parallelism with removed memory parallel

0d53dc9

eordentlich reviewed Oct 18, 2024

View reviewed changes

python/run_test.sh Outdated Show resolved Hide resolved

revise typo not using TEST_PARALLEL

d75fc02

eordentlich approved these changes Oct 18, 2024

View reviewed changes

lijinf2 merged commit 668b20e into NVIDIA:branch-24.10 Oct 18, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spawn multiple pytest workers to speedup CI #758

Spawn multiple pytest workers to speedup CI #758

lijinf2 commented Oct 15, 2024 •

edited

Loading

lijinf2 commented Oct 15, 2024

lijinf2 commented Oct 15, 2024

eordentlich commented Oct 15, 2024

eordentlich left a comment

lijinf2 commented Oct 16, 2024

lijinf2 commented Oct 16, 2024

pxLi commented Oct 17, 2024 •

edited

Loading

lijinf2 commented Oct 17, 2024

lijinf2 commented Oct 17, 2024

eordentlich Oct 17, 2024

lijinf2 Oct 17, 2024

lijinf2 Oct 18, 2024

pxLi Oct 18, 2024 •

edited

Loading

lijinf2 commented Oct 17, 2024

lijinf2 commented Oct 17, 2024

lijinf2 commented Oct 18, 2024

lijinf2 commented Oct 18, 2024

pxLi commented Oct 18, 2024 •

edited

Loading

lijinf2 commented Oct 18, 2024

lijinf2 commented Oct 18, 2024

lijinf2 commented Oct 18, 2024

eordentlich left a comment

		CPU_CORES=`nproc`
		HOST_MEM_PARALLEL=`cat /proc/meminfo \| grep MemAvailable \| awk '{print int($2 / (8 * 1024 * 1024))}'`

Spawn multiple pytest workers to speedup CI #758

Spawn multiple pytest workers to speedup CI #758

Conversation

lijinf2 commented Oct 15, 2024 • edited Loading

lijinf2 commented Oct 15, 2024

lijinf2 commented Oct 15, 2024

eordentlich commented Oct 15, 2024

eordentlich left a comment

Choose a reason for hiding this comment

lijinf2 commented Oct 16, 2024

lijinf2 commented Oct 16, 2024

pxLi commented Oct 17, 2024 • edited Loading

lijinf2 commented Oct 17, 2024

lijinf2 commented Oct 17, 2024

eordentlich Oct 17, 2024

Choose a reason for hiding this comment

lijinf2 Oct 17, 2024

Choose a reason for hiding this comment

lijinf2 Oct 18, 2024

Choose a reason for hiding this comment

pxLi Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

lijinf2 commented Oct 17, 2024

lijinf2 commented Oct 17, 2024

lijinf2 commented Oct 18, 2024

lijinf2 commented Oct 18, 2024

pxLi commented Oct 18, 2024 • edited Loading

lijinf2 commented Oct 18, 2024

lijinf2 commented Oct 18, 2024

lijinf2 commented Oct 18, 2024

eordentlich left a comment

Choose a reason for hiding this comment

lijinf2 commented Oct 15, 2024 •

edited

Loading

pxLi commented Oct 17, 2024 •

edited

Loading

pxLi Oct 18, 2024 •

edited

Loading

pxLi commented Oct 18, 2024 •

edited

Loading