Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spawn multiple pytest workers to speedup CI #758

Merged
merged 10 commits into from
Oct 18, 2024

Conversation

lijinf2
Copy link
Collaborator

@lijinf2 lijinf2 commented Oct 15, 2024

with pytest -n 4 on a V100 32GB GPU,
run_tests.sh got passed in 20 mins.
run_tests.sh --runslow got passed in 40 mins.
peak CPU memory usage was around 2GB.

Signed-off-by: Jinfeng <jinfengl@nvidia.com>
@lijinf2
Copy link
Collaborator Author

lijinf2 commented Oct 15, 2024

build

@lijinf2
Copy link
Collaborator Author

lijinf2 commented Oct 15, 2024

pre-merge CI took about 50 mins

@lijinf2 lijinf2 marked this pull request as ready for review October 15, 2024 18:50
@eordentlich
Copy link
Collaborator

Wasn't it taking 50 min before?

Copy link
Collaborator

@eordentlich eordentlich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pxLi @YanxuanLiu Any potential issues with this?
Seems to give 2x speedup in premerge.

@lijinf2
Copy link
Collaborator Author

lijinf2 commented Oct 16, 2024

build

@lijinf2
Copy link
Collaborator Author

lijinf2 commented Oct 16, 2024

build

@pxLi
Copy link
Collaborator

pxLi commented Oct 17, 2024

@pxLi @YanxuanLiu Any potential issues with this? Seems to give 2x speedup in premerge.

It may cause some GPU OOM or host memory OOM as in the CI pool, parts of machines have only 16/24GB GPU memory.

It would be nice for the ML, the test script could detect the gpu type (mem size) and then dynamically set the parallelism like e.g. https://github.com/NVIDIA/spark-rapids/blob/branch-24.12/integration_tests/run_pyspark_from_build.sh#L113-L134 instead of hardcode the numbers thanks

@lijinf2
Copy link
Collaborator Author

lijinf2 commented Oct 17, 2024

build

@lijinf2
Copy link
Collaborator Author

lijinf2 commented Oct 17, 2024

build

Comment on lines 24 to 25
CPU_CORES=`nproc`
HOST_MEM_PARALLEL=`cat /proc/meminfo | grep MemAvailable | awk '{print int($2 / (8 * 1024 * 1024))}'`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these reflect container limits or VM limits? And where did denominators come from?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

denominator is inherited from the script that Peixin shared. The script expects 8GB peak memory usage during the whole testing. Ours was 2.5 GB I observed. 8GB is much larger than that so also works for us.

@pxLi Do you have an idea on whether the the /proc/meminfo reflects container limits or VM limits?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised back to -n 3 instances.

Copy link
Collaborator

@pxLi pxLi Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an idea on whether the the /proc/meminfo reflects container limits or VM limits?

the info is for local machine but not the correct value in docker.

please give it a hard upper bound if possible as the GPU memory size is always correct~

like

if dynamical parallelism > upper_bound then go with upper bound. The link that I gave above would also covered the developer local run, and spark have a hard limit in the same script https://github.com/NVIDIA/spark-rapids/blob/branch-24.12/integration_tests/run_pyspark_from_build.sh#L174-L186

handle corner case minimal 1 pytest worker
get CUDA_VISIBLE_DEVICES working
@lijinf2
Copy link
Collaborator Author

lijinf2 commented Oct 17, 2024

@pxLi @YanxuanLiu Any potential issues with this? Seems to give 2x speedup in premerge.

It may cause some GPU OOM or host memory OOM as in the CI pool, parts of machines have only 16/24GB GPU memory.

It would be nice for the ML, the test script could detect the gpu type (mem size) and then dynamically set the parallelism like e.g. https://github.com/NVIDIA/spark-rapids/blob/branch-24.12/integration_tests/run_pyspark_from_build.sh#L113-L134 instead of hardcode the numbers thanks

Thanks for sharing the link.
Added the dynamic parallelism with a few modifications:
(1) calculate GPU_PARALLEL with the minimum device memory.
(2) support CUDA_VISIBLE_DEVICES.

@lijinf2
Copy link
Collaborator Author

lijinf2 commented Oct 17, 2024

build

@lijinf2
Copy link
Collaborator Author

lijinf2 commented Oct 18, 2024

build

@lijinf2
Copy link
Collaborator Author

lijinf2 commented Oct 18, 2024

@pxLi @YanxuanLiu Any potential issues with this? Seems to give 2x speedup in premerge.

It may cause some GPU OOM or host memory OOM as in the CI pool, parts of machines have only 16/24GB GPU memory.

It would be nice for the ML, the test script could detect the gpu type (mem size) and then dynamically set the parallelism like e.g. https://github.com/NVIDIA/spark-rapids/blob/branch-24.12/integration_tests/run_pyspark_from_build.sh#L113-L134 instead of hardcode the numbers thanks

The script gave 9 on CI. It seems we need only 3 at this stage after discussing with Erik. No OOM was observed when I tested pre-merge and nightly tests locally. Should be safe.

@pxLi
Copy link
Collaborator

pxLi commented Oct 18, 2024

@pxLi @YanxuanLiu Any potential issues with this? Seems to give 2x speedup in premerge.

It may cause some GPU OOM or host memory OOM as in the CI pool, parts of machines have only 16/24GB GPU memory.
It would be nice for the ML, the test script could detect the gpu type (mem size) and then dynamically set the parallelism like e.g. https://github.com/NVIDIA/spark-rapids/blob/branch-24.12/integration_tests/run_pyspark_from_build.sh#L113-L134 instead of hardcode the numbers thanks

The script gave 9 on CI. It seems we need only 3 at this stage after discussing with Erik. No OOM was observed when I tested pre-merge and nightly tests locally. Should be safe.

I replied above, the important part is the GPU memory, feel free to remove the host memory detect part, just give GPU memory calculation (e.g. spark test required ~1.5g GPU mem for each instance)
and put 3 as a upper bound for parallelism if possible thanks.

If you can also confirm hardcode 3 would not consume more than 15g (minimal GPU mem size in CI pool), we should be good too~

@lijinf2
Copy link
Collaborator Author

lijinf2 commented Oct 18, 2024

build

@lijinf2
Copy link
Collaborator Author

lijinf2 commented Oct 18, 2024

@pxLi @YanxuanLiu Any potential issues with this? Seems to give 2x speedup in premerge.

It may cause some GPU OOM or host memory OOM as in the CI pool, parts of machines have only 16/24GB GPU memory.
It would be nice for the ML, the test script could detect the gpu type (mem size) and then dynamically set the parallelism like e.g. https://github.com/NVIDIA/spark-rapids/blob/branch-24.12/integration_tests/run_pyspark_from_build.sh#L113-L134 instead of hardcode the numbers thanks

The script gave 9 on CI. It seems we need only 3 at this stage after discussing with Erik. No OOM was observed when I tested pre-merge and nightly tests locally. Should be safe.

I replied above, the important part is the GPU memory, feel free to remove the host memory detect part, just give GPU memory calculation (e.g. spark test required ~1.5g GPU mem for each instance) and put 3 as a upper bound for parallelism if possible thanks.

If you can also confirm hardcode 3 would not consume more than 15g (minimal GPU mem size in CI pool), we should be good too~

@pxLi Got you. Just added back the dynamic parallelism, with host memory constraint removed yet a maximum parallelism 3 added. Our pre-merge and nightly tests took < 4 GB device memory tested in my workstation. Hardcode 3 should consume no more than 12g.

python/run_test.sh Outdated Show resolved Hide resolved
@lijinf2
Copy link
Collaborator Author

lijinf2 commented Oct 18, 2024

build

Copy link
Collaborator

@eordentlich eordentlich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@lijinf2 lijinf2 merged commit 668b20e into NVIDIA:branch-24.10 Oct 18, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants