-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spawn multiple pytest workers to speedup CI #758
Conversation
Signed-off-by: Jinfeng <jinfengl@nvidia.com>
build |
pre-merge CI took about 50 mins |
Wasn't it taking 50 min before? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pxLi @YanxuanLiu Any potential issues with this?
Seems to give 2x speedup in premerge.
build |
build |
It may cause some GPU OOM or host memory OOM as in the CI pool, parts of machines have only 16/24GB GPU memory. It would be nice for the ML, the test script could detect the gpu type (mem size) and then dynamically set the parallelism like e.g. https://github.com/NVIDIA/spark-rapids/blob/branch-24.12/integration_tests/run_pyspark_from_build.sh#L113-L134 instead of hardcode the numbers thanks |
build |
build |
python/run_test.sh
Outdated
CPU_CORES=`nproc` | ||
HOST_MEM_PARALLEL=`cat /proc/meminfo | grep MemAvailable | awk '{print int($2 / (8 * 1024 * 1024))}'` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these reflect container limits or VM limits? And where did denominators come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
denominator is inherited from the script that Peixin shared. The script expects 8GB peak memory usage during the whole testing. Ours was 2.5 GB I observed. 8GB is much larger than that so also works for us.
@pxLi Do you have an idea on whether the the /proc/meminfo reflects container limits or VM limits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised back to -n 3 instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an idea on whether the the /proc/meminfo reflects container limits or VM limits?
the info is for local machine but not the correct value in docker.
please give it a hard upper bound if possible as the GPU memory size is always correct~
like
if dynamical parallelism > upper_bound then go with upper bound. The link that I gave above would also covered the developer local run, and spark have a hard limit in the same script https://github.com/NVIDIA/spark-rapids/blob/branch-24.12/integration_tests/run_pyspark_from_build.sh#L174-L186
handle corner case minimal 1 pytest worker get CUDA_VISIBLE_DEVICES working
Thanks for sharing the link. |
build |
build |
The script gave 9 on CI. It seems we need only 3 at this stage after discussing with Erik. No OOM was observed when I tested pre-merge and nightly tests locally. Should be safe. |
I replied above, the important part is the GPU memory, feel free to remove the host memory detect part, just give GPU memory calculation (e.g. spark test required ~1.5g GPU mem for each instance) If you can also confirm hardcode 3 would not consume more than 15g (minimal GPU mem size in CI pool), we should be good too~ |
build |
@pxLi Got you. Just added back the dynamic parallelism, with host memory constraint removed yet a maximum parallelism 3 added. Our pre-merge and nightly tests took < 4 GB device memory tested in my workstation. Hardcode 3 should consume no more than 12g. |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
with pytest -n 4 on a V100 32GB GPU,
run_tests.sh got passed in 20 mins.
run_tests.sh --runslow got passed in 40 mins.
peak CPU memory usage was around 2GB.