-
Notifications
You must be signed in to change notification settings - Fork 1.6k
test: Enable GB200 torch compile multi gpu tests #6145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe updates modify SLURM job environment variable handling in a Jenkins Groovy script, simplify test parameterization and skip logic in a PyTorch LLM API integration test file, adjust a test configuration parameter, and remove a skip entry from a test waiver list. No changes to public APIs or control flow were made. Changes
Sequence Diagram(s)sequenceDiagram
participant Jenkins
participant SLURM
participant Docker
Jenkins->>SLURM: Submit job with NVIDIA_IMEX_CHANNELS=0 (env or --container-env)
SLURM->>Docker: Launch container with NVIDIA_IMEX_CHANNELS=0
Docker-->>SLURM: Run tests
SLURM-->>Jenkins: Return results
Estimated code review effort2 (10–30 minutes) Possibly related PRs
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (4)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
/bot run --stage-list "GB200-4_GPUs-PyTorch-1,GB200-4_GPUs-PyTorch-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-4" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
jenkins/L0_Test.groovy (1)
351-359
: Pass the value inline to--container-env
to avoid relying on a priorexport
.
--container-env
accepts bothNAME
(copy from parent env) andNAME=value
.
You’re exporting the variable later in the launch script, so this will work, but making the value explicit here is slightly safer and self-contained (e.g. if the export line ever gets removed).- "--container-env=NVIDIA_IMEX_CHANNELS" + "--container-env=NVIDIA_IMEX_CHANNELS=0"This also removes one implicit dependency and makes the generated SLURM command clearer when debugging.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
jenkins/L0_Test.groovy
(2 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py
(12 hunks)
🔇 Additional comments (6)
jenkins/L0_Test.groovy (1)
378-381
: Env var correctly exported beforesrun
.The explicit
export NVIDIA_IMEX_CHANNELS=0
ensures the variable is available to SLURM and propagated into the container runtime. Good catch.tests/integration/defs/accuracy/test_llm_api_pytorch.py (5)
87-87
: LGTM: Torch compile parameterization enabled.The removal of device-specific skip conditions for the
torch_compile
parameter is consistent with the PR objective to enable GB200 torch compile multi-GPU tests.
103-103
: LGTM: Multi-GPU torch compile testing enabled.The parameterization change enables torch compile testing on 4-GPU configurations, which directly supports the PR objective of enabling GB200 torch compile multi-GPU tests.
131-131
: LGTM: FP8 torch compile testing enabled.The removal of device-specific skip conditions for both single-GPU and multi-GPU FP8 tests enables comprehensive torch compile testing across different quantization scenarios.
Also applies to: 154-154
637-637
: LGTM: Comprehensive torch compile testing enabled for DeepSeekV3Lite.The consistent removal of device-specific skip conditions across all test methods (bfloat16, fp8_block_scales, and nvfp4) enables comprehensive torch compile testing for both single-GPU and multi-GPU configurations. The existing skip logic for unsupported combinations (e.g., PP with torch.compile) is properly maintained.
Also applies to: 671-671, 714-714, 859-859, 1055-1055, 1097-1097
1333-1336
: LGTM: Explicit torch compile configuration for multi-GPU testing.The explicit configuration of
TorchCompileConfig
withenable_fullgraph=True
enables torch compile in full graph mode for this multi-GPU test. This approach differs from the parameterized approach in other tests but is appropriate for ensuring consistent torch compile behavior in this specific test scenario.
PR_Github #12206 [ run ] triggered by Bot |
PR_Github #12206 [ run ] completed with state |
/bot run --stage-list "GB200-4_GPUs-PyTorch-1,GB200-4_GPUs-PyTorch-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-4" |
PR_Github #12213 [ run ] triggered by Bot |
PR_Github #12213 [ run ] completed with state |
/bot run --stage-list "GB200-4_GPUs-PyTorch-1,GB200-4_GPUs-PyTorch-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-4" |
PR_Github #12221 [ run ] triggered by Bot |
PR_Github #12221 [ run ] completed with state |
/bot run --stage-list "GB200-4_GPUs-PyTorch-1,GB200-4_GPUs-PyTorch-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-4" |
PR_Github #12225 [ run ] triggered by Bot |
PR_Github #12225 [ run ] completed with state |
/bot run --stage-list "GB200-4_GPUs-PyTorch-1,GB200-4_GPUs-PyTorch-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-4" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (51)
cpp/CMakeLists.txt
(2 hunks)cpp/include/tensorrt_llm/batch_manager/runtimeBuffers.h
(1 hunks)cpp/tensorrt_llm/batch_manager/runtimeBuffers.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/CMakeLists.txt
(2 hunks)cpp/tensorrt_llm/nanobind/batch_manager/algorithms.cpp
(0 hunks)cpp/tensorrt_llm/nanobind/batch_manager/algorithms.h
(0 hunks)cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
(0 hunks)cpp/tensorrt_llm/nanobind/batch_manager/bindings.h
(0 hunks)cpp/tensorrt_llm/nanobind/batch_manager/buffers.cpp
(0 hunks)cpp/tensorrt_llm/nanobind/batch_manager/buffers.h
(0 hunks)cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
(0 hunks)cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.h
(0 hunks)cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
(0 hunks)cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.h
(0 hunks)cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp
(0 hunks)cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h
(0 hunks)cpp/tensorrt_llm/nanobind/bindings.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/common/bindTypes.h
(0 hunks)cpp/tensorrt_llm/nanobind/common/customCasters.h
(0 hunks)cpp/tensorrt_llm/nanobind/executor/bindings.cpp
(0 hunks)cpp/tensorrt_llm/nanobind/executor/bindings.h
(0 hunks)cpp/tensorrt_llm/nanobind/executor/executor.cpp
(0 hunks)cpp/tensorrt_llm/nanobind/executor/executor.h
(0 hunks)cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
(0 hunks)cpp/tensorrt_llm/nanobind/executor/executorConfig.h
(0 hunks)cpp/tensorrt_llm/nanobind/executor/request.cpp
(0 hunks)cpp/tensorrt_llm/nanobind/executor/request.h
(0 hunks)cpp/tensorrt_llm/nanobind/runtime/bindings.cpp
(0 hunks)cpp/tensorrt_llm/nanobind/runtime/bindings.h
(0 hunks)cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp
(0 hunks)cpp/tensorrt_llm/nanobind/runtime/moeBindings.h
(0 hunks)cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.cpp
(0 hunks)cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.h
(0 hunks)cpp/tensorrt_llm/nanobind/userbuffers/bindings.cpp
(0 hunks)cpp/tensorrt_llm/nanobind/userbuffers/bindings.h
(0 hunks)cpp/tensorrt_llm/pybind/bindings.cpp
(1 hunks)cpp/tensorrt_llm/pybind/executor/bindings.cpp
(1 hunks)cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
(1 hunks)examples/models/core/llama/summarize_long.py
(1 hunks)examples/models/core/qwen2audio/run.py
(1 hunks)examples/models/core/qwenvl/run.py
(1 hunks)jenkins/Build.groovy
(0 hunks)jenkins/L0_Test.groovy
(2 hunks)tensorrt_llm/builder.py
(1 hunks)tensorrt_llm/commands/build.py
(1 hunks)tensorrt_llm/runtime/model_runner.py
(1 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py
(12 hunks)tests/integration/test_lists/test-db/l0_a10.yml
(0 hunks)tests/integration/test_lists/waives.txt
(0 hunks)tests/unittest/bindings/test_bindings_ut.py
(0 hunks)tests/unittest/bindings/test_executor_bindings.py
(1 hunks)
💤 Files with no reviewable changes (34)
- tests/integration/test_lists/waives.txt
- cpp/tensorrt_llm/nanobind/executor/request.h
- cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.h
- cpp/tensorrt_llm/nanobind/batch_manager/bindings.h
- cpp/tensorrt_llm/nanobind/batch_manager/algorithms.h
- cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.h
- cpp/tensorrt_llm/nanobind/userbuffers/bindings.h
- cpp/tensorrt_llm/nanobind/runtime/moeBindings.h
- cpp/tensorrt_llm/nanobind/runtime/bindings.h
- cpp/tensorrt_llm/nanobind/batch_manager/buffers.h
- cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.h
- tests/unittest/bindings/test_bindings_ut.py
- cpp/tensorrt_llm/nanobind/batch_manager/buffers.cpp
- cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.cpp
- cpp/tensorrt_llm/nanobind/executor/bindings.h
- tests/integration/test_lists/test-db/l0_a10.yml
- jenkins/Build.groovy
- cpp/tensorrt_llm/nanobind/userbuffers/bindings.cpp
- cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h
- cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp
- cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
- cpp/tensorrt_llm/nanobind/runtime/bindings.cpp
- cpp/tensorrt_llm/nanobind/executor/executorConfig.h
- cpp/tensorrt_llm/nanobind/batch_manager/algorithms.cpp
- cpp/tensorrt_llm/nanobind/executor/executor.cpp
- cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp
- cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
- cpp/tensorrt_llm/nanobind/common/bindTypes.h
- cpp/tensorrt_llm/nanobind/executor/bindings.cpp
- cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
- cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
- cpp/tensorrt_llm/nanobind/common/customCasters.h
- cpp/tensorrt_llm/nanobind/executor/request.cpp
- cpp/tensorrt_llm/nanobind/executor/executor.h
✅ Files skipped from review due to trivial changes (1)
- tests/unittest/bindings/test_executor_bindings.py
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/integration/defs/accuracy/test_llm_api_pytorch.py
- jenkins/L0_Test.groovy
🧰 Additional context used
🧬 Code Graph Analysis (2)
examples/models/core/llama/summarize_long.py (1)
tensorrt_llm/runtime/generation.py (1)
kv_cache_type
(1169-1170)
examples/models/core/qwenvl/run.py (1)
tensorrt_llm/runtime/generation.py (1)
kv_cache_type
(1169-1170)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (18)
cpp/include/tensorrt_llm/batch_manager/runtimeBuffers.h (1)
171-171
: Good improvement: Switched to exclusive ownership model.The change from
std::shared_ptr
tostd::unique_ptr
fortransformerBuffers
is appropriate and beneficial. This aligns with the nanobind removal and provides:
- More explicit ownership semantics (exclusive vs shared)
- Better performance (no reference counting overhead)
- Clearer intent that this buffer is exclusively owned by the RuntimeBuffers instance
Since the class already has deleted copy operations, shared ownership was unnecessary.
cpp/tensorrt_llm/batch_manager/runtimeBuffers.cpp (1)
87-88
: Consistent implementation change for unique ownership.The change from
std::make_shared
tostd::make_unique
correctly implements the ownership model change from the header. The constructor parameters and initialization logic remain unchanged, ensuring functional consistency while improving ownership semantics.cpp/CMakeLists.txt (2)
201-204
: LGTM: Simplified pybind11 inclusion logicThe removal of
OR BUILD_DEEP_EP
condition correctly aligns with the nanobind removal effort. The logic now cleanly restricts pybind11 inclusion to only whenBINDING_TYPE
is explicitly set to"pybind"
.
220-222
: LGTM: Consistent pybind11 include directory logicThe change maintains consistency with the subdirectory inclusion logic above, ensuring pybind11 includes are only added when
BINDING_TYPE
equals"pybind"
.tensorrt_llm/runtime/model_runner.py (1)
89-89
: No action needed: KVCacheType initializes from strings via its pybind constructorThe
KVCacheType
enum in Python is bound with
.def(py::init(&tr::ModelConfig::KVCacheTypeFromString))
so callingKVCacheType(builder_config.get('kv_cache_type'))correctly parses the string into the enum. This matches existing patterns in
builder.py
and across the codebase.examples/models/core/qwenvl/run.py (1)
121-121
: Consistent KVCacheType instantiation patternThe change from
KVCacheType.from_string()
toKVCacheType()
is consistent with the pattern seen in other files. This maintains uniformity across the codebase.examples/models/core/llama/summarize_long.py (1)
100-100
: Consistent KVCacheType constructor usageThe change follows the same pattern as other files in the codebase, replacing the factory method with direct constructor usage. This maintains consistency across the entire codebase.
cpp/tensorrt_llm/pybind/bindings.cpp (1)
173-173
: API improvement: Constructor-based enum instantiationThe change from a
from_string
method to a constructor-based approach usingpy::init
is a good improvement. This makes the API more Pythonic and consistent with standard enum usage patterns.// Before: KVCacheType.from_string("paged") // After: KVCacheType("paged")This change aligns well with the corresponding Python-side updates across the codebase.
examples/models/core/qwen2audio/run.py (1)
125-125
: Consistent API update for KVCacheType instantiationThe change from
KVCacheType.from_string()
to direct constructor callKVCacheType()
is consistent with the new constructor-based API. The functionality remains identical while following the updated enum instantiation pattern.tensorrt_llm/commands/build.py (1)
134-134
: Simplified argument parsing with direct enum usageThe change from using a custom
enum_type
wrapper to directly usingKVCacheType
as the argument type is a good simplification. This leverages the enum's built-in string parsing capabilities through its constructor, reducing complexity while maintaining the same functionality.tensorrt_llm/builder.py (1)
596-596
: Approved: KVCacheType String Conversion via ConstructorThe change to use
KVCacheType(config.pop('kv_cache_type'))
is consistent with existing patterns elsewhere in the codebase (e.g., runtime/model_runner.py and various example scripts), demonstrating that the constructor already handles string inputs correctly. No further updates are needed.cpp/tensorrt_llm/nanobind/CMakeLists.txt (4)
6-6
: Source file list has been drastically reducedThe source files have been reduced to only
ipcNvlsMemory.cu
andbindings.cpp
. Given that the nanobind module is being simplified/removed, ensure that these are the only required sources and that no critical functionality is being lost.
19-21
: Simplified library linkingThe removal of conditional NVSHMEM libraries and CUDA_NVML_LIB aligns with the reduced functionality. The retained libraries (SHARED_TARGET, Python3_LIBRARIES, TORCH_LIBRARIES, torch_python) appear sufficient for the remaining functionality.
24-24
: Updated compile definition to nanobind-specific macroThe change from
PYBIND11_DETAILED_ERROR_MESSAGES=1
toNB_DETAILED_ERROR_MESSAGES=1
correctly reflects the use of nanobind instead of pybind11 for this module.
31-31
: Simplified runtime path configurationThe removal of CUDA stubs directory from rpath is consistent with the reduced dependencies. The remaining paths for local libs and NCCL should be sufficient.
cpp/tensorrt_llm/pybind/executor/bindings.cpp (1)
247-247
: Simplified binding leverages pybind11's native type supportThe direct exposure of
getLatestEvents
with optional timeout is cleaner and more maintainable than the previous lambda wrapper approach. This change correctly utilizes pybind11's built-in support forstd::optional
and chrono types.cpp/tensorrt_llm/nanobind/bindings.cpp (2)
2-2
: Copyright year updated to 2024The copyright year has been changed from 2025 to 2024, which appears to be a correction since 2025 is in the future.
24-28
: Minimal nanobind module retainedThe module has been stripped of all functionality but still defines a basic nanobind module with documentation. Consider whether this placeholder module is necessary or if it should be completely removed as part of the nanobind deprecation.
/bot run --stage-list "GB200-4_GPUs-PyTorch-1,GB200-4_GPUs-PyTorch-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-4" |
PR_Github #12232 [ run ] triggered by Bot |
PR_Github #12232 [ run ] completed with state |
/bot run --stage-list "GB200-4_GPUs-PyTorch-1,GB200-4_GPUs-PyTorch-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-4" --disable-fail-fast |
PR_Github #12247 [ run ] triggered by Bot |
06338d4
to
33dfd64
Compare
/bot run --stage-list "GB200-4_GPUs-PyTorch-1,GB200-4_GPUs-PyTorch-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-4" --disable-fail-fast |
PR_Github #12265 [ run ] triggered by Bot |
PR_Github #12247 [ run ] completed with state |
PR_Github #12265 [ run ] completed with state |
/bot run --stage-list "GB200-4_GPUs-PyTorch-1,GB200-4_GPUs-PyTorch-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-4" --disable-fail-fast |
PR_Github #12384 [ run ] triggered by Bot |
PR_Github #12378 [ run ] completed with state |
PR_Github #12384 [ run ] completed with state |
/bot run --stage-list "GB200-4_GPUs-PyTorch-1,GB200-4_GPUs-PyTorch-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-4" --disable-fail-fast |
PR_Github #12396 [ run ] triggered by Bot |
PR_Github #12396 [ run ] completed with state |
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
/bot run --stage-list "GB200-4_GPUs-PyTorch-1,GB200-4_GPUs-PyTorch-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-4" --disable-fail-fast |
PR_Github #12432 [ run ] triggered by Bot |
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com> Update test_llm_api_pytorch.py Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
…y_moe_trtllm] Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
20ca107
to
3778211
Compare
/bot run --stage-list "GB200-4_GPUs-PyTorch-1,GB200-4_GPUs-PyTorch-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-3,GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-4" --disable-fail-fast |
PR_Github #12443 [ run ] triggered by Bot |
PR_Github #12432 [ run ] completed with state |
PR_Github #12443 [ run ] completed with state |
/bot skip --comment "Related tests are passed" |
PR_Github #12445 [ skip ] triggered by Bot |
PR_Github #12445 [ skip ] completed with state |
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Summary by CodeRabbit
Bug Fixes
NVIDIA_IMEX_CHANNELS
variable is properly set for both single-node and multi-node runs.Tests
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]
Launch build/test pipelines. All previously running jobs will be killed.
--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-[Post-Merge]-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.