-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: bump TF to 2.18, PT to 2.5 #4228
Conversation
Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
📝 Walkthrough📝 WalkthroughWalkthroughThe pull request updates the workflow configurations for testing CUDA and Python in the Changes
Possibly related PRs
Suggested reviewers
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (3)
.github/workflows/test_cuda.yml (3)
50-50
: LGTM: TensorFlow and PyTorch versions updatedThe update to TensorFlow 2.18.0rc2 and PyTorch 2.5.0 aligns with the PR objectives. The use of
~=
for version specification is a good practice.Consider using more specific version ranges to avoid potential issues with future releases:
- - run: source/install/uv_with_retry.sh pip install --system "tensorflow~=2.18.0rc2" "torch~=2.5.0" + - run: source/install/uv_with_retry.sh pip install --system "tensorflow~=2.18.0,<2.19.0" "torch~=2.5.0,<2.6.0"This ensures compatibility with patch releases while preventing automatic updates to major versions.
Line range hint
41-47
: Consider removing the commented-out CUDA installation stepThe CUDA installation step has been correctly disabled as it's no longer needed with the NVIDIA Docker image. This is a good change.
For code cleanliness, consider removing this entire block instead of keeping it commented out. If you want to preserve this information for future reference, consider moving it to a separate documentation file or adding it as a comment in the workflow file's header.
You can remove these lines:
- - run: | - wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb \ - && sudo dpkg -i cuda-keyring_1.0-1_all.deb \ - && sudo apt-get update \ - && sudo apt-get -y install cuda-12-3 libcudnn8=8.9.5.*-1+cuda12.3 - if: false # skip as we use nvidia image🧰 Tools
🪛 actionlint
51-51: shellcheck reported issue in this script: SC2155:warning:1:8: Declare and assign separately to avoid masking return values
(shellcheck)
51-51: shellcheck reported issue in this script: SC2155:warning:2:8: Declare and assign separately to avoid masking return values
(shellcheck)
51-51: shellcheck reported issue in this script: SC2102:info:3:61: Ranges can only match single chars (mentioned due to duplicates)
(shellcheck)
Line range hint
1-105
: Summary of CUDA workflow updateThe changes in this PR successfully update the CUDA testing workflow to use CUDA 12.6, TensorFlow 2.18, and PyTorch 2.5, aligning with the PR objectives. Here's a summary of the main points:
- The Docker image has been updated to use CUDA 12.6 and Ubuntu 24.04.
- TensorFlow and PyTorch versions have been updated as intended.
- The libtorch download link has been updated, but there's a potential CUDA version mismatch to verify.
- The redundant CUDA installation step has been disabled.
Please address the following points:
- Verify compatibility with Ubuntu 24.04 for all dependencies and scripts.
- Consider using more specific version ranges for TensorFlow and PyTorch.
- Check and update the libtorch download link to ensure CUDA 12.6 compatibility.
- Remove the commented-out CUDA installation step for code cleanliness.
Once these points are addressed, the PR will be ready for merge.
🧰 Tools
🪛 actionlint
19-19: label "nvidia" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-14.0", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-13.0", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "macos-12.0", "macos-11", "macos-11.0", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- .github/workflows/test_cuda.yml (3 hunks)
🧰 Additional context used
🔇 Additional comments (2)
.github/workflows/test_cuda.yml (2)
22-22
: LGTM: Docker image updated to CUDA 12.6The update to
nvidia/cuda:12.6-devel-ubuntu24.04
aligns with the PR objective of bumping CUDA to 12.6. This change also includes an upgrade to Ubuntu 24.04.Please ensure that all dependencies and scripts are compatible with Ubuntu 24.04. Run the following command to check the Ubuntu version in the workflow:
66-66
: Verify CUDA compatibility for libtorchThe libtorch download link has been updated to version 2.5.0, which is consistent with the PyTorch version installed earlier. However, there's a potential version mismatch:
- The Docker image uses CUDA 12.6
- The libtorch download is for CUDA 12.4
Please verify if libtorch 2.5.0 with CUDA 12.4 is compatible with CUDA 12.6, or if there's a more appropriate version available. You can check the PyTorch website for the latest compatible versions:
If a CUDA 12.6 compatible version is available, consider updating the download link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
.github/workflows/test_python.yml (1)
29-29
: Simplified torch installation looks good, but consider version pinning.The change to simplify the torch installation and ensure the CPU version is installed is a good improvement. It aligns with the PR objective of updating dependencies.
However, consider the following suggestions:
- To ensure reproducibility across different environments and times, it might be beneficial to pin the torch version. This prevents potential issues from unexpected changes in newer versions.
- Verify that this version of torch is compatible with the updated TensorFlow version (2.18 as per PR objectives) to avoid any integration issues.
Consider modifying the line to include a version pin:
- source/install/uv_with_retry.sh pip install --system torch -i https://download.pytorch.org/whl/cpu + source/install/uv_with_retry.sh pip install --system torch==2.5.* -i https://download.pytorch.org/whl/cpuThis ensures you're using the latest patch version of PyTorch 2.5, which aligns with the PR objective of bumping to PT 2.5.
pyproject.toml (1)
279-280
: Overall assessment of version updatesThe changes to
TENSORFLOW_VERSION
andPYTORCH_VERSION
align with the PR objectives to bump CUDA, TensorFlow, and PyTorch versions. These updates are significant and may have far-reaching effects on the project.To ensure a smooth transition to these new versions:
- Conduct thorough testing across the entire codebase, paying special attention to TensorFlow and PyTorch specific functionalities.
- Update any documentation or README files to reflect the new version requirements.
- Consider creating a migration guide for users of the project, highlighting any breaking changes or new features that may affect their usage.
- Monitor the CI/CD pipeline closely after these changes to catch any unforeseen issues early.
- Plan for a potential rollback strategy in case critical issues are discovered post-merge.
These version bumps represent a significant change. While they bring new features and improvements, they also introduce the risk of compatibility issues. Careful testing and monitoring will be crucial for a successful integration.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
- .github/workflows/test_python.yml (1 hunks)
- pyproject.toml (1 hunks)
🧰 Additional context used
🔇 Additional comments (3)
.github/workflows/test_python.yml (1)
Line range hint
1-95
: Overall workflow improvements look good.The changes to this workflow file align well with the PR objectives and introduce several improvements:
- The torch installation has been simplified and ensures the CPU version is used.
- The workflow has been updated to ignore certain branches and include merge group events, which can help with CI efficiency.
- Concurrency settings have been refined to better manage parallel runs.
- New jobs for updating test durations have been added, which can help with better test distribution and performance tracking.
These changes should lead to a more efficient and maintainable CI process. Good job on the improvements!
🧰 Tools
🪛 actionlint
27-27: shellcheck reported issue in this script: SC2155:warning:3:8: Declare and assign separately to avoid masking return values
(shellcheck)
27-27: shellcheck reported issue in this script: SC2102:info:4:80: Ranges can only match single chars (mentioned due to duplicates)
(shellcheck)
27-27: shellcheck reported issue in this script: SC2102:info:4:102: Ranges can only match single chars (mentioned due to duplicates)
(shellcheck)
pyproject.toml (2)
279-279
: Verify compatibility with TensorFlow 2.18.0rc2The addition of
TENSORFLOW_VERSION = "2.18.0rc2"
aligns with the PR objective to bump TensorFlow to version 2.18. However, using a release candidate version in a CI/CD pipeline could potentially lead to instability or unexpected behavior.Please ensure that:
- The codebase is compatible with TensorFlow 2.18.0rc2.
- Any breaking changes introduced in this version are addressed.
- The CI pipeline successfully runs with this version.
Consider adding a TODO comment to update to the stable 2.18.0 release when it becomes available.
280-280
: Verify compatibility with PyTorch 2.5.0The update of
PYTORCH_VERSION
from "2.3.1" to "2.5.0" is a significant version bump, which aligns with the PR objective. This update may introduce new features and potentially breaking changes.Please ensure that:
- The codebase is compatible with PyTorch 2.5.0.
- Any breaking changes or deprecations between versions 2.3.1 and 2.5.0 are addressed.
- The CI pipeline successfully runs with this version.
- Performance improvements or new features in PyTorch 2.5.0 are leveraged where applicable.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## devel #4228 +/- ##
==========================================
- Coverage 84.22% 84.20% -0.02%
==========================================
Files 547 547
Lines 51334 51334
Branches 3051 3051
==========================================
- Hits 43235 43226 -9
Misses 7147 7147
- Partials 952 961 +9 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
pyproject.toml (1)
Line range hint
1-580
: Summary of changes and potential impactThe main changes in this PR involve updating the versions of key dependencies:
- TensorFlow updated to 2.18.0rc2 (release candidate)
- PyTorch updated to 2.5.0
- CUDNN constraint removed, allowing for newer versions
These updates align with the PR objectives and may bring performance improvements and new features. However, they also introduce potential risks:
- Using a TensorFlow release candidate in production
- Possible breaking changes with the significant PyTorch version jump
- Potential compatibility issues with unconstrained CUDNN versions
To mitigate these risks and ensure a smooth transition:
- Implement comprehensive integration tests covering critical functionality with these new versions.
- Consider a phased rollout or maintaining a fallback option to quickly revert if issues arise.
- Monitor performance metrics closely after deployment to identify any regressions or improvements.
- Update documentation to reflect any changes in API or behavior resulting from these version updates.
These steps will help maintain the stability and reliability of the project while benefiting from the latest improvements in the underlying libraries.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- pyproject.toml (2 hunks)
🧰 Additional context used
🔇 Additional comments (2)
pyproject.toml (2)
135-135
: Consider the implications of removing CUDNN version constraintThe removal of the version constraint for
nvidia-cudnn-cu12
(from<9
to no constraint) allows for the use of newer CUDNN versions. This change aligns with the CUDA update mentioned in the PR title. However, consider the following:
- Benefit: This change enables the use of the latest CUDNN optimizations and features.
- Risk: Removing the upper bound may lead to compatibility issues with future, untested CUDNN versions.
To ensure compatibility, run the following verification script:
#!/bin/bash # Verify CUDNN version and compatibility python3 << END import torch print(f"CUDNN version: {torch.backends.cudnn.version()}") # Add any CUDNN-specific functionality tests here ENDThis script will help confirm that the CUDNN version is correctly detected and basic functionality is maintained.
279-280
: Verify compatibility with updated TensorFlow and PyTorch versionsThe changes update the TensorFlow version to a release candidate (2.18.0rc2) and PyTorch to a newer stable version (2.5.0). While these updates align with the PR objectives, consider the following:
- Using a TensorFlow release candidate (2.18.0rc2) in a production environment may introduce instability. Ensure thorough testing is performed, especially for critical functionality.
- The PyTorch update to 2.5.0 is a significant version jump. While it likely brings performance improvements and new features, it may also introduce breaking changes.
To ensure compatibility, run the following verification script:
This script will help confirm that the new versions are correctly installed and basic functionality is maintained.
Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (4)
backend/find_pytorch.py (2)
111-111
: Add validation for CUDA_VERSION formatThe CUDA version is retrieved without format validation. Consider adding version format validation to prevent issues with malformed version strings.
- cuda_version = os.environ.get("CUDA_VERSION", "12.2") + cuda_version = os.environ.get("CUDA_VERSION", "12.2") + # Validate CUDA version format + try: + Version(cuda_version) + except ValueError: + raise RuntimeError(f"Invalid CUDA version format: {cuda_version}")
115-117
: Document CUDA-PyTorch compatibility matrixThe version mapping between CUDA and PyTorch versions should be documented to make maintenance easier.
Add a docstring section explaining the compatibility matrix:
def get_pt_requirement(pt_version: str = "") -> dict: """Get PyTorch requirement when PT is not installed. ... CUDA-PyTorch Compatibility Matrix: - CUDA 12.x (up to 12.6) -> PyTorch 2.5.0 - CUDA 11.x -> PyTorch 2.3.1 """backend/find_tensorflow.py (1)
Line range hint
95-101
: LGTM! Consider documenting version compatibility.The version constraints for CUDA 11.8 are well-defined. The explicit version 2.14.1 provides a stable fallback.
Consider adding a comment explaining why TensorFlow 2.14.1 was chosen as the specific version for CUDA 11.8 compatibility.
requires.extend( [ "tensorflow-cpu>=2.5.0rc0,<2.15; platform_machine=='x86_64' and platform_system == 'Linux'", ] ) + # TF 2.14.1 is the last known stable version for CUDA 11.8 tf_version = "2.14.1"
pyproject.toml (1)
135-135
: LGTM! Consider documenting tested versions.The removal of the version constraint for
nvidia-cudnn-cu12
aligns with the PR objective to bump CUDA support. While allowing any version provides flexibility, consider documenting the tested version range in the project documentation to help users troubleshoot compatibility issues.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
- backend/find_pytorch.py (2 hunks)
- backend/find_tensorflow.py (1 hunks)
- pyproject.toml (1 hunks)
🔇 Additional comments (3)
backend/find_pytorch.py (2)
25-27
: LGTM: Clean import additionThe addition of
SpecifierSet
import is appropriate for version range comparisons.
110-119
: Verify CUDA version compatibilityLet's verify that these CUDA version mappings are consistent with PyTorch's official compatibility matrix.
✅ Verification successful
Let me try to search for any configuration or build files that might contain PyTorch version and CUDA compatibility information.
Let me check the specific file where the CUDA version mapping is defined to understand its context better.
After analyzing the code and context, I can now provide a final response:
CUDA version mappings are correctly aligned with PyTorch releases
The CUDA version mappings in the code are accurate according to PyTorch's official releases:
- PyTorch 2.5.0 is the correct version for CUDA 12.x (specifically 12.2)
- PyTorch 2.3.1 is the correct version for CUDA 11.x (specifically 11.8)
This is a build-time configuration used only during wheel building (CIBUILDWHEEL=1) to ensure correct PyTorch versions are matched with CUDA versions. The code properly handles the version selection and includes appropriate error handling for unsupported CUDA versions.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for any PyTorch compatibility documentation in the repository rg -i "cuda.*compatibility|pytorch.*cuda" README.md docs/Length of output: 105
Script:
#!/bin/bash # Search for PyTorch and CUDA version references in configuration and build files fd -e toml -e txt -e ini -e cfg -e yaml -e yml | xargs rg -i "pytorch|cuda" # Also search in setup files fd "setup\.py|requirements\.txt" | xargs rg -i "pytorch|cuda"Length of output: 8715
Script:
#!/bin/bash # Look at the entire find_pytorch.py file to understand the context cat backend/find_pytorch.py # Also search for any test files related to PyTorch version detection fd -e py | xargs rg -l "test.*find_pytorch|test.*cuda.*version"Length of output: 4683
backend/find_tensorflow.py (1)
88-91
: LGTM! Verify TensorFlow 2.18.0rc0 compatibility with CUDA 12.2.The version bump aligns with the PR objectives. However, let's verify that TensorFlow 2.18.0rc0 is officially supported with CUDA 12.2.
Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
This is prepared for the upcoming TF 2.18, which needs CUDNN 9. In the future, I may move all pinnings into pyproject.toml...
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Chores