-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: fix average training time for restart #4212
Conversation
Fix deepmodeling#4208. Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
📝 WalkthroughWalkthroughThe changes in this pull request focus on improving the training processes in the Changes
Assessment against linked issues
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (1)
deepmd/tf/train/trainer.py (1)
602-611
: Enhanced average training time calculationThe calculation of average training time has been improved to provide more accurate results. It now uses the
elapsed_batch
variable and excludes the first display interval from the calculation when there are enough batches.However, there's a small improvement that can be made to enhance readability:
Consider extracting the complex condition and calculation into a separate function for better readability. For example:
def calculate_average_training_time(total_time, elapsed_batch, disp_freq): if elapsed_batch >= 2 * disp_freq: return total_time / (elapsed_batch - disp_freq) else: return total_time / elapsed_batch # Then in the logging statement: avg_time = calculate_average_training_time(total_train_time, elapsed_batch, self.disp_freq) log.info( "average training time: %.4f s/batch (exclude first %d batches)", avg_time, self.disp_freq if elapsed_batch >= 2 * self.disp_freq else 0 )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
- deepmd/pt/train/training.py (3 hunks)
- deepmd/tf/train/trainer.py (3 hunks)
🧰 Additional context used
🔇 Additional comments (5)
deepmd/tf/train/trainer.py (4)
419-420
: New variables introduced to track training progressThe addition of
start_batch
andelapsed_batch
variables improves the accuracy of tracking the training progress. This change allows for a more precise calculation of the elapsed batches, especially in scenarios where training is resumed from a checkpoint.
557-560
: Improved logic for calculating total training timeThe updated condition for calculating the total training time is more accurate. It now excludes the first training interval (which might be inaccurate due to initialization overhead) when the training has progressed beyond the first display interval, or includes it if the total elapsed batches are less than twice the display frequency.
617-618
: Simplified average training time calculation for short training sessionsThe calculation for average training time when the total elapsed batches are less than twice the display frequency has been simplified. This provides a more straightforward representation of the average time for shorter training sessions.
Line range hint
419-618
: Summary of improvements to training time calculation and progress trackingThe changes in this file significantly improve the accuracy of training time calculations and progress tracking, addressing the issue mentioned in PR #4208. Key improvements include:
- Introduction of
start_batch
andelapsed_batch
variables for better batch counting.- Updated logic for calculating total training time, excluding potential inaccuracies from the first training interval.
- Enhanced average training time calculation, providing more accurate results for both long and short training sessions.
These changes should resolve the reported issue of incorrect average training times, especially in scenarios involving training restarts. The implementation now provides a more reliable representation of the training progress and performance.
deepmd/pt/train/training.py (1)
892-894
: Verify the logic for updatingtotal_train_time
.The condition used to update
self.total_train_time
has been modified. Please ensure that this logic correctly accounts for cases when training is restarted, and that it accurately reflects the intended batches for timing calculations.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## devel #4212 +/- ##
==========================================
- Coverage 83.50% 83.49% -0.01%
==========================================
Files 541 541
Lines 52459 52463 +4
Branches 3047 3047
==========================================
+ Hits 43804 43805 +1
Misses 7710 7710
- Partials 945 948 +3 ☔ View full report in Codecov by Sentry. |
Fix #4208.
Summary by CodeRabbit
New Features
Bug Fixes
Chores