Skip to content

Conversation

@tyler-griggs
Copy link
Member

Summary

  • Add SKYRL_LOG_LEVEL environment variable to env_vars.py for centralized log level control
  • Suppress verbose Ray worker/raylet logs from stdout by default using log_to_driver=False
  • Suppress Ray C++ metrics exporter errors using RAY_BACKEND_LOG_LEVEL=fatal
  • Enable verbose logging when SKYRL_LOG_LEVEL=DEBUG

This improves the signal-to-noise ratio of training logs by hiding infrastructure logs and only showing training progress.

Test plan

  • Run training with default settings - verbose Ray logs should be suppressed
  • Run with SKYRL_LOG_LEVEL=DEBUG - verbose Ray logs should appear

🤖 Generated with Claude Code

- Add SKYRL_LOG_LEVEL to env_vars.py for centralized log level control
- Update configure_ray_worker_logging to use SKYRL_LOG_LEVEL
- Add log_to_driver=False to suppress worker/raylet log forwarding
- Set RAY_BACKEND_LOG_LEVEL=fatal to suppress C++ metrics errors
- Enable verbose logging when SKYRL_LOG_LEVEL=DEBUG

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new environment variable SKYRL_LOG_LEVEL to centralize control over logging levels and suppresses verbose Ray logs by default. This is a good change that improves the signal-to-noise ratio of training logs. The implementation is clean and follows the existing coding patterns. I've identified one potential issue regarding the lack of validation for the new environment variable, which could cause the application to crash if an invalid value is provided. I've included a suggestion to make the implementation more robust.

# Logging
# ─────────────────────────────────────────────────────────────────────────────

SKYRL_LOG_LEVEL = os.environ.get("SKYRL_LOG_LEVEL", "INFO").upper()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The SKYRL_LOG_LEVEL is read from the environment but not validated. If a user provides an invalid log level (e.g., SKYRL_LOG_LEVEL=TYPO), the application will crash later when loguru tries to use it in configure_ray_worker_logging. It's better to validate the input against standard logging levels and fall back to a default value with a warning to prevent the application from crashing.

Suggested change
SKYRL_LOG_LEVEL = os.environ.get("SKYRL_LOG_LEVEL", "INFO").upper()
import logging
_log_level_input = os.environ.get("SKYRL_LOG_LEVEL", "INFO").upper()
if _log_level_input not in logging._nameToLevel:
# Using print because logger is not yet configured.
print(
f"WARNING: Invalid SKYRL_LOG_LEVEL '{_log_level_input}'. "
f"Valid options are {list(logging._nameToLevel.keys())}. Defaulting to 'INFO'."
)
_log_level_input = "INFO"
SKYRL_LOG_LEVEL = _log_level_input

- Add SKYRL_LOG_DIR env var for log file location (default: /tmp/skyrl-logs)
- Create new logging module (skyrl_train/utils/logging.py) with:
  - setup_logging(): Routes all logs to file, training progress to stdout
  - configure_worker_logging(): Simplified worker logging setup
- Training progress loggers (trainer, fully_async_trainer, tracking, evaluate)
  go to stdout by default
- Infrastructure loggers (vllm, ray, workers, inference_engines, etc.)
  only go to log file by default
- Set SKYRL_LOG_LEVEL=DEBUG to show all logs on stdout
- Update main_base.py to call setup_logging at startup

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants