[WIP] [tx] Make it easy to run on a multi-node Ray cluster #955

Future-Outlier · 2026-01-25T15:52:06Z

cd skyrl-tx
uv run --extra ray --extra tinker -m tx.tinker.api \
    --base-model Qwen/Qwen3-0.6B \
    --backend-config '{"enable_ray": true, "ray_num_workers": 1}'

Signed-off-by: Future-Outlier <eric901201@gmail.com>

vercel · 2026-01-25T15:52:10Z

@Future-Outlier is attempting to deploy a commit to the Tyler's projects Team on Vercel.

A member of the Team first needs to authorize it.

gemini-code-assist

Code Review

This pull request introduces support for running on a multi-node Ray cluster by adding a RayProcessManager and integrating it into the JAX backend and the Tinker engine. During the review, several issues were identified. A critical resource leak exists where Ray workers are not shut down gracefully, and there's a high-severity issue with a brittle synchronization mechanism that could cause startup failures. Furthermore, a security audit revealed two medium-severity vulnerabilities: leakage of database credentials in application logs (due to the API server logging the full command line including the database URL, potentially exposing passwords) and insecure configuration of the Ray dashboard (bound to 0.0.0.0, exposing it to the network). All these issues should be addressed.

gemini-code-assist · 2026-01-25T15:54:31Z

skyrl-tx/tx/tinker/engine.py

+        self._ray_process_manager = None
+        if hasattr(backend_config, "enable_ray") and backend_config.enable_ray:
+            logger.info("Starting Ray worker processes for multi-node support...")
+            self._ray_process_manager, coordinator_address = start_ray_workers(
+                backend_config
+            )


The RayProcessManager is created and stored in self._ray_process_manager, but its shutdown() method is never called. When the engine process is terminated (e.g., by the API server's lifespan manager), the Ray worker actors will be orphaned, leading to a significant resource leak on the cluster.

To fix this, a graceful shutdown mechanism should be implemented. I recommend:

Adding a shutdown() method to the TinkerEngine class that calls self._ray_process_manager.shutdown() if it exists.

In engine.py's main() function, add a signal handler for SIGTERM and SIGINT that calls the new engine.shutdown() method before exiting.

gemini-code-assist · 2026-01-25T15:54:31Z

skyrl-tx/tx/tinker/backends/jax.py

+    import time
+
+    time.sleep(2)
+
+    return manager, coordinator_address


The use of time.sleep(2) to wait for Ray workers to initialize is brittle and can lead to race conditions, especially on a loaded system where actor startup might take longer. This can cause non-deterministic startup failures if the workers are not ready when the coordinator tries to connect.

A more robust approach would be to actively poll the status of the worker actors until they all report as running, with a reasonable timeout. This ensures that the system waits just as long as necessary and provides a more reliable startup sequence.

import time # Wait for all workers to report they are running logger.info("Waiting for all Ray workers to start...") start_time = time.time() timeout_seconds = 60 # Consider making this configurable while time.time() - start_time < timeout_seconds: try: statuses = ray.get([h.get_status.remote() for h in manager.worker_handles]) if all(s.get("is_running") for s in statuses): logger.info("All Ray workers are running.") return manager, coordinator_address except Exception as e: logger.warning(f"Error checking worker status, will retry: {e}") time.sleep(1) # If loop finishes, it's a timeout manager.shutdown() raise RuntimeError(f"Ray workers did not start within {timeout_seconds} seconds.")

gemini-code-assist · 2026-01-25T15:54:31Z

skyrl-tx/tx/tinker/api.py

+    logger.info(
+        f"Started background engine with PID {background_engine.pid}: {' '.join(cmd)}"
+    )


The API server logs the full command line used to start the background engine. This command line includes all configuration parameters, including the database_url. If the database URL contains credentials (e.g., a password for a PostgreSQL or MySQL database), these credentials will be written to the application logs in plain text.

Remediation: Sanitize the command line arguments before logging them. Specifically, the database_url should be masked or excluded from the log message.

gemini-code-assist · 2026-01-25T15:54:31Z

skyrl-tx/tx/tinker/backends/jax.py

+        ray.init(
+            include_dashboard=True,
+            dashboard_host="0.0.0.0",
+            dashboard_port=8265,
+        )


When Ray support is enabled, the application initializes Ray with the dashboard bound to 0.0.0.0. This makes the Ray dashboard accessible from any machine on the network. The Ray dashboard can expose sensitive information about the cluster and, depending on the version and configuration, may allow for unauthorized task submission or code execution.

Remediation: Change the default dashboard_host to 127.0.0.1 to ensure it is only accessible locally. If remote access is required, it should be made configurable and the user should be warned about the security implications.

[WIP] [tx] Make it easy to run on a multi-node Ray cluster

363be1e

Signed-off-by: Future-Outlier <eric901201@gmail.com>

gemini-code-assist bot reviewed Jan 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] [tx] Make it easy to run on a multi-node Ray cluster #955

[WIP] [tx] Make it easy to run on a multi-node Ray cluster #955

Uh oh!

Future-Outlier commented Jan 25, 2026 •

edited

Loading

Uh oh!

vercel bot commented Jan 25, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 25, 2026

Uh oh!

gemini-code-assist bot Jan 25, 2026

Uh oh!

gemini-code-assist bot Jan 25, 2026

Uh oh!

gemini-code-assist bot Jan 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[WIP] [tx] Make it easy to run on a multi-node Ray cluster #955

Are you sure you want to change the base?

[WIP] [tx] Make it easy to run on a multi-node Ray cluster #955

Uh oh!

Conversation

Future-Outlier commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vercel bot commented Jan 25, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Future-Outlier commented Jan 25, 2026 •

edited

Loading