Skip to content

Conversation

@han-steve
Copy link
Contributor

We just had a production outage at Roblox due to a large number of headless services created for Ray overwhelming the service mesh sidecars and causing ingress gateways to fail.

Another thing we observed is that mTLS slows down some jobs significantly. So we ended up using istio with interception mode "none" to only proxy the 8265 port to expose the head node securely, and leave the head - worker grpc connections unencrypted. But I wasn't sure it's a common enough issue to mention in the docs.

We just had a production outage at Roblox due to a large number of headless services created for Ray overwhelming the service mesh sidecars and causing ingress gateways to fail. 

Another thing we observed is that mTLS slows down some jobs significantly. But I wasn't sure it's a common enough issue to mention in the docs. 

Signed-off-by: Steve Han <36038610+han-steve@users.noreply.github.com>
@han-steve han-steve requested review from a team, kevin85421 and pcmoritz as code owners May 28, 2025 23:09

:::{warning}
The default Ray worker port range, from 10002 to 19999, is too large to specify in the service manifest and can cause memory issues in Kubernetes. Set a smaller `max-worker-port` to work with Istio.
The default Ray worker port range, from 10002 to 19999, is too large to specify in the service manifest and can cause memory issues in Kubernetes. Set a smaller `max-worker-port` to work with Istio. Note that by default these ports will be cached in every sidecar in the service mesh, which could lead to sidecar OOMs if too many headless services are created.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just had a production outage at Roblox due to a large number of headless services created for Ray overwhelming the service mesh sidecars and causing ingress gateways to fail.

Just out of curiosity, how many ports did you set?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We set 200 ports to still allow 200 workers per node in the Ray cluster. However, people queued up hundreds of jobs and caused every infra pod with sidecar to OOM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, does that mean the memory usage is positively correlated with 200 times the total number of worker pods in the Kubernetes cluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah for each headless svc, we add 200 entries to ALL sidecars in the cluster - unless there's a Sidecar CRD to restrict the caching. However, the Sidecar CRD doesn't apply to ingress gateway, which still might OOM

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will ask Ray doc team to review this PR. Would you mind installing Vale?

@han-steve
Copy link
Contributor Author

Yeah these are the vale outputs:

168:233 error Avoid using 'will'. Google.Will
168:238 suggestion In general, use active voice Google.Passive
instead of passive voice ('be
cached').
168:314 error Did you really mean 'OOMs'? Vale.Spelling
168:349 suggestion In general, use active voice Google.Passive
instead of passive voice ('are
created').

Not sure if these are actionable. The doc has a lot more errors/suggestions other than this line

@kevin85421
Copy link
Member

Not sure if these are actionable. The doc has a lot more errors/suggestions other than this line

Our current policy is that newly added lines should follow the Vale rules. You don’t need to fix existing ones.

Copy link
Contributor

@angelinalg angelinalg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit to clarify the sentence.

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Steve Han <36038610+han-steve@users.noreply.github.com>
@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Jun 3, 2025
@kevin85421
Copy link
Member

@han-steve I have enabled premerge. Please ping me when all CI tests pass! Thanks!

@han-steve
Copy link
Contributor Author

yeah looks like everything passed!

@kevin85421
Copy link
Member

It looks like CI is still running. Btw, it's not necessary to sync with the master branch if there is no conflict.

@kevin85421
Copy link
Member

The doc build CI stuck forever. I update this branch to trigger the CI again.

@kevin85421
Copy link
Member

Please ping me when all CI tests pass.

@han-steve
Copy link
Contributor Author

Looks like the ci is stuck again 🥲

@kevin85421
Copy link
Member

cc @jjyao @edoakes is it OK to merge it directly? I think this change is safe to merge and the CI issue is unrelated to the change.

@edoakes
Copy link
Collaborator

edoakes commented Jun 11, 2025

The docs build didn't finish, re-triggered it to be sure

@edoakes edoakes enabled auto-merge (squash) June 11, 2025 15:20
@edoakes edoakes merged commit 94a466d into ray-project:master Jun 11, 2025
6 checks passed
elliot-barn pushed a commit that referenced this pull request Jun 18, 2025
We just had a production outage at Roblox due to a large number of
headless services created for Ray overwhelming the service mesh sidecars
and causing ingress gateways to fail.

Another thing we observed is that mTLS slows down some jobs
significantly. So we ended up using istio with interception mode "none"
to only proxy the 8265 port to expose the head node securely, and leave
the head - worker grpc connections unencrypted. But I wasn't sure it's a
common enough issue to mention in the docs.

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

---------

Signed-off-by: Steve Han <36038610+han-steve@users.noreply.github.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Jul 2, 2025
We just had a production outage at Roblox due to a large number of
headless services created for Ray overwhelming the service mesh sidecars
and causing ingress gateways to fail.

Another thing we observed is that mTLS slows down some jobs
significantly. So we ended up using istio with interception mode "none"
to only proxy the 8265 port to expose the head node securely, and leave
the head - worker grpc connections unencrypted. But I wasn't sure it's a
common enough issue to mention in the docs.

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

---------

Signed-off-by: Steve Han <36038610+han-steve@users.noreply.github.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants