[Docs] [istio mtls] Add warning on sidecar OOM for mTLS #53385

han-steve · 2025-05-28T23:09:20Z

We just had a production outage at Roblox due to a large number of headless services created for Ray overwhelming the service mesh sidecars and causing ingress gateways to fail.

Another thing we observed is that mTLS slows down some jobs significantly. So we ended up using istio with interception mode "none" to only proxy the 8265 port to expose the head node securely, and leave the head - worker grpc connections unencrypted. But I wasn't sure it's a common enough issue to mention in the docs.

We just had a production outage at Roblox due to a large number of headless services created for Ray overwhelming the service mesh sidecars and causing ingress gateways to fail. Another thing we observed is that mTLS slows down some jobs significantly. But I wasn't sure it's a common enough issue to mention in the docs. Signed-off-by: Steve Han <36038610+han-steve@users.noreply.github.com>

kevin85421 · 2025-05-29T04:50:28Z

doc/source/cluster/kubernetes/k8s-ecosystem/istio.md


 :::{warning}
-The default Ray worker port range, from 10002 to 19999, is too large to specify in the service manifest and can cause memory issues in Kubernetes. Set a smaller `max-worker-port` to work with Istio.
+The default Ray worker port range, from 10002 to 19999, is too large to specify in the service manifest and can cause memory issues in Kubernetes. Set a smaller `max-worker-port` to work with Istio. Note that by default these ports will be cached in every sidecar in the service mesh, which could lead to sidecar OOMs if too many headless services are created. 


We just had a production outage at Roblox due to a large number of headless services created for Ray overwhelming the service mesh sidecars and causing ingress gateways to fail.

Just out of curiosity, how many ports did you set?

Have you installed https://docs.ray.io/en/latest/ray-contribute/docs.html#how-do-you-run-vale?

We set 200 ports to still allow 200 workers per node in the Ray cluster. However, people queued up hundreds of jobs and caused every infra pod with sidecar to OOM

Oh, does that mean the memory usage is positively correlated with 200 times the total number of worker pods in the Kubernetes cluster?

yeah for each headless svc, we add 200 entries to ALL sidecars in the cluster - unless there's a Sidecar CRD to restrict the caching. However, the Sidecar CRD doesn't apply to ingress gateway, which still might OOM

kevin85421

I will ask Ray doc team to review this PR. Would you mind installing Vale?

han-steve · 2025-05-29T19:11:21Z

Yeah these are the vale outputs:

168:233 error Avoid using 'will'. Google.Will
168:238 suggestion In general, use active voice Google.Passive
instead of passive voice ('be
cached').
168:314 error Did you really mean 'OOMs'? Vale.Spelling
168:349 suggestion In general, use active voice Google.Passive
instead of passive voice ('are
created').

Not sure if these are actionable. The doc has a lot more errors/suggestions other than this line

kevin85421 · 2025-05-29T19:52:42Z

Not sure if these are actionable. The doc has a lot more errors/suggestions other than this line

Our current policy is that newly added lines should follow the Vale rules. You don’t need to fix existing ones.

angelinalg

Small nit to clarify the sentence.

doc/source/cluster/kubernetes/k8s-ecosystem/istio.md

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Steve Han <36038610+han-steve@users.noreply.github.com>

kevin85421 · 2025-06-03T02:23:03Z

@han-steve I have enabled premerge. Please ping me when all CI tests pass! Thanks!

han-steve · 2025-06-04T01:06:03Z

yeah looks like everything passed!

kevin85421 · 2025-06-04T01:22:22Z

It looks like CI is still running. Btw, it's not necessary to sync with the master branch if there is no conflict.

kevin85421 · 2025-06-04T06:54:40Z

The doc build CI stuck forever. I update this branch to trigger the CI again.

kevin85421 · 2025-06-04T06:55:18Z

Please ping me when all CI tests pass.

han-steve · 2025-06-11T03:35:24Z

Looks like the ci is stuck again 🥲

kevin85421 · 2025-06-11T06:16:13Z

cc @jjyao @edoakes is it OK to merge it directly? I think this change is safe to merge and the CI issue is unrelated to the change.

edoakes · 2025-06-11T15:20:48Z

The docs build didn't finish, re-triggered it to be sure

We just had a production outage at Roblox due to a large number of headless services created for Ray overwhelming the service mesh sidecars and causing ingress gateways to fail. Another thing we observed is that mTLS slows down some jobs significantly. So we ended up using istio with interception mode "none" to only proxy the 8265 port to expose the head node securely, and leave the head - worker grpc connections unencrypted. But I wasn't sure it's a common enough issue to mention in the docs.   --------- Signed-off-by: Steve Han <36038610+han-steve@users.noreply.github.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Co-authored-by: Kai-Hsun Chen <kaihsun@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

han-steve requested review from a team, kevin85421 and pcmoritz as code owners May 28, 2025 23:09

kevin85421 reviewed May 29, 2025

View reviewed changes

kevin85421 approved these changes May 29, 2025

View reviewed changes

angelinalg approved these changes May 29, 2025

View reviewed changes

doc/source/cluster/kubernetes/k8s-ecosystem/istio.md Outdated Show resolved Hide resolved

Update doc/source/cluster/kubernetes/k8s-ecosystem/istio.md

8e95dd4

Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Steve Han <36038610+han-steve@users.noreply.github.com>

kevin85421 added the go add ONLY when ready to merge, run all tests label Jun 3, 2025

Merge branch 'master' into patch-1

d2000b5

Merge branch 'master' into patch-1

5c96632

Merge branch 'master' into patch-1

7d37ee2

edoakes enabled auto-merge (squash) June 11, 2025 15:20

edoakes merged commit 94a466d into ray-project:master Jun 11, 2025
6 checks passed

[Docs] [istio mtls] Add warning on sidecar OOM for mTLS #53385

[Docs] [istio mtls] Add warning on sidecar OOM for mTLS #53385

Uh oh!

Conversation

han-steve commented May 28, 2025

Uh oh!

kevin85421 May 29, 2025

Choose a reason for hiding this comment

Uh oh!

kevin85421 May 29, 2025

Choose a reason for hiding this comment

Uh oh!

han-steve May 29, 2025

Choose a reason for hiding this comment

Uh oh!

kevin85421 May 29, 2025

Choose a reason for hiding this comment

Uh oh!

han-steve May 29, 2025

Choose a reason for hiding this comment

Uh oh!

kevin85421 left a comment

Choose a reason for hiding this comment

Uh oh!

han-steve commented May 29, 2025

Uh oh!

kevin85421 commented May 29, 2025

Uh oh!

angelinalg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kevin85421 commented Jun 3, 2025

Uh oh!

han-steve commented Jun 4, 2025

Uh oh!

kevin85421 commented Jun 4, 2025

Uh oh!

kevin85421 commented Jun 4, 2025

Uh oh!

kevin85421 commented Jun 4, 2025

Uh oh!

han-steve commented Jun 11, 2025

Uh oh!

kevin85421 commented Jun 11, 2025

Uh oh!

edoakes commented Jun 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants