-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[Docs] [istio mtls] Add warning on sidecar OOM for mTLS #53385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We just had a production outage at Roblox due to a large number of headless services created for Ray overwhelming the service mesh sidecars and causing ingress gateways to fail. Another thing we observed is that mTLS slows down some jobs significantly. But I wasn't sure it's a common enough issue to mention in the docs. Signed-off-by: Steve Han <36038610+han-steve@users.noreply.github.com>
|
|
||
| :::{warning} | ||
| The default Ray worker port range, from 10002 to 19999, is too large to specify in the service manifest and can cause memory issues in Kubernetes. Set a smaller `max-worker-port` to work with Istio. | ||
| The default Ray worker port range, from 10002 to 19999, is too large to specify in the service manifest and can cause memory issues in Kubernetes. Set a smaller `max-worker-port` to work with Istio. Note that by default these ports will be cached in every sidecar in the service mesh, which could lead to sidecar OOMs if too many headless services are created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just had a production outage at Roblox due to a large number of headless services created for Ray overwhelming the service mesh sidecars and causing ingress gateways to fail.
Just out of curiosity, how many ports did you set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you installed https://docs.ray.io/en/latest/ray-contribute/docs.html#how-do-you-run-vale?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We set 200 ports to still allow 200 workers per node in the Ray cluster. However, people queued up hundreds of jobs and caused every infra pod with sidecar to OOM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, does that mean the memory usage is positively correlated with 200 times the total number of worker pods in the Kubernetes cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah for each headless svc, we add 200 entries to ALL sidecars in the cluster - unless there's a Sidecar CRD to restrict the caching. However, the Sidecar CRD doesn't apply to ingress gateway, which still might OOM
kevin85421
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will ask Ray doc team to review this PR. Would you mind installing Vale?
|
Yeah these are the vale outputs:
Not sure if these are actionable. The doc has a lot more errors/suggestions other than this line |
Our current policy is that newly added lines should follow the Vale rules. You don’t need to fix existing ones. |
angelinalg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit to clarify the sentence.
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Steve Han <36038610+han-steve@users.noreply.github.com>
|
@han-steve I have enabled |
|
yeah looks like everything passed! |
|
It looks like CI is still running. Btw, it's not necessary to sync with the master branch if there is no conflict. |
|
The doc build CI stuck forever. I update this branch to trigger the CI again. |
|
Please ping me when all CI tests pass. |
|
Looks like the ci is stuck again 🥲 |
|
The docs build didn't finish, re-triggered it to be sure |
We just had a production outage at Roblox due to a large number of headless services created for Ray overwhelming the service mesh sidecars and causing ingress gateways to fail. Another thing we observed is that mTLS slows down some jobs significantly. So we ended up using istio with interception mode "none" to only proxy the 8265 port to expose the head node securely, and leave the head - worker grpc connections unencrypted. But I wasn't sure it's a common enough issue to mention in the docs. <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> --------- Signed-off-by: Steve Han <36038610+han-steve@users.noreply.github.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Co-authored-by: Kai-Hsun Chen <kaihsun@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
We just had a production outage at Roblox due to a large number of headless services created for Ray overwhelming the service mesh sidecars and causing ingress gateways to fail. Another thing we observed is that mTLS slows down some jobs significantly. So we ended up using istio with interception mode "none" to only proxy the 8265 port to expose the head node securely, and leave the head - worker grpc connections unencrypted. But I wasn't sure it's a common enough issue to mention in the docs. <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> --------- Signed-off-by: Steve Han <36038610+han-steve@users.noreply.github.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Co-authored-by: Kai-Hsun Chen <kaihsun@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
We just had a production outage at Roblox due to a large number of headless services created for Ray overwhelming the service mesh sidecars and causing ingress gateways to fail.
Another thing we observed is that mTLS slows down some jobs significantly. So we ended up using istio with interception mode "none" to only proxy the 8265 port to expose the head node securely, and leave the head - worker grpc connections unencrypted. But I wasn't sure it's a common enough issue to mention in the docs.