-
-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't deploy on multi-node cluster #5
Comments
as far as i understand, the shared storage is required because one pod downloads the models, and the other runs it. RWX storage is commonly NFS which is slow and buggy a quick and easy solution might be to make the model storage a daemonset, |
The same problem. Here it is necessary either to prohibit the use of RWO or to make a restriction in the documentation that it works only within one node. Using a shared RWO is initially a mistake, since a Kubernetes cluster usually consists of several nodes. The solution may be to get the model from the storage over the network. (without a shared disk) Another solution is to store the model in the Model workload, without deploying a separate repository. |
Actually we have deployed to two clusters with 5 worker nodes (my cloud), and 3 worker nodes (my company team), we never encounter such issue in our scenario case. TBH, I appologize for the delayed message reply, I was in a short where I haven't find any oppertunities to have a chance to deploy any of the multi-node environment such as AWS with AWS EBS. If any of you are able to reach to a cluster with multiple nodes with more advanced fs and storage class, perhaps, we can work together to test this out. |
DaemonSet was an idea that I considered before. I remember there was some problems with such approach, I can't really remember why, I will reply in this thread if it reminds me any. |
The problem is, because of the models are always big, and we cannot treat them as images where nodes will use their node storage resources to store and manage with, while in most cases, users (or tenants, admins, operators, orgs) would think of that they can have a universal storage cluster to store the models all together to reduce the cost, with this being said, Kubernetes is namespaced (ns level), where storages are expected to be treated as clustered (cluster level), from the fundamental perspective to see, these two concepts are mutually exclusive.
Tweaking models, and use |
After several months of experiementing on model serving and production server deployment and architectual thinking. I want to propose a new parameter to specify the storage mode to allow you folks to try it to find out which way suits the best:
|
Hello. I am having the same issue, I get a multi-attach error which is normal because the PVC is being created with ReadWriteOnce.. if 2 different pods have to access a PVC (in this case the StatefulSet and the actual model deployment) the PVC access mode should be ReadWriteMany. If I try to deploy the model using this access mode, the StatefulSet anyway deploys a ReadWriteOnce PVC. |
When deploying on a multi-node cluster (EKS in my case but I guess it could be any other), there's a PVC clash between the model store and the model pod.
The model pod gets this error and so it cannot start:
The text was updated successfully, but these errors were encountered: