Can't deploy on multi-node cluster #5

mausch · 2024-04-26T09:00:28Z

When deploying on a multi-node cluster (EKS in my case but I guess it could be any other), there's a PVC clash between the model store and the model pod.
The model pod gets this error and so it cannot start:

Multi-Attach error for volume "pvc-63d894e9-1945-4ec7-988f-0fc6a08adc1a" Volume is already used by pod(s) ollama-models-store-0-x-ollama-operator-system-x-vcl-4497a69570

The text was updated successfully, but these errors were encountered:

aep · 2024-06-28T12:08:08Z

as far as i understand, the shared storage is required because one pod downloads the models, and the other runs it.

RWX storage is commonly NFS which is slow and buggy

a quick and easy solution might be to make the model storage a daemonset,
and contact the node-local one from the model pod

ilyapaff · 2024-10-11T15:10:53Z

The same problem.

Here it is necessary either to prohibit the use of RWO or to make a restriction in the documentation that it works only within one node.

Using a shared RWO is initially a mistake, since a Kubernetes cluster usually consists of several nodes.

The solution may be to get the model from the storage over the network. (without a shared disk)

Another solution is to store the model in the Model workload, without deploying a separate repository.
Each Models will have its own pvc, downloading it there at the first launch and saving the pvc after deleting the CR Model (or not saving it, it's unclear why we need the cached model if we deleted it)

nekomeowww · 2024-11-09T07:54:51Z

Actually we have deployed to two clusters with 5 worker nodes (my cloud), and 3 worker nodes (my company team), we never encounter such issue in our scenario case.

TBH, I appologize for the delayed message reply, I was in a short where I haven't find any oppertunities to have a chance to deploy any of the multi-node environment such as AWS with AWS EBS.

If any of you are able to reach to a cluster with multiple nodes with more advanced fs and storage class, perhaps, we can work together to test this out.

nekomeowww · 2024-11-09T07:59:15Z

a quick and easy solution might be to make the model storage a daemonset,
and contact the node-local one from the model pod

DaemonSet was an idea that I considered before. I remember there was some problems with such approach, I can't really remember why, I will reply in this thread if it reminds me any.

nekomeowww · 2024-11-09T08:08:22Z

The solution may be to get the model from the storage over the network. (without a shared disk)

The problem is, because of the models are always big, and we cannot treat them as images where nodes will use their node storage resources to store and manage with, while in most cases, users (or tenants, admins, operators, orgs) would think of that they can have a universal storage cluster to store the models all together to reduce the cost, with this being said, Kubernetes is namespaced (ns level), where storages are expected to be treated as clustered (cluster level), from the fundamental perspective to see, these two concepts are mutually exclusive.

Each Models will have its own pvc, downloading it there at the first launch and saving the pvc after deleting the CR Model (or not saving it, it's unclear why we need the cached model if we deleted it)

Tweaking models, and use ollama build, and experiement with different prompts and default configuration parameters is the most common use cases. One of the use case on my side is to have multiple test and eval instance running together, and they will be able to get the shared cached models from Statefulset instead of downloading them all over again. If we choose the own PVC approch, then building will cost hundreds of layers of storage cost.

nekomeowww · 2024-11-09T08:16:18Z

After several months of experiementing on model serving and production server deployment and architectual thinking.

I want to propose a new parameter to specify the storage mode to allow you folks to try it to find out which way suits the best:

I will implement a new parameter to specify the storage mode.
- a new parameter called cache accepts different enum values: Node, Namespace, None, (maybe Cluster if integrated other fs and methods, but requires advanced setup), where
- Node will create a DaemonSet on node level, where each node will get its own PVC to cache and store the needed data, and for Ollama Operator, I will try to calculate the needed routes for them to load the models from user-namespace Pods.
- Namespace is the current behavior, a StatefulSet with it own PVC to share within a namespace.
- None will not create any of the cache workloads, each time a model deploys, it will download on it's own with sidecar container as server.

ezequielfalcon · 2025-02-05T15:04:05Z

Hello. I am having the same issue, I get a multi-attach error which is normal because the PVC is being created with ReadWriteOnce.. if 2 different pods have to access a PVC (in this case the StatefulSet and the actual model deployment) the PVC access mode should be ReadWriteMany. If I try to deploy the model using this access mode, the StatefulSet anyway deploys a ReadWriteOnce PVC.

nekomeowww self-assigned this Apr 26, 2024

nekomeowww added the bug Something isn't working label Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't deploy on multi-node cluster #5

Can't deploy on multi-node cluster #5

mausch commented Apr 26, 2024

aep commented Jun 28, 2024

ilyapaff commented Oct 11, 2024

nekomeowww commented Nov 9, 2024

nekomeowww commented Nov 9, 2024

nekomeowww commented Nov 9, 2024

nekomeowww commented Nov 9, 2024

ezequielfalcon commented Feb 5, 2025

Can't deploy on multi-node cluster #5

Can't deploy on multi-node cluster #5

Comments

mausch commented Apr 26, 2024

aep commented Jun 28, 2024

ilyapaff commented Oct 11, 2024

nekomeowww commented Nov 9, 2024

nekomeowww commented Nov 9, 2024

nekomeowww commented Nov 9, 2024

nekomeowww commented Nov 9, 2024

ezequielfalcon commented Feb 5, 2025