Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using plasma on Kubernetes #1315

Closed
remram44 opened this issue Dec 12, 2017 · 39 comments
Closed

Using plasma on Kubernetes #1315

remram44 opened this issue Dec 12, 2017 · 39 comments

Comments

@remram44
Copy link

I am trying to use plasma to store datasets in memory, and share them between pods.

I find that this does not work well, and in particular, plasma.get/plasma.put tends to hang with no specific error message.

I am sure other people have tried this setup, I would love to hear about their experience.

The setup is:

  • Running plasma_store and clients in different pod on the same node
  • The socket is placed in one R/W volume mounted in all pods
  • Tried plasma_store using shm and hugepages (-h). shm hangs if get()ing an object submitted by another client, hugepages seems to complain about missing huge pages when mmapping

Note that I could get this running using Docker containers just fine. I understand that some of those issues are due to Kubernetes more than plasma, but I would love some pointers.

cc @mitar

@pcmoritz
Copy link
Contributor

Hey @remram44 thanks for bringing this up! Do you have Kubernetes scripts to reproduce this and instructions how to set it up on EC2 and reproduce this issue? Any pointers are welcome.

@robertnishihara
Copy link
Collaborator

@remram44 how did you get it working between Docker containers? Did you have to do anything special?

@remram44
Copy link
Author

On Docker, I didn't have to do anything. I ran with native Docker on macOS. However on doing this again it seems to work if I don't pass an explicit ObjectID to put(), otherwise get() hangs.

Server:

docker run -ti --rm --name plasmaserver -v plasmasocket:/mnt/socket remram/python3-pyarrow plasma_store -s /mnt/socket/plasma -m 10000000

Sender:

docker run -ti --rm --link plasmaserver -v plasmasocket:/mnt/socket remram/python3-pyarrow python -c 'import pyarrow.plasma as plasma; client = plasma.connect("/mnt/socket/plasma", "", 0); print(client.put("hello, world").binary())'
b'\x10\x85\x1b\xc6\xe3\xc6\x9f\x8d\x13\x1e\xa7\xda\xf3\xd9\xf0\x0cZ\xf1\xd7/'

Getter:

docker run -ti --rm --link plasmaserver -v plasmasocket:/mnt/socket remram/python3-pyarrow python -c 'import pyarrow.plasma as plasma; client = plasma.connect("/mnt/socket/plasma", "", 0); print(client.get(plasma.ObjectID(b"\x10\x85\x1b\xc6\xe3\xc6\x9f\x8d\x13\x1e\xa7\xda\xf3\xd9\xf0\x0cZ\xf1\xd7/")))'
hello, world

Explicit sender:

docker run -ti --rm --link plasmaserver -v plasmasocket:/mnt/socket remram/python3-pyarrow python -c 'import pyarrow.plasma as plasma; client = plasma.connect("/mnt/socket/plasma", "", 0); client.put("hello, world", plasma.ObjectID(b"testidhere"))'

Getter:

docker run -ti --rm --link plasmaserver -v plasmasocket:/mnt/socket remram/python3-pyarrow python -c 'import pyarrow.plasma as plasma; client = plasma.connect("/mnt/socket/plasma", "", 0); print(client.get(plasma.ObjectID(b"testidhere")))'
<hangs>

@remram44
Copy link
Author

I ran this on Kubernetes on Google Cloud with this configuration:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: plasmaserver
spec:
  replicas: 1
  template:
    metadata:
      labels:
        thing: plasmaserver
    spec:
      containers:
      - name: main
        image: remram/python3-pyarrow
        command: ['/bin/sh', '-c', 'plasma_store -s /mnt/socket/plasma -m 10000000']
# or
#        command: ['/bin/sh', '-c', 'plasma_store -s /mnt/socket/plasma -m 10000000 -d /mnt/hugepages -h']
        volumeMounts:
        - mountPath: /mnt/socket
          name: socket
        - mountPath: /mnt/hugepages
          name: hugepages
      volumes:
      - name: socket
        persistentVolumeClaim:
          claimName: plasmasocketvc
      - name: hugepages
        persistentVolumeClaim:
          claimName: hugepagesvc
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: plasma1
spec:
  replicas: 1
  template:
    metadata:
      labels:
        thing: plasma1
    spec:
      containers:
      - name: main
        image: remram/python3-pyarrow
        command: ['/bin/sh', '-c', 'while true; do sleep 30; done']
        volumeMounts:
        - mountPath: /mnt
          name: socket
      volumes:
      - name: socket
        persistentVolumeClaim:
          claimName: plasmasocketvc
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: plasma2
spec:
  replicas: 1
  template:
    metadata:
      labels:
        thing: plasma2
    spec:
      containers:
      - name: main
        image: remram/python3-pyarrow
        command: ['/bin/sh', '-c', 'while true; do sleep 30; done']
        volumeMounts:
        - mountPath: /mnt
          name: socket
      volumes:
      - name: socket
        persistentVolumeClaim:
          claimName: plasmasocketvc
---
kind: PersistentVolume
apiVersion: v1
metadata:
  name: plasmasocketv
  labels:
    thing: plasmasocket
spec:
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteMany
  hostPath:
    path: "/var/plasma-rr4"
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: plasmasocketvc
spec:
  storageClassName: ""
  selector:
    matchLabels:
      thing: plasmasocket
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 5Gi
---
kind: PersistentVolume
apiVersion: v1
metadata:
  name: hugepagesv
  labels:
    thing: hugepages
spec:
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteMany
  hostPath:
    path: "/var/hugepages"
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: hugepagesvc
spec:
  storageClassName: ""
  selector:
    matchLabels:
      thing: hugepages
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 5Gi

Then I ran commands on plasma1 and plasma2 using kubectl exec.

@pcmoritz
Copy link
Contributor

pcmoritz commented Dec 13, 2017

@remram44 Thanks! The hanging you are seeing is unrelated to using docker. It hangs because ObjectIDs need to be exactly 20 bytes long. So even without docker, this hangs:

In [5]: client.put("hello", plasma.ObjectID(b"hi"))
Out[5]: ObjectID(68690000537f0000300000000000000091010000)

In [6]: client.get(plasma.ObjectID(b"hi"))

Whereas this works:

In [3]: client.put("hello", plasma.ObjectID(20 * b"h"))
Out[3]: ObjectID(6868686868686868686868686868686868686868)

In [4]: client.get(plasma.ObjectID(20*b"h"))
Out[4]: 'hello'

Can you try if this also fixes the problem in Kubernetes?

@remram44
Copy link
Author

This probably should be raising a ValueError 😅 but I agree that it's a separate problem. I'll try again with valid IDs.

@mitar
Copy link
Member

mitar commented Dec 13, 2017

I am surprised that your Docker example even works. Because Plasma store uses /dev/shm as a default to share objects on Linux, but that is not shared between containers in Docker. So your server and client do not have the same /dev/shm. So I am not sure how communication would work here?

@mitar
Copy link
Member

mitar commented Dec 13, 2017

Hanging on invalid ObjectID is really surprising. :-)

(It is interesting that GitHub colors the invalid ObjectID with red background?)

@pcmoritz
Copy link
Contributor

pcmoritz commented Dec 13, 2017

I don't know why it is red :)

I agree it is not good behaviour and should give an error. I submitted a JIRA ticket here and will fix it ASAP: https://issues.apache.org/jira/browse/ARROW-1919

Thanks for finding the problem!

@mitar
Copy link
Member

mitar commented Dec 13, 2017

@pcmoritz: Do you understand why sharing works between containers even if /dev/shm is not shared?

@pcmoritz
Copy link
Contributor

pcmoritz commented Dec 13, 2017

I do not understand it and have not tried it, but it seems to be possible to share memory between docker containers in general, see https://stackoverflow.com/questions/29173193/shared-memory-with-docker-containers-docker-version-1-4-1

@mitar
Copy link
Member

mitar commented Dec 13, 2017

It seems we would have to use --ipc argument, but example above is not. This is why I am confused. @remram44, which Docker version are you using? If you go to two docker containers and you create a file in its /dev/shm, does it appear in the other container?

Also, @pcmoritz, is /dev/shm being used by Plasma store or is memory sharing done in some other way?

@pcmoritz
Copy link
Contributor

By default on linux it is using /dev/shm and on mac it is using /tmp/ and can be configured to use another location with the -d flag.

@mitar
Copy link
Member

mitar commented Dec 13, 2017

What does it store there? Does it store whole objects and then mmaps them? Because /tmp is not in memory on Mac.

@pcmoritz
Copy link
Contributor

So we had the same suspicion and did performance experiments with this, it behaves very much like it is in-memory. We are actually unlinking the file before writing anything, so maybe that prevents flushing to disk. This strategy is the same that Google Chrome is using for it's shared memory.

@mitar
Copy link
Member

mitar commented Dec 13, 2017

Do both containers have to have access to the same /dev/shm or do you send a file descriptor over the socket? Does /dev/shm have to be large (larger than -m parameter)?

@pcmoritz
Copy link
Contributor

pcmoritz commented Dec 13, 2017

The file descriptor is sent over the socket. That's a good point, probably that's what makes it work. And yes, /dev/shm needs to be larger than the -m parameter, otherwise and error is raised, see https://github.com/apache/arrow/blob/master/cpp/src/plasma/store.cc#L810.

@mitar
Copy link
Member

mitar commented Dec 13, 2017

OK, the above is 10000000 which is around 10 MB which is less than 64 MB default /dev/shm in Docker.

Yea, I would also suspect so. So I would assume that object is stored in /dev/shm of the Docker container which created it, and others then just access it over the file descriptor. I think we should test what happens if the container which created the object is stopped. And who is responsible for cleaning file descriptors up.

@pcmoritz
Copy link
Contributor

pcmoritz commented Dec 13, 2017

The beauty here is that the OS does refcounting on the file descriptors and will release the resources when the last refcount goes out of scope. That's why we went through the pain of making the file descriptor sending work and unlinking the original file, so the combination of these make sure there is no garbage left behind.

Not sure what happens in the docker container case however, does the host OS do the refcounting in that case and everything magically works? I don't know but suspect so. Let me know if you plan to look into this!

@pcmoritz
Copy link
Contributor

pcmoritz commented Dec 13, 2017

Disregarding the details, I'm extremely happy to learn that it works in the case of multiple docker containers. That's really great :)

So in the future we could use docker to get isolation between workers! And if the things stored in the object store are not pickled and use arrow data instead (pickle could be deactivated), it might even be possible to get some level of security from this if you trust docker's isolation.

@remram44
Copy link
Author

An issue right now is that Kubernetes doesn't have an equivalent to --shm-size just yet (kubernetes-28272).

@remram44
Copy link
Author

remram44 commented Dec 13, 2017

Ok, so again running on GKE, I could get plasma to run just fine with shm (staying under the default Docker size of 64MB, and using 20-byte object IDs), but no luck with hugepages. Support seems to be upcoming (alpha in 1.8; see here).

Can I use the -d option without -h to specify an alternate location for /dev/shm? So I can provide a bigger shm as a volume?

@remram44
Copy link
Author

Mounting a bigger shm from the host, either as /dev/shm in the container or somewhere else and using -d to point to it, allows me to use a bigger -m value that 64MB (as per this openshift workaround).

So I guess plasma is usable on Docker and Kubernetes after all, just no hugepages?

Allowing the Plasma store to use up to 0.01GB of memory.
Starting object store with directory /mnt/hugepages and huge page support enabled
mmap failed with error: Cannot allocate memory
  (this probably means you have to increase /proc/sys/vm/nr_hugepages)
mmap failed with error: Cannot allocate memory
  (this probably means you have to increase /proc/sys/vm/nr_hugepages)
...
mmap failed with error: Cannot allocate memory
  (this probably means you have to increase /proc/sys/vm/nr_hugepages)
There is not enough space to create this object, so evicting 0 objects to free up 0 bytes.
Disconnecting client on fd 5

@atumanov
Copy link
Contributor

@remram44 , in the log above, your plasma store is starting with 0.010GB, that's 10MB. Hugepages in the plasma store start working with 1GB minimum memory allocation:
https://github.com/apache/arrow/blob/master/cpp/src/plasma/store.cc#L820

@atumanov
Copy link
Contributor

@remram44 , if you are sure you are dealing with 2MB hugepages, you could try overriding that 1GB default with, say 10MB instead, to fit your memory configuration.

@remram44
Copy link
Author

You mean that plasma doesn't work with hugepages if the -m value is below 1GB?

@atumanov
Copy link
Contributor

yes, I believe that's correct, but it's a one line change. I think we could log an error message on startup if the specified -m value is < 1GB when -h is also specified. We might've decided against it because 1GB is not fundamental. It's a safe default that will work for both 2MB and 1GB pages. With 2MB hugepages as the more popular/widespread option, that default can be changed. We felt that 1GB would be a more robust out-of-box default when the hugepage size on the target platform is unknown.

@remram44
Copy link
Author

Same error running with -m 2000000000 -h unfortunately.

@atumanov
Copy link
Contributor

@remram44 , did you set up the mount point inside docker containers to be backed by hugetlbfs? I'm not sure if you've gone through the process of setting up the mount point, here's the link:
http://ray.readthedocs.io/en/latest/plasma-object-store.html
Things to check:

  • is the directory specified with -d visible inside the container and backed by hugetlbfs? You should be able to touch files in there.
  • what's the gid of the plasma store process? Does it match cat /proc/sys/vm/hugetlb_shm_group
  • what's the number of huge pages allocated? What's the output of cat /proc/sys/vm/nr_hugepages

All of this -- inside the docker container running the plasma store. I haven't tried it in the docker container, so it's not officially supported, but let's see if we can make it work together :)

@mitar
Copy link
Member

mitar commented Dec 13, 2017

An issue right now is that Kubernetes doesn't have an equivalent to --shm-size just yet

This is just one more reason why we should use huge pages instead of /dev/shm.

@mitar
Copy link
Member

mitar commented Dec 13, 2017

Using emptyDir with medium = Memory seems reasonable. But how do you configure the size of the volume? Or is it just unlimited (all memory) unless specified? So how large it is if you look manually into the size of it?

Can you use emptyDir across pods? Or is it not necessary and file descriptor sharing works?

@robertnishihara
Copy link
Collaborator

robertnishihara commented Jan 11, 2018

@remram44 @mitar, this was a while ago, but how did you end up resolving this? Were you able to get something working with shared memory between pods?

@robertnishihara
Copy link
Collaborator

Please reopen if there are more questions/updates.

@metasyn
Copy link

metasyn commented May 10, 2019

My team is interested in the possibility of using plasma as a way of transferring data between pods - @remram44 @mitar just checking to see if you ever got this working?

@mitar
Copy link
Member

mitar commented May 10, 2019

Yes. It works well. We just have a host-local directory we mount to all pods which we use for plasma socket between pods.

@mitar
Copy link
Member

mitar commented May 10, 2019

How precisely to configure this host local directory in scalable way I have not yet found a good solution though. If you want your pods to run on multiple nodes. Some of my notes I wrote about this:

There seems to be two ways to achieve this:

  • Using inter-pod affinity:
    • Pros: It can work across namespaces. So if we will end up running each pod in its own namespace (Remove unnecessary files. #4) we can still schedule pods together. Then they can simply use a shared host directory.
    • Cons: We would have to modify provided pod configurations for each pair to add this affinity configuration to their specs. This should not be too tricky though and can probably be a simple YAML transformation.
    • Questions: Do we have to modify pod configuration or can we attach affinity configuration in some other way to pods (maybe through some other Kubernetes objects which then depend on pods by performers).
  • Using local persistent storage. It allows one to expose a local directory as a persistent volume. Then each pod can reference this same persistent volume claim and this makes both pods be scheduled on the same node.
    • Pros: We can expose those claims through pod preset so pods can simply use those as any other volumes.
    • Cons: It seems the same claim cannot be used across namespaces.
    • Questions: It is unclear to me how exactly one identifies which claim name to use. Do you create a claim name per node? It seems that maybe this works if you have two pods which can have multiple copies of them, and then they are both paired together, pairwise. But you cannot use the same claim for different pairs and expect things to just work. So it seems this makes that each pair needs its own local volume and a related claim and this means this is very similar to manually scheduling pairs to nodes. In that case it is probably easier to just use node affinity.

@metasyn
Copy link

metasyn commented May 10, 2019

@mitar thank you for the detailed response :)

@Lorry1123
Copy link

Yes. It works well. We just have a host-local directory we mount to all pods which we use for plasma socket between pods.

@mitar Hi! I'm also interested in the way you mount host-local directory. Is it works like following steps?

  1. create a tmpfs/hugepage directory(for memory sharing) and a empty directory(for unix domain socket) on every node
  2. run plasma_store using daemon set
  3. mount those two directory to every pod by volume hostPath api (since the memory is not declared in node or pod spec, i'm affraid it may cause some kubernetes scheduling issues?)

@mitar
Copy link
Member

mitar commented Apr 26, 2022

I haven't done it at the end in a way that would support automatic scheduling. So I cannot help you much here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants