Using plasma on Kubernetes #1315

remram44 · 2017-12-12T17:41:54Z

I am trying to use plasma to store datasets in memory, and share them between pods.

I find that this does not work well, and in particular, plasma.get/plasma.put tends to hang with no specific error message.

I am sure other people have tried this setup, I would love to hear about their experience.

The setup is:

Running plasma_store and clients in different pod on the same node
The socket is placed in one R/W volume mounted in all pods
Tried plasma_store using shm and hugepages (-h). shm hangs if get()ing an object submitted by another client, hugepages seems to complain about missing huge pages when mmapping

Note that I could get this running using Docker containers just fine. I understand that some of those issues are due to Kubernetes more than plasma, but I would love some pointers.

cc @mitar

The text was updated successfully, but these errors were encountered:

pcmoritz · 2017-12-12T22:01:36Z

Hey @remram44 thanks for bringing this up! Do you have Kubernetes scripts to reproduce this and instructions how to set it up on EC2 and reproduce this issue? Any pointers are welcome.

robertnishihara · 2017-12-12T22:03:29Z

@remram44 how did you get it working between Docker containers? Did you have to do anything special?

remram44 · 2017-12-13T01:07:01Z

On Docker, I didn't have to do anything. I ran with native Docker on macOS. However on doing this again it seems to work if I don't pass an explicit ObjectID to put(), otherwise get() hangs.

Server:

docker run -ti --rm --name plasmaserver -v plasmasocket:/mnt/socket remram/python3-pyarrow plasma_store -s /mnt/socket/plasma -m 10000000

Sender:

docker run -ti --rm --link plasmaserver -v plasmasocket:/mnt/socket remram/python3-pyarrow python -c 'import pyarrow.plasma as plasma; client = plasma.connect("/mnt/socket/plasma", "", 0); print(client.put("hello, world").binary())'
b'\x10\x85\x1b\xc6\xe3\xc6\x9f\x8d\x13\x1e\xa7\xda\xf3\xd9\xf0\x0cZ\xf1\xd7/'

Getter:

docker run -ti --rm --link plasmaserver -v plasmasocket:/mnt/socket remram/python3-pyarrow python -c 'import pyarrow.plasma as plasma; client = plasma.connect("/mnt/socket/plasma", "", 0); print(client.get(plasma.ObjectID(b"\x10\x85\x1b\xc6\xe3\xc6\x9f\x8d\x13\x1e\xa7\xda\xf3\xd9\xf0\x0cZ\xf1\xd7/")))'
hello, world

Explicit sender:

docker run -ti --rm --link plasmaserver -v plasmasocket:/mnt/socket remram/python3-pyarrow python -c 'import pyarrow.plasma as plasma; client = plasma.connect("/mnt/socket/plasma", "", 0); client.put("hello, world", plasma.ObjectID(b"testidhere"))'

Getter:

docker run -ti --rm --link plasmaserver -v plasmasocket:/mnt/socket remram/python3-pyarrow python -c 'import pyarrow.plasma as plasma; client = plasma.connect("/mnt/socket/plasma", "", 0); print(client.get(plasma.ObjectID(b"testidhere")))'
<hangs>

remram44 · 2017-12-13T01:12:04Z

I ran this on Kubernetes on Google Cloud with this configuration:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: plasmaserver
spec:
  replicas: 1
  template:
    metadata:
      labels:
        thing: plasmaserver
    spec:
      containers:
      - name: main
        image: remram/python3-pyarrow
        command: ['/bin/sh', '-c', 'plasma_store -s /mnt/socket/plasma -m 10000000']
# or
#        command: ['/bin/sh', '-c', 'plasma_store -s /mnt/socket/plasma -m 10000000 -d /mnt/hugepages -h']
        volumeMounts:
        - mountPath: /mnt/socket
          name: socket
        - mountPath: /mnt/hugepages
          name: hugepages
      volumes:
      - name: socket
        persistentVolumeClaim:
          claimName: plasmasocketvc
      - name: hugepages
        persistentVolumeClaim:
          claimName: hugepagesvc
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: plasma1
spec:
  replicas: 1
  template:
    metadata:
      labels:
        thing: plasma1
    spec:
      containers:
      - name: main
        image: remram/python3-pyarrow
        command: ['/bin/sh', '-c', 'while true; do sleep 30; done']
        volumeMounts:
        - mountPath: /mnt
          name: socket
      volumes:
      - name: socket
        persistentVolumeClaim:
          claimName: plasmasocketvc
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: plasma2
spec:
  replicas: 1
  template:
    metadata:
      labels:
        thing: plasma2
    spec:
      containers:
      - name: main
        image: remram/python3-pyarrow
        command: ['/bin/sh', '-c', 'while true; do sleep 30; done']
        volumeMounts:
        - mountPath: /mnt
          name: socket
      volumes:
      - name: socket
        persistentVolumeClaim:
          claimName: plasmasocketvc
---
kind: PersistentVolume
apiVersion: v1
metadata:
  name: plasmasocketv
  labels:
    thing: plasmasocket
spec:
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteMany
  hostPath:
    path: "/var/plasma-rr4"
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: plasmasocketvc
spec:
  storageClassName: ""
  selector:
    matchLabels:
      thing: plasmasocket
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 5Gi
---
kind: PersistentVolume
apiVersion: v1
metadata:
  name: hugepagesv
  labels:
    thing: hugepages
spec:
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteMany
  hostPath:
    path: "/var/hugepages"
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: hugepagesvc
spec:
  storageClassName: ""
  selector:
    matchLabels:
      thing: hugepages
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 5Gi

Then I ran commands on plasma1 and plasma2 using kubectl exec.

pcmoritz · 2017-12-13T05:13:13Z

@remram44 Thanks! The hanging you are seeing is unrelated to using docker. It hangs because ObjectIDs need to be exactly 20 bytes long. So even without docker, this hangs:

In [5]: client.put("hello", plasma.ObjectID(b"hi"))
Out[5]: ObjectID(68690000537f0000300000000000000091010000)

In [6]: client.get(plasma.ObjectID(b"hi"))

Whereas this works:

In [3]: client.put("hello", plasma.ObjectID(20 * b"h"))
Out[3]: ObjectID(6868686868686868686868686868686868686868)

In [4]: client.get(plasma.ObjectID(20*b"h"))
Out[4]: 'hello'

Can you try if this also fixes the problem in Kubernetes?

remram44 · 2017-12-13T05:17:03Z

This probably should be raising a ValueError 😅 but I agree that it's a separate problem. I'll try again with valid IDs.

mitar · 2017-12-13T05:17:19Z

I am surprised that your Docker example even works. Because Plasma store uses /dev/shm as a default to share objects on Linux, but that is not shared between containers in Docker. So your server and client do not have the same /dev/shm. So I am not sure how communication would work here?

mitar · 2017-12-13T05:19:03Z

Hanging on invalid ObjectID is really surprising. :-)

(It is interesting that GitHub colors the invalid ObjectID with red background?)

pcmoritz · 2017-12-13T05:28:44Z

I don't know why it is red :)

I agree it is not good behaviour and should give an error. I submitted a JIRA ticket here and will fix it ASAP: https://issues.apache.org/jira/browse/ARROW-1919

Thanks for finding the problem!

mitar · 2017-12-13T05:35:14Z

@pcmoritz: Do you understand why sharing works between containers even if /dev/shm is not shared?

pcmoritz · 2017-12-13T05:41:53Z

I do not understand it and have not tried it, but it seems to be possible to share memory between docker containers in general, see https://stackoverflow.com/questions/29173193/shared-memory-with-docker-containers-docker-version-1-4-1

mitar · 2017-12-13T06:01:57Z

It seems we would have to use --ipc argument, but example above is not. This is why I am confused. @remram44, which Docker version are you using? If you go to two docker containers and you create a file in its /dev/shm, does it appear in the other container?

Also, @pcmoritz, is /dev/shm being used by Plasma store or is memory sharing done in some other way?

pcmoritz · 2017-12-13T06:32:31Z

By default on linux it is using /dev/shm and on mac it is using /tmp/ and can be configured to use another location with the -d flag.

mitar · 2017-12-13T07:20:31Z

What does it store there? Does it store whole objects and then mmaps them? Because /tmp is not in memory on Mac.

pcmoritz · 2017-12-13T07:25:46Z

So we had the same suspicion and did performance experiments with this, it behaves very much like it is in-memory. We are actually unlinking the file before writing anything, so maybe that prevents flushing to disk. This strategy is the same that Google Chrome is using for it's shared memory.

mitar · 2017-12-13T07:32:13Z

Do both containers have to have access to the same /dev/shm or do you send a file descriptor over the socket? Does /dev/shm have to be large (larger than -m parameter)?

pcmoritz · 2017-12-13T07:38:05Z

The file descriptor is sent over the socket. That's a good point, probably that's what makes it work. And yes, /dev/shm needs to be larger than the -m parameter, otherwise and error is raised, see https://github.com/apache/arrow/blob/master/cpp/src/plasma/store.cc#L810.

mitar · 2017-12-13T07:42:40Z

OK, the above is 10000000 which is around 10 MB which is less than 64 MB default /dev/shm in Docker.

Yea, I would also suspect so. So I would assume that object is stored in /dev/shm of the Docker container which created it, and others then just access it over the file descriptor. I think we should test what happens if the container which created the object is stopped. And who is responsible for cleaning file descriptors up.

pcmoritz · 2017-12-13T07:45:37Z

The beauty here is that the OS does refcounting on the file descriptors and will release the resources when the last refcount goes out of scope. That's why we went through the pain of making the file descriptor sending work and unlinking the original file, so the combination of these make sure there is no garbage left behind.

Not sure what happens in the docker container case however, does the host OS do the refcounting in that case and everything magically works? I don't know but suspect so. Let me know if you plan to look into this!

pcmoritz · 2017-12-13T07:48:33Z

Disregarding the details, I'm extremely happy to learn that it works in the case of multiple docker containers. That's really great :)

So in the future we could use docker to get isolation between workers! And if the things stored in the object store are not pickled and use arrow data instead (pickle could be deactivated), it might even be possible to get some level of security from this if you trust docker's isolation.

remram44 · 2017-12-13T16:11:28Z

An issue right now is that Kubernetes doesn't have an equivalent to --shm-size just yet (kubernetes-28272).

remram44 · 2017-12-13T16:34:32Z

Ok, so again running on GKE, I could get plasma to run just fine with shm (staying under the default Docker size of 64MB, and using 20-byte object IDs), but no luck with hugepages. Support seems to be upcoming (alpha in 1.8; see here).

Can I use the -d option without -h to specify an alternate location for /dev/shm? So I can provide a bigger shm as a volume?

remram44 · 2017-12-13T16:48:58Z

Mounting a bigger shm from the host, either as /dev/shm in the container or somewhere else and using -d to point to it, allows me to use a bigger -m value that 64MB (as per this openshift workaround).

So I guess plasma is usable on Docker and Kubernetes after all, just no hugepages?

Allowing the Plasma store to use up to 0.01GB of memory.
Starting object store with directory /mnt/hugepages and huge page support enabled
mmap failed with error: Cannot allocate memory
  (this probably means you have to increase /proc/sys/vm/nr_hugepages)
mmap failed with error: Cannot allocate memory
  (this probably means you have to increase /proc/sys/vm/nr_hugepages)
...
mmap failed with error: Cannot allocate memory
  (this probably means you have to increase /proc/sys/vm/nr_hugepages)
There is not enough space to create this object, so evicting 0 objects to free up 0 bytes.
Disconnecting client on fd 5

atumanov · 2017-12-13T18:03:23Z

@remram44 , in the log above, your plasma store is starting with 0.010GB, that's 10MB. Hugepages in the plasma store start working with 1GB minimum memory allocation:
https://github.com/apache/arrow/blob/master/cpp/src/plasma/store.cc#L820

atumanov · 2017-12-13T18:07:30Z

@remram44 , if you are sure you are dealing with 2MB hugepages, you could try overriding that 1GB default with, say 10MB instead, to fit your memory configuration.

remram44 · 2017-12-13T18:09:41Z

You mean that plasma doesn't work with hugepages if the -m value is below 1GB?

atumanov · 2017-12-13T18:15:09Z

yes, I believe that's correct, but it's a one line change. I think we could log an error message on startup if the specified -m value is < 1GB when -h is also specified. We might've decided against it because 1GB is not fundamental. It's a safe default that will work for both 2MB and 1GB pages. With 2MB hugepages as the more popular/widespread option, that default can be changed. We felt that 1GB would be a more robust out-of-box default when the hugepage size on the target platform is unknown.

remram44 · 2017-12-13T18:31:12Z

Same error running with -m 2000000000 -h unfortunately.

atumanov · 2017-12-13T18:45:53Z

@remram44 , did you set up the mount point inside docker containers to be backed by hugetlbfs? I'm not sure if you've gone through the process of setting up the mount point, here's the link:
http://ray.readthedocs.io/en/latest/plasma-object-store.html
Things to check:

is the directory specified with -d visible inside the container and backed by hugetlbfs? You should be able to touch files in there.
what's the gid of the plasma store process? Does it match cat /proc/sys/vm/hugetlb_shm_group
what's the number of huge pages allocated? What's the output of cat /proc/sys/vm/nr_hugepages

All of this -- inside the docker container running the plasma store. I haven't tried it in the docker container, so it's not officially supported, but let's see if we can make it work together :)

mitar · 2017-12-13T19:29:50Z

An issue right now is that Kubernetes doesn't have an equivalent to --shm-size just yet

This is just one more reason why we should use huge pages instead of /dev/shm.

mitar · 2017-12-13T19:33:55Z

Using emptyDir with medium = Memory seems reasonable. But how do you configure the size of the volume? Or is it just unlimited (all memory) unless specified? So how large it is if you look manually into the size of it?

Can you use emptyDir across pods? Or is it not necessary and file descriptor sharing works?

robertnishihara · 2018-01-11T16:32:26Z

@remram44 @mitar, this was a while ago, but how did you end up resolving this? Were you able to get something working with shared memory between pods?

robertnishihara · 2018-02-02T07:50:03Z

Please reopen if there are more questions/updates.

metasyn · 2019-05-10T20:46:09Z

My team is interested in the possibility of using plasma as a way of transferring data between pods - @remram44 @mitar just checking to see if you ever got this working?

mitar · 2019-05-10T22:50:01Z

Yes. It works well. We just have a host-local directory we mount to all pods which we use for plasma socket between pods.

mitar · 2019-05-10T22:53:01Z

How precisely to configure this host local directory in scalable way I have not yet found a good solution though. If you want your pods to run on multiple nodes. Some of my notes I wrote about this:

There seems to be two ways to achieve this:

Using inter-pod affinity:
- Pros: It can work across namespaces. So if we will end up running each pod in its own namespace (Remove unnecessary files. #4) we can still schedule pods together. Then they can simply use a shared host directory.
- Cons: We would have to modify provided pod configurations for each pair to add this affinity configuration to their specs. This should not be too tricky though and can probably be a simple YAML transformation.
- Questions: Do we have to modify pod configuration or can we attach affinity configuration in some other way to pods (maybe through some other Kubernetes objects which then depend on pods by performers).
Using local persistent storage. It allows one to expose a local directory as a persistent volume. Then each pod can reference this same persistent volume claim and this makes both pods be scheduled on the same node.
- Pros: We can expose those claims through pod preset so pods can simply use those as any other volumes.
- Cons: It seems the same claim cannot be used across namespaces.
- Questions: It is unclear to me how exactly one identifies which claim name to use. Do you create a claim name per node? It seems that maybe this works if you have two pods which can have multiple copies of them, and then they are both paired together, pairwise. But you cannot use the same claim for different pairs and expect things to just work. So it seems this makes that each pair needs its own local volume and a related claim and this means this is very similar to manually scheduling pairs to nodes. In that case it is probably easier to just use node affinity.

metasyn · 2019-05-10T23:48:13Z

@mitar thank you for the detailed response :)

Lorry1123 · 2022-04-26T03:38:43Z

Yes. It works well. We just have a host-local directory we mount to all pods which we use for plasma socket between pods.

@mitar Hi! I'm also interested in the way you mount host-local directory. Is it works like following steps?

create a tmpfs/hugepage directory(for memory sharing) and a empty directory(for unix domain socket) on every node
run plasma_store using daemon set
mount those two directory to every pod by volume hostPath api (since the memory is not declared in node or pod spec, i'm affraid it may cause some kubernetes scheduling issues?)

mitar · 2022-04-26T07:55:32Z

I haven't done it at the end in a way that would support automatic scheduling. So I cannot help you much here.

robertnishihara closed this as completed Feb 2, 2018

Using plasma on Kubernetes #1315

Using plasma on Kubernetes #1315

Comments

remram44 commented Dec 12, 2017

pcmoritz commented Dec 12, 2017

robertnishihara commented Dec 12, 2017

remram44 commented Dec 13, 2017

remram44 commented Dec 13, 2017

pcmoritz commented Dec 13, 2017 • edited Loading

remram44 commented Dec 13, 2017

mitar commented Dec 13, 2017

mitar commented Dec 13, 2017

pcmoritz commented Dec 13, 2017 • edited Loading

mitar commented Dec 13, 2017

pcmoritz commented Dec 13, 2017 • edited Loading

mitar commented Dec 13, 2017

pcmoritz commented Dec 13, 2017

mitar commented Dec 13, 2017

pcmoritz commented Dec 13, 2017

mitar commented Dec 13, 2017

pcmoritz commented Dec 13, 2017 • edited Loading

mitar commented Dec 13, 2017

pcmoritz commented Dec 13, 2017 • edited Loading

pcmoritz commented Dec 13, 2017 • edited Loading

remram44 commented Dec 13, 2017

remram44 commented Dec 13, 2017 • edited Loading

remram44 commented Dec 13, 2017

atumanov commented Dec 13, 2017

atumanov commented Dec 13, 2017

remram44 commented Dec 13, 2017

atumanov commented Dec 13, 2017

remram44 commented Dec 13, 2017

atumanov commented Dec 13, 2017

mitar commented Dec 13, 2017

mitar commented Dec 13, 2017

robertnishihara commented Jan 11, 2018 • edited Loading

robertnishihara commented Feb 2, 2018

metasyn commented May 10, 2019

mitar commented May 10, 2019

mitar commented May 10, 2019

metasyn commented May 10, 2019

Lorry1123 commented Apr 26, 2022

mitar commented Apr 26, 2022

pcmoritz commented Dec 13, 2017 •

edited

Loading

pcmoritz commented Dec 13, 2017 •

edited

Loading

pcmoritz commented Dec 13, 2017 •

edited

Loading

pcmoritz commented Dec 13, 2017 •

edited

Loading

pcmoritz commented Dec 13, 2017 •

edited

Loading

pcmoritz commented Dec 13, 2017 •

edited

Loading

remram44 commented Dec 13, 2017 •

edited

Loading

robertnishihara commented Jan 11, 2018 •

edited

Loading