Integration tests: spill/unspill #136

fjetter · 2022-05-25T09:25:48Z

Persist a lot of data on disk

What happens when we load 10x more data than we have RAM?

Does Coiled have enough storage by default? How do we modify this?
How are we doing against theoretical performance? Is Dask's spill-to-disk efficient?

Pseudocode

x = da.random.random(...).persist()  # load the data 
wait(x)
x.sum().compute()  # force us to read the data again

[EDIT by @crusaderky ]
In addition to the trivial use case above, we would also like to have
https://github.com/dask/distributed/blob/f7f650154fea29978906c65dd0225415da56ed11/distributed/tests/test_active_memory_manager.py#L1079-L1085

scaled up to production size. This will stress the important use case of spilled tasks that are taken out of the spill file and back into memory not to be computed, but to the transferred to another worker.
This stress test should find a sizing that is spilling/unspilling heavily but is still completing successfully.
Related:

mrocklin · 2022-05-25T13:39:10Z

This was mostly done by Naty and Hendrik. Thinks were fine.

…

On Wed, May 25, 2022 at 4:26 AM Florian Jetter ***@***.***> wrote: Persist a lot of data on disk What happens when we load 10x more data than we have RAM? - Does Coiled have enough storage by default? How do we modify this? - How are we doing against theoretical performance? Is Dask's spill-to-disk efficient? Pseudocode x = da.random.random(...).persist() # load the data wait(x)x.sum().compute() # force us to read the data again — Reply to this email directly, view it on GitHub <#136>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTALVTHGIYJHKFEZMOLVLXW2RANCNFSM5W4OUKYQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- <https://coiled.io> Matthew Rocklin CEO

mrocklin · 2022-05-25T13:39:13Z

*Things were fine

…

On Wed, May 25, 2022 at 8:38 AM Matthew Rocklin ***@***.***> wrote: This was mostly done by Naty and Hendrik. Thinks were fine. On Wed, May 25, 2022 at 4:26 AM Florian Jetter ***@***.***> wrote: > Persist a lot of data on disk > > What happens when we load 10x more data than we have RAM? > > - Does Coiled have enough storage by default? How do we modify this? > - How are we doing against theoretical performance? Is Dask's > spill-to-disk efficient? > > Pseudocode > > x = da.random.random(...).persist() # load the data wait(x)x.sum().compute() # force us to read the data again > > — > Reply to this email directly, view it on GitHub > <#136>, or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AACKZTALVTHGIYJHKFEZMOLVLXW2RANCNFSM5W4OUKYQ> > . > You are receiving this because you are subscribed to this thread.Message > ID: ***@***.***> > -- <https://coiled.io> Matthew Rocklin CEO

-- <https://coiled.io> Matthew Rocklin CEO

ntabris · 2022-05-25T13:42:34Z

Does Coiled have enough storage by default? How do we modify this?

Currently the default is 100GiB EBS, but if there's a local NVMe we'll attach this to /scratch and set dask to use that for temp storage (so, spill).

There isn't currently a way to adjust size of EBS, I'd be interested to know if there's desire/need for that. This means that currently best way to get large disk (and fast disk for disk-intensive workloads) is using instance with NVMe—for example, something from i3 family.

jrbourbeau · 2022-05-25T14:43:27Z

This was mostly done by Naty and Hendrik

@ncclementi where did this work end up?

ncclementi · 2022-05-25T16:26:47Z

We tried to persist a lot of data with @hendrikmakait and we were able to do so, things didn't crash, and in the process we discover something that lead to this dask/distributed#6280

We experimented with this, but we did not write a test. We were not quite sure what the test would look like, as the EBS kept expanding.

ntabris · 2022-05-25T16:28:33Z

the EBS kept expanding

As in, the amount written to disk was expanding? Or disk size was expanding? (That would make me very puzzled.)

ncclementi · 2022-05-25T16:33:46Z

@ntabris I think I made a bad choice of words, we just saw that it kept spilling and it seemed to never end but I don't recall how close did we get to 100GB, @hendrikmakait Do you remember?

fjetter · 2022-05-31T11:58:49Z

This was mostly done by Naty and Hendrik. Thinks were fine.

I would actually like to have a test as part of the coiled runtime to not only confirm that this is not an issue right now but it isn't ever becoming an issue.

we were able to do so, things didn't crash, and in the process we discover something that lead to this

Is the code that led to this still available? Just because it didn't crash doesn't mean it isn't valuable

ncclementi · 2022-05-31T17:24:31Z

I would actually like to have a test as part of the coiled runtime to not only confirm that this is not an issue right now but it isn't ever becoming an issue.

I think this sounds reasonable, the part I am struggling with is how to design a test for this. What is the expected, and what is an issue? At the moment we did something like
da.random.random((1_000_000, 1_00000), chunks=(10000, 1000)) which is an array of approximately 750GB and try to persist it on a default cluster. But there was not a proper test design around it.

Is the code that led to this still available? Just because it didn't crash doesn't mean it isn't valuable

I do not have the exact code that we run at the moment, we were experimenting on an ipython session, but Guido was able to reproduce this, and created a test for it, that is on the PR. https://github.com/dask/distributed/pull/6280/files#diff-96777781dd54f26ed9441afb42909cf6f5393d6ef0b2b2a2e7e8dc46f074df93

mrocklin · 2022-05-31T17:50:36Z

Yup. I think that that would work fine. You would probably persist, wait, and then call sum() or something else that pulled the data out of memory. That would be a good thing to test and time

…

On Tue, May 31, 2022 at 12:24 PM Naty Clementi ***@***.***> wrote: I would actually like to have a test as part of the coiled runtime to not only confirm that this is not an issue right now but it isn't ever becoming an issue. I think this sounds reasonable, the part I am struggling with is how to design a test for this. What is the expected, and what is an issue? At the moment we did something like da.random.random((1_000_000, 1_00000), chunks=(10000, 1000)) which is an array of approximately 750GB and try to persist it on a default cluster. But there was not a proper test design around it. Is the code that led to this still available? Just because it didn't crash doesn't mean it isn't valuable I do not have the exact code that we run at the moment, we were experimenting on an ipython session, but Guido was able to reproduce this, and created a test for it, that is on the PR. https://github.com/dask/distributed/pull/6280/files#diff-96777781dd54f26ed9441afb42909cf6f5393d6ef0b2b2a2e7e8dc46f074df93 — Reply to this email directly, view it on GitHub <#136 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTAFIOGSO32KCLFXKO3VMZDNTANCNFSM5W4OUKYQ> . You are receiving this because you commented.Message ID: ***@***.***>

-- <https://coiled.io> Matthew Rocklin CEO

hendrikmakait · 2022-07-26T13:52:22Z

While working on this, dask/distributed#6783 was identified, which makes it hard to gauge disk space usage on workers.

hendrikmakait · 2022-08-05T10:19:52Z

XREF: dask/distributed/pull/6835 makes it easier to evaluate disk I/O

hendrikmakait · 2022-08-05T11:03:13Z

What happens when we load 10x more data than we have RAM?

It depends as we will see below.

Does Coiled have enough storage by default?

Since https://docs.coiled.io/user_guide/cloud_changelog.html#june-2022, the answer to this question is mostly NO:

Previously we attached a 100GB disk to every instance, now the default size will be between 30GB and 100GB and depends on how much memory (RAM) the instance has.

For example, the default t3.medium instance, which has 4 GB of RAM, runs ouf of disk space around 26.5 GB of spillage. In the worker logs, we then find output like this:

Aug  5 10:46:50 ip-10-0-15-123 rsyslogd: action 'action-3-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check for additional error messages before this one. [v8.2001.0 try https://www.rsyslog.com/e/2027 ]
Aug  5 10:46:50 ip-10-0-15-123 rsyslogd: file '/var/log/syslog'[7] write error - see https://www.rsyslog.com/solving-rsyslog-write-errors/ for help OS error: No space left on device [v8.2001.0 try https://www.rsyslog.com/e/2027 ]
Aug  5 10:46:52 ip-10-0-15-123 dockerd[887]: time="2022-08-05T10:46:52.205685231Z" level=error msg="Failed to log msg \"\" for logger json-file: error writing log entry: write /var/lib/docker/cAug  5 10:46:52 ip-10-0-15-123 cloud-init[1318]:     self.d[key] = pickled
Aug  5 10:46:52 ip-10-0-15-123 cloud-init[1318]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/zict/file.py", line 101, in __setitem__
Aug  5 10:46:52 ip-10-0-15-123 cloud-init[1318]:     fh.writelines(value)
Aug  5 10:46:52 ip-10-0-15-123 cloud-init[1318]: OSError: [Errno 28] No space left on device

For the user, this means that the task eventually fails enough times and they receive a bunch of the following messages:

2022-08-05 12:38:04,691 - distributed.protocol.pickle - INFO - Failed to deserialize b"\x80\x04\x95\xad\x19\x00\x00\x00\x00\x00\x00\x8c\x15distributed.scheduler\x94\x8c\x0cKilledWorker\x94\x93\x94\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 6, 23)\x94h\x00\x8c\x0bWorkerState\x94\x93\x94)\x81\x94N}\x94(\x8c\x07address\x94\x8c\[x17tls://10.0.15.123:40069\x94\x8c\x03pid\x94K7\x8c\x04name\x94\x8c\x1fhendrik-debug-worker-78663614bd\x94\x8c\x08nthreads\x94K\x02\x8c\x0cmemory_limit\x94\x8a\x05\x00\xc0\x9e\xf1\x00\x8c\x0flocal_directory\x94\x8c*/scratch/dask-worker-space/worker-r_h553hi\x94\x8c\x08services\x94}\x94\x8c\tdashboard\x94M\xd3\x99s\x8c\x08versions\x94}\x94\x8c\x05nanny\x94\x8c\x17tls://10.0.15.123:46585\x94\x8c\x06status\x94\x8c\x10distributed.core\x94\x8c\x06Status\x94\x93\x94\x8c\x06closed\x94\x85\x94R\x94\x8c\x05_hash\x94\x8a\x08\x1a\x82\xf0&E\x92(\xa2\x8c\x06nbytes\x94K\x00\x8c\toccupancy\x94K\x00\x8c\x15_memory_unmanaged_old\x94K\x00\x8c\x19_memory_unmanaged_history\x94\x8c\x0bcollections\x94\x8c\x05deque\x94\x93\x94)R\x94\x8c\x07metrics\x94}\x94\x8c\tlast_seen\x94K\x00\x8c\ntime_delay\x94K\x00\x8c\tbandwidth\x94J\x00\xe1\xf5\x05\x8c\x06actors\x94\x8f\x94\x8c\t_has_what\x94}\x94\x8c\nprocessing\x94}\x94(h\x03G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18](x17tls://10.0.15.123:40069\x94\x8c\x03pid\x94K7\x8c\x04name\x94\x8c\x1fhendrik-debug-worker-78663614bd\x94\x8c\x08nthreads\x94K\x02\x8c\x0cmemory_limit\x94\x8a\x05\x00\xc0\x9e\xf1\x00\x8c\x0flocal_directory\x94\x8c*/scratch/dask-worker-space/worker-r_h553hi\x94\x8c\x08services\x94}\x94\x8c\tdashboard\x94M\xd3\x99s\x8c\x08versions\x94}\x94\x8c\x05nanny\x94\x8c\x17tls://10.0.15.123:46585\x94\x8c\x06status\x94\x8c\x10distributed.core\x94\x8c\x06Status\x94\x93\x94\x8c\x06closed\x94\x85\x94R\x94\x8c\x05_hash\x94\x8a\x08\x1a\x82\xf0&E\x92(\xa2\x8c\x06nbytes\x94K\x00\x8c\toccupancy\x94K\x00\x8c\x15_memory_unmanaged_old\x94K\x00\x8c\x19_memory_unmanaged_history\x94\x8c\x0bcollections\x94\x8c\x05deque\x94\x93\x94)R\x94\x8c\x07metrics\x94}\x94\x8c\tlast_seen\x94K\x00\x8c\ntime_delay\x94K\x00\x8c\tbandwidth\x94J\x00\xe1\xf5\x05\x8c\x06actors\x94\x8f\x94\x8c\t_has_what\x94}\x94\x8c\nprocessing\x94}\x94(h\x03G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9(%27random_sample-950bfce4388763a088b491ac651d6b18)', 6, 24)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 6, 3)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 6, 4)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 6, 5)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 6, 6)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 6, 7)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 6, 8)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 6, 9)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 0)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 1)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 10)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 11)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 12)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 13)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 14)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 15)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 16)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 17)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 18)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 19)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 2)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 20)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 21)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 22)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 23)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 24)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 3)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 4)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 5)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 6)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 7)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 8)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 9)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 0)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 1)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 10)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 11)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 12)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 13)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 14)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 15)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 16)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 17)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 18)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 19)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 2)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 20)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 21)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 22)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 23)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 24)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 3)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 4)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 5)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 6)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 7)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 8)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 9)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 0)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 1)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 10)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 11)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 12)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 13)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 14)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 15)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 16)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 17)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 18)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 19)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 2)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 20)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 21)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 22)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 23)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 24)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 3)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 4)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 5)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 6)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 7)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 8)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 9)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8u\x8c\x0clong_running\x94\x8f\x94\x8c\texecuting\x94}\x94\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 6, 22)\x94G?\xd8hh\x00\x00\x00\x00s\x8c\tresources\x94}\x94\x8c\x0eused_resources\x94}\x94\x8c\x05extra\x94}\x94\x8c\tserver_id\x94\x8c+Worker-e1d62c65-e3d3-41ff-bcd1-97f8dafeeeea\x94u\x86\x94b\x86\x94R\x94."
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/coiled-runtime/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 66, in loads
    return pickle.loads(x)
AttributeError: 'WorkerState' object has no attribute 'server_id'

Note: This is not helpful at all and AttributeError: 'WorkerState' object has no attribute 'server_id' looks like a bug on its own.

How do we modify this?

We can use worker_disk_size when creating a cluster to allocate sufficient disk space. If we choose a sufficiently large size, everything works as expected.

How are we doing against theoretical performance? Is Dask's spill-to-disk efficient?

I'd like dask/distributed#6835 merged first before diving into this since we cannot copy values of the WorkerTable.

hendrikmakait · 2022-08-05T13:49:41Z

How are we doing against theoretical performance? Is Dask's spill-to-disk efficient?

In terms of hard numbers, we reach ~125 MiB on both exclusive read and write. It looks like performance numbers for EBS with EC2 instances are hard to come by (at the very least for the avg. throughput). To check against theoretical performance, we need to perform some experiments on raw EC2 instances.

One caveat I found is that if we keep referencing the large array and then calculate a sum on it, we see ~62 MiB of read and write at the same time. It looks like we are unspilling a chunk to calculate a sum on it, then spilling it back to disk because we need that memory for another chunk. Given the immutability of the chunks, we may want to consider a lazier policy that keeps the spilled data on disk until it should remove it, i.e. it wants to get rid of them both in memory and on disk.

hendrikmakait · 2022-08-05T14:33:41Z

In addition to the trivial use case above, we would also like to have
https://github.com/dask/distributed/blob/f7f650154fea29978906c65dd0225415da56ed11/distributed/tests/test_active_memory_manager.py#L1079-L1085 scaled up to production size.

After taking the first shot at this, it looks like scaling tensordot_stress up is more of a memory than a disk problem:

Aug  5 14:30:29 ip-10-0-5-24 kernel: [  708.236580] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=c391e262f16fa83e76cb194623ffd8e4a9599f544effa9e34a78135b799ea07d,mems_allowed=0,global_oom,task_memcg=/docker/c391e262f16fa83e76cb194623ffd8e4a9599f544effa9e34a78135b799ea07d,task=python,pid=1914,uid=0
Aug  5 14:30:29 ip-10-0-5-24 kernel: [  708.236609] Out of memory: Killed process 1914 (python) total-vm:4358956kB, anon-rss:3259096kB, file-rss:0kB, shmem-rss:8kB, UID:0 pgtables:6660kB oom_score_adj:0

hendrikmakait · 2022-08-10T12:44:49Z

Does Coiled have enough storage by default?

Since https://docs.coiled.io/user_guide/cloud_changelog.html#june-2022, the answer to this question is mostly NO[...]

To clarify: I mean that we do not have enough storage by default to store 10x more data than we have in RAM, which is the initial question of this issue. The discussion around default disk sizes can be found here: https://github.com/coiled/oss-engineering/issues/123. Following the discussion on that issue and given how easy it is to adjust disk size with worker_disk_size, the default seems to be reasonable to me.

fjetter · 2022-08-10T13:17:56Z

To clarify: I mean that we do not have enough storage by default to store 10x more data than we have in RAM, which is the initial question of this issue

That's fine. I think any multiplier >1 is fine assuming we can configure this on coiled side. The idea of this issue is to apply a workload that requires more memory than there is available on the cluster but can finish successfully if data is stored to disk. Whether this is 1.5x, 2x or 10x is not that important

After taking the first shot at this, it looks like scaling tensordot_stress up is more of a memory than a disk problem:

Also an interesting find. If chunks are small enough this should always be able to finish. There is an edge case where disk is full, memory is full and the entire cluster pauses. Beyond this edge case, the computation should always finish and we should definitely never see an OOM exception, iff the chunks are small enough

hendrikmakait · 2022-08-10T13:20:57Z

@fjetter: I'm currently taking a look at what's going on in the tensordot_stress test, will keep you posted.

hendrikmakait · 2022-08-10T15:03:16Z

For the user, this means that the task eventually fails enough times and they receive a bunch of the following messages:

Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/coiled-runtime/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 66, in loads
    return pickle.loads(x)
AttributeError: 'WorkerState' object has no attribute 'server_id'

Note: This is not helpful at all and AttributeError: 'WorkerState' object has no attribute 'server_id' looks like a bug on its own.

This was caused by a version mismatch with my custom software environment that used the latest distributed from main. After fixing that, it takes a lot of time for all tasks to either end up in-memory or erred, but when they do, the erred ones return a KilledWorkerException:

distributed.scheduler.KilledWorker("('random_sample-98bbea72b3b5abb3f8bf909c540c55f0', 0, 428)",
                                   <WorkerState 'tls://10.0.3.246:45495', name: hendrik-debug-worker-bb94fa4efc, status: closed, memory: 0, processing: 222>

One reason why it takes a lot of time is that workers keep straggling when reaching the disk size limit and we wait a while before deciding to re-assign those tasks to other workers (and removing said worker).
(Cluster ID for inspection: https://cloud.coiled.io/dask-engineering/clusters/48344/details)

hendrikmakait · 2022-08-10T15:36:20Z

One thing that might be helpful for the failing case is monitoring of disk usage for each worker. We currently monitor how much data is spilled, but we do not track/display disk used/disk free anywhere.

fjetter added the stability work related to stability label May 25, 2022

fjetter mentioned this issue May 25, 2022

Expand coverage of stability integration tests #138

Open

7 tasks

jrbourbeau added the test idea label Jun 9, 2022

crusaderky changed the title ~~Integration tests: Persist data~~ Integration tests: spill/unspill Jul 7, 2022

hendrikmakait self-assigned this Jul 21, 2022

hendrikmakait mentioned this issue Aug 5, 2022

Integration tests for spilling #229

Merged

fjetter closed this as completed in #229 Aug 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration tests: spill/unspill #136

Integration tests: spill/unspill #136

fjetter commented May 25, 2022 •

edited by crusaderky

Loading

mrocklin commented May 25, 2022 via email

mrocklin commented May 25, 2022 via email

ntabris commented May 25, 2022

jrbourbeau commented May 25, 2022

ncclementi commented May 25, 2022

ntabris commented May 25, 2022

ncclementi commented May 25, 2022

fjetter commented May 31, 2022

ncclementi commented May 31, 2022

mrocklin commented May 31, 2022 via email

hendrikmakait commented Jul 26, 2022

hendrikmakait commented Aug 5, 2022

hendrikmakait commented Aug 5, 2022 •

edited

Loading

hendrikmakait commented Aug 5, 2022

hendrikmakait commented Aug 5, 2022 •

edited

Loading

hendrikmakait commented Aug 10, 2022

fjetter commented Aug 10, 2022

hendrikmakait commented Aug 10, 2022

hendrikmakait commented Aug 10, 2022 •

edited

Loading

hendrikmakait commented Aug 10, 2022

Integration tests: spill/unspill #136

Integration tests: spill/unspill #136

Comments

fjetter commented May 25, 2022 • edited by crusaderky Loading

Persist a lot of data on disk

Pseudocode

mrocklin commented May 25, 2022 via email

mrocklin commented May 25, 2022 via email

ntabris commented May 25, 2022

jrbourbeau commented May 25, 2022

ncclementi commented May 25, 2022

ntabris commented May 25, 2022

ncclementi commented May 25, 2022

fjetter commented May 31, 2022

ncclementi commented May 31, 2022

mrocklin commented May 31, 2022 via email

hendrikmakait commented Jul 26, 2022

hendrikmakait commented Aug 5, 2022

hendrikmakait commented Aug 5, 2022 • edited Loading

hendrikmakait commented Aug 5, 2022

hendrikmakait commented Aug 5, 2022 • edited Loading

hendrikmakait commented Aug 10, 2022

fjetter commented Aug 10, 2022

hendrikmakait commented Aug 10, 2022

hendrikmakait commented Aug 10, 2022 • edited Loading

hendrikmakait commented Aug 10, 2022

fjetter commented May 25, 2022 •

edited by crusaderky

Loading

hendrikmakait commented Aug 5, 2022 •

edited

Loading

hendrikmakait commented Aug 5, 2022 •

edited

Loading

hendrikmakait commented Aug 10, 2022 •

edited

Loading