Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration tests: spill/unspill #136

Closed
Tracked by #138
fjetter opened this issue May 25, 2022 · 20 comments · Fixed by #229
Closed
Tracked by #138

Integration tests: spill/unspill #136

fjetter opened this issue May 25, 2022 · 20 comments · Fixed by #229
Assignees
Labels
stability work related to stability test idea

Comments

@fjetter
Copy link
Member

fjetter commented May 25, 2022

Persist a lot of data on disk

What happens when we load 10x more data than we have RAM?

  • Does Coiled have enough storage by default? How do we modify this?
  • How are we doing against theoretical performance? Is Dask's spill-to-disk efficient?

Pseudocode

x = da.random.random(...).persist()  # load the data 
wait(x)
x.sum().compute()  # force us to read the data again

[EDIT by @crusaderky ]
In addition to the trivial use case above, we would also like to have
https://github.com/dask/distributed/blob/f7f650154fea29978906c65dd0225415da56ed11/distributed/tests/test_active_memory_manager.py#L1079-L1085

scaled up to production size. This will stress the important use case of spilled tasks that are taken out of the spill file and back into memory not to be computed, but to the transferred to another worker.
This stress test should find a sizing that is spilling/unspilling heavily but is still completing successfully.
Related:

@fjetter fjetter added the stability work related to stability label May 25, 2022
@mrocklin
Copy link
Member

mrocklin commented May 25, 2022 via email

@mrocklin
Copy link
Member

mrocklin commented May 25, 2022 via email

@ntabris
Copy link
Member

ntabris commented May 25, 2022

Does Coiled have enough storage by default? How do we modify this?

Currently the default is 100GiB EBS, but if there's a local NVMe we'll attach this to /scratch and set dask to use that for temp storage (so, spill).

There isn't currently a way to adjust size of EBS, I'd be interested to know if there's desire/need for that. This means that currently best way to get large disk (and fast disk for disk-intensive workloads) is using instance with NVMe—for example, something from i3 family.

@jrbourbeau
Copy link
Member

This was mostly done by Naty and Hendrik

@ncclementi where did this work end up?

@ncclementi
Copy link
Contributor

We tried to persist a lot of data with @hendrikmakait and we were able to do so, things didn't crash, and in the process we discover something that lead to this dask/distributed#6280

We experimented with this, but we did not write a test. We were not quite sure what the test would look like, as the EBS kept expanding.

@ntabris
Copy link
Member

ntabris commented May 25, 2022

the EBS kept expanding

As in, the amount written to disk was expanding? Or disk size was expanding? (That would make me very puzzled.)

@ncclementi
Copy link
Contributor

@ntabris I think I made a bad choice of words, we just saw that it kept spilling and it seemed to never end but I don't recall how close did we get to 100GB, @hendrikmakait Do you remember?

@fjetter
Copy link
Member Author

fjetter commented May 31, 2022

This was mostly done by Naty and Hendrik. Thinks were fine.

I would actually like to have a test as part of the coiled runtime to not only confirm that this is not an issue right now but it isn't ever becoming an issue.

we were able to do so, things didn't crash, and in the process we discover something that lead to this

Is the code that led to this still available? Just because it didn't crash doesn't mean it isn't valuable

@ncclementi
Copy link
Contributor

I would actually like to have a test as part of the coiled runtime to not only confirm that this is not an issue right now but it isn't ever becoming an issue.

I think this sounds reasonable, the part I am struggling with is how to design a test for this. What is the expected, and what is an issue? At the moment we did something like
da.random.random((1_000_000, 1_00000), chunks=(10000, 1000)) which is an array of approximately 750GB and try to persist it on a default cluster. But there was not a proper test design around it.

Is the code that led to this still available? Just because it didn't crash doesn't mean it isn't valuable

I do not have the exact code that we run at the moment, we were experimenting on an ipython session, but Guido was able to reproduce this, and created a test for it, that is on the PR. https://github.com/dask/distributed/pull/6280/files#diff-96777781dd54f26ed9441afb42909cf6f5393d6ef0b2b2a2e7e8dc46f074df93

@mrocklin
Copy link
Member

mrocklin commented May 31, 2022 via email

@crusaderky crusaderky changed the title Integration tests: Persist data Integration tests: spill/unspill Jul 7, 2022
@hendrikmakait hendrikmakait self-assigned this Jul 21, 2022
@hendrikmakait
Copy link
Member

While working on this, dask/distributed#6783 was identified, which makes it hard to gauge disk space usage on workers.

@hendrikmakait
Copy link
Member

XREF: dask/distributed/pull/6835 makes it easier to evaluate disk I/O

@hendrikmakait
Copy link
Member

hendrikmakait commented Aug 5, 2022

What happens when we load 10x more data than we have RAM?

It depends as we will see below.

Does Coiled have enough storage by default?

Since https://docs.coiled.io/user_guide/cloud_changelog.html#june-2022, the answer to this question is mostly NO:

Previously we attached a 100GB disk to every instance, now the default size will be between 30GB and 100GB and depends on how much memory (RAM) the instance has.

For example, the default t3.medium instance, which has 4 GB of RAM, runs ouf of disk space around 26.5 GB of spillage. In the worker logs, we then find output like this:

Aug  5 10:46:50 ip-10-0-15-123 rsyslogd: action 'action-3-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check for additional error messages before this one. [v8.2001.0 try https://www.rsyslog.com/e/2027 ]
Aug  5 10:46:50 ip-10-0-15-123 rsyslogd: file '/var/log/syslog'[7] write error - see https://www.rsyslog.com/solving-rsyslog-write-errors/ for help OS error: No space left on device [v8.2001.0 try https://www.rsyslog.com/e/2027 ]
Aug  5 10:46:52 ip-10-0-15-123 dockerd[887]: time="2022-08-05T10:46:52.205685231Z" level=error msg="Failed to log msg \"\" for logger json-file: error writing log entry: write /var/lib/docker/cAug  5 10:46:52 ip-10-0-15-123 cloud-init[1318]:     self.d[key] = pickled
Aug  5 10:46:52 ip-10-0-15-123 cloud-init[1318]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/zict/file.py", line 101, in __setitem__
Aug  5 10:46:52 ip-10-0-15-123 cloud-init[1318]:     fh.writelines(value)
Aug  5 10:46:52 ip-10-0-15-123 cloud-init[1318]: OSError: [Errno 28] No space left on device

For the user, this means that the task eventually fails enough times and they receive a bunch of the following messages:

2022-08-05 12:38:04,691 - distributed.protocol.pickle - INFO - Failed to deserialize b"\x80\x04\x95\xad\x19\x00\x00\x00\x00\x00\x00\x8c\x15distributed.scheduler\x94\x8c\x0cKilledWorker\x94\x93\x94\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 6, 23)\x94h\x00\x8c\x0bWorkerState\x94\x93\x94)\x81\x94N}\x94(\x8c\x07address\x94\x8c\[x17tls://10.0.15.123:40069\x94\x8c\x03pid\x94K7\x8c\x04name\x94\x8c\x1fhendrik-debug-worker-78663614bd\x94\x8c\x08nthreads\x94K\x02\x8c\x0cmemory_limit\x94\x8a\x05\x00\xc0\x9e\xf1\x00\x8c\x0flocal_directory\x94\x8c*/scratch/dask-worker-space/worker-r_h553hi\x94\x8c\x08services\x94}\x94\x8c\tdashboard\x94M\xd3\x99s\x8c\x08versions\x94}\x94\x8c\x05nanny\x94\x8c\x17tls://10.0.15.123:46585\x94\x8c\x06status\x94\x8c\x10distributed.core\x94\x8c\x06Status\x94\x93\x94\x8c\x06closed\x94\x85\x94R\x94\x8c\x05_hash\x94\x8a\x08\x1a\x82\xf0&E\x92(\xa2\x8c\x06nbytes\x94K\x00\x8c\toccupancy\x94K\x00\x8c\x15_memory_unmanaged_old\x94K\x00\x8c\x19_memory_unmanaged_history\x94\x8c\x0bcollections\x94\x8c\x05deque\x94\x93\x94)R\x94\x8c\x07metrics\x94}\x94\x8c\tlast_seen\x94K\x00\x8c\ntime_delay\x94K\x00\x8c\tbandwidth\x94J\x00\xe1\xf5\x05\x8c\x06actors\x94\x8f\x94\x8c\t_has_what\x94}\x94\x8c\nprocessing\x94}\x94(h\x03G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18](x17tls://10.0.15.123:40069\x94\x8c\x03pid\x94K7\x8c\x04name\x94\x8c\x1fhendrik-debug-worker-78663614bd\x94\x8c\x08nthreads\x94K\x02\x8c\x0cmemory_limit\x94\x8a\x05\x00\xc0\x9e\xf1\x00\x8c\x0flocal_directory\x94\x8c*/scratch/dask-worker-space/worker-r_h553hi\x94\x8c\x08services\x94}\x94\x8c\tdashboard\x94M\xd3\x99s\x8c\x08versions\x94}\x94\x8c\x05nanny\x94\x8c\x17tls://10.0.15.123:46585\x94\x8c\x06status\x94\x8c\x10distributed.core\x94\x8c\x06Status\x94\x93\x94\x8c\x06closed\x94\x85\x94R\x94\x8c\x05_hash\x94\x8a\x08\x1a\x82\xf0&E\x92(\xa2\x8c\x06nbytes\x94K\x00\x8c\toccupancy\x94K\x00\x8c\x15_memory_unmanaged_old\x94K\x00\x8c\x19_memory_unmanaged_history\x94\x8c\x0bcollections\x94\x8c\x05deque\x94\x93\x94)R\x94\x8c\x07metrics\x94}\x94\x8c\tlast_seen\x94K\x00\x8c\ntime_delay\x94K\x00\x8c\tbandwidth\x94J\x00\xe1\xf5\x05\x8c\x06actors\x94\x8f\x94\x8c\t_has_what\x94}\x94\x8c\nprocessing\x94}\x94(h\x03G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9(%27random_sample-950bfce4388763a088b491ac651d6b18)', 6, 24)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 6, 3)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 6, 4)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 6, 5)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 6, 6)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 6, 7)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 6, 8)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 6, 9)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 0)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 1)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 10)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 11)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 12)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 13)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 14)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 15)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 16)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 17)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 18)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 19)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 2)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 20)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 21)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 22)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 23)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 24)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 3)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 4)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 5)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 6)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 7)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 8)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 9)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 0)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 1)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 10)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 11)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 12)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 13)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 14)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 15)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 16)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 17)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 18)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 19)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 2)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 20)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 21)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 22)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 23)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 24)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 3)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 4)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 5)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 6)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 7)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 8)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 9)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 0)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 1)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 10)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 11)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 12)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 13)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 14)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 15)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 16)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 17)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 18)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 19)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 2)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 20)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 21)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 22)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 23)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 24)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 3)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 4)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 5)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 6)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 7)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 8)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 9)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8u\x8c\x0clong_running\x94\x8f\x94\x8c\texecuting\x94}\x94\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 6, 22)\x94G?\xd8hh\x00\x00\x00\x00s\x8c\tresources\x94}\x94\x8c\x0eused_resources\x94}\x94\x8c\x05extra\x94}\x94\x8c\tserver_id\x94\x8c+Worker-e1d62c65-e3d3-41ff-bcd1-97f8dafeeeea\x94u\x86\x94b\x86\x94R\x94."
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/coiled-runtime/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 66, in loads
    return pickle.loads(x)
AttributeError: 'WorkerState' object has no attribute 'server_id'

Note: This is not helpful at all and AttributeError: 'WorkerState' object has no attribute 'server_id' looks like a bug on its own.

How do we modify this?

We can use worker_disk_size when creating a cluster to allocate sufficient disk space. If we choose a sufficiently large size, everything works as expected.

How are we doing against theoretical performance? Is Dask's spill-to-disk efficient?

I'd like dask/distributed#6835 merged first before diving into this since we cannot copy values of the WorkerTable.

@hendrikmakait
Copy link
Member

How are we doing against theoretical performance? Is Dask's spill-to-disk efficient?

In terms of hard numbers, we reach ~125 MiB on both exclusive read and write. It looks like performance numbers for EBS with EC2 instances are hard to come by (at the very least for the avg. throughput). To check against theoretical performance, we need to perform some experiments on raw EC2 instances.

One caveat I found is that if we keep referencing the large array and then calculate a sum on it, we see ~62 MiB of read and write at the same time. It looks like we are unspilling a chunk to calculate a sum on it, then spilling it back to disk because we need that memory for another chunk. Given the immutability of the chunks, we may want to consider a lazier policy that keeps the spilled data on disk until it should remove it, i.e. it wants to get rid of them both in memory and on disk.

@hendrikmakait
Copy link
Member

hendrikmakait commented Aug 5, 2022

In addition to the trivial use case above, we would also like to have
https://github.com/dask/distributed/blob/f7f650154fea29978906c65dd0225415da56ed11/distributed/tests/test_active_memory_manager.py#L1079-L1085 scaled up to production size.

After taking the first shot at this, it looks like scaling tensordot_stress up is more of a memory than a disk problem:

Aug  5 14:30:29 ip-10-0-5-24 kernel: [  708.236580] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=c391e262f16fa83e76cb194623ffd8e4a9599f544effa9e34a78135b799ea07d,mems_allowed=0,global_oom,task_memcg=/docker/c391e262f16fa83e76cb194623ffd8e4a9599f544effa9e34a78135b799ea07d,task=python,pid=1914,uid=0
Aug  5 14:30:29 ip-10-0-5-24 kernel: [  708.236609] Out of memory: Killed process 1914 (python) total-vm:4358956kB, anon-rss:3259096kB, file-rss:0kB, shmem-rss:8kB, UID:0 pgtables:6660kB oom_score_adj:0

@hendrikmakait
Copy link
Member

Does Coiled have enough storage by default?

Since https://docs.coiled.io/user_guide/cloud_changelog.html#june-2022, the answer to this question is mostly NO[...]

To clarify: I mean that we do not have enough storage by default to store 10x more data than we have in RAM, which is the initial question of this issue. The discussion around default disk sizes can be found here: https://github.com/coiled/oss-engineering/issues/123. Following the discussion on that issue and given how easy it is to adjust disk size with worker_disk_size, the default seems to be reasonable to me.

@fjetter
Copy link
Member Author

fjetter commented Aug 10, 2022

To clarify: I mean that we do not have enough storage by default to store 10x more data than we have in RAM, which is the initial question of this issue

That's fine. I think any multiplier >1 is fine assuming we can configure this on coiled side. The idea of this issue is to apply a workload that requires more memory than there is available on the cluster but can finish successfully if data is stored to disk. Whether this is 1.5x, 2x or 10x is not that important

After taking the first shot at this, it looks like scaling tensordot_stress up is more of a memory than a disk problem:

Also an interesting find. If chunks are small enough this should always be able to finish. There is an edge case where disk is full, memory is full and the entire cluster pauses. Beyond this edge case, the computation should always finish and we should definitely never see an OOM exception, iff the chunks are small enough

@hendrikmakait
Copy link
Member

@fjetter: I'm currently taking a look at what's going on in the tensordot_stress test, will keep you posted.

@hendrikmakait
Copy link
Member

hendrikmakait commented Aug 10, 2022

For the user, this means that the task eventually fails enough times and they receive a bunch of the following messages:

Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/coiled-runtime/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 66, in loads
    return pickle.loads(x)
AttributeError: 'WorkerState' object has no attribute 'server_id'

Note: This is not helpful at all and AttributeError: 'WorkerState' object has no attribute 'server_id' looks like a bug on its own.

This was caused by a version mismatch with my custom software environment that used the latest distributed from main. After fixing that, it takes a lot of time for all tasks to either end up in-memory or erred, but when they do, the erred ones return a KilledWorkerException:

distributed.scheduler.KilledWorker("('random_sample-98bbea72b3b5abb3f8bf909c540c55f0', 0, 428)",
                                   <WorkerState 'tls://10.0.3.246:45495', name: hendrik-debug-worker-bb94fa4efc, status: closed, memory: 0, processing: 222>

One reason why it takes a lot of time is that workers keep straggling when reaching the disk size limit and we wait a while before deciding to re-assign those tasks to other workers (and removing said worker).
(Cluster ID for inspection: https://cloud.coiled.io/dask-engineering/clusters/48344/details)

@hendrikmakait
Copy link
Member

One thing that might be helpful for the failing case is monitoring of disk usage for each worker. We currently monitor how much data is spilled, but we do not track/display disk used/disk free anywhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stability work related to stability test idea
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants