Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SpillBuffer.spilled_total appears to return incorrect results #6783

Closed
hendrikmakait opened this issue Jul 25, 2022 · 5 comments · Fixed by #6789
Closed

SpillBuffer.spilled_total appears to return incorrect results #6783

hendrikmakait opened this issue Jul 25, 2022 · 5 comments · Fixed by #6789
Assignees
Labels
bug Something is broken

Comments

@hendrikmakait
Copy link
Member

When spilling data to disk, the SpillBuffer appears to return the incorrect size of the data on disk. For example, when spilling an ~8 MB random matrix onto disk, the data file created by the SpillBuffer is also ~8 MB in size, yet the SpillBuffer only returns ~1 MB.

Reproducer:

from __future__ import annotations

import os
import tempfile

import numpy as np

from distributed.spill import SpillBuffer


def test_spill_size():
    tmpdir = tempfile.mkdtemp()
    buf = SpillBuffer(spill_directory=tmpdir, target=0, max_spill=False)
    data = np.random.random((1024, 1024))
    buf["data"] = data
    spill_size = os.stat(os.path.join(tmpdir,  "data")).st_size
    assert buf.spilled_total.disk == spill_size

fails with

>       assert buf.spilled_total.disk == spill_size
E       assert 1048808 == 8388840
E        +  where 1048808 = SpilledSize(memory=8388608, disk=1048808).disk
E        +    where SpilledSize(memory=8388608, disk=1048808) = Buffer<<LRU: 0/0 on dict>, <zict.cache.Cache object at 0x11f2413d0>>.spilled_total
@ncclementi
Copy link
Member

A couple of comments here, it looks a bit weird to me that the disk size calculated as spill_size is very close to the value we are calculating as the buf.spilled_total.memory. I would expect the spill_size to be bigger. However, I wonder if this is related to how we calculate spilled_total. See:

def __add__(self, other: SpilledSize) -> SpilledSize: # type: ignore
return SpilledSize(self.memory + other.memory, self.disk + other.disk)

@hendrikmakait
Copy link
Member Author

The size of spill_size feels fine to me. Just looking at the way we serialize things, the overhead for this array appears to be 232 bytes (sys.getsizeof(serialize(data)[0])), which matches perfectly. There might be a clue in the fact that spill_size is always almost 8x the size of buf.spilled_total.disk, regardless of the size of data.

@hendrikmakait
Copy link
Member Author

hendrikmakait commented Jul 25, 2022

It looks like the calculation of the memory size is wrong:

pickled_size = sum(len(frame) for frame in pickled)

uses len but we should probably use nbytes on memoryview objects (https://docs.python.org/3/library/stdtypes.html#memoryview.nbytes). I'll file a PR for this.

@ncclementi
Copy link
Member

That line was suggested by Guido when we were working on this. I'm sure he'll have an idea of what could be happening here. I know he is on PTO, but he might be a good person to review this. He will be back next week.

@crusaderky
Copy link
Collaborator

I was not aware that len(memoryview)! = memoryview.nbytes. Good catch! No need to wait for my return.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants