Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: reading speed optimisation for AMRVAC frontend #3508

Merged
merged 4 commits into from
Sep 16, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 4 additions & 9 deletions yt/frontends/amrvac/datfile_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -151,17 +151,12 @@ def get_single_block_data(istream, byte_offset, block_shape):

def get_single_block_field_data(istream, byte_offset, block_shape, field_idx):
"""retrieve a specific block (ONE field) from a datfile"""
istream.seek(byte_offset)

# compute byte size of a single field
field_shape = block_shape[:-1]
fmt = ALIGN + np.prod(field_shape) * "d"
byte_size_field = struct.calcsize(fmt)

# Read actual data
istream.seek(byte_size_field * field_idx, 1) # seek forward
d = struct.unpack(fmt, istream.read(struct.calcsize(fmt)))

# Fortran ordering
block_field_data = np.reshape(d, field_shape, order="F")
return block_field_data
istream.seek(byte_offset + byte_size_field * field_idx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reducing the number of seeks is really helpful. It might be possible to sort the calls to get_single_block_of_data by offset and field_idx (even sorting by that tuple should work) to make sure we're reading them in the right order, too.

Copy link
Member Author

@neutrinoceros neutrinoceros Sep 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds like a good idea, I'll try that !

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I was able to sort by field but it may not have a huge impact because of how AMRVAC dump files represent data (fields are on the inner loop of the writing routine, individual fields are never contiguous). I've tried sorting grids too and got no significant gain. The dataset I'm using to benchmark this also happens to use a single "data chunk", so there's no way to measure if sorting chunks would help. In conclusion I don't think there's anything I can do here.

data = np.fromfile(istream, "=f8", count=np.prod(field_shape))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you want = here? Any chance endianness will be different on generating/accepting systems?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module was adapted from a python script that was meant to be manually edited by users, so the endianness is actually hardcoded globally as ALIGN = "=" (see the top lines of this module), and apparently no one ever complained about that since the frontend was released 2 yrs ago, but I agree that's there's a chance of failure here. If you have any advice on how to make this more robust I'll gladly hear them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.. ALIGN = "=" has been hardcoded in literally every script that read in amrvac datfile data over the last few years and nobody ever complained about it, so I don't foresee any issues with this :)

data.shape = field_shape[::-1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One trick we've used for hdf5 files has been to create a destination array and read directly into that. To be honest this did make me stop and think what I was seeing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not super happy with how this looks. [::-1] may be a common(ish) idiom but it doesn't feel right to me.
I don't think numpy.fromfile has a "dest" (or similar) argument, where should I look for the technique you're mentioning ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also couldn't find a good one. All I could come up with was doing some kind of destination_buffer[;] = np.fromfile but that's not really very helpful. Let me keep looking. I think one possibility would be to do a view on the results of np.fromfile.

return data.T
45 changes: 26 additions & 19 deletions yt/frontends/amrvac/io.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,6 @@
from yt.utilities.io_handler import BaseIOHandler
from yt.utilities.on_demand_imports import _f90nml as f90nml

from .datfile_utils import get_single_block_field_data


def read_amrvac_namelist(parfiles):
"""Read one or more parfiles, and return a unified f90nml.Namelist object.
Expand Down Expand Up @@ -79,11 +77,13 @@ def _read_particle_fields(self, chunks, ptf, selector):
# you need to do your masking here.
raise NotImplementedError

def _read_data(self, grid, field):
def _read_data(self, fid, grid, field):
"""Retrieve field data from a grid.

Parameters
----------
fid: file descriptor (open binary file with read access)

grid : yt.frontends.amrvac.data_structures.AMRVACGrid
The grid from which data is to be read.
field : str
Expand All @@ -98,11 +98,15 @@ def _read_data(self, grid, field):
ileaf = grid.id
offset = grid._index.block_offsets[ileaf]
field_idx = self.ds.parameters["w_names"].index(field)
with open(self.datfile, "rb") as istream:
data = get_single_block_field_data(
istream, offset, self.block_shape, field_idx
)

field_shape = self.block_shape[:-1]
count = np.prod(field_shape)
byte_size_field = count * 8 # size of a double

fid.seek(offset + byte_size_field * field_idx)
data = np.fromfile(fid, "=f8", count=count)
data.shape = field_shape[::-1]
data = data.T
# Always convert data to 3D, as grid.ActiveDimensions is always 3D
while len(data.shape) < 3:
data = data[..., np.newaxis]
Expand Down Expand Up @@ -142,12 +146,15 @@ def _read_fluid_selection(self, chunks, selector, fields, size):

chunks = list(chunks)
data_dict = {} # <- return variable

if isinstance(selector, GridSelector):
if not len(chunks) == len(chunks[0].objs) == 1:
raise RuntimeError
grid = chunks[0].objs[0]
for ftype, fname in fields:
data_dict[ftype, fname] = self._read_data(grid, fname)

with open(self.datfile, "rb") as fh:
for ftype, fname in fields:
data_dict[ftype, fname] = self._read_data(fh, grid, fname)
else:
if size is None:
size = sum(g.count(selector) for chunk in chunks for g in chunk.objs)
Expand All @@ -156,16 +163,16 @@ def _read_fluid_selection(self, chunks, selector, fields, size):
data_dict[field] = np.empty(size, dtype="float64")

# nb_grids = sum(len(chunk.objs) for chunk in chunks)

ind = 0
for chunk in chunks:
for grid in chunk.objs:
nd = 0
for field in fields:
ftype, fname = field
data = self._read_data(grid, fname)
nd = grid.select(selector, data, data_dict[field], ind)
ind += nd
with open(self.datfile, "rb") as fh:
ind = 0
for chunk in chunks:
for grid in chunk.objs:
nd = 0
for field in fields:
ftype, fname = field
data = self._read_data(fh, grid, fname)
nd = grid.select(selector, data, data_dict[field], ind)
ind += nd

return data_dict

Expand Down