Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What makes the NAME loader faster than the NetCDF loader? #5053

Closed
trexfeathers opened this issue Nov 7, 2022 · 3 comments
Closed

What makes the NAME loader faster than the NetCDF loader? #5053

trexfeathers opened this issue Nov 7, 2022 · 3 comments

Comments

@trexfeathers
Copy link
Contributor

📰 Custom Issue

@bsherratt has provided an interesting case. A Name III .txt file describing:

  • 200 x 200 grid
  • ~200 phenomena

The file takes ~20s to load. If Iris is used to convert the file to NetCDF, that NetCDF file takes ~105s to load. Both loaders are unreasonably slow in this example, since the grid is shared between all phenomena, yet the Coord instances are created on a Cube-by-Cube basis (see #3172 for discussion on sharing). But why is NetCDF even slower?

File available on request.

@vsherratt
Copy link
Contributor

vsherratt commented Nov 8, 2022

Just to update...

We had a theory that "breaking" the coordinate metadata could speed this up, and it really did.

Pre-processing: rename each variable's coordinates attribute to something that iris does not recognise. It will therefore not attempt to create any aux coords on load (so they show up as cubes instead).

netCDF4 example
with nc.Dataset(filename, "a") as file:
    for var in file.variables.values():
        if "coordinates" in var.ncattrs():
            var.renameAttribute("coordinates", "fake_coordinates")
    file.sync()

Post-processing: identify the "cubes" that were supposed to be aux coords, and add them to the relevant cubes. This can be done by overriding the usual netcdf loaders in iris.fileformats.FORMAT_AGENT with a simple wrapper around iris.fileformats.netcdf.load_cubes:

Incomplete example
def load_cubes(filenames, callback=None, constraints=None):
    # Identify coords
    cubes = []
    unknown = []
    coord_names = set()
    for cube in netcdf.load_cubes(filenames, callback, constraints):
        fake_coords = cube.attributes.get("fake_coordinates", "").split()
        if fake_coords:
            # Definitely a cube
            cubes.append(cube)
            coord_names.update(fake_coords)
        else:
            # Might be a cube or a coord
            unknown.append(cube)

    # Create the coords
    coords = {}
    for cube in unknown:
        if cube.name() in coord_names:
            coord = AuxCoord(cube.data)
            coord.metadata = cube.metadata
            coords[coord.var_name] = coord
        else:
            cubes.append(cube)

    # Add coords to cubes
    for cube in cubes:
        fake_coords = cube.attributes.pop("fake_coordinates", "").split()
        for name in fake_coords:
            # TODO: Assumes scalar...
            cube.add_aux_coord(coords[name])

        yield cube

This doesn't cover everything, such as bounds or non-scalar aux coords, but it did bring the load time down to ~3 seconds. The re-adding coordinates part is a tiny proportion of that so I don't imagine a full solution would take much longer either.

However... in all .nc cases the cube data is lazy, while the NAME loader does not create lazy data. Realising the data takes about 50s, so netCDF is still slower overall.

@pp-mo
Copy link
Member

pp-mo commented Nov 15, 2022

#5069 shows a really basic test where we bypass lazy array creation in _get_cf_var_data, if the array is "small".
This really speeds up the testcase for netcdf load -- from ~150sec to ~4secs ...

Like this...

_NONLAZY_SMALL_AUXCOORDS = True

def _get_cf_var_data(cf_var, filename):
    if _NONLAZY_SMALL_AUXCOORDS and np.prod(cf_var.shape) < 20:
        result = cf_var[:]
    else:
        # Get lazy chunked data out of a cf variable.

Of course, #3172 would be a solution, but sadly that still looks a long way off.

This sounds like a possible win, as-is ??
It doesn't seem like a breaking change.

How do we choose a size threshold ?

@trexfeathers
Copy link
Contributor Author

Closed by #5229

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants